[ajug-members] PDF Program

efried at bellsouth.net efried at bellsouth.net
Thu Mar 17 13:27:09 EST 2005


Ghostscript certainly would do the trick but it's heavyweight.

I was referring a utility program, I think written in C, that extracts text from a pdf file. It's name was something like pdf2text.

Good luck,

Eric
> 
> From: Ray Elenteny <ray.elenteny at gmail.com>
> Date: 2005/03/17 Thu PM 01:14:22 EST
> To: "General AJUG membership forum (100-200 messages/month)"
> 	<ajug-members at ajug.org>
> Subject: Re: [ajug-members] PDF Program
> 
> Tim,
> 
> I'm guessing that the utility mentioned by Eric on Linux is probably
> ghostscript.  It does good job of handling PDF files.
> 
> Essentially, PDF is an object representation of a page.  Getting to
> the text on line "x" is not a straightforward thing to do.  To parse a
> PDF file, one needs to understand and interpret the object model.  In
> a basic interpretation, you can know what text is on what page.  To
> get more precise, you need to start interpreting the object model to
> determine where text is drawn on a page.  I've seen it where each
> character is given a specific location.  To top this all off, many PDF
> files compress objects, so that a given object needs to be read into
> memory and "uncompressed" before it can interpreted.
> 
> At the company I'm at, I've written what you're looking for. 
> Unfortunately, it's for a commercial application, therefore, I can't
> share the code.  In a quick Google search, there's a project on
> SourceForge (http://www.pdfbox.org/) that looks like it might work for
> you.  Depending on your situation, ghostscript may or may not be an
> option.  There are GPL licensing issues to consider if you're
> redistributing your code commercially.
> 
> BTW.  The PDF specification may be found at
> http://partners.adobe.com/public/developer/pdf/index_reference.html.
> 
> Ray
> 
> 
> On Thu, 17 Mar 2005 12:47:08 -0500, efried at bellsouth.net
> <efried at bellsouth.net> wrote:
> > Tim,
> > 
> > I did this on Linux a year ago or so and found a utility that converts pdf to text. That is probably the easiest way to accomplish this - though not exactly what you are requesting.
> > 
> > Google for pdf to text - I came up with a number of hits.
> > 
> > Eric
> > >
> > > From: "Tim Scott" <tscott at WilsonLLP.com>
> > > Date: 2005/03/17 Thu AM 11:09:43 EST
> > > To: <ajug-members at www.ajug.org>
> > > Subject: [ajug-members] PDF Program
> > >
> > > Dear ajug members,
> > >
> > > Does anyone have experience reading text from a PDF file (with some sort
> > > of java utility of course)?
> > >
> > > I'm trying to write a little application that looks into an existing PDF
> > > file's text and pulls out a certain value identifying the file (the
> > > identifying text is a numeric value immediately following the word
> > > "Identifier" in the 2nd or 3rd line of the PDF file).  Based on the
> > > identifier, it moves the file to a certain location and makes several
> > > database entries (this is the easy part).  I need to do this with 20,000
> > > scanned documents saved as PDF.
> > >
> > > I've found a few tools advertising PDF text extraction, but would like
> > > some opinions from anybody actually performing a similar type action.
> > >
> > > Any suggestions or recommendations would be greatly appreciated?
> > >
> > > Thanks,
> > > TRS
> > >
> > >
> > >
> > 
> > 
> > _______________________________________________
> > ajug-members mailing list
> > ajug-members at ajug.org
> > http://www.ajug.org/mailman/listinfo/ajug-members
> > 
> > 
> > _______________________________________________
> > ajug-members mailing list
> > ajug-members at ajug.org
> > http://www.ajug.org/mailman/listinfo/ajug-members
> > 
> > 
> > 
> >
> _______________________________________________
> ajug-members mailing list
> ajug-members at ajug.org
> http://www.ajug.org/mailman/listinfo/ajug-members
> 





More information about the ajug-members mailing list