[ajug-members] PDF Program
efried at bellsouth.net
efried at bellsouth.net
Thu Mar 17 13:27:09 EST 2005
Ghostscript certainly would do the trick but it's heavyweight.
I was referring a utility program, I think written in C, that extracts text from a pdf file. It's name was something like pdf2text.
Good luck,
Eric
>
> From: Ray Elenteny <ray.elenteny at gmail.com>
> Date: 2005/03/17 Thu PM 01:14:22 EST
> To: "General AJUG membership forum (100-200 messages/month)"
> <ajug-members at ajug.org>
> Subject: Re: [ajug-members] PDF Program
>
> Tim,
>
> I'm guessing that the utility mentioned by Eric on Linux is probably
> ghostscript. It does good job of handling PDF files.
>
> Essentially, PDF is an object representation of a page. Getting to
> the text on line "x" is not a straightforward thing to do. To parse a
> PDF file, one needs to understand and interpret the object model. In
> a basic interpretation, you can know what text is on what page. To
> get more precise, you need to start interpreting the object model to
> determine where text is drawn on a page. I've seen it where each
> character is given a specific location. To top this all off, many PDF
> files compress objects, so that a given object needs to be read into
> memory and "uncompressed" before it can interpreted.
>
> At the company I'm at, I've written what you're looking for.
> Unfortunately, it's for a commercial application, therefore, I can't
> share the code. In a quick Google search, there's a project on
> SourceForge (http://www.pdfbox.org/) that looks like it might work for
> you. Depending on your situation, ghostscript may or may not be an
> option. There are GPL licensing issues to consider if you're
> redistributing your code commercially.
>
> BTW. The PDF specification may be found at
> http://partners.adobe.com/public/developer/pdf/index_reference.html.
>
> Ray
>
>
> On Thu, 17 Mar 2005 12:47:08 -0500, efried at bellsouth.net
> <efried at bellsouth.net> wrote:
> > Tim,
> >
> > I did this on Linux a year ago or so and found a utility that converts pdf to text. That is probably the easiest way to accomplish this - though not exactly what you are requesting.
> >
> > Google for pdf to text - I came up with a number of hits.
> >
> > Eric
> > >
> > > From: "Tim Scott" <tscott at WilsonLLP.com>
> > > Date: 2005/03/17 Thu AM 11:09:43 EST
> > > To: <ajug-members at www.ajug.org>
> > > Subject: [ajug-members] PDF Program
> > >
> > > Dear ajug members,
> > >
> > > Does anyone have experience reading text from a PDF file (with some sort
> > > of java utility of course)?
> > >
> > > I'm trying to write a little application that looks into an existing PDF
> > > file's text and pulls out a certain value identifying the file (the
> > > identifying text is a numeric value immediately following the word
> > > "Identifier" in the 2nd or 3rd line of the PDF file). Based on the
> > > identifier, it moves the file to a certain location and makes several
> > > database entries (this is the easy part). I need to do this with 20,000
> > > scanned documents saved as PDF.
> > >
> > > I've found a few tools advertising PDF text extraction, but would like
> > > some opinions from anybody actually performing a similar type action.
> > >
> > > Any suggestions or recommendations would be greatly appreciated?
> > >
> > > Thanks,
> > > TRS
> > >
> > >
> > >
> >
> >
> > _______________________________________________
> > ajug-members mailing list
> > ajug-members at ajug.org
> > http://www.ajug.org/mailman/listinfo/ajug-members
> >
> >
> > _______________________________________________
> > ajug-members mailing list
> > ajug-members at ajug.org
> > http://www.ajug.org/mailman/listinfo/ajug-members
> >
> >
> >
> >
> _______________________________________________
> ajug-members mailing list
> ajug-members at ajug.org
> http://www.ajug.org/mailman/listinfo/ajug-members
>
More information about the ajug-members
mailing list