[ajug-members] PDF Program
Tim Scott
tscott at WilsonLLP.com
Thu Mar 17 14:39:08 EST 2005
Thanks for everyones help - I'm getting closer, but am wondering if
there may even be an easier solution (I'm the only developer at my
company and it is nice to get other perspectives). Here is how the
whole process works (probably should have expounded a little more right
off the bat).
The documents have all been printed and they are all in the same format
(the field containing the identification number is on the same spot for
all the documents). The paper copy is shipped to individuals where they
fill out some information on the sheet and send it back to us. We
review the copy and enter the information into our system (will
eventually automate process, but not this year). We then need a PDF
copy of the document for our records.
An employee scans the document and saves it as a PDF (it could be saved
as a different format, but must end up in the system as a PDF).
Multiple documents are scanned at the same time to save time. The
employee then opens a doc management program on the computer, identifies
the individual identified on the document, selects the document type,
selects the newly scanned document, and hits the enter button. The doc
management program then inserts the program into the correct location,
etc... user does this for every document.
The identification number really identifies the individual filling out
the form and the document type. My goal is for the end user to scan the
document one time and the system takes care of the rest (automatically
looks through all the files in the "Z" drive where scanned documents are
saved, reads the identification number, moves file to appropriate
location and other appropriate steps). I'm just trying to figure out
the best method to retrieve this identification number (as everything
else is easily programmable once this number is retrieved). We shall be
scanning in about 10,000 of these documents next month and a one step
process is significant. Is it better to save the file in some other
format and then programmatically convert to PDF? Or grab info out of the
PDF? Any ideas are appreciately.
Thanks for all the help!
-----Original Message-----
From: ajug-members-bounces at ajug.org
[mailto:ajug-members-bounces at ajug.org] On Behalf Of efried at bellsouth.net
Sent: Thursday, March 17, 2005 1:27 PM
To: ajug-members at ajug.org
Subject: Re: Re: [ajug-members] PDF Program
Ghostscript certainly would do the trick but it's heavyweight.
I was referring a utility program, I think written in C, that extracts
text from a pdf file. It's name was something like pdf2text.
Good luck,
Eric
>
> From: Ray Elenteny <ray.elenteny at gmail.com>
> Date: 2005/03/17 Thu PM 01:14:22 EST
> To: "General AJUG membership forum (100-200 messages/month)"
> <ajug-members at ajug.org>
> Subject: Re: [ajug-members] PDF Program
>
> Tim,
>
> I'm guessing that the utility mentioned by Eric on Linux is probably
> ghostscript. It does good job of handling PDF files.
>
> Essentially, PDF is an object representation of a page. Getting to
> the text on line "x" is not a straightforward thing to do. To parse a
> PDF file, one needs to understand and interpret the object model. In
> a basic interpretation, you can know what text is on what page. To
> get more precise, you need to start interpreting the object model to
> determine where text is drawn on a page. I've seen it where each
> character is given a specific location. To top this all off, many PDF
> files compress objects, so that a given object needs to be read into
> memory and "uncompressed" before it can interpreted.
>
> At the company I'm at, I've written what you're looking for.
> Unfortunately, it's for a commercial application, therefore, I can't
> share the code. In a quick Google search, there's a project on
> SourceForge (http://www.pdfbox.org/) that looks like it might work for
> you. Depending on your situation, ghostscript may or may not be an
> option. There are GPL licensing issues to consider if you're
> redistributing your code commercially.
>
> BTW. The PDF specification may be found at
> http://partners.adobe.com/public/developer/pdf/index_reference.html.
>
> Ray
>
>
> On Thu, 17 Mar 2005 12:47:08 -0500, efried at bellsouth.net
> <efried at bellsouth.net> wrote:
> > Tim,
> >
> > I did this on Linux a year ago or so and found a utility that
converts pdf to text. That is probably the easiest way to accomplish
this - though not exactly what you are requesting.
> >
> > Google for pdf to text - I came up with a number of hits.
> >
> > Eric
> > >
> > > From: "Tim Scott" <tscott at WilsonLLP.com>
> > > Date: 2005/03/17 Thu AM 11:09:43 EST
> > > To: <ajug-members at www.ajug.org>
> > > Subject: [ajug-members] PDF Program
> > >
> > > Dear ajug members,
> > >
> > > Does anyone have experience reading text from a PDF file (with
> > > some sort of java utility of course)?
> > >
> > > I'm trying to write a little application that looks into an
> > > existing PDF file's text and pulls out a certain value identifying
> > > the file (the identifying text is a numeric value immediately
> > > following the word "Identifier" in the 2nd or 3rd line of the PDF
> > > file). Based on the identifier, it moves the file to a certain
> > > location and makes several database entries (this is the easy
> > > part). I need to do this with 20,000 scanned documents saved as
PDF.
> > >
> > > I've found a few tools advertising PDF text extraction, but would
> > > like some opinions from anybody actually performing a similar type
action.
> > >
> > > Any suggestions or recommendations would be greatly appreciated?
> > >
> > > Thanks,
> > > TRS
> > >
> > >
> > >
> >
> >
> > _______________________________________________
> > ajug-members mailing list
> > ajug-members at ajug.org
> > http://www.ajug.org/mailman/listinfo/ajug-members
> >
> >
> > _______________________________________________
> > ajug-members mailing list
> > ajug-members at ajug.org
> > http://www.ajug.org/mailman/listinfo/ajug-members
> >
> >
> >
> >
> _______________________________________________
> ajug-members mailing list
> ajug-members at ajug.org
> http://www.ajug.org/mailman/listinfo/ajug-members
>
_______________________________________________
ajug-members mailing list
ajug-members at ajug.org
http://www.ajug.org/mailman/listinfo/ajug-members
More information about the ajug-members
mailing list