[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ale] Mining PDF's
Have you looked at Lucene
(http://jakarta.apache.org/lucene/docs/) and a PDF indexer
such as PDF Box (http://www.pdfbox.org/)?
I've never used PDF Box, but I messed around with Lucene to
index/search HTML and found it to be pretty easy to use.
chris
> -----Original Message-----
> From: Kevin O'Neill Stoll [mailto:kevinostoll@yahoo.com]
> Sent: Friday, December 06, 2002 5:41 PM
> To: ale@ale.org
> Cc: ajug-members@www.ajug.org
> Subject: Re: [ale] Mining PDF's
>
>
> It seems that my solution to this problem is the age old
> question of time or money.
>
>
> IF, I have lots of money then there seem to be quite a few
> products available that would allow me to index and
> catalog an archive of PDF's, with some limitations in the
> area of PDF's that are scanned in, such that they are
> images and not text.
>
> IF, I have lots of time then my solution leans towards
> linux with the use of xpdf. By converting the pdfs to text
> then using a script to load this information into a
> database table. Then build a search application to perform
> a full-text search on the table I just built.
>
>
> That's what I came up with, I'm open to any comments /
> critiques that anyone may have. This may not be the most
> elaborate solution but it does meet all of the
> requirements that my supervisior had asked for and is
> fairly savy.
>
>
> thanks for the help.
>
> =====
> Kevin Stoll
> http://kevinstoll.org
>
> OpenSource Software...FREE!
> Angering Bill Gates...priceless.
>
============================================================
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
>