[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ale] Mining PDF's



Have you looked at Lucene
(http://jakarta.apache.org/lucene/docs/) and a PDF indexer
such as PDF Box (http://www.pdfbox.org/)?

I've never used PDF Box, but I messed around with Lucene to
index/search HTML and found it to be pretty easy to use.

chris

> -----Original Message-----
> From: Kevin O'Neill Stoll [mailto:kevinostoll@yahoo.com]
> Sent: Friday, December 06, 2002 5:41 PM
> To: ale@ale.org
> Cc: ajug-members@www.ajug.org
> Subject: Re: [ale] Mining PDF's
> 
> 
> It seems that my solution to this problem is the age old
> question of time or money.
> 
> 
> IF, I have lots of money then there seem to be quite a few
> products available that would allow me to index and
> catalog an archive of PDF's, with some limitations in the
> area of PDF's that are scanned in, such that they are
> images and not text.
> 
> IF, I have lots of time then my solution leans towards
> linux with the use of xpdf. By converting the pdfs to text
> then using a script to load this information into a
> database table. Then build a search application to perform
> a full-text search on the table I just built.
> 
> 
> That's what I came up with, I'm open to any comments /
> critiques that anyone may have. This may not be the most
> elaborate solution but it does meet all of the
> requirements that my supervisior had asked for and is
> fairly savy.
> 
> 
> thanks for the help.
> 
> =====
> Kevin Stoll
> http://kevinstoll.org
> 
> OpenSource Software...FREE!
> Angering Bill Gates...priceless.
>
============================================================
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> 
>