[ajug-members] html / website screen scraper API?

WILLIAM SIGGELKOW bsiggelkow at mac.com
Mon Feb 19 17:43:59 EST 2007


Use Hpricot with JRuby :)
 
On Monday, February 19, 2007, at 12:02PM, "Dan Marchant" <driedtoast at gmail.com> wrote:
>Try just using a combination of :
>
>JTidy (for getting content) - http://jtidy.sourceforge.net/
>HttpClient - to handle the protocol and connections.
>
>Also XMLUnit works better than HtmlUnit for some reason on html based documents.
>
>hope this helps.
>
>- Dan
>
>
>On 2/19/07, Curt Smith <csmith at javadepot.com> wrote:
>> Greetings ajug'ers,
>>
>> I need to scrape info off a dozen different public websites and incoming
>> email that's in html format. The info is typically a table or single
>> values next to labels but it'll get more complex I'm sure. Some info
>> will require logging via custom login pages, cookies etc.
>>
>> There's two sourceforge projects: HtmlUnit and httpunit. Both would be
>> good for simple scraping values and tables.
>>
>> googling: "scraping public websites" finds commercial APIs (links on
>> the right side of the google results page).
>>
>> Is there any experience or discussion on this technology or APIs?
>>
>> Thanks, Curt Smith
>>
>>
>>
>> _______________________________________________
>> ajug-members mailing list
>> ajug-members at ajug.org
>> http://www.ajug.org/mailman/listinfo/ajug-members
>>
>_______________________________________________
>ajug-members mailing list
>ajug-members at ajug.org
>http://www.ajug.org/mailman/listinfo/ajug-members
>
>



More information about the ajug-members mailing list