[ajug-members] html / website screen scraper API?
Dan Marchant
driedtoast at gmail.com
Mon Feb 19 14:46:09 EST 2007
Try just using a combination of :
JTidy (for getting content) - http://jtidy.sourceforge.net/
HttpClient - to handle the protocol and connections.
Also XMLUnit works better than HtmlUnit for some reason on html based documents.
hope this helps.
- Dan
On 2/19/07, Curt Smith <csmith at javadepot.com> wrote:
> Greetings ajug'ers,
>
> I need to scrape info off a dozen different public websites and incoming
> email that's in html format. The info is typically a table or single
> values next to labels but it'll get more complex I'm sure. Some info
> will require logging via custom login pages, cookies etc.
>
> There's two sourceforge projects: HtmlUnit and httpunit. Both would be
> good for simple scraping values and tables.
>
> googling: "scraping public websites" finds commercial APIs (links on
> the right side of the google results page).
>
> Is there any experience or discussion on this technology or APIs?
>
> Thanks, Curt Smith
>
>
>
> _______________________________________________
> ajug-members mailing list
> ajug-members at ajug.org
> http://www.ajug.org/mailman/listinfo/ajug-members
>
More information about the ajug-members
mailing list