Web Scraping (Newbie)

Anonymous — Sat, 16 Nov 2024 13:10:38 GMT

Hi
There is a web site that I use regularly that will present a table based on search criteria. I know how to structure the URI to return the page with the table of data on it. The web site, however requires that I log in first.
To automate this I am trying to use the tFileFetch component. I have set the protocol to "http", put in the URI (that I know works as I've tested it in a browser), set the Destination directory and filename, un-selected the POST Method and Die on error check boxes. I have then set the Need authentication box to checked and entered my username/pwd combination (confirmed that I've entered them correctly).
The saved output from this is a file with "<h1>Incorrect access</h1> You are not logged in." - a total of 48 bytes.
I have tried this in 4.1.1 and now in 4.2 and I get the same results. In 4.2 I tried putting the tHttpRequest component in to access the web site's login form first and then run the tFileFetch (major fail).
I'm stuck! I watched the Web Scraping webinar this afternoon and it all looked so easy 😞
The normal sequence I go through is to go to the web site's home page, click on the "Log In" link, log in, then go to the search page, search and then I get my table. Any ideas on how to automate this with TOS would be gratefully received.
TIA
Stephen

Re: Web Scraping (Newbie)

Anonymous — Fri, 27 Mar 2015 01:16:37 GMT

This is maybe a bit of a late response, but I have a tutorial on this here.

topic Web Scraping (Newbie) in Talend Studio

Web Scraping (Newbie)

Re: Web Scraping (Newbie)