
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How to Extract CSV file from public URL/Website using Talend DI ???
Hi All,
We have got a requirement and first time in my talend career to extract CSV file from a government public portal/website https://vaers.hhs.gov/data/datasets.html? (This website is accessed by public and doesn't require authentication
. We can directly access the website and download year wise csv
using Talend DI).The requirement is to download below 3 different CSV's available in their website
- CSV File (VAERS Data)
- CSV File (VAERS Symptoms)
- CSV File (VAERS Vaccine)
When we try to manually download the data file, it asks for captcha for authentication and upon providing the captcha, file gets downloaded into the local machine. Having said that, when we try to automate this file fetching of yearly files using Talend, how to handle this captcha part ? If the solution is using REST components, can we POST "Year" parameter as HTTP BODY to the website and download year wise csv files?
I tried to explore tfilefetch, thttpparse but doesn't help much to read the csv file and parse the data using Talend DI. When using tfilefetch, it extracts only HTML page source of the website into the output csv or excel that is generated.
The reason I appended /eSubDownload/index.jsp?fn=2020VAERSData.csv after the url https://vaers.hhs.gov/data/datasets.html? in URI because when i explored the below seen page source of the website, found this link to download 2020VAERSData csv.
I may be wrong but trying to explore the possibilities.
CSV Output when using tfilefetch:
Can someone assist how to do this activity using Talend DI and move the data across to Azure Data Lake.
Please find attached screenshot of the CSV which I am referring from the government public portal.
Please assist
Regards
Sasidharan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sasidharan Udayakumar , You need to download the below way to local from the url.
check the below screenshots.
Thanks,
Manohar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sasidharan Udayakumar ,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Manohar B ,
Thanks for your help and support in this issue.
Can you also pls share the screenshot of the output data file generated in the destination directory using tfilefetch ? And why did you select POST Method in tfilefetch properties ? Since we are trying to GET the file, why POST here ? apologies if this is a layman's question.
Thanks
Sasidharan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sasidharan Udayakumar , please find the attached downloaded file.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Sasidharan Udayakumar , Regarding post check is not necessity. you can uncheck it.the job able to download file.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Manohar B,
When I checked the csv file which you attached, I am not seeing required data in it. Its just the page source data available in csv.
Regards
Sasidharan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Sasidharan Udayakumar , manodwhb,
If you observe the html code in generated csv files, it is printing the page where you enter the 'captcha' for verification, this captcha is in image format as shown below:Somehow if we can capture this captcha in readable format then we can utilize this for further processing.
