Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
Sasidharan_Udayakumar
Contributor
Contributor

How to Extract CSV file from public URL/Website using Talend DI ???

Hi All,

We have got a requirement and first time in my talend career to extract CSV file from a government public portal/website https://vaers.hhs.gov/data/datasets.html? (This website is accessed by public and doesn't require authentication

. We can directly access the website and download year wise csv

using Talend DI).

The requirement is to download below 3 different CSV's available in their website

  • CSV File (VAERS Data)
  • CSV File (VAERS Symptoms)
  • CSV File (VAERS Vaccine)

0693p000009TKf0AAG.png

0693p000009TKZCAA4.png

When we try to manually download the data file, it asks for captcha for authentication and upon providing the captcha, file gets downloaded into the local machine. Having said that, when we try to automate this file fetching of yearly files using Talend, how to handle this captcha part ? If the solution is using REST components, can we POST "Year" parameter as HTTP BODY to the website and download year wise csv files?

0693p000009TKZWAA4.png

0693p000009TKZbAAO.png

I tried to explore tfilefetch, thttpparse but doesn't help much to read the csv file and parse the data using Talend DI. When using tfilefetch, it extracts only HTML page source of the website into the output csv or excel that is generated.

0693p000009TKbSAAW.png

The reason I appended /eSubDownload/index.jsp?fn=2020VAERSData.csv after the url https://vaers.hhs.gov/data/datasets.html? in URI because when i explored the below seen page source of the website, found this link to download 2020VAERSData csv.

I may be wrong but trying to explore the possibilities.

0693p000009TKe2AAG.png

CSV Output when using tfilefetch:

0693p000009TKjHAAW.png

Can someone assist how to do this activity using Talend DI and move the data across to Azure Data Lake.

Please find attached screenshot of the CSV which I am referring from the government public portal.

Please assist

Regards

Sasidharan

Labels (3)
7 Replies
manodwhb
Champion II
Champion II

@Sasidharan Udayakumar​ , You need to download the below way to local from the url.

 

check the below screenshots.

 

 

Thanks,

Manohar

 

 

 

 

manodwhb
Champion II
Champion II

0693p000009TLqeAAG.png@Sasidharan Udayakumar​ , 0693p000009TM09AAG.png 0693p000009TM1lAAG.png

 

Sasidharan_Udayakumar
Contributor
Contributor
Author

Hi @Manohar B​ ,

 

Thanks for your help and support in this issue.

 

Can you also pls share the screenshot of the output data file generated in the destination directory using tfilefetch ? And why did you select POST Method in tfilefetch properties ? Since we are trying to GET the file, why POST here ? apologies if this is a layman's question.

 

Thanks

Sasidharan

 

 

 

 

 

manodwhb
Champion II
Champion II

@Sasidharan Udayakumar​ , please find the attached downloaded file.

 

 

manodwhb
Champion II
Champion II

@Sasidharan Udayakumar​ , Regarding post check is not necessity. you can uncheck it.the job able to download file.

Sasidharan_Udayakumar
Contributor
Contributor
Author

Hi @Manohar B​,

 

When I checked the csv file which you attached, I am not seeing required data in it. Its just the page source data available in csv.

 

0693p000009TMX8AAO.png

 

Regards

Sasidharan

 

 

 

vikramk
Creator II
Creator II

Hi @Sasidharan Udayakumar​ , manodwhb,

 

If you observe the html code in generated csv files, it is printing the page where you enter the 'captcha' for verification, this captcha is in image format as shown below:0693p000009oSktAAE.pngSomehow if we can capture this captcha in readable format then we can utilize this for further processing.