Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
SanyaBLR
Contributor
Contributor

Enrich data from the website based on the CSV-file (REST, CSV)

I would like to scrape the website http://cti.voa.gov.uk/cti/inits.asp to get the council tax banding for every address from the first 4 digits of the postcode

I need to create a job that get data from the website based on data from my CSV-file.

The source file looks like below:

0695b00000PKONdAAP.png

1) Call for every postcode (row) in the file (postcode.csv), the first 4 digits of each postcode should be filled in the search line to website

http://cti.voa.gov.uk/cti/inits.asp. For example, I had postcode "S3 7AY", I should insert into the search line "S3 7A" like in the picture below:

0695b00000PKOKkAAP.png

2) Then I need to write all information from the search results to CSV-file

(1 per postcode)

(structure: "Address

",

"Council Tax band", "Improvement indicator", "Local authority reference number"). But I have no idea how to loop, get info from all url-pages (loop to get all pages).

0695b00000PKONnAAP.png

3) The file should be named like {POSTCODE}_{DATE}.csv

ie: S37AY_20220314.csv

4) ZIP all files into one archive.

Could you help me how to realize that? The most important question is in the second step. How to loop from all pages.

I suppose that need to use tFileInputDelimited, then use tRestClient with POST. But how to do that, loop and fetch it.

Labels (5)
3 Replies
Anonymous
Not applicable

Hi

Is there API available for query the information from the site and pass postcode as parameter? If so, try with tRest or tHttpRequest to call the API. In addition, go to check how many pages are returned by API one time, is there any parameters like limit, offset that we usually configure to do a loop and return all pages. So now, you need to get more information about API.

 

Regards

Shong

SanyaBLR
Contributor
Contributor
Author

Hi @Shicong Hong​ 

 

Unfortunately, I can't find an API.

 

I suppose that all data can be retrieved from the website using a simple POST with the appropriate parameters.

 

It expects the first 4 digits of the postcode (form) and results are paginated (20 or 50 on each page), the process will need to do a loop to fetch all addresses.

 

I inspected the page about the form. May be the next one can help, but now I'm little bit of get stuck.

 

curl 'http://cti.voa.gov.uk/cti/RefSResp.asp?lcn=0' \

 --data-raw 'lstPageSize=50&UARN=&txtDoeCode=0435&txtNameNum=+&txtStreet=&txtPostalDistrict=+&txtPDSpecific=&txtTown=&txtRefSPostCode=MK8+1&txtBillRef=+&txtStartKey=10&txtPageNum=2&txtBack=0&lstBand=+&lstCourtCode=+&lstBandStatus=+&lstPartDomestic=+&txtPageSize=50&txtLastStreetResp=&txtLastPDResp=&txtLastTownResp=&txtStreetSelected=+&lstBA=0435&txtPostCode=MK8+1&txtUpdateDate=04%2F08%2F2021&txtPF=0&txtPickedSubSt=&txtPickedStreet=&txtPickedTown=&blnBAChosen=&intNumFound=1836&intNumStreets=0&blnPaging=1&txtBAName=MILTON+KEYNES&txtBAWeb=http%3A%2F%2Fwww.mkweb.co.uk%2F&txtRedirectTo=InitS.asp' \

 --compressed \

 --insecure ;

 

Kind regards,

Sanya

Anonymous
Not applicable

@Sania Oreshkevich​ , First, make sure you are able to retrieve data from website using a talend component, so please test to use tHTTPRequest to send a POST request or use a tSystem to execute a CURL command, can you confirm this step is working.

Next, we will see how to do a loop to iterate each postcode and retrieve all pages data.

 

Regards

Shong