Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
Gourav_King_of_DataLand
Contributor II
Contributor II

Iterative Data extraction (Pagination and Polling) from REST API using tRestClient

Hi All,

Need you expert help.

 

The requirement is to pull all chat data from REST API (one time full data dump) and then pull chat on daily basis.The output is spread across 180K pages with each page giving URL to next and previous (except first page which have only 'nex_url' and last page with have only 'prev_url').

 

So Far I have been able to use the API/URL to extract information from first page first page

 

tRestClient ->tJavaRow->tJsonExtract->tOracleOut

 

How do I modify the job to

1) Pull all data for one time data dump, 180k pages

2) Pull data on daily basis for current day or extract data until the timestamp is current day.

 

Example output from API

 

Page 1 gives
{
    "chats": [

                  all chat related attributes that needs to imported

                 ],
    "count": 179451,
    "next_url": "next_url_here"

}


Page2 gives

{
    "chats": [

                all chat related attributes that needs to imported

                 ],
    "count": 179451
    "prev_url": "previous_url_here"
    "next_url": "next_url_here"

}

Page 3 gives ......next page 

 

 

Labels (5)
79 Replies
Gourav_King_of_DataLand
Contributor II
Contributor II
Author

@Parikhharshal

 

Few points

1) For "As mentioned if you could share screenshot of these values that would be fantastic" - The cope/screenshot I uploaded are the actual values that I am using 

https://community.talend.com/t5/Design-and-Development/Iterative-Data-extraction-Pagination-and-Poll...

 

2) Break down the job and try to execute standalone components to identify if they are working as expected. For example 

   A) Deactivate all components except tRestClient_1 and manually put the URL what you have set in global variable in  URL of tRestClient and check if the        component is successfully connecting to source.

   B) deactivate everything except tSetGlobalVar_1 and tRestClient_1 and see is this is working. This is one of the most important step as it will validate if the URL passed from tSetGlobalVar_1 is causing any issue to tRestClient_1

   and so on...

3) I see that you have put the code for assigning next URL in tJava_1. You should move it to tJavaFlex_1 

 

Breakdown the solution and run individual components or combination of components to identify what is working correctly and what's not.

 

Parikhharshal
Creator III
Creator III

@gr44: I think there is a disconnect there. What is Tjava component doing in your job? Is it just dummy one?

 

If that's the case then I might add the below code in tjavaflex as mentioned in your solution:

 

globalMap.put("VApiUrl",row4.NextUrl);

 

if  ((globalMap.get("VApiUrl"))==null)

    {

    globalMap.put("VLoop", false);

    };

Parikhharshal
Creator III
Creator III

@gr44: I did this change and I was able to run the job but it didn't return anything at all and just ran. Bit weird though.

 

0683p000009M0pg.png

 

 

Gourav_King_of_DataLand
Contributor II
Contributor II
Author

@Parikhharshal did you configure the tExtractJSONFields properly. You should have a key/attribute in your JSON output which contains the URL for next page. For me the key/attribute/property name "next_url" so I extracted that using tExtractJSONField and used it further. The solution will not work if fields are not properly configured in the tExtractJSONField

Parikhharshal
Creator III
Creator III

@gr44: I was about to reply on the same. Seems I am not configuring them properly?

 

This is what I have got in field mapping and not having the fields which are required for output. Do you manually add fields in mapping here?

 

0683p000009M0pl.png 

Gourav_King_of_DataLand
Contributor II
Contributor II
Author

 

The fields in the json are not directly available as column here. You have to use the json query to extract the required information from incoming JSON string.

1) Set the JSON Field to 'String'

2) Edit Schema and add the column_name that you want to extract. Column names can be whatever you want, you will be using these later 

3) Once you create columns in the edit schema, they will appear in the mapping pallet as shown below.

4) use proper json query for each column, json query will define what value the columns will hold.

 

0683p000009M0pq.png

 

 

 

 

 

 

 

 

 

 

 

 

if you are not familiar with JSON structure and JSON query I suggest you read 

https://goessner.net/articles/JsonPath/

 

I used this to understand structure of my json output

http://jsonpath.com/

 

in the very beginning I used postman app in google chrome to see the output of my REST call,  understood the output and identified the properties I was interested. Then created the json queries there and tested the same in Talend. 

Parikhharshal
Creator III
Creator III

@gr44: Thanks a lot for for your explanation. One last question hopefully.....In my case I have for many fields to be extracted from jason input. However, in this case I still need to keep next_url as a column right otherwise there is no way I can pass next_url. May be I did not understand well though you explained the use of next_url in your case.

Gourav_King_of_DataLand
Contributor II
Contributor II
Author

Yes, you need to extract the attribute from JSON output containing the URL for next page. In my json output the URL to go to next page was called "next_url" and I created the column with the same name.

Parikhharshal
Creator III
Creator III

@gr44: In my case, for going to next URL is going to be per_page entry ie 100 and no. of page=2 in this example and also student_id which I need to populate dynamically. https://abc.com/api/v1/courses/160/analytics/student_summaries?per_page=100&page=2

 

And apart from this other attributes not for URL but writing data to DB. Does it mean I will have to capture number of pages and student id as column?

 

What about the other columns which exist at source which needs to be written to DB?

Anonymous
Not applicable

I'm not sure I understand the parameters. What is the difference between "per_page" and "page"?

 

To write other attributes back to a database, you have several options. You can dump the entire JSON to a tHashOutput (set to append mode) and then read it all back and interrogate it from a tHashInput component linked to it. This will be a subsequent subjob and may be easier to understand for a future developer to inherit.

 

Alternatively, you could use the tExtractJSONFields to extract all of the data you wish to extract and write this straight to your DB or (similar to the above) write it to a tHashOutput and deal with loading it to your DB in a later subjob.