Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
cancel
Showing results for 
Search instead for 
Did you mean: 
Gourav_King_of_DataLand
Contributor II
Contributor II

Iterative Data extraction (Pagination and Polling) from REST API using tRestClient

Hi All,

Need you expert help.

 

The requirement is to pull all chat data from REST API (one time full data dump) and then pull chat on daily basis.The output is spread across 180K pages with each page giving URL to next and previous (except first page which have only 'nex_url' and last page with have only 'prev_url').

 

So Far I have been able to use the API/URL to extract information from first page first page

 

tRestClient ->tJavaRow->tJsonExtract->tOracleOut

 

How do I modify the job to

1) Pull all data for one time data dump, 180k pages

2) Pull data on daily basis for current day or extract data until the timestamp is current day.

 

Example output from API

 

Page 1 gives
{
    "chats": [

                  all chat related attributes that needs to imported

                 ],
    "count": 179451,
    "next_url": "next_url_here"

}


Page2 gives

{
    "chats": [

                all chat related attributes that needs to imported

                 ],
    "count": 179451
    "prev_url": "previous_url_here"
    "next_url": "next_url_here"

}

Page 3 gives ......next page 

 

 

Labels (5)
79 Replies
Parikhharshal
Creator III
Creator III

@rhall: Thanks a lot for your reply. In header at source two things are mentioned which restricts the record entries in page.

 

So for getting max record per page I have to add entry per page=100 and there will be max of pages I will have to move to, to get all data entries (in the example it is 2 pages).

 

On top of this my attribute like course id also form part of dynamic url which I have to deal with it too.

 

https://abc.com/api/v1/courses/<course_id>/analytics/student_summaries?per_page=100&page=2

 

How to achieve this? Or does it mean whatever logic is there for pagination URL, will take care of this thing? If yes then how do I add course id to the dynamic url?

Anonymous
Not applicable

I see. It sounds to me like you only have to worry about the Page parameter then. Your base URL will be this....

https://abc.com/api/v1/courses/<course_id>/analytics/student_summaries?per_page=100&page=

....and you just need to append 1 to whatever the previous page number was. That is even simpler than having to check the JSON for the next URL.....although it might lead to an error if you go beyond the max number of pages.

 

 

 

 
 
Parikhharshal
Creator III
Creator III

@rhall: Thanks for your reply. Yes you are correct but not sure if number of page entry might change too so that might needs to be parameterised too at some point.

 

Having said that along with page attributes how do I dynamically feed the course_id attribute to URL and loop them? I am unable to think from existing design perspective or do I completely need to change?

Anonymous
Not applicable

First of all, don't worry about the number of page entry unless it is a requirement. If it becomes a requirement, deal with it then. 

For the course_id, it depends on how you get hold of the list of course_ids. Does this also require a service call? If so, you should get this data first, store it in a tHash component, and then use that (maybe with a tFlowToIterate) to iterate over each course_id before your tLoop. So the flow would be....

 

Subjob 1

1) Identify all of the course_ids and add them to a tHashOutput

Subjob2
1) Use the tHashOutput data with a tHashInput linked to a tFlowToIterate

2) Link the tFlowToIterate component to your tLoop

3) Carry on in the same way as you are, but use the globalMap variable created by the tFlowToIterate, to pass the course_id to your URL

 

In case you are not aware, if you have a tHashInput with a column called "myData" linked to a tFlowToIterate using a row called "row1", the globalMap variable created will have a key of "row1.myData".

Parikhharshal
Creator III
Creator III

@rhall: Thanks a lot for your reply.

 

Course id is not provided by service call so I should be good. I can even read off from sql and feed them off. So I am planning to use tredshiftinput ->Tflowiterate->Get globalmap using tjava?? (Make sure pass this variable to pass course id to URL)->tloop and .....usual logic.

 

However I have one question on what you mentioned. You did not mention about page_number. I will have to take care of that too or that is already taken care of by next url logic? If yes then what about the textractjasonfield mapping object where i have mentioned next_url? Does it stay as it is?

 

0683p000009M0q0.png

 

And I add other attributes in addition to next_url here?

Anonymous
Not applicable

Which page variable are you talking about, the page or per_page variable? If it is the per_page variable, just hard code it to 100 for now. It sounds like it is just used to set a max number of records. 100 is fine. If you are talking about the page variable, just start at 1 and keep incrementing until your service runs out of records. I expect there to be something in the JSON to identify the last page.

Parikhharshal
Creator III
Creator III

@rhall: Ok cool. Ohh....So is it safe to say that in this case then I can replace next_url logic with page number logic?

Anonymous
Not applicable

I suspect so. But you need to be able to create a new URL after each iteration AND know when to stop creating URLs for the course_id. For example, your flow might go like below (pay attention to the course_id and page numbers)....

 

 

https://abc.com/api/v1/courses/course_id1/analytics/student_summaries?per_page=100&page=1
https://abc.com/api/v1/courses/course_id1/analytics/student_summaries?per_page=100&page=2
https://abc.com/api/v1/courses/course_id1/analytics/student_summaries?per_page=100&page=3
https://abc.com/api/v1/courses/course_id2/analytics/student_summaries?per_page=100&page=1
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=1
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=2
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=3
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=4

.....due to this, you will need a clean way of identifying when the max number of pages have been hit for each course_id. This will have to be part of your processing.

Parikhharshal
Creator III
Creator III

@rhall: Is there a straight forward way of identifying max number of pages by any chance?

Anonymous
Not applicable

I'm sure there is, but I don't have access to the web service or details about it to identify that. I suspect that something will be returned in the JSON identifying either the max number of pages or whether it is the last page.