Solved: Re: Iterative Data extraction (Pagination and Poll... - Page 4 - Qlik Community

Gourav_King_of_DataLand · ‎2018-08-22

Hi All,

Need you expert help.

The requirement is to pull all chat data from REST API (one time full data dump) and then pull chat on daily basis.The output is spread across 180K pages with each page giving URL to next and previous (except first page which have only 'nex_url' and last page with have only 'prev_url').

So Far I have been able to use the API/URL to extract information from first page first page

tRestClient ->tJavaRow->tJsonExtract->tOracleOut

How do I modify the job to

1) Pull all data for one time data dump, 180k pages

2) Pull data on daily basis for current day or extract data until the timestamp is current day.

Example output from API

Page 1 gives
{
"chats": [

all chat related attributes that needs to imported

],
"count": 179451,
"next_url": "next_url_here"

}

Page2 gives

{
"chats": [

all chat related attributes that needs to imported

],
"count": 179451
"prev_url": "previous_url_here"
"next_url": "next_url_here"

}

Page 3 gives ......next page

Parikhharshal · ‎2018-11-02

@rhall: Thanks a lot for your reply. In header at source two things are mentioned which restricts the record entries in page.

So for getting max record per page I have to add entry per page=100 and there will be max of pages I will have to move to, to get all data entries (in the example it is 2 pages).

On top of this my attribute like course id also form part of dynamic url which I have to deal with it too.

https://abc.com/api/v1/courses/<course_id>/analytics/student_summaries?per_page=100&page=2

How to achieve this? Or does it mean whatever logic is there for pagination URL, will take care of this thing? If yes then how do I add course id to the dynamic url?

Anonymous · ‎2018-11-02

I see. It sounds to me like you only have to worry about the Page parameter then. Your base URL will be this....

https://abc.com/api/v1/courses/<course_id>/analytics/student_summaries?per_page=100&page=

....and you just need to append 1 to whatever the previous page number was. That is even simpler than having to check the JSON for the next URL.....although it might lead to an error if you go beyond the max number of pages.

Parikhharshal · ‎2018-11-02

@rhall: Thanks for your reply. Yes you are correct but not sure if number of page entry might change too so that might needs to be parameterised too at some point.

Having said that along with page attributes how do I dynamically feed the course_id attribute to URL and loop them? I am unable to think from existing design perspective or do I completely need to change?

Anonymous · ‎2018-11-02

First of all, don't worry about the number of page entry unless it is a requirement. If it becomes a requirement, deal with it then.

For the course_id, it depends on how you get hold of the list of course_ids. Does this also require a service call? If so, you should get this data first, store it in a tHash component, and then use that (maybe with a tFlowToIterate) to iterate over each course_id before your tLoop. So the flow would be....

Subjob 1

1) Identify all of the course_ids and add them to a tHashOutput

Subjob2
1) Use the tHashOutput data with a tHashInput linked to a tFlowToIterate

2) Link the tFlowToIterate component to your tLoop

3) Carry on in the same way as you are, but use the globalMap variable created by the tFlowToIterate, to pass the course_id to your URL

In case you are not aware, if you have a tHashInput with a column called "myData" linked to a tFlowToIterate using a row called "row1", the globalMap variable created will have a key of "row1.myData".

Parikhharshal · ‎2018-11-02

@rhall: Thanks a lot for your reply.

Course id is not provided by service call so I should be good. I can even read off from sql and feed them off. So I am planning to use tredshiftinput ->Tflowiterate->Get globalmap using tjava?? (Make sure pass this variable to pass course id to URL)->tloop and .....usual logic.

However I have one question on what you mentioned. You did not mention about page_number. I will have to take care of that too or that is already taken care of by next url logic? If yes then what about the textractjasonfield mapping object where i have mentioned next_url? Does it stay as it is?

And I add other attributes in addition to next_url here?

Anonymous · ‎2018-11-02

Which page variable are you talking about, the page or per_page variable? If it is the per_page variable, just hard code it to 100 for now. It sounds like it is just used to set a max number of records. 100 is fine. If you are talking about the page variable, just start at 1 and keep incrementing until your service runs out of records. I expect there to be something in the JSON to identify the last page.

Parikhharshal · ‎2018-11-02

@rhall: Ok cool. Ohh....So is it safe to say that in this case then I can replace next_url logic with page number logic?

Anonymous · ‎2018-11-02

I suspect so. But you need to be able to create a new URL after each iteration AND know when to stop creating URLs for the course_id. For example, your flow might go like below (pay attention to the course_id and page numbers)....

https://abc.com/api/v1/courses/course_id1/analytics/student_summaries?per_page=100&page=1
https://abc.com/api/v1/courses/course_id1/analytics/student_summaries?per_page=100&page=2
https://abc.com/api/v1/courses/course_id1/analytics/student_summaries?per_page=100&page=3
https://abc.com/api/v1/courses/course_id2/analytics/student_summaries?per_page=100&page=1
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=1
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=2
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=3
https://abc.com/api/v1/courses/course_id3/analytics/student_summaries?per_page=100&page=4

.....due to this, you will need a clean way of identifying when the max number of pages have been hit for each course_id. This will have to be part of your processing.

Parikhharshal · ‎2018-11-02

@rhall: Is there a straight forward way of identifying max number of pages by any chance?

Anonymous · ‎2018-11-02

I'm sure there is, but I don't have access to the web service or details about it to identify that. I suspect that something will be returned in the JSON identifying either the max number of pages or whether it is the last page.

Iterative Data extraction (Pagination and Polling) from REST API using tRestClient

Java

JSON

REST

Talend Data Integration

v7.x