Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
karandama2006
Creator
Creator

downloading JSON in structured format

Hi All, 

I'm working on web services and I need to download the JSON response using http get requests and then read those JSON files and parse them.

This works out well for smaller files , the issue here is the JSON document which I get is downloaded in a single row ! , so I have 200 mb of data downloaded in a single row .

This file is then impossible to open using an editor and then if I try to parse it in talend I get out of memory exception.

Is there any work around for this issue ? 

Does JSON response from https get request always have to be in Single row ? can we break it down and do some formatting before saving it on disk ?

What is the best way to parse a large JSON file size greater than 200 MB ?

 

 

Labels (4)
12 Replies
Anonymous
Not applicable

I've looked at your file and the example you gave was not valid JSON. I suspect that is because of how you transferred it and cut it to here though. Before this code....

 

{ "ActionId":410768,

....you needed a comma. A small thing, but if your source does not provide that, it will not work.

 

Now, I changed the values of your file to make it easier for me to see whether it was working. I changed your file to this....

{
   "ExportSetGuid":"a7480ae9-5045-4e33-b6f5-875cd17e1711",
   "Actions":[
      {
         "ActionId":410766,
         "Allocations":[
            {
               "AllocationTargetOrgUnitId":82373,
               "FiscalYears":[
                  {
                     "FiscalYear":2017,
                     "PercentValue":1.0300
                  },
                  {
                     "FiscalYear":2018,
                     "PercentValue":2.0300
                  }

               ]
            }
         ]
      },
      { "ActionId":410768, 
         "Allocations":[ 
             { "AllocationTargetOrgUnitId":82377,
               "FiscalYears":
                [ 
                { "FiscalYear":2020, "PercentValue":3.0600 },
                { "FiscalYear":2021, "PercentValue":5.0900 }
                ]
             }
          ]
       }
   ]
}

I then built a job like this.....

0683p000009M2dJ.png

 

Ignore the deactivated tLogRows. I added those to see what I was getting after each component. This is a really good way of checking you are on the right track. The following screenshots show each of the components from left to right in order (ignoring the tLogRows....

0683p000009M2nX.png

First I extract the outermost loop information. The Actions data in the above screen shot is essentially pulling out a json snippet from the bigger json file and sending it forward to be processed.0683p000009M2nc.png

The above tExtractJsonFields component is passing through the ExportSetGuid value (hence it is left blank in the Json query) and is extracting the ActionId value and the Allocations array json snippet. Notice the Json Field value is set to Actions. Also notice the loop. It is imply looping over the array.

0683p000009M2ft.png

The above tExtractJsonFields component is passing through the ExportSetGuid and ActionId values and is extracting the AllocationTargetOrgUnitId value and the FiscalYears array json snippet. Notice the Json Field value is set to Allocations. Also notice the loop. It is imply looping over the array.

0683p000009M2Vy.png

The above tExtractJsonFields component is passing through the ExportSetGuid, ActionId and AllocationTargetOrgUnitId values and is extracting the FiscalYear and PercentValue values. This is the lowest leaf on the tree. Notice the Json Field value is set to FiscalYears. Also notice the loop. It is imply looping over the array.

 

The output from this was.....

Starting job ExampleFilteringtXML at 11:28 26/02/2019.

[statistics] connecting to socket on port 3677
[statistics] connected
a7480ae9-5045-4e33-b6f5-875cd17e1711|410766|82373|2017|1.03
a7480ae9-5045-4e33-b6f5-875cd17e1711|410766|82373|2018|2.03
a7480ae9-5045-4e33-b6f5-875cd17e1711|410768|82377|2020|3.06
a7480ae9-5045-4e33-b6f5-875cd17e1711|410768|82377|2021|5.09
[statistics] disconnected

Job ExampleFilteringtXML ended at 11:28 26/02/2019. [exit code=0]

This is not necessarily a quick and easy way to achieve this, but it is a methodical way to work. 

karandama2006
Creator
Creator
Author

Thank you very much @rhall  for such detailed explanation.

The approach you suggested is considerably fast for parsing JSON than the Xpath I was using .

 

It also solved another issue of invalid XML unicode character which I was getting in some of the files due to the Xpath approach

 

For the 208 mb file at first I got the heap-space issue (That was immediate didn't have to wait 4 hours to get this error) .

I increased the memory of the job to 12 GB (Ran it on a system which actually had more RAM) and then I got the GC overhead issue , found a way to fix that in some other Talend community post which is to add  JVM parameter to disable GC overhead limit , and after that everything worked 0683p000009MACn.png

File was processed within 2 minutes !!

 

0683p000009M2bI.png

 

Thanks again , really appreciate your help .

 

Can't imagine how much time I invested to parse this JSON the wrong way , I think this approach can be  added as a Scenario under tfilejsoninput  in help.talend.com (This is the first page you land on when you google Talend + JSON)

The existing one's there are too basic (If this is not already present somewhere else , as I couldn't find it)

Anonymous
Not applicable

Hi @karandama2006, I am pleased that this worked for you. I will see if we can get this post reworded and promoted to an article.