Job Performance tuning

Ajay3 · ‎2019-01-31

Hi All,.

I have .dat file which has json data. File size is 8GB and its has complex json data. I have created job to extract data and load it into the database but it take too much of time to parse the json data and load it into DB table. Currently it is taking more than 20 hrs to load.

tfileunarchive-------->tfileinnputdelimeterd--------->tflowtoiterate-------->tjavaflex----->textractjsonfield----->tjavarow------>tmap------>toracleoutput

in tfileinputdel reading as id string column then I am only iterate it on 50 records enabling the parallel execution

it takes time to parse the data and extract the json field so basically it takes time to process from textractjsonfield component.

Can anyone pls assist me how I can improve the performance of the job. I don't have much knowledge in JAVA I am still learning.

I have attached the sample file.

ankit7359 · ‎2019-01-31

Hi @Ajay ,

Greetings of the day,

You can try and enable "Set Parallelization" in main source component(mostly from the tfileinputdelimited) and one another way is considering that the link from delimited file to flowiterate is main flow then change the parallelization to partition row and then increase the number of child threads,increase the queue size as well.

If you want you can check hash functions but be wary of this checkbox. This is like indirectly "Set Parallelization", next what can be done is on iterate option you can enable set parallelization option and increase the number of parallel exceution, by doing this you might end up splitting of the records into various threads.In the textractjsonfield link(MAIN ROW), you can merge the child threads which will ultimately improve the performance.

i m sure that you must have tried above approach.

One other way what could be done is take an .zip export of the job and as a standalone job you can execute in Task scheduler - this may reduce the dependency on Talend Studio and one another way to improve the performance.

Pls do get back with your comments.

Thanks,

Ankit

Anonymous · ‎2019-01-31

The problem is the poor json capabilities in the studio.

I had similar problems with extrem huge json files and therefore build a lightweight streaming json parser.

Please try the user component tJSONDocInputStream of the component suite tJSONDoc.

In this component I had implemented a very simple json parser which has only one task to solve, extract json documents by a path (but only by a simple json-path without any complex queries).

This parser does not requires the tFlowToIterate component and can take the file input directly and provides an output stream of json documents (as String or a Json Object of the Jackson JSON lib).

One of the biggest problem of you design is the tFlowToIterate component in you job. This component let start the initialisation of the following component for EVERY single record in your file! It means e.g. the tOracleOutput will create the statement for every record new and cannot use the batch mode (regardless if you have switched it on, it has no effect in your design!).

So my suggestion is in the attached picture...

demojob_json.png

Java

JSON

Talend Data Integration

v7.x