Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Bucharest on Sept 18th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Reusing multiple pieces of data from one job to another

We have a solution built in Talend Open Studio 5.6.2 that we're currently working on optimizing for performance. The solution is comprised of the following jobs, each of which parses a JSON blob item to extract data for a specfic purpose:
[list=*]

  • Base Dimension Job = parses the JSON for dimensions considered common to other solutions

 

  • Batch Dimension Job = parses the JSON for dimensions specific to Batch solution

 

  • SCD Job = parses the JSON for slowly changing dimension. This is separate for now because there is no JDBC version of the SDC component.

 

  • Fact Job = re-parses the JSON to extract values and then do key lookups from dimension tables to populate fact tables. Hashmaps are used to retain some of this data in memory for the fact table processing.



I guess I have a few of questions regarding this implementation: 
[list=1]

  • Is this a typical architecture of jobs when extracting data from JSON? 

 

  • Is there a better/other way to pass data values from one job to be used in another within Talend Open Studio? 

 

  • Is it better to parse all the JSON once for all dimensions at the outset and hold all of the information in buffer/hash/memory and then piecemeal it out as needed?



I've been doing some searching in forums and such and have found comparisons of the tHashOutput vs tBufferOutput and I'm really not seeing any good examples of people passing more than one value from one job to another job.

Labels (2)
2 Replies
Anonymous
Not applicable
Author

Hi  
If the data set is not very large, you can cache them in memory with tHashOutput for used later, you don't need to split the job to different child jobs, because every business will read the same JSON data. 
Anonymous
Not applicable
Author

That's the approach I had begun to test, and I've got the tHashOutput(s) doing exactly what you recommended. You mention this should be done "if the data set is not very large." Can you expand on that part? 
[list=*]
  • If we're processing <1000 rows we're OK but ~10k we're hitting bottlenecks? 

  • Is there a threshold that I should try to stay under? 

  • Is there any documentation or best practice I can follow for this?

  • Sorry for all the questions, I would just hate to re-architect this thing to find that I've made things worse in real world scenarios!

    Oh, and as another aside, the initial solution was built in 5.6.2 but I'm rebuilding in 6.1.0. I'm not sure if there's anything in the newer version I can take advantage of to help with that. Or for that matter, if there's anything within the Enterprise version which might help out.
    Thanks!