Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Concerning Talend and Garbage Collection

Hi,

I posted this question in the Open-Source forum yesterday, but since we are using Enterprise I think this might be a more appropriate place for my post.

I've got a job that looks something like this (sorry I can't just post a screencap, but it's for work and I'm not sure if it would get me in trouble):

OracleInput1 -main-> tMap -main-> tAggregateRow -main-> tFileOutput(file_1)
         |                           ^
         |                           |
         |                       lookup
         |                           |
         |                 DB2Input1
         |                             
OnSubjobOk                               
         |            
         |                                tFileOutputDelimited(duplicate_file)    
         |                                                     ^   
         |                                                     |        
         |                                                duplicates                  
        V                                                     |
OracleInput2 -main-> tMap -main-> tUniqueRow -main-> tSortRow -main-> tAggrRow  -main-> tFileOutput(file_2)
         |                           ^
         |                           |
         |                       lookup
         |                           |
         |                 DB2Input2
         |                             
OnSubjobOk                             
         |                          
         |
         |
         |
         V
tFileInputDelimited(file_1) -main-> tMap -main-> tAggregateRow -main-> OracleOutput1
                                                            |
                                                         lookup
                                                            |
                                             tFileInputDelimited(File_2)




Some other details:
[list=*]

  • There are a few more sort/aggregate functions in the job than I'm using in the description above.  We need these sorts because some of the aggregate functions are to get the first and last records after the sort happens.

 

  • Each of these DBInput components is pulling between 800k and 2M records.

 

  • the combined size of the two flat-files is less than 1GB


To run this job takes close to 6 GB of memory, and I don't understand why. 

If I watch the memory usage on my machine, during the first subjob (which is processing the largest amount of data), Talend is using between 2 and 3 GB of memory.  I see no issue with that.

However, once the job dumps it's data into my first flat-file and moves on to the second subjob, the data stored in memory from the first subjob does not get released.  The second subjob is working with far less data than the first, but instead of memory usage dropping, it steadily increases.  By the end of the 2nd subjob, I'm giving talend between 4 and 5 GBs of memory.  The same thing happens when I go to subjob 3 (where I'm joining the 2 delimited files).  By the time subjob 3 begins loading my Oracle table, talend is using over 6GB of memory on my machine.

I am not referencing this data at any other point in the job, and I've left the subjob.  My understanding is that at this point, the data should be available for garbage collection.  Subjob 1 works with the largest amount of data out of the three, so I would not expect my memory usage to get much higher than 2-3 GB (the amount that is used while I'm in subjob 1).  Our senior dev thinks that this might have something to do with the tSortRow or tAggregateRow components causing Talend to not release these references and thus preventing the Garbage Collector from freeing up the memory used in the main flows of the first 2 subjobs.

Can anyone shed some light on this and help me understand what's going on behind the scenes, here?

Labels (2)
1 Reply
Anonymous
Not applicable
Author

You *may* have discovered a bug here. If your representation of your job is correct (by the way, I doubt you company would be cross with you sharing a screenshot, you are not revealing anything really), there is no reason that the memory should be held. You can see if you can correct this with a little bit of code in a tJava connected by an OnSubJobOK after the first subjob.

Try the below....
globalMap.remove("tAggregateRow_1");
System.gc();


I am assuming your tAggregateRow is the first one (with a label of  tAggregateRow_1).

Basically your components are stored in the globalMap HashMap. They should be released. The code above removes it and then calls

the garbage collector.

Try it and see what it does.