Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi everybody.
Again, I REALLY need your help
I'm having the error "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space".
My OS is Windows 7 Professional Service Pack 1, 64 bits and my RAM 8GB. I'm using Talend Open Studio for Data Integration, Version: 7.0.1
Right now, I'm trying to read an Excel file with 1 million rows and 34 columns aprox. full of data, using tFileInputExcel and a tLogRow, but my job only reads the first row (header) and then I get the error.
If I can read all the data (I hope you can help me with this), I'll process the information with components such as tMap, tAggregateRow, tPivotToColumnsDelimited, tFilterRow, tHashInput and tHashOutput, sending the entire results to a tFileOutputExcel & tFileOutputDelimited.
My advanced settings are:
-Xsm256M
-Xmx2048m
-XX:-UseGCOverheadLimit
Any suggestion? If somebody thinks that I could send you the file I'm trying to read (to ckeck if the problem is my computer), just tell me (it's an excel file .xlsx, 153MB)
Thanks!
did you try with Event mode in tFileexcelinput.
also please do load testing on job server and see if still giving error.
Hi!
I get the same error (after 24 minutes) Another suggestion? Thanks!
You can reduce the required memory space by replacing tHash components by files.
You can also store temporary data required by tMap components on disk.
For this, click on the 3rd icone on the upper left corner of the tMap then indicate the "Temp data directory path" and the buffer size.
and never use tLogRow connected to 1M rows source - it kills your studio!
Hi @TRF
The data stored in the tHash components isn't large and also, being information that results from the same job and then used as input, Talend apparently forces me to create metadata to reprocess the information with a tMap (and this could be a problem when I run the .bat file in another pc).
On the other hand, I tried to reduce the use of memory with your suggestion but I get the same error:
My job should look like this and I pretty sure that my problem is reading the excel file with the source data (because it works just fine with an small amount of data in the excel file, by the way @vapukov I was using the tLogRow just to try the reading part, but you are right, it's not the best idea).
Please tell me that you have any other idea.
I dont know if this is crazy but maybe with Talend I can split the file and then read the data separately? Or what do you think I should try...
Thank you in advance!
Hi,
Change all sets of tAggregateRow into a tSortRow (sort by all group by criteria) and tAggregateSortedRow. Ensure you set the Use disk option on all components where possible (giving a more sensible buffer size of 100,000), including tMaps.
Also consider splitting it into 2 subjobs around the tFileOutputDelimited_3 (make the lookup a tFileInputDelimited of what tFileOutputDelimited_3 has just output).
Hi @dsoulalioux . Thanks for your answer. I'm sorry to bother you, but before making the changes you suggest, I wanted to tell you that even without those components (tMap or tAggregateRow) the job generates the error. I've even tried to just read the Excel file, filter the columns I need (with tFilterColumns) and then filter the rows I need (with tFilterRow) to save this data to a new Excel file (for example), and the error also appeared. With this context, do you still think that I should replace the tAggregateRow with the components you mention? Thank you!
did you try with Event mode in tFileexcelinput.
also please do load testing on job server and see if still giving error.
Hi @uganesh
I am really grateful for your answer because so far it is the only way that the memory error has not appeared and the job has read the data from the source excel. I had not tried that alternative (because I did not know it existed), I did it and the reading part worked.
However, a new problem appears: the source file (excel .xlsx) has a date column in this format "14-10-2017 01:42:12" (dd-MM-yyyy hh: mm: ss), when I select in the tFileInputExcel the Event Mode, the job generates the error where it says that this data is not a date and forces me to change it to String (therefore the date becomes something like this "05: 15.3"). This column is very important for the calculations that I must do in the job because after some filters, I have to use the data of the date (the time is irrelevant) to calculate statistical information such as frequency and repetition. Is there any way that using the Event Mode that column where the date is can still be Date type? Thank you!