Solved: Job optimization - Excel file reading - Page 2 - Qlik Community

jeoste · ‎2019-08-20

Hello all,

--- Short version ---
Encountering a massive optimisation problem when reading multiple time Excel 2007 file which causes outOfMemory error.
How to increase performance, by only reading the Excel File 1 time per Sheet file rather than 1 time per table to insert ?

--- The long version ---
I'm encountering a problem with a "out of memory" error in Talend. I solved this problem, by increasing xmx and xms of my .ini file.
But now this is back, because my program will not be launched from my computer, but from a server which uses only half of the space I defined on my local machine.

There's no possibility to increase xmx and xms from the server. So my only way is to optimize my code (in my opinion, but if you have a genius idea I'll take it )

My jobs are doing this :
- Reading an Excel 2007 file (.xlsx)
- Transforming some data to correspond with management rules
- Inserting in a table
- Committing

First optimization problem :

I'm reading the Excel file at the beginning of each job.

In my file, I have columns, let's say A,B,C,D and E.
I need to insert the data from A,B,C into tableA.
Then inserting data from D into tableB.
And inserting data from E into tableC.

This order cannot be changed, as long as I have foreign constraint in my database I need to insert all the data in a specific order.

For the rest of my program, I have many others tables, with many columns, and need to do the same ; that's why I need to optimize the process.

So I'm reading the Excel file, I'm inserting into TableA ; I'm calling a subjob which insert data from column D into tableB, and then another subjob which insert col E into tableC.
At the end I got 17 jobs, with 17 same excel file reading (but not the same sheets).

See the attachments with these 3 jobs.

First job

---

Second job

---

Third job

---

How can I optimize this pattern ?

I already tried multiple solution :

First, rather than insert into tableA then tableB then tableC, I used a tMap with Main order 1 to insert into tableA, Main order 2 tableB and Main order3 tableC.
But it doesn't respect the "order", Talend is just parallelizing the flow, so my job is inserting B before A and C, which causes error in database due to foreign key constraint.

Second, I used the "tReplicate" component, to multiply by 3 the input flow, it allows me to avoid to read the Excel File 3 times ; but the behavior is the same as tMap, the "order" is not respected, and I have the same issue as tMap...

Third, I grouped my code with my 3 tables A,B and C into the same job, so I'm avoiding the multiple connection to database and disconnection.
Committing insert without ending the connection, it works on my machine ; but when I'm building the job and launching the .sh I got the error "outOfMemory".

Fourth, same as third solution, but I grouped my code with the component tFileExcelInput, and not the actual components tFileExcelWorkbookOpen -> tFileExcelSheetInput

Right now I don't really know what to do, I heard about "tHashMap" and "tFlowToIterate/tFixedFlowInput" but never used this. Do you think it might be a good alternative ?
I have around 3 tLogRow per jobs, this might be a bad habit too ?
And all the console is exported into a "log file" into a specific folder.

If you have an idea, whatever passing through your mind, I'll be glad to hear it and test it ASAP ! All the optimization I can get, I take it !
Thanks by advance, if you need more details, tell me and I'll add more screenshots etc...

jeoste · ‎2019-08-29

As promise here's a little update about this topic.

The code above optimized the time factor by 50% ; but I've discovered that the main problem came from the Excel file. There's lots of operation, links between sheets and it costs tons of RAM (2.5Gb with the operations, 50Mb without)
To optimize again I copy the entire sheet, and I paste it "as value" in Excel. I did that for all the sheets.

The execution is now around 12 sec with my local db instead of 14 min.
In the customer's environment it's 10 min, instead of 30+ min.
The RAM used dropped from 2GB to 1GB for the entire Excel file, which is perfect !

If someone has this case, don't forget to optimize your code but check by the same time your input file ! Excel is always guilty

Job optimization - Excel file reading

Talend Data Integration

v6.x