Solved: Re: Job optimization - Excel file reading - Qlik Community

jeoste · ‎2019-08-20

Hello all,

--- Short version ---
Encountering a massive optimisation problem when reading multiple time Excel 2007 file which causes outOfMemory error.
How to increase performance, by only reading the Excel File 1 time per Sheet file rather than 1 time per table to insert ?

--- The long version ---
I'm encountering a problem with a "out of memory" error in Talend. I solved this problem, by increasing xmx and xms of my .ini file.
But now this is back, because my program will not be launched from my computer, but from a server which uses only half of the space I defined on my local machine.

There's no possibility to increase xmx and xms from the server. So my only way is to optimize my code (in my opinion, but if you have a genius idea I'll take it )

My jobs are doing this :
- Reading an Excel 2007 file (.xlsx)
- Transforming some data to correspond with management rules
- Inserting in a table
- Committing

First optimization problem :

I'm reading the Excel file at the beginning of each job.

In my file, I have columns, let's say A,B,C,D and E.
I need to insert the data from A,B,C into tableA.
Then inserting data from D into tableB.
And inserting data from E into tableC.

This order cannot be changed, as long as I have foreign constraint in my database I need to insert all the data in a specific order.

For the rest of my program, I have many others tables, with many columns, and need to do the same ; that's why I need to optimize the process.

So I'm reading the Excel file, I'm inserting into TableA ; I'm calling a subjob which insert data from column D into tableB, and then another subjob which insert col E into tableC.
At the end I got 17 jobs, with 17 same excel file reading (but not the same sheets).

See the attachments with these 3 jobs.

First job

---

Second job

---

Third job

---

How can I optimize this pattern ?

I already tried multiple solution :

First, rather than insert into tableA then tableB then tableC, I used a tMap with Main order 1 to insert into tableA, Main order 2 tableB and Main order3 tableC.
But it doesn't respect the "order", Talend is just parallelizing the flow, so my job is inserting B before A and C, which causes error in database due to foreign key constraint.

Second, I used the "tReplicate" component, to multiply by 3 the input flow, it allows me to avoid to read the Excel File 3 times ; but the behavior is the same as tMap, the "order" is not respected, and I have the same issue as tMap...

Third, I grouped my code with my 3 tables A,B and C into the same job, so I'm avoiding the multiple connection to database and disconnection.
Committing insert without ending the connection, it works on my machine ; but when I'm building the job and launching the .sh I got the error "outOfMemory".

Fourth, same as third solution, but I grouped my code with the component tFileExcelInput, and not the actual components tFileExcelWorkbookOpen -> tFileExcelSheetInput

Right now I don't really know what to do, I heard about "tHashMap" and "tFlowToIterate/tFixedFlowInput" but never used this. Do you think it might be a good alternative ?
I have around 3 tLogRow per jobs, this might be a bad habit too ?
And all the console is exported into a "log file" into a specific folder.

If you have an idea, whatever passing through your mind, I'll be glad to hear it and test it ASAP ! All the optimization I can get, I take it !
Thanks by advance, if you need more details, tell me and I'll add more screenshots etc...

Anonymous · ‎2019-08-20

OK, we might have found a way in which your job is not very efficient. I'm not entirely sure what the iterate link will do from the tFileExcelWorkbookOpen component, but since you have hardcoded your sheet in the next component, you should probably use another link. The tFileExcelWorkbookOpen simply provides a "connection" to your Excel file (by the look of it). I presume that this could go at the very beginning of your job (with all and any other connections). You can connect these with OnSubJobOk links.

The other issue that I have spotted is that you are opening this Excel file multiple times at once. You open it for each job and since each of the jobs is "inside" another one, you are being really inefficient with your memory. Why do you need 3 jobs? Could you not retrieve all of the data in one job? There are components called tHashInput and tHashOutput. These store data in memory. What you could do is load the data into a tHashOutput (one per sheet maybe) and then read it back using corresponding tHashInput components in the same job. You can load the data in order this way by reading from the tHashInputs in order. Something like this flow.....

DB Connection
|

OnSubJobOK
|

tFileExcelWorkbookOpen
|
OnSubJobOK

|

tFileSheetInput---->tHashOutput (Sheet1)

|

OnSubJobOK

|

tFileSheetInput---->tHashOutput (Sheet2)

|

OnSubJobOK

|

tFileSheetInput---->tHashOutput (Sheet3)
|

OnSubJobOK

|

tHashInput ----> Database (Sheet 1)

|

OnSubJobOK

|

tHashInput ----> Database (Sheet 2)

|

OnSubJobOK

|

tHashInput ----> Database (Sheet 3)

The above is obviously a very high level description. But what you are doing there is reading the data only once from the Excel file and only creating one connection.

View solution in original post

Anonymous · ‎2019-08-20

I am unsure about a few things, can you clarify please?

1) Are the jobs called by subsequent jobs? So does job 1 start job 2 using a tRunJob and does job 2 start job 3 using a tRunJob?

2) What components are you using? I can't quite make them out from your images. You appear to maybe have an Excel component iterating straight to another file component. This looks weird.

3) How are the components you are iterating to configured? How are you passing the data and are they set to be created brand new on each iteration?

If you could answer the above it should give me a better idea. It may also lead to more questions though 🙂

Anonymous · ‎2019-08-20

Sorry, I just saw the tags on your last post. That answers the component questions. @lli may be able to advise as I believe you are using his components. Could you please answer the other questions?

Anonymous · ‎2019-08-20

An observation, you appear to be using OnComponentOK links where it is possibly better to be using OnSubJobOK links. You certainly need to adjust this. At the moment the flow of your job is very difficult to figure out

jeoste · ‎2019-08-20

Thanks for your answer !

Sure let me clarify your questions

@rhall wrote:

1) Are the jobs called by subsequent jobs? So does job 1 start job 2 using a tRunJob and does job 2 start job 3 using a tRunJob?

That's right, my first job is starting job 2 (which is called Subjob "sports"), if the component "Commit t1" succeeds.

Then job 2 is starting job 3 etc...

@rhall wrote:

2) What components are you using? I can't quite make them out from your images. You appear to maybe have an Excel component iterating straight to another file component. This looks weird.

These components comes from the Talend Exchange right !

@rhall wrote:

3) How are the components you are iterating to configured? How are you passing the data and are they set to be created brand new on each iteration?

The first component is the "tFileExcelWorkbookOpen", which contain the path to my Excel file.

The second one, tFileSheetInput is configured like this :

"Structur my sport" is the name of the sheet inside the Excel.

"4" is the header, it begins to read the data at the line 4 of my excel

"R" is the column where the data needs to be read

See the screenshot below

@rhall wrote:

3) How are you passing the data and are they set to be created brand new on each iteration?

I think this is a good point, at each iteration, the file might be read ; when i'm running this job, I can see the iteration increasing through time, but it takes 1 minute and a half to read the entire file (with another sheet with 1600 lines).

But when I'm doing the same with tFileInputExcel, it takes 4 seconds.

jeoste · ‎2019-08-20

@rhall wrote:

An observation, you appear to be using OnComponentOK links where it is possibly better to be using OnSubJobOK links. You certainly need to adjust this. At the moment the flow of your job is very difficult to figure out

Thanks you, I changed all the links I could with OnSubjobOk instead of onComponentOk

Anonymous · ‎2019-08-20

OK, we might have found a way in which your job is not very efficient. I'm not entirely sure what the iterate link will do from the tFileExcelWorkbookOpen component, but since you have hardcoded your sheet in the next component, you should probably use another link. The tFileExcelWorkbookOpen simply provides a "connection" to your Excel file (by the look of it). I presume that this could go at the very beginning of your job (with all and any other connections). You can connect these with OnSubJobOk links.

The other issue that I have spotted is that you are opening this Excel file multiple times at once. You open it for each job and since each of the jobs is "inside" another one, you are being really inefficient with your memory. Why do you need 3 jobs? Could you not retrieve all of the data in one job? There are components called tHashInput and tHashOutput. These store data in memory. What you could do is load the data into a tHashOutput (one per sheet maybe) and then read it back using corresponding tHashInput components in the same job. You can load the data in order this way by reading from the tHashInputs in order. Something like this flow.....

DB Connection
|

OnSubJobOK
|

tFileExcelWorkbookOpen
|
OnSubJobOK

|

tFileSheetInput---->tHashOutput (Sheet1)

|

OnSubJobOK

|

tFileSheetInput---->tHashOutput (Sheet2)

|

OnSubJobOK

|

tFileSheetInput---->tHashOutput (Sheet3)
|

OnSubJobOK

|

tHashInput ----> Database (Sheet 1)

|

OnSubJobOK

|

tHashInput ----> Database (Sheet 2)

|

OnSubJobOK

|

tHashInput ----> Database (Sheet 3)

The above is obviously a very high level description. But what you are doing there is reading the data only once from the Excel file and only creating one connection.

Anonymous · ‎2019-08-20

Your design ist ok, but actually you do not need to cache the data in the tHash components. You can write them directly in the flow starting with the tFileExcelSheetInput components.

Anonymous · ‎2019-08-20

You're right @lli, I was thinking from the perspective of the Talend Excel components (not yours) and that the data would be used multiple times. Thanks for picking that up 😉

jeoste · ‎2019-08-21

Hello,

Once again, thanks for your responses

@rhallI tried your design I followed the @lli ' tips so I didn't used the tHash components, and it works perfectly on my machine !

I have to wait the end of the week to test it on the server, because my colleague is the only one who can access it, but I'm very confident about the result. I'll accept your solution as the good answer !

More info about the optimization result :

For now I've made 3 tests. The first one with all the components (this is in the previous attachments), it took 4GB of RAM, and failed due to the outofMemory's error.

Second one, with the first optimization, I just deleted many tFileExcelWorkbookOpen as you recommended, here's the graph in jvisualvm

Still a peek, which corresponds to some others jobs. I've decided to refacto this code too and this is the final result :

From 4GB to 1.2GB and 50% less time, just for one row.

Job optimization - Excel file reading

Talend Data Integration

v6.x