Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I am trying to create multiple worksheets in an excel workbook and the number of worksheets is not known at the start of the job. This is dependent on the values of a field from my Oracle input. I followed the post to design the same and this works wonders when there is very less data. Even with 18k records from my toracleinput, the iterations completed when run for whole night are just about 500.
I tried to modify the approach using toraclerow (Prepared statement setting) and tparserecordset and couldn't see anything better. I have attached both the approaches design.
The rest of the job completes in about 10 minutes and just the part of final excel creation is the problem now. How can I improve the performance. My toracleinput has a simple SELECT query that return 18k records.
You only iterate over the groups. The main bulk of your data rows should be supplied by the queries that are triggered by each iteration. For example, given the dataset I gave you earlier.....
Iteration 1: You will have a query for class A, which will return rows for every student in class A.
Iteration 2: You will have a query for class B, which will return rows for every student in class B.
Iteration 3: You will have a query for class C, which will return rows for every student in class C.
Iteration 4: You will have a query for class D, which will return rows for every student in class D.
Iterating is a little slower, but you will only be iterating over the classes (in this example). All of the data will be returned per class as a normal flow. You are unlikely to notice any performance issues.
There may be a component in the Exchange which will do this for you. I believe @lli has built a couple of Excel components. Maybe he can point you in the right direction.
Alternatively, if you want to use standard Talend components, you can do this by grouping your data by prospective "sheet" in your query and then running a query for each group and loading the Excel worksheet in an iterative fashion. For example, lets say I have some data that looks like this....
Class | Name | Age |
A | Ben | 12 |
A | John | 12 |
A | Jane | 12 |
A | Gemma | 12 |
C | Mohammad | 13 |
D | Abi | 13 |
B | Mike | 11 |
B | Safron | 13 |
B | Jade | 14 |
B | Abdul | 12 |
C | Henry | 13 |
C | Sam | 13 |
C | Mikela | 12 |
D | Seb | 13 |
D | Giles | 12 |
D | Jess | 13 |
D | Rachel | 12 |
Now, if I want a sheet per group, I can do it this way.
tDBInput_1-----> tFlowToIterate_1--iterate--->tDBInput_2------>tFileOutputExcel
The tDBInput just performs a query on the data above and returns a distinct list of the classes. That is it. It passes that to the tFlowToIterate.
The tFlowToIterate allows us to iterate over the classes for the next step AND saves the current class to the globalMap. This can be used in our where clause for the tDBInput_2 which will return all records for that class. It can also be used in the Excel file sheet name.
When this jobs runs, it will go through an iteration for each class and it is this which will allow you to dynanically add your sheets.
Thank you for your response.
For very smaller data sets, tflowtoiterate is working absolutely great. But even with 18k records, tDBinput1-->tflowtoiterate is doing around 1-2 rows/second. Is this the expected performance? I have used "Use the default (Key,value) in global variables" in tflowtoiterate basic settings.
You only iterate over the groups. The main bulk of your data rows should be supplied by the queries that are triggered by each iteration. For example, given the dataset I gave you earlier.....
Iteration 1: You will have a query for class A, which will return rows for every student in class A.
Iteration 2: You will have a query for class B, which will return rows for every student in class B.
Iteration 3: You will have a query for class C, which will return rows for every student in class C.
Iteration 4: You will have a query for class D, which will return rows for every student in class D.
Iterating is a little slower, but you will only be iterating over the classes (in this example). All of the data will be returned per class as a normal flow. You are unlikely to notice any performance issues.