Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi there,
I think I'm doing something wrong, but I don't know what.
Here is my job
The tDBInput_1 does a full select of the source table
The tmap component routes the columns to the components HashTheKey (produces a hash over two key columns) and HashTheRow (produces a hash over all the columns)
And at the end I want to update the rows that I have selected with the two columns HashKey, and HashRow. This update is on the same table which is used in tDBInput_1.
The table on the database has primary key columns defined including a clusterd index on those fields. All trough the job the fields are marked as key.
After waiting a long time (more than 1,5 minute) the execution looks like this :
Questions are :
1 Why are only 1000 rows selected ?
2 Why does it take a long time to update just 1000 rows ?
3 Is there a work-a-round ?
4 Does anybody know what mistake that I'm making ?
Please help !
Thanks in advance.
Jacco
Think I got it !!
Change the initial load of the table, added an new identity field with an auto increment.
Created a clustered index on the new ID field.
Used the ID field as key in my job, and . . . . . from 145 row/s to 8192 rows/s !!!
Thanks for the hints !!
Jacco
Okay, anwser to self.. . . . Found out why only 1000 rows seem to be selected. Has to do with the Batch size of the tDBOutput component. Changed that to 10000 and 10000 rows are selected.
But the problem of the long wait still remains 😔
Well spotted regarding the commit. Sorry I didn't see this earlier.
About the time it takes, can you say whether the time is taken with the reading of data or just the writing? You can test this by replacing the output with a tLogRow and running the job to see how fast the data is written to the output window. If it is definitely the writing of the data, could you show us your output db component settings?
I should say that you will likely get better performance if you allow the DB components to create their own connections and allow the output component to commit by itself. It doesn't look like you need to wait to commit until the end (going by your job design). This might be worth a try.
Thanks for reply !
I tried to strip the job down to its essence. It looks like this now :
When executing the job, the jobs holds on the first 10000 rows as shown above.
After waiting halve a minute the next, and then a long wait and again the following 10000.
When I look to the Activity monitor on MS-SQL Manager, I see that the insert is an expensive query. The execution plan looks like this :
Using an index on the columns, the execution plan looks like this :
To me it is strange that the index is updated, so now I'm lost.
Jacco
Just so that I understand, I take that you have set the component to commit by itself here and to use its own connection? If so, did you see any improvement?
Also, how big is the table being updated? Updates do generally take a bit longer, but from what you said I presume you already knew that. How often is the table reindexed? Your plans do look expensive for a simple update. Do you have similar performance issues if you replicate the updates using just SQL Server (ie, without Talend)?
Other things you can try to tweak this is to modify the batch size (advanced settings) and also, are you using the "update or insert" or "insert or update" option on the table action? Often this can be a pretty heavy load on the job if the table is massive. In this case it is more efficient to identify if the data is an insert or update prior to loading it. Then select two db components, one to insert and one to update.
Yes, the components have their own connection.
Right now the table hold about 900.000 records, but this is in test fase. In production there will be an estimated 5 miljoen records in the table which need to be updated with a hash over the key fields and an hash over the whole record.
The table is not often index. Only the first time, after the insert of all the records the index is created.
Can it be that the index is on two character fields (one is char, the other is varchar) ?? I will test if an integer index will speed things up a bit.
I will also check to see if I have the same performance issue with an other job doing the same actions.
Think I got it !!
Change the initial load of the table, added an new identity field with an auto increment.
Created a clustered index on the new ID field.
Used the ID field as key in my job, and . . . . . from 145 row/s to 8192 rows/s !!!
Thanks for the hints !!
Jacco
I think you did all the work there 😉
Glad you got it sorted and thanks for coming back to give your solution!