Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Is Talend making any effort to improve Talend performance with MS SQL Server?
Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server. I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).
I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables: a source table on one SQL Server and a target table on another SQL Server. The data is about 220k rows. The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec. Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec! Really?? Almost 100x slower? What the heck is it doing?
I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact). None of these made significant differences in performance. Still achieves < 30 rows/sec.
I chased this problem down for the past day and a half. The issue is that Talend recommends downloading an old JDBC driver from Microsoft. I downloaded the latest https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view... Version 7 driver and magic!
To uninstall the old SQL Server module (probably v6.x) follow these instructions https://community.talend.com/t5/Design-and-Development/uninstall-an-external-jar/td-p/30541 aka delete the folder labeled studio\configuration\.m2\repository\org\talend\libraries\mssql-jar
Hello everyone !
I'm digging up the topic since, the issue is not solved !
I'm facing real issues with UPSERT in MSSQL in TOS 7.x...
I read the whole topic and managed to do some improvements :
And finally I analyzed fine Events in SQLServer logs :
As you can see, the statement is "instant" for it's execution, but the latency between the Talend runtime and the Database is the real problem : 3 milliseconds.
It's all it takes to go down drastically in performance. You cannot basically go faster than 300/500 row per sec
I know pretty much how Datastage works, and I now understand why Talend is so slow on MSSQL Server !!
Datastage approach :
Talend approach :
In terms of performance, with a pretty slow machine these are the metrics :
You can do whatever you want, UPSERT is never going to be a valid option with the current JTDS version (which is quite old now) because of the batch size.
You have to either :
Delimited ==> DBOutput (Insert) ==> DBOuput (Update).
I'm going to try this one because it's to **bleep** painful to do a lookup for each table you are going to update ...
And even there, I'm not sure that the update will be fast (insert are faster in jTDS)
EDIT : as stated before, it's indeed far better with INSERT => UPDATE method (it's instant). But it creates a ton of logs ...
And with Insert if not exists :
Performance drops again.
Ok, I figured it out !
It's pretty simple in the end ... sadly :
For having precise numbers :
It takes :
It's a clear flaw in design : the connector should declare all asked statements even if no lines are used...
Is there a way to change the component to force creation of the prepared statement (sp prepare) ?
It seems mandatory to me...
I think I understand what you are saying, but I am curious by what you mean by....
"It's a clear flaw in design : the connector should declare all asked statements even if no lines are used...
Is there a way to change the component to force creation of the prepared statement (sp prepare) ?"
Are you asking for an option to force all DB components to communicate regarding the creation of prepared statements?
In some way yes. But not really.
Like "Prepare statement" checkbox, which can be even unchecked for a simple insertion where it's not needed.
If checked, it prepares all selected statement for this component (DBOutput).
I hope it's clearer !
Sincerely,
Ah I see. That is certainly an interesting and useful feature request. I will point R&D to this thread. Could I ask you to raise this as a Jira? Coming from a customer might give it that extra sense of importance 🙂
Sure !
I'll prepare a full detailed request for this.
It's so usefull in many cases that I don't see why it's not already here ^^
Hi @rhall ,
I've created an issue in JIRA.
Can you check if I submitted it properly ?
Performance issue are very serious when on upsert... it's a factor x1000
And so my jobs are ... I need 45 minutes to run my simulation, instead of 1 or 2 minutes (20 runs of 8 jobs lasting normally around 2-3 seconds - 10 max).
Actually each batch of 8 jobs take around 2-3 minutes depending greatly on the volume of data.
Thank you in advance.