Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Slow SQL Server updates

Is Talend making any effort to improve Talend performance with MS SQL Server?

 

Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server.  I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).

 

I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables:  a source table on one SQL Server and a target table on another SQL Server.  The data is about 220k rows.  The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec.  Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec!  Really??  Almost 100x slower?  What the heck is it doing?

 

I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact).  None of these made significant differences in performance.  Still achieves < 30 rows/sec.

Labels (2)
43 Replies
Anonymous
Not applicable
Author

I chased this problem down for the past day and a half.  The issue is that Talend recommends downloading an old JDBC driver from Microsoft.  I downloaded the latest https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view...  Version 7 driver and magic!  0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 0683p000009MACn.png 

 

To uninstall the old SQL Server module (probably v6.x) follow these instructions https://community.talend.com/t5/Design-and-Development/uninstall-an-external-jar/td-p/30541  aka delete the folder labeled studio\configuration\.m2\repository\org\talend\libraries\mssql-jar

rmartin2
Creator II
Creator II

Hello everyone !

 

I'm digging up the topic since, the issue is not solved !

I'm facing real issues with UPSERT in MSSQL in TOS 7.x...

 

I read the whole topic and managed to do some improvements :

  • Using NONCLUSTERED index is better than CLUSTERED in the end, and you still have a sorted index
  • I read some interesting digging up there : https://github.com/microsoft/mssql-jdbc/issues/166 : basically, JTDS, in some cases, JTDS was unpreparing the statemement at each row.
  • putting "not updatable" on my primary key nearly doubled my throughput (from 120 to 250 row/s still...)

And finally I analyzed fine Events in SQLServer logs :

0683p000009M6pA.jpg

As you can see, the statement is "instant" for it's execution, but the latency between the Talend runtime and the Database is the real problem : 3 milliseconds.

It's all it takes to go down drastically in performance. You cannot basically go faster than 300/500 row per sec 0683p000009MPcz.png

 

I know pretty much how Datastage works, and I now understand why Talend is so slow on MSSQL Server !!

 

Datastage approach :

  • You declare a "batch size" (generally 2k to 10k)
  • Those 2k lines are sent in a single batch (an array of prepared statements in each insert and update statements):
    • Step 1 : update statements are sent
    • Step 2 : all failed updates are marked and corresponding insert statements are sent
  • UPDATE then Insert is prefered to INSERT then UPDATE, because UPDATE is much more faster (not changing primary key)
  • It's up to you to not have a duplicate key in your batch

Talend approach :

  • You can't use batch size in update/Insert operations : it's a 1 row batch
  • It's send as a prepared statement atomically
  • if the insert or update fails, it launch the other statement
  • you don't have to worry about having duplicate key in you your input data

 

In terms of performance, with a pretty  slow machine these are the metrics :

  • Datastage SE : 2k to 5k row/s
  • Datastage PX : 100k to 400k row/s (depends on how many parallel channels you're using)
  • Talend : 50 row/s

 

You can do whatever you want, UPSERT is never going to be a valid option with the current JTDS version (which is quite old now) because of the batch size.

You have to either :

  • separate insert and update
  • use insert first, then a second connector just after like that :

Delimited ==> DBOutput (Insert) ==> DBOuput (Update).

 

I'm going to try this one because it's to **bleep** painful to do a lookup for each table you are going to update ...

And even there, I'm not sure that the update will be fast (insert are faster in jTDS)

 

 

EDIT : as stated before, it's indeed far better with INSERT => UPDATE method (it's instant). But it creates a ton of logs ... 

0683p000009M6pF.jpg

 

And with Insert if not exists :

0683p000009M6QR.jpg

 

Performance drops again.

rmartin2
Creator II
Creator II

Ok, I figured it out !

 

It's pretty simple in the end ... sadly :

  • if you use a "many operations" like update then insert, update then insert, or insert if not exists, it does not prepare the statement. It only performs a "sp_execute"
  • if you use a "single operation" option, it prepares your statement ("sp_prepare"), and then is followed by "sp_execute" which is much more faster !!

 

For having precise numbers :

It takes :

  • It takes 4110 milliseconds to insert/update 7k row with the prepared statement (creating a bunch of PK exception, so you can account for less in reality)
  • it takes 25 seconds (tStatCatcher) to do the same without prepared statement

 

It's a clear flaw in design : the connector should declare all asked statements even if no lines are used...

Is there a way to change the component to force creation of the prepared statement  (sp prepare) ?

 

It seems mandatory to me...

 

Anonymous
Not applicable
Author

I think I understand what you are saying, but I am curious by what you mean by....

"It's a clear flaw in design : the connector should declare all asked statements even if no lines are used...

Is there a way to change the component to force creation of the prepared statement  (sp prepare) ?"

 

Are you asking for an option to force all DB components to communicate regarding the creation of prepared statements?

rmartin2
Creator II
Creator II

In some way yes. But not really.

 

Like "Prepare statement" checkbox, which can be even unchecked for a simple insertion where it's not needed.

If checked, it prepares all selected statement for this component (DBOutput).

 

I hope it's clearer !

 

Sincerely,

 

Anonymous
Not applicable
Author

Ah I see. That is certainly an interesting and useful feature request. I will point R&D to this thread. Could I ask you to raise this as a Jira? Coming from a customer might give it that extra sense of importance 🙂

rmartin2
Creator II
Creator II

Sure !

 

I'll prepare a full detailed request for this. 

It's so usefull in many cases that I don't see why it's not already here ^^

rmartin2
Creator II
Creator II

Hi @rhall ,

 

I've created an issue in JIRA.

Can you check if I submitted it properly ?

 

Performance issue are very serious when on upsert... it's a factor x1000 

And so my jobs are ... I need 45 minutes to run my simulation, instead of 1 or 2 minutes (20 runs of 8 jobs lasting normally around 2-3 seconds - 10 max).

Actually each batch of 8 jobs take around 2-3 minutes depending greatly on the volume of data.

 

Thank you in advance.

Anonymous
Not applicable
Author

Hi @mhodent could you give me a link and I will check this tomorrow first thing

 

Regards

 

Richard

rmartin2
Creator II
Creator II

Hi,

Sorry, I was a bit overcharged by all other topics I'm working on 0683p000009MACn.png

Here it is : https://jira.talendforge.org/browse/TDI-42913


Thank you in advance.

I found the piece of code responsible for building the sp_prepare, but I'm not confident enough to modify the code directly. you may help me through it ?