Solved: Slow SQL Server updates - Page 4 - Qlik Community

Anonymous · ‎2018-03-21

Is Talend making any effort to improve Talend performance with MS SQL Server?

Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server. I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).

I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables: a source table on one SQL Server and a target table on another SQL Server. The data is about 220k rows. The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec. Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec! Really?? Almost 100x slower? What the heck is it doing?

I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact). None of these made significant differences in performance. Still achieves < 30 rows/sec.

Anonymous · ‎2018-09-18

I chased this problem down for the past day and a half. The issue is that Talend recommends downloading an old JDBC driver from Microsoft. I downloaded the latest https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view... Version 7 driver and magic!

To uninstall the old SQL Server module (probably v6.x) follow these instructions https://community.talend.com/t5/Design-and-Development/uninstall-an-external-jar/td-p/30541 aka delete the folder labeled studio\configuration\.m2\repository\org\talend\libraries\mssql-jar

rmartin2 · ‎2019-08-29

Hello everyone !

I'm digging up the topic since, the issue is not solved !

I'm facing real issues with UPSERT in MSSQL in TOS 7.x...

I read the whole topic and managed to do some improvements :

Using NONCLUSTERED index is better than CLUSTERED in the end, and you still have a sorted index
I read some interesting digging up there : https://github.com/microsoft/mssql-jdbc/issues/166 : basically, JTDS, in some cases, JTDS was unpreparing the statemement at each row.
putting "not updatable" on my primary key nearly doubled my throughput (from 120 to 250 row/s still...)

And finally I analyzed fine Events in SQLServer logs :

As you can see, the statement is "instant" for it's execution, but the latency between the Talend runtime and the Database is the real problem : 3 milliseconds.

It's all it takes to go down drastically in performance. You cannot basically go faster than 300/500 row per sec

I know pretty much how Datastage works, and I now understand why Talend is so slow on MSSQL Server !!

Datastage approach :

You declare a "batch size" (generally 2k to 10k)
Those 2k lines are sent in a single batch (an array of prepared statements in each insert and update statements):
- Step 1 : update statements are sent
- Step 2 : all failed updates are marked and corresponding insert statements are sent
UPDATE then Insert is prefered to INSERT then UPDATE, because UPDATE is much more faster (not changing primary key)
It's up to you to not have a duplicate key in your batch

Talend approach :

You can't use batch size in update/Insert operations : it's a 1 row batch
It's send as a prepared statement atomically
if the insert or update fails, it launch the other statement
you don't have to worry about having duplicate key in you your input data

In terms of performance, with a pretty slow machine these are the metrics :

Datastage SE : 2k to 5k row/s
Datastage PX : 100k to 400k row/s (depends on how many parallel channels you're using)
Talend : 50 row/s

You can do whatever you want, UPSERT is never going to be a valid option with the current JTDS version (which is quite old now) because of the batch size.

You have to either :

separate insert and update
use insert first, then a second connector just after like that :

Delimited ==> DBOutput (Insert) ==> DBOuput (Update).

I'm going to try this one because it's to **bleep** painful to do a lookup for each table you are going to update ...

And even there, I'm not sure that the update will be fast (insert are faster in jTDS)

EDIT : as stated before, it's indeed far better with INSERT => UPDATE method (it's instant). But it creates a ton of logs ...

And with Insert if not exists :

Performance drops again.

rmartin2 · ‎2019-08-29

Ok, I figured it out !

It's pretty simple in the end ... sadly :

if you use a "many operations" like update then insert, update then insert, or insert if not exists, it does not prepare the statement. It only performs a "sp_execute"
if you use a "single operation" option, it prepares your statement ("sp_prepare"), and then is followed by "sp_execute" which is much more faster !!

For having precise numbers :

It takes :

It takes 4110 milliseconds to insert/update 7k row with the prepared statement (creating a bunch of PK exception, so you can account for less in reality)
it takes 25 seconds (tStatCatcher) to do the same without prepared statement

It's a clear flaw in design : the connector should declare all asked statements even if no lines are used...

Is there a way to change the component to force creation of the prepared statement (sp prepare) ?

It seems mandatory to me...

Anonymous · ‎2019-08-29

I think I understand what you are saying, but I am curious by what you mean by....

"It's a clear flaw in design : the connector should declare all asked statements even if no lines are used...

Is there a way to change the component to force creation of the prepared statement (sp prepare) ?"

Are you asking for an option to force all DB components to communicate regarding the creation of prepared statements?

rmartin2 · ‎2019-08-29

In some way yes. But not really.

Like "Prepare statement" checkbox, which can be even unchecked for a simple insertion where it's not needed.

If checked, it prepares all selected statement for this component (DBOutput).

I hope it's clearer !

Sincerely,

Anonymous · ‎2019-08-29

Ah I see. That is certainly an interesting and useful feature request. I will point R&D to this thread. Could I ask you to raise this as a Jira? Coming from a customer might give it that extra sense of importance 🙂

rmartin2 · ‎2019-08-30

Sure !

I'll prepare a full detailed request for this.

It's so usefull in many cases that I don't see why it's not already here ^^

rmartin2 · ‎2019-09-03

Hi @rhall ,

I've created an issue in JIRA.

Can you check if I submitted it properly ?

Performance issue are very serious when on upsert... it's a factor x1000

And so my jobs are ... I need 45 minutes to run my simulation, instead of 1 or 2 minutes (20 runs of 8 jobs lasting normally around 2-3 seconds - 10 max).

Actually each batch of 8 jobs take around 2-3 minutes depending greatly on the volume of data.

Thank you in advance.

Anonymous · ‎2019-09-03

Hi @mhodent could you give me a link and I will check this tomorrow first thing

Regards

Richard

rmartin2 · ‎2019-09-06

Hi,

Sorry, I was a bit overcharged by all other topics I'm working on

Here it is : https://jira.talendforge.org/browse/TDI-42913

Thank you in advance.

I found the piece of code responsible for building the sp_prepare, but I'm not confident enough to modify the code directly. you may help me through it ?

Slow SQL Server updates

Talend Data Integration

v6.x

Related Topics