Solved: Slow SQL Server updates - Qlik Community

Anonymous · ‎2018-03-21

Is Talend making any effort to improve Talend performance with MS SQL Server?

Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server. I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).

I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables: a source table on one SQL Server and a target table on another SQL Server. The data is about 220k rows. The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec. Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec! Really?? Almost 100x slower? What the heck is it doing?

I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact). None of these made significant differences in performance. Still achieves < 30 rows/sec.

Anonymous · ‎2018-04-11

By adding "sendStringParametersAsUnicode=false" to the Advanced Settings of the update component, it is now producing a sp_prepare using varchar(8000) instead of nvarchar(4000) and the performance difference was dramatic -- almost 4k rows/sec instead of a couple hundred. The problem is that the clustered key index in SQL Server is varchar() so it doesn't match the nvarchar() parameter value and forces a table scan.

The engineer assigned to the defect I opened (TDI-40281) was able to confirm the performance issue.

On a side note: It isn't clear to me why the component wouldn't use the size of the schema columns for the parameters, but that probably doesn't matter much.

View solution in original post

Anonymous · ‎2018-03-21

All I can say is that your job or database are poorly configured if those numbers are true. I'm currently working with Ms SQL Server and am able to get around 2500 rows per second without even looking into optimisation of my job.

Anonymous · ‎2018-03-21

I appreciate you confirming you are able to see decent results. That gives me hope that it is something misconfigured in the job. It is clearly not the database as Datastage can achieve respectable rates similar to what you stated.

The input side of the job all performs very well. The job uses tMSSQLInput to select all rows in the source table, another tMSSQLInput to select all rows from the target table into a tHashOutput, and then compares the two streams in a tMap, sending the output to either an insert path or an update path depending upon whether the record exists in the target with the same CRC (calculated in a tAddCRCRow to evaluate changes to the columns).

If I direct the insert and update paths to simple delimited files, the output is also performant reaching 51k rows/sec. (Of course, I realize a flat file is going to be extremely faster than a DB update. Just using it to isolate where the issue lies.)

When I enable the tMSSqlOutput components on the insert and update paths, the performance drops to < 30 rows/sec. Thus, the reason I've been playing around with the various settings for that component.

Anonymous · ‎2018-03-21

I built a new job that does nothing but select rows from one table and insert them into a new table in the same DB on the same server. I see the same results with this new, simpler job even though it is on a different DB and different server than the previous job was running against.

Based upon the displayed statistics, it cranks along fine at about 1k-2k rows/sec until it hits 10k rows and then just apparently hangs. If I wait, it will eventually finish and I'll then see the 30 rows/sec rate.

Given the 10k limit, I tried changing the "Batch Size" and "Commit Every" values (which appear to be the same as the "Array Size" and "Transaction Size" in Datastage) to other than 10k. As expected when I do this, it then hangs at whatever the larger of those two values are.

Anonymous · ‎2018-03-21

Right, I have just set up a job based on a table I have with 37668 rows. First, I essentially used Talend to read everything into a tHash component and write it to a new version of the table (but with no indexes or primary key). I used Talend to "create table if does not exist" and it built the table and loaded all 37668 rows in 1.74 seconds. This was with standard "out of the palette' settings.

However, inserting into an empty table and updating are two different beasts, so I then changed the job to update every single one of the 37668 rows. The only change I made was to the "Action on data" option and I set the update key using the Advanced settings' "Use field options". I then ran the job again....and I managed to get 47.66 rows per second....

Pretty terrible, but I barely changed anything and the database table has no keys or indexes.

I next played around with the "Use batch size" and "Commit every" settings and got marginal gains, but nothing substantial.

Now remember I said I have no keys or indexes on the table. Well I decided to add a clustered index on the update key field. This is an entirely reasonable thing to do, in fact it could be argued that you should index your tables appropriately to how they are used. If you update by a combination of columns as a composite key, you should create an index based on that combination of columns. Anyway, I added the index and ran exactly the same bog standard update with default settings (other than the "field options" settings). This was the result....

19598.34 rows per second from 47.66 just by adding an index on the update key.

Now I understand you may not be able to add any new indexes on your table, but I assume you have some already. Are you able to set the field options of the db component to match any existing indexes? Can you rebuild the index before you run? Essentially you have to performance tune the Talend job like you would performance tune a SQL query (going back to ETL with just database links 🙂 ).

Anonymous · ‎2018-03-21

UPDATE: I just realized that this simple job is doing a straight insert so the update key won't have any affect. DOH! I built the same simple job in Datastage for comparison and will do further testing. I'll try the update key in the job doing the updates and see if that is an improvement.

Well, that is interesting. (Note that this simple job is just doing an insert, not update.) Refer to images, below:

The key in my target table is already a primary clustered key.
I have that column marked as the key in my Talend schemas.
I enabled "Field Options" and checked the box under "Update Key" beside that column.
I set "Commit Every" and "Batch Size" back to the default 10k.
Performance hasn't changed. It reaches 10k fine, and then hangs.

Anonymous · ‎2018-03-21

Figured out one thing: I had been using the Microsoft provided JDBC provider. Apparently, that was a bad idea. In the simple insert job, I switched to the Open Source JTDS provider and it now is providing very respectable numbers. However, making that switch in my other job (the one doing the update) has not had that same affect. The performance drops dramatically once it hits the first commit/batch limit. Dang...

Anonymous · ‎2018-03-21

One more thing to check (I do this automatically and forgot to mention), did you untick the "updatable" field for the update key column in the field options? This can make a big difference since updating a key field (even with the same value) will incur an update performance hit since the index is technically having to be recalculated. Since it is a key value, it won't change....so you may as well untick the updatable field.

If that doesn't make a difference then I can only assume there is something funky going on with your database config. My database is not on the same machine as my Studio. In fact, it is in a different country. My Studio machine is a virtual machine with 8GB RAM (not ideal). However I am processing more data in a fraction of the time with what appear to be the same (or very similar) Talend settings.

In your last post you say "Note that this simple job is just doing an insert, not update" and show the status of the job hung. But you also show the field options with an update key specified. You only want to use the update key option if you are updating....not that it will make much of difference in an insert.

By the way, I don't believe your job is hanging at 10000 records. That is just a user interface "feature" it is displaying the results pre-emptively according to the "Batch Size" settings. Switch that off and you will see the result increment in smaller more frequent increments. Alternatively if you left your job running it would eventually get through the data (probably in a similar sort of time to if you left the job running with the "Batch Size" swicthed off).

Another slight difference between yours and mine is that you are using a primary key. When inserting a value into a primary key field the database essentially has to ensure that the value being inserted is unique, otherwise the constraint is broken. This will cause a little latency when inserting a large amount of data with existing PK data. Having said that, it shouldn't cause the issues you are seeing.

I really do not believe that Talend is doing anything weird here. Firstly because I have never seen any issues on countless customer sites. Secondly because Talend are simply using third party Jars to provide the database functionality. These are the same Jars used by many other Java applications with database interactions. All Talend does is prepare insert/update/select statements which are sent to the DB using these Jars. Even if the statements were inefficient, most RDBMS' ultimately rewrite the statements internally to make them more efficient. Any latency caused by Talend is more likely to be caused by complex in-memory transformations reuquiring a lot of memory. This is the main cause of any perfomance tuning I need to carry out, no matter what database I am using.

Anonymous · ‎2018-03-22

NOTE: I added the topic Performance impact of tMSSqlOutput Commit Every and Batch Size which shows the performance of inserts and how Commit and Batch affect the outcome. I did this testing with regard to the experience I was describing in this thread.

First, thank you very much for all the info you have shared and provided. It is really appreciated. I've got years of Datastage experience and now we're moving to Talend. Much of what I know still applies but the details are often different.

@rhallwrote:

One more thing to check (I do this automatically and forgot to mention), did you untick the "updatable" field for the update key column in the field options?

Yes, saw that list and made that same change at the time. I unchecked everything except the primary key in the "Update key" column and checked everything except the primary key in the "Updatable" column. I even unchecked everything in the "Insertable" column, just in case.

@rhallwrote:
If that doesn't make a difference then I can only assume there is something funky going on with your database config. My database is not on the same machine as my Studio. In fact, it is in a different country. My Studio machine is a virtual machine with 8GB RAM (not ideal). However I am processing more data in a fraction of the time with what appear to be the same (or very similar) Talend settings.

For this "update" job, the source and target databases are remote from me, both in Chicago. My Studio machine is a Macbook Pro 16G Ram, 2.8 GHz i7. As noted in the article from the link at top, my inserts appear to be performing fine.

@rhallwrote:
In your last post you say "Note that this simple job is just doing an insert, not update" and show the status of the job hung. But you also show the field options with an update key specified. You only want to use the update key option if you are updating....not that it will make much of difference in an insert.

Yeah, I caught that just after I posted and then made an update to that post. Oversight on my part, getting confused between the two jobs.

@rhallwrote:
By the way, I don't believe your job is hanging at 10000 records. That is just a user interface "feature" it is displaying the results pre-emptively according to the "Batch Size" settings. Switch that off and you will see the result increment in smaller more frequent increments. Alternatively if you left your job running it would eventually get through the data (probably in a similar sort of time to if you left the job running with the "Batch Size" swicthed off).

I only meant "hanging" in the sense of the visual behavior I saw. Of course, it is that the first "Batch" rows get processed very quickly (so the count goes quickly to that number), and then it hangs as it waits for that batch to be committed. As noted in the other article, "Batch" clearly has an impact on performance. Disabling it resulted in a huge drop in performance for the inserts.

@rhallwrote:
Another slight difference between yours and mine is that you are using a primary key. When inserting a value into a primary key field the database essentially has to ensure that the value being inserted is unique, otherwise the constraint is broken. This will cause a little latency when inserting a large amount of data with existing PK data. Having said that, it shouldn't cause the issues you are seeing.

I think you are referring to the fact that one of the columns in the stream has the "Key" checked? Assuming I understand that correctly, I had wondered why setting that column as the key still required me to do something else to tell the update to use it. Seems odd. In DS, setting a column as the key automatically makes it use in the update statement.

@rhallwrote:
I really do not believe that Talend is doing anything weird here. Firstly because I have never seen any issues on countless customer sites. Secondly because Talend are simply using third party Jars to provide the database functionality. These are the same Jars used by many other Java applications with database interactions. All Talend does is prepare insert/update/select statements which are sent to the DB using these Jars. Even if the statements were inefficient, most RDBMS' ultimately rewrite the statements internally to make them more efficient. Any latency caused by Talend is more likely to be caused by complex in-memory transformations reuquiring a lot of memory. This is the main cause of any perfomance tuning I need to carry out, no matter what database I am using.

Your point about using 3rd party components is an important one and certainly validates the thought that Talend isn't really at fault here, wherever else the fault might lie.

I'm just still stumped though still investigating. Doesn't make sense why this Talend job would perform so poorly compared to a Datastage job doing the same thing to the same tables in the same database.

I will report back when and if I find out anything conclusive on the update performance issue.

Again, thanks.

Anonymous · ‎2018-03-22

I'm still seeing 31 rows/sec in Talend vs. 3898 rows/sec in Datastage -- over 131 minutes vs. 1 minute.

In Datastage, I can look at the actual statement the stage is going to use:

UPDATE customer SET customer_name=?,ship_address=?,ship_postalcode=?,ship_city=?,ship_county=?,ship_state=?,ship_country=?,mktsegcode=?,crc=? WHERE cmf=?;

I'd really like to see if the SQL used by the Talend component actually includes the WHERE clause to employ the update key.

The docs indicate there is a global variable QUERY but I've not been able to successfully access. Probably have a syntax issue...

Slow SQL Server updates

Talend Data Integration

v6.x

Slow SQL Server updates

Talend Data Integration

v6.x

Related Topics