Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Slow SQL Server updates

Is Talend making any effort to improve Talend performance with MS SQL Server?

 

Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server.  I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).

 

I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables:  a source table on one SQL Server and a target table on another SQL Server.  The data is about 220k rows.  The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec.  Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec!  Really??  Almost 100x slower?  What the heck is it doing?

 

I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact).  None of these made significant differences in performance.  Still achieves < 30 rows/sec.

Labels (2)
43 Replies
Anonymous
Not applicable
Author

I started from scratch and built a completely new job and new tables in a new database -- see image at bottom. I am seeing essentially the same results.  Most importantly, I still see the same issue that it is clearly something to do with pushing the data to the database as, once it hits the first "commit" limit (when it begins to send data to the database), the job hangs visually and I can watch the row count in SQL Profiler, just crawling along...

 

Interesting note #1:  The Open Source JTDS JDBC provider provides marginally better performance (100 rows/sec) than the Microsoft JDBC provider (30 rows/sec), though still nothing like an acceptable level.

 

Interesting note #2:  From reading the docs, I understand that I should be able to use a single tMSSqlConnection with multiple tMSSqlInput or tMSSqlOutput components.  So, in this job, I initially tried that.  The first run was with zero rows in the target table: the job ran, and completed quickly, indicating that all 20k rows had been inserted.  Yet nothing appeared in the target table; it was empty.  After playing around with this, I then added a second tMSSqlConnection (thinking I must have misunderstood the docs) and set the tMSSqlOutput components to use it.  I ran the test again:  It indicated it had inserted 20k rows but the target table was still empty.  (NOTE: Using SQL Profiler during this, I confirmed that nothing was being sent to the database during the tests.)  So, I changed the tMSSqlOutput components to NOT use a connection but to use their a connection from themselves.  This time, when I ran the test, it showed 20k rows inserted, and those rows were actually in the table.  Similarly, though slow, the update worked as well only when configured to not use an existing connection.   However, the tMSSqlConnection component works fine for multiple tMSSqlInput connections (there are two in this job).  It is just the tMSSqlOutput components which will not work with it.

 

Should anyone care to try and replicate this, I've also attached a zip of the exported job (sans context) and a SQL script that will facilitate building the tables and managing the data between tests.  The general test is to:

  1. Create the tables and populate the source table
  2. Run the job -- confirm that all 20k rows take the insert path and that the target table is populated.
  3. Run the job again -- confirm that no inserts or updates occur.
  4. Run the section of the script to reset the crc values in the target table.
  5. Run the job again -- confirm that all 20k rows take the update path and that the target table crc values are all non-zero.

 

 

0683p000009LsvO.png

 

 

 


ODS_test.zip
ODS_test_tables.txt
Anonymous
Not applicable
Author

@rhall wrote:

One more thing to check (I do this automatically and forgot to mention), did you untick the "updatable" field for the update key column in the field options? This can make a big difference since updating a key field (even with the same value) will incur an update performance hit since the index is technically having to be recalculated. Since it is a key value, it won't change....so you may as well untick the updatable field.

 

In fact, it appears the tMSSqlOutput component is, indeed, using the key(s) defined in the schema as the column(s) to use in the WHERE clause and the Field Options have no affect.

 

If I uncheck the key column in the schema and attempt to run, it replies with this:

 

[FATAL]: workspace.testupdate_0_1.TestUpdate - tMSSqlOutput_2 For update, Schema must have a key

 

If I then check the key column and add a second key column, when I run the job, I get this in the sp_prepare -- note that both columns are listed:

 

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3 nvarchar(4000),@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval2] = @P0 ,[strval3] = @P1 ,[crc] = @P2 WHERE [keyval] = @P3 AND [strval1] = @P4 ',1 select @p1

 

I then unchecked the second column from above (leaving only the first column which is the real key) and played around with the Field Options.  It has no effect on the sp_prepare statement:

  

Field Options checked, Updatable checked:

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1 select @p1

 

Field Options unchecked (leaving Updatable checked):

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1 select @p1

 

Field Options unchecked (leaving Updatable unchecked):

declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1 select @p1

 

 

Anonymous
Not applicable
Author

I'm afraid your last post has some clear errors in it. I shall explain. The code (SQL) that Talend fires is not hardcoded and hidden, it changes with the component configuration and can be seen to change if you look at the generated code. There is absolutely no way that the same query is being fired to the database with different field option configurations. The field options DO make a difference....unless we are going to believe that the code generated is just for show and Talend does something mysterious behind the scenes. The code below shows the changes with different settings. I replicated the sort of changes you made and am showing the code that is generated....

 

Field options checked, no update field checked, updatable field catalogue_id checked (amongst others)...

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [catalogue_id] = ?,[datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE ";

You will see there is an empty WHERE CLAUSE which is what you noticed. This will not run due to the empty WHERE CLAUSE.

Field options checked, update field id checked, updatable field catalogue_id checked (amongst others)....

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [catalogue_id] = ?,[datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE [id] = ?";
				

You will notice that the WHERE CLAUSE now has the id field specified.

Field options checked, update field id checked and updatable field catalogue_id unchecked....

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE [id] = ?";
				

In the code above you will notice that the catalogue_id field is removed from the list of fields to be updated. That is what the updatable field does. It specifies whether the field should be able to be updated. It has nothing to do with the update key. I mentioned it because there is no point updating your key field as the value will not change. If your key is id and your datarow contains 21 for the key, the row to update will have a key of 21 and if you update the key field as well you are updating 21 with 21. That was the point I was making.

Field options unchecked, but everything else left the same as above...

String update_tMSSqlOutput_1 = "UPDATE ["
+ tableName_tMSSqlOutput_1
+ "] SET [id] = ?,[catalogue_id] = ?,[datasource] = ?,[start_time] = ?,[end_time] = ?,[status_id] = ?,[root_talend_process_id] = ?,[rerun] = ?,[batch_repaired] = ?,[comment] = ? WHERE ";

Since we have switched off the Field Options, the settings are now ignored. We are back to having no WHERE CLAUSE again. This will not run. You will get an error similar to the following....

Exception in component tMSSqlOutput_1
java.lang.RuntimeException: For update, Schema must have a key
	at allport_dq.testjob_0_1.TestJob.tHashInput_1Process(TestJob.java:1769)
	at allport_dq.testjob_0_1.TestJob.tMSSqlInput_1Process(TestJob.java:3073)
	at allport_dq.testjob_0_1.TestJob.runJobInTOS(TestJob.java:3425)
	at allport_dq.testjob_0_1.TestJob.main(TestJob.java:3166)
[FATAL]: allport_dq.testjob_0_1.TestJob - tMSSqlOutput_1 For update, Schema must have a key

If you have Field Options unchecked and are trying to carry out an update, there is no way your job would have run. So your SQL Profiler data showing what was fired at the database when the Field Options was switched off is not actually showing that. It might be showing a cached query or maybe just reshowing the last successful run. But it is impossible for it to be showing a query that could never have been fired at the database.

 

I'm sensing either aggravation at being forced to use a tool other than Datastage, or some eagerness to "prove" that Datastage is better. The problem is that you cannot do that until you know how to use Talend. You clearly do not know Talend and are not trying to learn it....just trying to find fault. The problem with doing that is that you look very silly when you make a schooboy error because you have omitted to RTFM. An example of this can be seen below....

"Interesting note #2:  From reading the docs, I understand that I should be able to use a single tMSSqlConnection with multiple tMSSqlInput or tMSSqlOutput components.  So, in this job, I initially tried that.  The first run was with zero rows in the target table: the job ran, and completed quickly, indicating that all 20k rows had been inserted.  Yet nothing appeared in the target table; it was empty.  After playing around with this, I then added a second tMSSqlConnection (thinking I must have misunderstood the docs) and set the tMSSqlOutput components to use it.  I ran the test again:  It indicated it had inserted 20k rows but the target table was still empty.  (NOTE: Using SQL Profiler during this, I confirmed that nothing was being sent to the database during the tests.)  So, I changed the tMSSqlOutput components to NOT use a connection but to use their a connection from themselves.  This time, when I ran the test, it showed 20k rows inserted, and those rows were actually in the table.  Similarly, though slow, the update worked as well only when configured to not use an existing connection.   However, the tMSSqlConnection component works fine for multiple tMSSqlInput connections (there are two in this job).  It is just the tMSSqlOutput components which will not work with it."

 

Now someone who had read the manual would know that with the tMSSQLConnection component you need to either set Auto Commit on the Advanced Settings tab OR use a tMSSQLCommit component. Had you done this, your data would have been committed. Here is the section of the manual you missed (and I believe I pointed you to): https://help.talend.com/reader/iYcvdknuprDzYycT3WRU8w/tR5sAkr~80fPFhrYZNlNWQ

 

 

I may have come across as a bit harsh here, but I'm afraid I am a little cheesed off that I tried to help you only for you to play passive aggressive developer one-upmanship with me, while trying to pull apart a tool you clearly have little understanding of (and haven't really bothered to try to learn). Do you REALLY think that Talend would have reached the adoption levels it has if it could only update at 30 rows a second?!? I mean, seriously? I have demonstrated screen shots showing you updates on a database running in a different country to the location of my Studio, running at a thousand times faster than you are claiming you are stuck at. But you insist it is a flaw with Talend and have spent your time trying to prove that Talend is a smoke and mirrors application. Maybe you should stick with Datastage......

 

Anonymous
Not applicable
Author

@rhall: I intended no offense at all.  That you are offended saddens me.   I had no intent to make you look bad or whatever you felt.  I was only reporting what I experienced.  I said many times that I appreciated your help and I still do.  My apologies for whatever you mistakenly thought I intended.

 

My sole purpose has been and continues to be to understand why Talend performs so poorly compared to my own experience with Datastage.  I would be very happy for someone to identify what it is in the job (which I attached in the prior post) is being done incorrectly to account for the experience I am having.  The goal is to leave Datastage behind but I'm having trouble doing that when I cannot justify a 100x difference in performance.

 

I came here seeking help.  That has not changed.  I will repeat my tests from the last post in the chance I did, indeed, make some errors -- and I will include the code that is generated for comparison; I neglected to include that.

 

You needn't feel obligated to respond but I hope that you will as your prior information has been beneficial.

Anonymous
Not applicable
Author

My previous post turns out to have suffered from a misunderstanding of how the key checkbox and Field Options work, and some copy/paste errors.  Below are the details of what I've found but I'm still left trying to understand why updates are so slow compared to my experience in Datastage.  This, after creating multiple jobs from scratch, and testing with different databases on different servers (trying to isolate potential contributors).  I would still appreciate someone attempting to duplicate these tests using the job and script I attached.  I'd really like to understand what I may be doing wrong so I can finally move forward using Talend.

 

After confirming, again, that the key checkbox is used by an Update action to supply the key, I found this section in the documentation of tMSSqlOutput:

 

"It is necessary to specify at least one column as a primary key on which the Update and Delete operations are based. You can do that by clicking Edit Schema and selecting the check box(es) next to the column(s) you want to set as primary key(s). For an advanced use, click the Advanced settings view where you can simultaneously define primary keys for the Update and Delete operations. To do that: Select the Use field options check box and then in the Key in update column, select the check boxes next to the column names you want to use as a base for the Update operation. Do the same in the Key in delete column for the Delete operation."

 

I was able to confirm that the key checkboxes are used unless the Field Options checkbox is checked, in which case, it uses whatever is set in the "Field Options" columns.  So, while the key checkbox is sufficient for most cases, there may be times where checking the "Field Options" checkbox is necessary to achieve a greater level of control.

 

I was able to confirm this both in the generated Java code as well as in the sp_prepare statements (which suffered from copy/paste errors in my previous post).  Details listed below.

 

field options one updatable key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2

+ "] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ?";

 

declare @p1 int

set @@p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 bigint,@P5 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [keyval] = @P0 ,[strval1] = @P1 ,[strval2] = @P2 ,[strval3] = @P3 ,[crc] = @P4 WHERE [keyval] = @P5 ',1

select @p

 

field options no updatable key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2 +

"] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE ";

 

Job errors out - no sp_prepare sent

 

field options two updatable keys

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2

+ "] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ? AND [strval1] = ?";

 

declare @p1 int

set @p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 bigint,@P5 nvarchar(4000),@P6 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [keyval] = @P0 ,[strval1] = @P1 ,[strval2] = @P2 ,[strval3] = @P3 ,[crc] = @P4 WHERE [keyval] = @P5 AND [strval1] = @P6 ',1

select @p1

 

key checkbox -- one key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2 +

"] SET [strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ?";

 

declare @p1 int

set @p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 bigint,@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval1] = @P0 ,[strval2] = @P1 ,[strval3] = @P2 ,[crc] = @P3 WHERE [keyval] = @P4 ',1

select @@p1

 

key checkbox -- no key

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2

+ "] SET [keyval] = ?,[strval1] = ?,[strval2] = ?,[strval3] = ?,[crc] = ? WHERE ";

 

Job errors out - no sp_prepare sent

 

key checkbox -- two keys

String update_tMSSqlOutput_2 = "UPDATE ["

+ tableName_tMSSqlOutput_2 +

"] SET [strval2] = ?,[strval3] = ?,[crc] = ? WHERE [keyval] = ? AND [strval1] = ?";

 

declare @p1 int

set @p1=1

exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3 nvarchar(4000),@P4 nvarchar(4000)',N'UPDATE [dbo].[test_target] SET [strval2] = @P0 ,[strval3] = @P1 ,[crc] = @P2 WHERE [keyval] = @P3 AND [strval1] = @P4 ',1

select @p1

 

 

Anonymous
Not applicable
Author

So it seems I may have got the wrong the impression from the last post I replied to. I guess it is very easy to give and get the wrong impression when communicating in just text, so I apologise for misunderstanding your motives.

 

Regarding the job you uploaded to test, I have had a look at it and tested it on my machine. I downloaded a new version of Talend 6.5.1 (since I am using 6.2.1 at the moment and your job was written in 6.4.1). I am running Microsoft SQL Server 2014 (SP1-GDR). Here is what I found.

 

1) The reason your data wasn't inserting or updating is because there was not commit specified when using the tMSSQLConnection component. This can either be set using  a tMSSQLCommit component or setting the Auto Commit option on the Advanced Settings of the tMSSQLConnection.

2) You only need one tMSSQLConnection for this. Adding 2 does nothing for the performance in this case. However I suspect you added a second in an effort to solve the problem with inserts and updates not occurring.

3) When I sorted the Auto Commit on the tMSSQLConnection component, the update ran successfully at a rate of 1320.9 rows per second. This seemed a little slow, but apparently is faster than you were getting. How busy is your database out of interest?

4) I then left everything in the job the same but changed the Update DB component to manage it's own connection. I set the connection details for that component. I left everything else the same. So the shared connection was responsible for the reads and the update had its own connection. I managed to get a performance of 3545 rows per second. This was with no Field Options (so the schema keys were automatically used).

5) I then switched the Field Options on and ticked the update key (keyval), leaving everything else the same. The performance was worse doing this, at 2471 rows per second.

6) My suspicion was that updating the keyval was likely causing this issue (since it was left as updateable), so I unticked it and ran again. I got a row rate of 3563 rows per second.

7) Since the point of this update is just to update one field, I unticked all updateable fields but the crc field. So the only fields ticked were the key (keyval) and the field to update (crc). This increased the rate to 3926 per second.

0683p000009LtKH.png

The screenshot above shows the slightly changed layout of the job (I removed the extra connection components) and also shows the row rate I got on the last test.

0683p000009LtBC.png

The screenshot above shows the configuration of the update component's field options for the fastest performance. 

 

What sort of performance do you get on your system with the same configuration? Even if it is slower, do you see the same sort of performance improvements when you make the changes I have made?

 

 

Anonymous
Not applicable
Author


@rhallwrote: 

The reason your data wasn't inserting or updating is because there was not commit specified when using the tMSSQLConnection component. This can either be set using  a tMSSQLCommit component or setting the Auto Commit option on the Advanced Settings of the tMSSQLConnection.


Dang!  I actually knew this but had forgotten it in the midst of focusing upon why the updates weren't performing well.  'Sometimes can't see the forest because of being too close to the tree.

 


@rhallwrote:

 

When I sorted the Auto Commit on the tMSSQLConnection component, the update ran successfully at a rate of 1320.9 rows per second. This seemed a little slow, but apparently is faster than you were getting. How busy is your database out of interest?

 


The database is completely idle except for my testing.  I am the only one on it.

 


@rhallwrote:

 

I then left everything in the job the same but changed the Update DB component to manage it's own connection. I set the connection details for that component. I left everything else the same. So the shared connection was responsible for the reads and the update had its own connection. I managed to get a performance of 3545 rows per second. This was with no Field Options (so the schema keys were automatically used).


Well, I'm not surprised.  You had indicated that you've had no issues with updates.  And, certainly, those rates are very reasonable.  There is clearly something that I am missing.

 


@rhallwrote: 

What sort of performance do you get on your system with the same configuration? Even if it is slower, do you see the same sort of performance improvements when you make the changes I have made?


The rates I'm seeing are so low that using either the key checkbox or the Field Options, I don't see enough difference to determine if either is better than the other.  Both are < 100 rows/sec.

 

Another associate has begun working with Talend and I asked him to run this job.  He did so and got numbers similar to what I did.  That would lead me to believe that it has to be something about the database.  But he also changed the job to point at an Oracle database and though he saw a little better performance, he still saw < 500 rows/sec.  And when I run other statements in SQL Mgmt Studio or run the Datastage job I originally started with, I have no obvious performance issues.  I just don't get it.

 

Thanks for willingness to try and help me resolve this.

Anonymous
Not applicable
Author

With Oracle you will get very different tuning requirements. I've found that reads are often the bottleneck with Oracle. Using a cursor for reads can make a big difference....but you aren't really concerned about that right now.

 

One more thing I've thought of, have you tried compiling your job to a Jar (build job) and running it from the command line? You won't be able to see the rows per second as easily, but you will be able to tell from the time of running whether it improves things. On some systems the Studio can be a bit slow. This can be because of virus software (scanning the Studio filesystem for changes), slow disk reads/writes (best to use SSD), low memory (I insist upon at least 8GB,...preferably 16GB) and non-i7 processors.

 

You can also try switching out the Jars for newer ones. I was using 6.5.1 so may very well have newer MS DB Jars. Although this will be fiddly.  

 

If you are assessing Talend for a potential purchase, I would advise raising this with their presales team. I have to admit that I have never seen the sort of performance you describe without having been able to improve it with a bit of tinkering. I used to work for Talend in professional services, so I have seen A LOT of examples of most of the supported databases. If you get a presales guy/girl on site, they should be able to find your issue with a little time tinkering.

 

One more performance thing, your Studio will have the bog standard JVM settings I imagine. You will probably want to improve those. If you look for an ini file in your Studio folder. The file will be named the same as the executable you use to start Studio. Change the contents of the ini file to something like this....

 

-vmargs
-Xms512m
-Xmx8192m
-Dfile.encoding=UTF-8
-Djava.net.preferIPv4Stack=true

You will see that the standard mem settings are pretty low.

 

 

Anonymous
Not applicable
Author

By adding "sendStringParametersAsUnicode=false" to the Advanced Settings of the update component, it is now producing a sp_prepare using varchar(8000) instead of nvarchar(4000) and the performance difference was dramatic -- almost 4k rows/sec instead of a couple hundred.  The problem is that the clustered key index in SQL Server is varchar() so it doesn't match the nvarchar() parameter value and forces a table scan.

 

The engineer assigned to the defect I opened (TDI-40281) was able to confirm the performance issue.

 

On a side note:  It isn't clear to me why the component wouldn't use the size of the schema columns for the parameters, but that probably doesn't matter much.

 

Herb_Appirio
Contributor
Contributor

I have a MUCH SIMPLER job and I am have the same performance issue on the tMSSqlOutput doing a straight Insert to a new table -- it writes the first 10,000 rows fairly fast, and then slows to a crawl. I initially saw this problem in a more complex job, so I built 2 very simple jobs to try to find the cause/bottleneck.

 

Simple Job: Properties:

  • Talend-Studio v6.5.1 on local Windows 10 PC
  • Sql Server 2016 running on an AWS EC2 (connect using TCP connection to IP address w/ Sql Authentication)
  • Table of Account data: 69,000 rows and about 15 fields (address fields and few string attributes)
  • Sql Connection stored in Repository (both sql components below using this Repository connection)
  • Action 1:  tMSSqlInput - Read all fields all rows from the above table, using "Microsoft JDBC"
  • Action 2: tMSSqlOutput - Create a new table (drop if exist and create), doing Insert; no Keys defined, default settings commit every 10K, using "Microsoft JDBC"

Comparison:

  • the input Read operation by itself takes 7 seconds to read 69,000 rows == good (10K rows/sec)
  • Trial 1:  wire the tMSSqlInput data straight into tMSSqlOutput  (or through tMap) -- writes 10,000 rows in a few seconds, then slows to a CRAWL, finishes in 45 minutes (avg 25 rows/sec)
  • Trial 2:  change the "commit every" settings, make it 1,000, etc. -- it writes only about 24~28 rows/sec
  • Trial 3:  separate the tMSSqlInput and tMSSqlOutput components into 2 subjobs -- 1st subjob reads tMSSqlInput  and writes to a CSV file (tOutputFileDelimited), then trigger On Component Ok --> 2nd subjob reads CSV file (tInputFileDelimited) and writes to tMSSqlOutput   -- this completes the read in 7sec (10K rows/s = good) and the write/insert step takes only 1 minute (1000 rows/s = decent)

As you can see, I am doing nothing special with the data. Read from a table, write to a new different table, no keys, no transformation, just straight read then write the data as-is. SOMETHING IS NOT RIGHT with the tMSSqlOutput component, it should not take longer to read and write directly to/from SQL than it takes to write it to a CSV file in the middle and re-read it.