Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Is Talend making any effort to improve Talend performance with MS SQL Server?
Having searched this community and old talendforge forums entries, it appears there are performance issues with Talend using SQL Server. I found this article (Data loading from MS SQL Server DB to other MS SQL Server DB is very low compared to SSIS) which confirms my own experience in comparing Talend to Datastage (2005 Server Edition -- an old, no longer supported version).
I built a Talend job to mirror what the old Datastage job is doing -- syncing content between two tables: a source table on one SQL Server and a target table on another SQL Server. The data is about 220k rows. The Datastage jobs does this (with MSOLEDB stage) in < 2 minutes with 2238 rows/sec. Talend is poking along (with tMSSqlOutput component) and barely able to make less than 30 rows/sec! Really?? Almost 100x slower? What the heck is it doing?
I tried changing the batch size, changing to use a component connection and changing the Commit Every setting; tried using select top to limit rows (just to see the impact). None of these made significant differences in performance. Still achieves < 30 rows/sec.
I can help you here. On your workspace you will see a a tab called "Code" in the bottom left corner. Click on that and you will see the code generated. It is a bit messy, but you can learn a lot from it. Now your update component will have a name something like ....
tMSSqlOutput_
...after the _ will be a number. You can see the name when you click on the component and then click on the component tab. Lets assume your component is numbered 1. What you want to search for when looking in the code tab is for a code like this....
String update_tMSSqlOutput_1 =
This will take you to where your update query is built as a String.
I selected the component and then went to the "Code Viewer" window. I didn't see any way to search there so I selected all the content and pasted it into notepad++. I see references for "whetherReject_tMSSqlOutput_2", "pstmt_tMSSqlOutput_2", "updatedCount_tMSSqlOutput_2", and several others but no "update_tMSSqlOutput_2". So I checked my "insert" path (tMSSqlOutput_1) and had the same result -- lots of references but no "insert_tMSSqlOutput_1". (Yes, I was searching case-insensitive.) I am clearly missing something...
I can see where it uses "pstmt_tMSSqlOutput_2" as an arg of type java.sql.PreparedStatement, and can see where it builds that variable, but it only shows the column values being added -- I assume as parameters to the statement.
/** * [tMSSqlOutput_2 main ] start */ currentComponent="tMSSqlOutput_2"; if(log.isTraceEnabled()){ log.trace("row17 - " + (row17==null? "": row17.toLogString())); } whetherReject_tMSSqlOutput_2 = false; if(row17.customer_name == null) { pstmt_tMSSqlOutput_2.setNull(1, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(1, row17.customer_name); } if(row17.ship_address == null) { pstmt_tMSSqlOutput_2.setNull(2, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(2, row17.ship_address); } if(row17.ship_postalcode == null) { pstmt_tMSSqlOutput_2.setNull(3, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(3, row17.ship_postalcode); } if(row17.ship_city == null) { pstmt_tMSSqlOutput_2.setNull(4, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(4, row17.ship_city); } if(row17.ship_county == null) { pstmt_tMSSqlOutput_2.setNull(5, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(5, row17.ship_county); } if(row17.ship_state == null) { pstmt_tMSSqlOutput_2.setNull(6, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(6, row17.ship_state); } if(row17.ship_country == null) { pstmt_tMSSqlOutput_2.setNull(7, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(7, row17.ship_country); } if(row17.mktsegcode == null) { pstmt_tMSSqlOutput_2.setNull(8, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(8, row17.mktsegcode); } if(row17.crc == null) { pstmt_tMSSqlOutput_2.setNull(9, java.sql.Types.INTEGER); } else {pstmt_tMSSqlOutput_2.setLong(9, row17.crc); } if(row17.cmf == null) { pstmt_tMSSqlOutput_2.setNull(10 + count_tMSSqlOutput_2, java.sql.Types.VARCHAR); } else {pstmt_tMSSqlOutput_2.setString(10 + count_tMSSqlOutput_2, row17.cmf); } pstmt_tMSSqlOutput_2.addBatch(); nb_line_tMSSqlOutput_2++; if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("Adding the record ") + (nb_line_tMSSqlOutput_2) + (" to the ") + ("UPDATE") + (" batch.") ); batchSizeCounter_tMSSqlOutput_2++; //////////batch execute by batch size/////// class LimitBytesHelper_tMSSqlOutput_2{ public int limitBytePart1(int counter,java.sql.PreparedStatement pstmt_tMSSqlOutput_2) throws Exception { try { if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("Executing the ") + ("UPDATE") + (" batch.") ); for(int countEach_tMSSqlOutput_2: pstmt_tMSSqlOutput_2.executeBatch()) { if(countEach_tMSSqlOutput_2 == -2 || countEach_tMSSqlOutput_2 == -3) { break; } counter += countEach_tMSSqlOutput_2; } if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("The ") + ("UPDATE") + (" batch execution has succeeded.") ); }catch (java.sql.BatchUpdateException e){ int countSum_tMSSqlOutput_2 = 0; for(int countEach_tMSSqlOutput_2: e.getUpdateCounts()) { counter += (countEach_tMSSqlOutput_2 < 0 ? 0 : countEach_tMSSqlOutput_2); } log.error("tMSSqlOutput_2 - " + (e.getMessage()) ); System.err.println(e.getMessage()); } return counter; } public int limitBytePart2(int counter,java.sql.PreparedStatement pstmt_tMSSqlOutput_2) throws Exception { try { if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("Executing the ") + ("UPDATE") + (" batch.") ); for(int countEach_tMSSqlOutput_2: pstmt_tMSSqlOutput_2.executeBatch()) { if(countEach_tMSSqlOutput_2 == -2 || countEach_tMSSqlOutput_2 == -3) { break; } counter += countEach_tMSSqlOutput_2; } if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("The ") + ("UPDATE") + (" batch execution has succeeded.") ); }catch (java.sql.BatchUpdateException e){ for(int countEach_tMSSqlOutput_2: e.getUpdateCounts()) { counter += (countEach_tMSSqlOutput_2 < 0 ? 0 : countEach_tMSSqlOutput_2); } log.error("tMSSqlOutput_2 - " + (e.getMessage()) ); System.err.println(e.getMessage()); } return counter; } } if ((batchSize_tMSSqlOutput_2 > 0) && (batchSize_tMSSqlOutput_2 <= batchSizeCounter_tMSSqlOutput_2)) { updatedCount_tMSSqlOutput_2 = new LimitBytesHelper_tMSSqlOutput_2().limitBytePart1(updatedCount_tMSSqlOutput_2,pstmt_tMSSqlOutput_2); batchSizeCounter_tMSSqlOutput_2 = 0; } ////////////commit every//////////// commitCounter_tMSSqlOutput_2++; if(commitEvery_tMSSqlOutput_2 <= commitCounter_tMSSqlOutput_2) { if ((batchSize_tMSSqlOutput_2 > 0) && (batchSizeCounter_tMSSqlOutput_2 > 0)) { updatedCount_tMSSqlOutput_2 = new LimitBytesHelper_tMSSqlOutput_2().limitBytePart1(updatedCount_tMSSqlOutput_2,pstmt_tMSSqlOutput_2); batchSizeCounter_tMSSqlOutput_2 = 0; } if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("Connection starting to commit ") + (commitCounter_tMSSqlOutput_2) + (" record(s).") ); conn_tMSSqlOutput_2.commit(); if(log.isDebugEnabled()) log.debug("tMSSqlOutput_2 - " + ("Connection commit has succeeded.") ); commitCounter_tMSSqlOutput_2=0; } tos_count_tMSSqlOutput_2++; /** * [tMSSqlOutput_2 main ] stop */
You should be able to switch to the code screen and hit ctrl+F to bring up a find box. You can edit the code, but you can search it. The code you copied and pasted isn't all of it for the tMSSqlOutput. There is a lot more. Here is an example I quickly knocked up. This isn't configured for use, but just dumped onto a job and configured to carry out an update. You will notice there is a section of code like this....
String update_tMSSqlOutput_1 = "UPDATE [" + tableName_tMSSqlOutput_1 + "] SET [newColumn] = ?,[newColumn1] = ?,[newColumn2] = ?,[newColumn3] = ?,[newColumn4] = ? WHERE [newColumn3] = ? AND [newColumn4] = ?";
I've highlighted in bold what you should search for in your job. I assume you are using Talend 6.something? If so this code will be there if you have your component configured for an update.
@rhallwrote:
You should be able to switch to the code screen and hit ctrl+F to bring up a find box.
Must be a difference between Windows and Mac. On Mac, it would be command-F to bring up search -- which works in most places but nothing happens in Talend. I tried control-F, also, but no response.
@rhallwrote:
The code you copied and pasted isn't all of it for the tMSSqlOutput.
The code I pasted is all that appears in the Code Viewer window when I have the tMSSqlOutput_2 job selected. You can see the start and stop comments from top and bottom of what displays in the viewer -- pasted again, below. Maybe another difference between Windows and Mac? Hmmm... I'm going to use SQL Server Profiler to try and look at the SQL coming in. But I'd really like to know why I don't see what you see.
/** * [tMSSqlOutput_2 main ] start */
...
/** * [tMSSqlOutput_2 main ] stop */
I was using a Mac at the time but was assuming you were using Windows when I suggested that.
I've just realised what you are doing wrong. I was not talking about the Code Viewer (its useless in my eyes). I was talking about about the "Code" tab at the bottom left of the design window....
Click that and you will see the whole code for the job. It is also useful when looking for compilation errors or trying to identify errors in error stacks. In fact, you should be spending a lot of time looking at this to quickly debug issues.
Ah, yes, that is more helpful. I agree about the "Code Viewer" window -- seems pointless given the ability to view the "Code" tab of the designer.
Clearly the key column is being used in the WHERE clause -- it appears in the Talend statement:
String update_tMSSqlOutput_2 = "UPDATE [" + tableName_tMSSqlOutput_2 + "] SET [customer_name] = ?,[ship_address] = ?,[ship_postalcode] = ?,[ship_city] = ?,[ship_county] = ?,[ship_state] = ?,[ship_country] = ?,[mktsegcode] = ?,[crc] = ? WHERE [cmf] = ?";
And in the resulting line shown in SQL Profiler:
exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 nvarchar(4000),@P5 nvarchar(4000),@P6 nvarchar(4000),@P7 nvarchar(4000),@P8 bigint,@P9 nvarchar(4000)',N'UPDATE [dbo].[customer] SET [customer_name] = @P0 ,[ship_address] = @P1 ,[ship_postalcode] = @P2 ,[ship_city] = @P3 ,[ship_county] = @P4 ,[ship_state] = @P5 ,[ship_country] = @P6 ,[mktsegcode] = @P7 ,[crc] = @P8 WHERE [cmf] = @P9 ',1
Which is followed by (for each row):
exec sp_execute 1,...
NOTE: It is interesting that the sp_prepare shows nvarchar(4000) for every string parameter despite each column being defined explicitly as varchar(n) with an appropriate length. But this is probably a generic thing done for all parameters since it requires NVARCHAR.
I did find the article Watch out those prepared SQL statements. It indicates that use of sp_prepare/sp_execute is typical of ODBC, OLEDB, and, apparently, JDBC. But it also points out a potential issue with the sp_unprepare not always being triggered. I'm not sure of the impact of that but I wouldn't suspect it is related to what I am seeing.
I'm completely stumped at this point. I have no explanation for the poor performance compared to Datastage.
I wish I could offer more help, but without actually seeing the issue and being able to hack around, it is very hard. As my screenshots showed you, Talend is capable of much better performance than you are seeing for both inserts and updates. I'd actually be interested to try out Datastage to get some comparisons of my own (I came from a pure SQL with DB links ETL to Informatica background). If you have a Talend license, I would advise contacting Talend support and asking them for a WebEx session so that they can see exactly what you have and maybe have a bit of a look around.
I appreciate all you have contributed to this conversation. I'll end this thread for now until I (hopefully) come up with a resolution. Meanwhile, I'll close with this information:
Using SQL Profiler I traced each of the Talend and Datastage jobs and saved the trace which I include below. (I have included only a single sp_execute as they are all the same for each row except for the content being passed in.)
The only differences in their output was this:
Note that Datastage does reset the IMPLICIT_TRANSACTIONS to off and closes with sp_unprepare
Talend
-- network protocol: TCP/IP set quoted_identifier on set arithabort off set numeric_roundabort off set ansi_warnings on set ansi_padding on set ansi_nulls on set concat_null_yields_null on set cursor_close_on_commit off set implicit_transactions off set language us_english set dateformat mdy set datefirst 7 set transaction isolation level read committed SELECT @@MAX_PRECISION SET TRANSACTION ISOLATION LEVEL READ COMMITTED SET IMPLICIT_TRANSACTIONS OFF SET QUOTED_IDENTIFIER ON SET TEXTSIZE 2147483647 SET IMPLICIT_TRANSACTIONS ON declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 nvarchar(4000),@P3 nvarchar(4000),@P4 nvarchar(4000),@P5 nvarchar(4000),@P6 nvarchar(4000),@P7 nvarchar(4000),@P8 bigint,@P9 nvarchar(4000)',N'UPDATE [dbo].[customer] SET [customer_name] = @P0 ,[ship_address] = @P1 ,[ship_postalcode] = @P2 ,[ship_city] = @P3 ,[ship_county] = @P4 ,[ship_state] = @P5 ,[ship_country] = @P6 ,[mktsegcode] = @P7 ,[crc] = @P8 WHERE [cmf] = @P9 ',1 select @p1 exec sp_execute 1,N'... IF @@TRANCOUNT > 0 COMMIT TRAN
Datastage
-- network protocol: TCP/IP set quoted_identifier on set arithabort off set numeric_roundabort off set ansi_warnings on set ansi_padding on set ansi_nulls on set concat_null_yields_null on set cursor_close_on_commit off set implicit_transactions off set language us_english set dateformat mdy set datefirst 7 set transaction isolation level read committed set implicit_transactions on declare @p1 int set @p1=1 exec sp_prepare @p1 output,N'@P1 varchar(50),@P2 varchar(240),@P3 varchar(10),@P4 varchar(60),@P5 varchar(60),@P6 char(2),@P7 char(4),@P8 varchar(2),@P9 bigint,@P10 char(8)',N'UPDATE customer SET customer_name=@P1,ship_address=@P2,ship_postalcode=@P3,ship_city=@P4,ship_county=@P5,ship_state=@P6,ship_country=@P7,mktsegcode=@P8,crc=@P9 WHERE cmf=@P10',1 select @p1 exec sp_execute 1,'... IF @@TRANCOUNT > 0 COMMIT TRAN set implicit_transactions off exec sp_unprepare 1
Not wishing to steal the last word, but I just had a thought which might help with your investigation. At the moment you are using the components in a way which lets them handle commits for you. You can take control of this and manually create a connection, perform your updates/inserts/deletes and then commit or rollback at the end. I mainly work in this way if I have to ensure a successful atomic transaction or rollback. But it *might* help with your problem. Take a look at this page which shows how to do this.....
https://help.talend.com/reader/iYcvdknuprDzYycT3WRU8w/tR5sAkr~80fPFhrYZNlNWQ
I originally had this job using a shared connection but had moved back from that to connections in the components in an effort to simplify the job and determine why it was slow. I'll keep in mind the option you have suggested here.