Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
You may have faced Job performance challenges. Poor performance can be due to multiple factors, for example, poor Job design, low hardware configuration, networking, and so on. So where should you start looking for improvements? Start with benchmarking.
This article explains how to capture performance throughput between Talend and the Azure Cloud MS SQL Server. The results can be used as a benchmark to compare the performance of similar Jobs, and can help pinpoint an issue to a specific Job or environment. Talend recommends that you create a baseline of the environment performance before building complex Jobs. Benchmarking helps you recognize the optimum performance of the environment and helps to troubleshoot future performance issues.
| Talend Studio | Version 6.4.1 |
| Azure SQL database | Standard 100 DTU (S3), 200 GB storage |
| Network | No proxy or VPN connection |
This article uses two Jobs, to demonstrate insert and update performance respectively, in batch mode. Job 1 creates a new table and inserts 100,000 records. Job 2 randomly updates 20,000 of the 100,000 records inserted in Job 1.
Inserts observed ~1300 rows/sec throughput
Updates observed ~1200 rows/sec throughput
Note: When comparing the performance of a real Job with this benchmark, consider the following factors, as each will directly affect Job performance:
Job 1 creates a connection to the MS SQL Server on Azure Cloud, and creates a new table in the database. Once the table is created, the Job inserts 100,000 records into the table within 72 seconds. The Job commits those records into the database table. Notice the OnSubjobOK order; it is important for data to be persisted in the table.
The Job 1 table schema has nine columns, and records basic information such as first name, last name, and address. It uses a tRowGenerator component to generate random values for these attributes. Notice that idpk is the primary key column. The Action column can have a value of I to insert a Job, or a value of U to update a Job. This helps to identify the updated rows in the table.
Select Repository to connect to the Azure Cloud MS SQL Server.
On the Advanced settings tab, make sure Auto Commit is not selected. The auto commit function commits each SQL statement as a single transaction immediately after the statement is executed, while the commit component does not commit until all the statements are executed. For the batch services it better to use the commit component explicitly.
The tMSSqlOutput component inserts the records into the database.
Select the Use an existing connection check box to reuse the connection details you already defined.
On the Advance settings tab, select the Use Batch Size check box and define the Batch Size as 10000. This activates the batch mode for inserting data.
Using a unique connection, this component commits everything in one transaction instead of committing every row or every batch, which provides a gain in performance.
Select Close Connection to close the database connection once the commit is done.
Job 2 creates a connection to the MS SQL Server on Azure Cloud, and the tRowGenerator component randomly generates 20,000 rows with a random IDPK between 1 and 100,000. The Action column value is set to U for update. The OnSubjobOK order should be maintained accurately for updates to persist in the database.
Select Repository to connect to the Azure Cloud MS SQL Server.
On the Advanced settings tab, make sure Auto Commit is not selected. The auto commit function commits each SQL statement as a single transaction immediately after the statement is executed, while the commit component does not commit until all the statements are executed. For the batch services, it better to use the commit component explicitly.
In this configuration, this component finds and updates the rows in the database table.
On the Advanced settings tab, select Use Batch Size, and define Batch Size as 10000. This activates the batch mode for updating data.
When updating the table, it is important to select Key as the clustered index column. Selecting the wrong column as the primary key results in a table scan, which in return, results in a significant decrease in performance. Additionally, to improve the performance of updates, only submit columns that are being updated. This reduces the row payload, and increases performance.
Clustered indexes sort and store the data rows in the table or view based on their key values. There can be only one clustered index per table because the data rows themselves can be sorted in only one order.
Nonclustered indexes have a structure separate from the data rows. If a table has no clustered index, its data rows are stored in an unordered structure called a heap.