Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Performance in Jobs

Hello Community,

Im having serious issues with a Job.

Facts:

* Tablerows in DB: ~2 Million rows
* Select * from Table: approximately 15 seconds runtime in MSSQL Management Studio
* Job is getting 8GB of RAM
* Dfile.encoding=UTF-8
* Java Version 7, Update 79
* Talend Version: 5.2.2 - Build id: r99271-20130307-0242
* It's a job for a Dimension
* I'm having some Jobs, which are running with 60k+ rows/second 

Job is starting at 1500 Rows / Second and is immediately losing 50% and then steadily dropping, till it hits ~ 100 rows / second.

I've tried so far:

* setting an offset with a limitation of 10k / 100k rows
* removing nearly all the selected fields in the statement
* removing nearly all the fields in the tMap component
* setting new indexes for my DB-Table
* Splitting the Job into small pieces (200k rows each job)

Any idea how I can "debug" this problem - to find and resolve the bottleneck?


I'd appreciate any help - my Boss is getting really troublesome 0683p000009MPcz.png


0683p000009MG29.png D_ARTIKEL_… .zip

Labels (2)
3 Replies
Anonymous
Not applicable
Author

first, are you sure the bottleneck is in the SCD component? one way to confirm this is to replace it with a tJavaRow that does nothing -- if the job runs fast, then you can be sure its on the output. 
once you can confirm the SCD component is the issue, start taking a look at the queries it sends to the DB. from the way you describe the performance (starts fast, but then slows down) it feels like a database issue. you may notice that the indexes you created earlier do not get used in these queries. 
if you've got good indexes ( you can confirm by capturing some queries and running EXPLAIN ) and its still too slow, you may need to implement custom SCD logic in your jobs to provide further performance optimizations. What this means is instead of using the SCD component, you would replicate the logic using direct SQL statements that you write and are able to optimize. There are many techniques that are good in building a slowly changing dimension quickly, here is some I've used in the past to deal with dimension tables in the >100GB range. 
[list=*]
  • create a insert-only process. by joining your incoming data with your existing dimension, build a temporary dimension table then swap the "real" one with the temporary one. This can be more efficient than direct update statements if your tables are big. 

  • separate inserts and updates into two different steps. first process all your inserts, then run a single update query to set the ACTIVE column for the appropriate records. This is beneficial because databases generally are faster when you modify as many rows as possible in a single query. 

  • Anonymous
    Not applicable
    Author

    Thank you for your response. I will try these methods in the future.

    Right now, I have "solved" the problem by adding more jvm arguments:

    Set Xms512M to 8192M (like Xmx)
    Set XX:+UseParallelGC

    Voila, the Job is running with 10k+ rows/sec.
    Anonymous
    Not applicable
    Author

    Okay.. new info:

    Same server - different databases (DWH-Test, / DWH-Prod) - same settings

    On my TestDB - The job is running with 10k+ rows / sec, the table was truncated before filling it!

    On my ProdDB - The job is running with 500 rows / sec, the table wasn't truncated before!

    Difference: I have some old records of MS SQL Server 2008 on my ProdDB - but the server is now MS SQL Server 2014

    COULD this be the problem / reason for being slow? Because of old Datasets with an older SQL version?