Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Nickster19
Contributor II
Contributor II

tPostgresqlInput Query Slow Through Talend, Yet Fast When Running on DB Visualizer

Hey all,

 

I have a tough question about one of my components that is not scaling well.  I am running a query using the tPostgresqlInput component.  The query, shown below, checks for circular references between an employee and the manager in a single table.  For testing purposes, I have been using a 25,000 record dataset.  When I run this query using PG Admin or DB Visualizer, it will take less than 2 seconds to run.  Additionally, using SAP BODS (our current ETL tool) the query will also run in less than 2 seconds.  However, when I run this query through tPostgresqlInput it will take over 5 minutes to run.

 

I have tried a few different things to speed this query up, but to no avail.  Inside Talend, I have tried adding additional memory on the JVM and running the subjob in parallel.  Comparing the run time of the query between Talend and at the database level proves that their is something hampering the performance when ran through Talend.  Any help understanding why Talend struggles with this query would be greatly appreciated.  We need to be able to scale this query to process up to 500,000 records.

 

Thank you for any help!  I truly appreciate it!

 

Running Job In DB Visualizer: 0.003 seconds

Running Query in Talend: 346 seconds

 

Query:

WITH RECURSIVE circular_managers(unique_id, mgr_unique_id, depth, path, cycle) AS (
SELECT u.unique_id, u.mgr_unique_id, 1,
ARRAY[u.unique_id || '']::varchar[],
false
FROM table1 u
WHERE u.unique_id IS NOT NULL
UNION ALL
SELECT u.unique_id, u.mgr_unique_id, cm.depth + 1,
path || u.unique_id,
u.unique_id = ANY(path)
FROM table1 u, circular_managers cm
WHERE u.unique_id = cm.mgr_unique_id
AND u.unique_id IS NOT NULL
AND NOT cycle
)
SELECT
depth,
array_to_string(path, ' > ') circular_managers
FROM circular_managers
WHERE cycle
AND array_to_string(path, ' > ') NOT LIKE ' >%'
AND array_to_string(path, ' > ') NOT LIKE '% > > %'
AND path[1] = path[array_upper(path, 1)]
group by 1,2

 

Labels (2)
3 Replies
Anonymous
Not applicable

Hello,

 What does your ETL job look like? Are you able to run your job successfully in talend studio?

 Could you please clarify in which Talend version/edition you are?

Best regards

Sabrina

Nickster19
Contributor II
Contributor II
Author

Sabrina,

 

I am using Enterprise Edition, version is 6.4.1.

 

The job will run successfully, but when I run it for larger datasets, it is exponentially slower.  For example, a 5,000 record dataset takes 2 seconds to run.  A 25,000 record dataset takes 5 minutes to run.  A 100,000 record dataset has taken over 3 hours to run.  When I run the input query using PG Admin or DB Visualizer, the query takes less than a second.

 

Using different logging options, I can see the component queryCircRef is the component taking a long time to run.  I have also tried creating a custom function at the database level to run, but see the same results.  Running the custom function at the db level is quick, but run it on Talend and it is super slow.

 

Thanks for your help!

 

 


Long Running Job.PNG
Anonymous
Not applicable

Nickster19, Were you able to find any solution to the performance issue? We are having the same issue extracting data from PostgreSQL database. After enabling parallelization at subjob level and increasing JVM -Xmx to almost 18 GB, the job still runs at 70 rows/sec for each thread - which comes to around 350 rows/second.

 

Thanks

Raj