topic Re: tDataShuffling - improving performance in Data Quality

tDataShuffling - improving performance

Anonymous — Tue, 27 Nov 2018 14:02:23 GMT

Hi,

I am using tDataShuffling component to shuffle a column which is 8 char length, partitioned on 1st 3 char of the column. eg.

SELECT field_1, substr(field_1, 1, 2) from table_name;

shuffle column value: py13456

partition column value: py1

This is running very slow with 3 rows/s.

This table has around 6 million records and the buffer size of the tDataShuffle component is 100000 with Seed generator - 12345678.

At the job level I have set Multi Thread execution with Parallelize Buffer Unit Size - 25000

Kindly suggest the ways to improve the performance of this component.

Thanks.

Re: tDataShuffling - improving performance

Anonymous — Tue, 27 Nov 2018 15:08:09 GMT

Hi,

I have tried the below:

Cursor: 100000 for tDBInput

rownum < 100000
At job level Max heap size to 2048M(Job run JVM Settings)

Is there anything I could do at tDataShuffle component level.

Could you kindly reply.

Re: tDataShuffling - improving performance

Anonymous — Tue, 27 Nov 2018 15:22:05 GMT

Hi,

The Job flow has:

tDBInput (with cursor ) ----> tDataShuffle -----> tDBOutput (update operation)

Db input component Cursor: 100000

Db input query Rownum: 100000

Shuffling Buffer size: 100000

Job Multi thread Parallelize Buffer Unit Size: 25000

Job Min heap space: -Xmx1024M

Job Max heap space: -Xmx4096M

Re: tDataShuffling - improving performance

Anonymous — Tue, 27 Nov 2018 16:23:32 GMT

I have used Db output Batch size: 50000.

This job is running for more than 30 mins and have not completed.

Would saving data in cache - tHashOutput before tDataShuffle, improve performnace?

Re: tDataShuffling - improving performance

Anonymous — Wed, 28 Nov 2018 06:31:53 GMT

Hello,

Would you mind posting your current job design screenshots on forum which will be helpful for us to understand your work flow?

Best regards

Sabrina