topic Re: How can a large table be processed efficiently? in Talend Studio

How can a large table be processed efficiently?

Anonymous — Sat, 16 Nov 2024 12:23:28 GMT

Hi,
I am having a challenge (or two) that someone can hopefully help me out iwth.
I have a job that is reading a table (via a tMysqlInput component) that has approximately 20 million rows and then writes those rows out to another table
that resides in another database. A tInfoBrightOutput component is used for the table being loaded to.
When the job was run, it threw an error due to an out of memory error. The 'Enable stream' option on the tMysqlInput component was then activated to
get around the out of memory error This worked, BUT the job took about 5 hours to load 20 million rows. A similar test using a tMysqlOutputBulkExec
component took 15 minutes to run.
With the above in mind, there are a couple of things I would like to understand.
1) Are there any obvious settings that should be put in place when using the tInfoBrightOutput component? The difference in the time it takes to load to
the same table using different component types is so glaring that I am thinking that I am overlooking something.
2) Is there a way to read the data from the input table (using the tMysqlInput component) in sections at a time? e.g. read and process a millions rows at
a time. I am thinking that if the tables can be read in sections at a time, the data can be held in memory and the 'Enable stream' option will not be
needed.
Also, why use the tInfoBrightOutput component if the tMysqlOutputBuldExec is so much more performant? We want to make use of the InfoBright Knowledge
Grid.
Thank you in advance for any help you can offer on this.
Regards,
Tom

Re: How can a large table be processed efficiently?

Anonymous — Fri, 10 Feb 2012 06:42:43 GMT

Hi Tom
1) The bulk db components are used for massive data performance which use temp file to insert and load data.
tInfoBrightOutput is a custom component. Its performance may be the same with tMysqlOutput(inefficient).
2) If the table has a sequence primary key, you can set the query of tMysqlInput.

        ....  where id < 1000000

Regards,
Pedro

Re: How can a large table be processed efficiently?

Anonymous — Fri, 10 Feb 2012 11:41:43 GMT

Hi Pedro,
Thank you for yor response. As of now, I am going to try to improve upon the way the job works with the InfoBright Loader component due to
us wanting to make use of the InfoBright Knowledge Grid. Unless there is a way to still do that by using the db bulk component.
I still have a question though in regards to processing the input table in sections. Do I need to place the read from the table into an iteration or
loop to allow for the process to read 'x' rows at a time? I believe that the example you listed has me reading the first 1 million rows and not the
rest of the table.
If the read should be placed into an iteration or a loop could you explain or show how that is to be done (or point me in the right direction)?
Thank you and have a great day.
Tom

Re: How can a large table be processed efficiently?

Anonymous — Mon, 13 Feb 2012 05:31:08 GMT

Hi Tom
I think you can use tLoop and read every 2 million(or 1 million) rows for one loop, which will improve performance.
tLoop -> tMysqlInput->tInfoBrightOutput->tJava.
Use context variables to set the query in tMysqlInput.
For example.

"SELECT 
  `TableName`.`id`, 
  `TableName`.`name`
FROM `TableName`
where id>="+context.new1+" and id<="+context.new2

Regards,
Pedro

Re: How can a large table be processed efficiently?

Anonymous — Mon, 13 Feb 2012 20:05:04 GMT

Thank you Pedro!