Batch processing large set of MySQL Data

Anonymous · ‎2018-07-27

Hi,

I have a billion of records in MySQL database. I want to read the records batch-wise (say 1000 records at a time), with those set of records i want to map to desired columns and convert to excel sheet or XML file format. Each file should contains 1 lac rows of records. Found this link
https://help.talend.com/reader/Dx4Xf8ykIvnjCCGyFrnUWw/aqGkWGJGV3o7u_MYna4K4A similar to how i wanted so followed this link and created a job as shown below:

but its taking more time for small records for example: i run 2 lac records it almost took 1 hour 45 mins. Please suggest any other process to achieve my expectation or need alteration to do.

Thanks and regards.

Anonymous · ‎2018-07-30

Hi,

You can use tparallelize component to do data extraction faster and perform all the subjobs after that.

The first step will be to identify the total number of records and pick them based on some specific criteria (say data load time range). The first sub job after tparallelize should process first 25%, second sub job should take next 25% etc.

Also increase the batch extraction size from MySQL to higher value than current 1000 records since you would like to store 1 lac (100k) records per file. This will also increase your throughput.

Warm Regards,

Nikhil Thampi

Anonymous · ‎2018-07-30

Hi Nikhil,
I'm using TOS_DI-V6.1.0 , so i couldn't find component which you have mentioned.
Can you please suggest alternative.

Thanks and Regards.

Anonymous · ‎2018-07-30

Hi,

No worries. We can still do lot of fine tuning.

a) First step is to avoid reading the files multiple times. You are reading 1000 records from MySQL to Delimited file and then copy it from there to an Excel file. Instead read the data from MySQL and push it to target excel file directly. You can use "Append existing file" option to move data to existing file (Use tfiletouch to create empty file at beginning). This will avoid multiple read issue from your current job.

b) Change the query to increase the fetch size to 20,000 records and increase it by 20k to check the performance. If the data fetch is working fine for 100,000 records in one go, you can avoid the looping while writing to same target excel file.

c) In run tab of the job, go to advanced settings, increase the -Xms and -Xmx parameters to give more memory to the job. But please do not give astronomically high values. Check your server capacity as a reference point.

d) Another pan point could be in the data fetch query itself. See whether the query performance is good. If the query joins are badly written, the data fetch will take longer time. See the explain plan and make necessary tuning there also.

If the details have helped you to find the path ahead, please mark it as solved to help Talend community.

Warm Regards,

Nikhil Thampi

Anonymous · ‎2018-07-31

Hi Nikhil,

Thanks for your quick reply.

First i'm reading records from MySQL and mapping using tMap to get desired columns (say i'm considering 50 columns out of 200) and based on some condition i need to generate single excel file and multiple excel files so i'm reading MySQL records to delimiter then push into the target excel-file.
Second point you specified to increase the fetching size, i have tried but didn't work .
Other settings that you specified know i have already done.
"SELECT * FROM table1 WHERE table1.jobId =52 limit 1,1000" this a sample query i'm using in tMysqlRow component.

Please help me out on this.

Thanks and Regards

Anonymous · ‎2018-07-31

Hi,

If you are using only 50 columns out of 200 columns from MySQL table, you should not fetch these columns from DB at all. This will increase the efficiency of data fetch from DB layer. You should fetch a column from DB only if you really need that column.

Warm Regards,

Nikhil Thampi

Anonymous · ‎2018-08-03

Then how i'm supposed to do. Kindly suggest me the flow.

Warm Regards

Anonymous · ‎2018-08-03

Hi,

If you assume there is an employee table with columns like id, employee_name, salary, DoB, joining_date and lot more columns in the source table.

If you need only id and employee_name from that table, your select query in tDBInput should contain only those columns in query and schema area.

Warm Regards,

Nikhil Thampi

Anonymous · ‎2018-08-03

I have implemented this in tmap component. Has you have mentioned to use it in the query it self, shall i remove tmap component and use it in query component only?
Kindly oblige.

Anonymous · ‎2018-08-03

and we need all the data to load as we are writing into excel

Talend Data Integration

v7.x

XML