Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I have a billion of records in MySQL database. I want to read the records batch-wise (say 1000 records at a time), with those set of records i want to map to desired columns and convert to excel sheet or XML file format. Each file should contains 1 lac rows of records. Found this link
https://help.talend.com/reader/Dx4Xf8ykIvnjCCGyFrnUWw/aqGkWGJGV3o7u_MYna4K4A similar to how i wanted so followed this link and created a job as shown below:
but its taking more time for small records for example: i run 2 lac records it almost took 1 hour 45 mins. Please suggest any other process to achieve my expectation or need alteration to do.
Thanks and regards.
Hi,
You can use tparallelize component to do data extraction faster and perform all the subjobs after that.
The first step will be to identify the total number of records and pick them based on some specific criteria (say data load time range). The first sub job after tparallelize should process first 25%, second sub job should take next 25% etc.
Also increase the batch extraction size from MySQL to higher value than current 1000 records since you would like to store 1 lac (100k) records per file. This will also increase your throughput.
Warm Regards,
Nikhil Thampi
Hi,
No worries. We can still do lot of fine tuning.
a) First step is to avoid reading the files multiple times. You are reading 1000 records from MySQL to Delimited file and then copy it from there to an Excel file. Instead read the data from MySQL and push it to target excel file directly. You can use "Append existing file" option to move data to existing file (Use tfiletouch to create empty file at beginning). This will avoid multiple read issue from your current job.
b) Change the query to increase the fetch size to 20,000 records and increase it by 20k to check the performance. If the data fetch is working fine for 100,000 records in one go, you can avoid the looping while writing to same target excel file.
c) In run tab of the job, go to advanced settings, increase the -Xms and -Xmx parameters to give more memory to the job. But please do not give astronomically high values. Check your server capacity as a reference point.
d) Another pan point could be in the data fetch query itself. See whether the query performance is good. If the query joins are badly written, the data fetch will take longer time. See the explain plan and make necessary tuning there also.
If the details have helped you to find the path ahead, please mark it as solved to help Talend community.
Warm Regards,
Nikhil Thampi
Hi Nikhil,
Thanks for your quick reply.
Please help me out on this.
Thanks and Regards
Hi,
If you are using only 50 columns out of 200 columns from MySQL table, you should not fetch these columns from DB at all. This will increase the efficiency of data fetch from DB layer. You should fetch a column from DB only if you really need that column.
Warm Regards,
Nikhil Thampi
Then how i'm supposed to do. Kindly suggest me the flow.
Warm Regards
Hi,
If you assume there is an employee table with columns like id, employee_name, salary, DoB, joining_date and lot more columns in the source table.
If you need only id and employee_name from that table, your select query in tDBInput should contain only those columns in query and schema area.
Warm Regards,
Nikhil Thampi