Hi,
I would like to know how Talend Open Studio handle huge amount of data.
For instance, if I want to select a 20Go MySQL table, do some transformations on the fields, and put the data in another database, how is Talend Open Studio doing it ?
In my example, I have a tMysqlInput --> tMap --> tMysqlOutput
I tried the 2 following things :
-> Limit the jvm memory usage with the option : -Xmx6144M (6Go memory usage)
-> In the tMap componant, I specified the "temp data directory disk", so I guess Talend is writing data here to free some memory.
By doing this, my job seems to work. If I don't do this, the job crass, returning a memory exception.
But what is Talend doing exactly ?
Does it just stored the data in a temporary file and do read / write access on this file ?
Does it uses a specific algorithm to store temporary the data ?
Can it crash anyway if there is too many data ?
Does the Talend Platform for Big Data is a lot more optimized than the Open Studio version ?
I know it's a lot of questions, but if someone has a few answers, it would really help me.
Best regards,
Bertrand.
Hi nicolasdiogo,
I'm using a MySQL database.
Which component would be more suitable ?
I actually found a very useful option on the tMysqlInput that I didn't see first : "Enable stream"
On the tMysqlOutput I also changed the "Number of rows per insert" and the "Commit every" options.
If you have any suggestions, please let me know.
Best regards,
your idea seems correct. *but* what happens if during the last batch some rows fail processing? do you leave all other rows already loaded in this target table?
There is a common problem with MySQL database input components (the same in tPostgresqlInput). The JDBC driver collects all data until the end and then starts sending them out to the application (your Talend job). Check "Enable stream" in the advanced option and that's it.
Hi,
Thank you for your answers.
Nicolasdiogo,
If one row of a commit fails, the whole commit is aborted. However, all the previous commits worked,
so the rows associated with those commits are loaded in the target table.
Jlolling,
I activated the "Enable Stream" and the results were great for the memory. My job was able to
transfer all the 25Go of data by using only a few Go of the memory (depending on the commit size of the
tMysqlOutput component).
However, the speed of the transfer doesn't seem to be very good (I reached a maximum of 3Go/s
by playing with the commit size of the tMysqlOutput component).
I also tried with the
bulk components. With this component, the job took about the same time than
with the tMysql, but the whole time of the job was spent writing the file on the hard drive (for recall,
the bulk component write all the data on a file and then import the file into the database).
So the job spend a long time writing the file, and something like
3 seconds to import the file into the database.
So my question is : Why does it take so long to write the file ? I checked my hard drive and it wasn't working a lot.
So what is limiting the job ?
The bulk component really needs time to write on the disk ?
The incoming stream of data is slow ? (due to a bad optimization of the tMysqlInput component ?)
If you have any idea, optimization suggestion or anything else, please let me know.
Best regards,
Bertrand
I am facing the same issue where I have a billion records as input from a mysql table, tMap looks up to 15 dimensions for keys and writes to temp directory. It's been 8 days since it is writing to disk and inserts have not started yet. We have enabled the stream and have commit every 10000 records. What more optimization can be done here? Or it will have to wait to load all the looked up data, store in temp disk and only then start inserting into mysql table output? Really stuck here and need some help.
Thanks in advance.
Hi pankaj, Can you create a master table for your input table, based on timestamp, key etc. and segment the data in batches using query say between clause. Use this master table and iterate on input data for those batches. This may be slow this will start insertion from first execution itself also try using ELT components. Thanks Vaibhav