Read one file parrallely

_AnonymousUser · ‎2016-06-12

Experts, could you please help to me implement the solution to read the file parallel? so example i have a file of 10G. i want to have multiple partitions reading that file? is that possible?

Anonymous · ‎2016-06-17

Hi,
You could use a sequence in tMap to break up your file into smaller chunks. What kind of data do you have in this file?
Do you want to load your big file into DB? Could you please give us more information about your current job situation?
Best regards
Sabrina

Anonymous · ‎2016-06-18

I am receiving full refresh files from my source team which contain 160M records. this is full refresh files, so i will have to read file and compare with previously loaded data and identify Insert, Update and Delete and apply delta to DB table. so as an example below data, and here i have customer_id as PK

Todays file contain
customer_id   customer_name
100             Sam
102              Alex
105              David

previously loaded table
100              Sam
102              Alexy
104              John

so with above data , i need to mark
customer_id 102 for update, 105 for insert and 104 for delete. in other words, i need to use content of latest file into my final table.
i don't want to truncate and reload table because this table is used by client almost all time. logic for identifying delta i could achieve with tMap, but problem is with processing 160M records. which is taking lot of time to process. sample file content is posted below.
in below file first 2 columns are PK

6014|A26904c676|0.0186370|61
6014|A27da32789|0.0154096|55
6014|A287f20d2c|0.0219631|55
6014|A2dfe8c97e|0.0408455|61
6014|A3b52342f8|0.0243586|61
6014|A3e7ac480f|0.0260668|61
6014|A5abde4f3b|0.0398880|55
6014|A5c54eed1b|0.0293591|55
6014|A5e4e4d111|0.0312439|61
6014|X14b34ecd508|0.0263314|61
6014|X14b34ecd529|0.0263314|61
6014|X14b34ecd53c|0.0263314|61
6014|X14b34ecd594|0.0464095|61
6014|X14b3f396fa8|0.0163314|58
6014|X14b53d31504|0.0207230|58
6014|X14c174dc981|0.0311294|55
6014|X14c174dc9f6|0.0224165|55
6014|X14c2be79613|0.0270148|55

Anonymous · ‎2016-06-18

The way to do this is to load the records into another table and carry out the comparison processing in the database. With your requirement to find deleted and new records, you will need to carry out two lookups using a tMap. Doing a lookup comparison like that, with that many records in a tMap is going to be slow even with a really powerful system. Java is nowhere near as fast as a database for comparisons.

Anonymous · ‎2016-06-22

But my main problem is with reading 160M records from file into table. how can i make it parallelized? so if i compare with another ETL tool informatica, it has concept of partitions, it will split the big files into logical partitions and read file parallel. do we have something like that in Talend.

Anonymous · ‎2016-06-22

With Talend you are not limited to only what Talend provides. You can also make use of third party Java APIs and command-line functionality. So, if you are working on a Linux environment you can use Split ( http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts). If you are not (or if you don't want to use Split), you can make use of a bit of Java to split the file ( http://stackoverflow.com/questions/19177994/java-read-file-and-split-into-multiple-files).

Processing in parallel may be a problem if you do not have the Enterprise Edition. That is one of the "paid for" features, but it doesn't stop you from doing this in parallel in the Open Source Edition. You can simply create a job which will read a file (name supplied by context variable) and the run it as many times as your system will handle it concurrently. This won't be the elegant solution that you get with the Enterprise Edition, but since the aim is simply to get the data loaded (I am assuming), then it shouldn't matter.

Anonymous · ‎2016-07-04

Sure, thanks for your help.

Anonymous · ‎2016-07-04

Hi bibintjohn1,

You can do this using enterprise edition , else the other option could be to do it manually. You can split your file using one job and then can execute multiple job in parallel on different file.

Thanks,
Saurabh.

Talend Data Integration

v6.x