Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

large csv file import

I have many large files, each one contains more than 2 Mil rows of .CVS data.

I need to import gradually into MySQL table because I cannot load data in CVS file into memory.

Could you guide me which should use to do that?

I try to use tFileInputDelimited component, but it take out all file data and consume all my lap memory

Labels (2)
15 Replies
Anonymous
Not applicable
Author

Hi Manohar,
Thanks for your suggestion. I try it now.
But do we have a solution for increasing import as TRF mentioned above? Mean that we don't load all rows in file to RAM, instead we insert a number of rows then continue loading other rows.
TRF
Champion II

@phancongphuoc your job design doesn't make sense.

Here is what I suggest to you:

0683p000009M7TV.png

With 1rst subjob, you'll split your big file into smallest CSV file, thanks to the "Split output in several files" option.

For example, you can generate files of 100,000 records.

tFileInputFullRow consider the input file with 1 single field called "line". Use the same for the tFileOutputDelimited and don't include header for the output files.

For the 2nd subjob, use tFileList component to iterate over the list of previously generated CSV files.

tFileInputDelimited let you read each CSV file one by one with the desired schema.

You have to use the following expression for the filename:

((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))

The content of the current file is pushed to the database by the tMysqlOutput component.

Anonymous
Not applicable
Author

Hi TRF,

 

I try your approach, but it seem that TFileInputFullRow component read all rows in the file before it seperate into several files.

This is mean that it is load all the file content into RAM

So that, I still got the error of "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space " in even that first step of seperating into serveral files

0683p000009M7dF.png

I also try to increase the JAVA heap to > 3GH

0683p000009M7dK.png

All those thing cannot be done

 

 

 

Tarun2
Partner - Contributor

"If you connect tFileInputFullRow to tFileOutputDelimited set the advanced parameter "Split output in several files" to the number of lines your are able to manage at once, it should be an easier solution."

 

Thanks! Your suggestion was useful in my case. Do you know of any limitation or shortcoming of this process?

 

What I am doing is reading a big XML (1GB) as a flat file, updating it and saving it with the same name. I was able to achieve this using the attached flow. But not aware, if there is any shortcoming , that I might face in future due to this. 


Read_Big_XML.png
Tarun2
Partner - Contributor

Update to the previous comment. Only 350MB file processed successfully with that approach. The 1GB file failed with Java Heap error. It processed only after bumping the JVM to 12GB. So, this is not an ideal solution.