File size

Anonymous · ‎2016-02-12

I am trying to run some CSV data through the tool and it keeps crashing at the end of the adding dataset process. The file itself is about 2.8GB, which is what I may end up dealing with if I pull from an HDFS datastore. Is there a maximum file size that the tool can ingest and is there a way to tune the Java stack?

Anonymous · ‎2016-02-16

The file is being loaded from a local drive. I would characterize the file contents as being 72 columns wide (not my data) x 8.8M rows, comma delimited with quoted text, and the widest column has a max length of 896 (avg is 44). Data Prep reads the file as UTF-8, though it's actually ISO-8859-1 (aka Latin1).
A slightly different problem I encountered was with a smaller, less complicated CSV was with exporting prepped data. That file has about 400k lines, but I could only export 10k lines to a new file.

Anonymous · ‎2016-02-17

Thanks Brian. I will generate a file (with Talend Open Studio 😉 ) similar to yours and have it looked into. The Free Desktop version of Talend DP is meant for personal use on reasonably sized files, but I still want the behavior of the software to be sound and predictable on larger files. I need to understand why it doesn't do that with yours.
We use heuristics to guess the character encoding. Unfortunately there is no deterministic method, so it may, or may not, guess correctly. You may manually change the automatic encoding by clicking on the little "gear" icon next to the dataset name, in the upper left of the preparation screen.
Wrt the 10K in export: yes we cut off automatically at 10K to avoid out of memory errors, so you can only prepare 10K out of your 400K original file, and therefore this is all you have for export too. This is btw what should have happened with your 8.8M file too.
You could increase or decrease the 10K threshold by editing this file: <INSTALL_DIR>\config\application.properties
Note: The commercial, server version of the software due in June this year, will have no limitation with volume. NOT because we put an artificial limit in Free Desktop. It is freely configurable and our code is open source (not to mention it would be contrary to our code of conduct). But only because we have invested in more sophisticated scalability techniques in the server version.

Anonymous · ‎2016-02-17

I confirm we have an issue, we indeed consume too much memory when assessing the format of the file in certain circumstances. This stage happens before the actual import of the file which cuts off at 10K rows. A fix is already in the work.

Anonymous · ‎2016-02-18

Good to hear that you found what is causing the problem. I was able to adjust the input/output to 400k without any problems.

_AnonymousUser · ‎2016-10-18

Hello,
if this is still active topic, I have DP 1.3 and with 8GB RAM PC I have serious downspeed when I open 30krow/20col csv.
I also tried to open 700k row with adjusting sample size in config on 700k and it is not possible to show this ammount of data.
Should I or can I adjust more java heap in DP config to handle bigger samples of data?
Thank you

Data Prep

v6.x

Related Topics