Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
we have a couple of large csv files (>100 columns and more than 200000 rows). The schema guessed by Talend is completly wrong because it uses only the first (50?) rows.
Is there an option to expand the number of rows that Talend is using to guess the schema?
Thanks
Ben
Hello,
Are you referring to File Delimited metadata?
Best regards
Sabrina
Yes. I cant find an option to guess the schema on more rows in the csv.
Hi @Benjamin Gnädig,
I'm afraid that this option is only really used as an aid to get you going quickly. When you are developing a job you need to know the schema types/sizes of every file you may ever use with the job. Granted, when using a DB this is much easier than a flat file unless you are aware of the settings from the application which creates that flat file.
A helpful "trick" to do this with Talend (if you have some example files) is to quickly knock up a job to process several files at once and read them as single column files (one column per row). Then calculate the length of each row and return the top 50 rows from all files. Then output this as your test file to build against.
I know that this is not as simple as being able to guess from the whole file while building a job, but if the file you use to build your job does not contain examples of the maximum sizes of columns, you will run into a problem when running your job later. Creating a job to quickly analyse all of the files you have available to extract the maximum length rows show mitigate for this sort of issue. I'd also recommend adding 10% to the sizes just to be safe if you do not know the precise limitations.