Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
See why IDC MarketScape names Qlik a 2025 Leader! Read more
cancel
Showing results for 
Search instead for 
Did you mean: 
BenG
Partner - Contributor III
Partner - Contributor III

How to tell Talend to use more rows to guess the schema (File delimited)

Hi,

we have a couple of large csv files (>100 columns and more than 200000 rows). The schema guessed by Talend is completly wrong because it uses only the first (50?) rows.

Is there an option to expand the number of rows that Talend is using to guess the schema?

Thanks

Ben

Labels (3)
3 Replies
Anonymous
Not applicable

Hello,

Are you referring to File Delimited metadata?

https://help.talend.com/r/en-US/8.0/studio-user-guide-open-studio-for-data-integration/centralizing-...

Best regards

Sabrina

BenG
Partner - Contributor III
Partner - Contributor III
Author

Yes. I cant find an option to guess the schema on more rows in the csv.

Anonymous
Not applicable

Hi @Benjamin Gnädig​,

 

I'm afraid that this option is only really used as an aid to get you going quickly. When you are developing a job you need to know the schema types/sizes of every file you may ever use with the job. Granted, when using a DB this is much easier than a flat file unless you are aware of the settings from the application which creates that flat file.

 

A helpful "trick" to do this with Talend (if you have some example files) is to quickly knock up a job to process several files at once and read them as single column files (one column per row). Then calculate the length of each row and return the top 50 rows from all files. Then output this as your test file to build against.

 

I know that this is not as simple as being able to guess from the whole file while building a job, but if the file you use to build your job does not contain examples of the maximum sizes of columns, you will run into a problem when running your job later. Creating a job to quickly analyse all of the files you have available to extract the maximum length rows show mitigate for this sort of issue. I'd also recommend adding 10% to the sizes just to be safe if you do not know the precise limitations.