Loading Flat File With Dynamic Schemas

Nickster19 · ‎2018-02-13

Hey all, I am fairly new to Talend Integration. This is my first post, looking to gain some insight. We are migrating some of our current jobs from Business Objects.

So here is our scenario: we load flat files (tab delimited) from our customers. Each file has a header record containing the field name (first_name, last_name, etc.). We allow our customers to send only the fields that they require, in any order. I am a bit lost on which components I can use to read each file, dynamically figure out what the schema is and load the data into a postgres table.

For instance, our first file might contain firstName, lastName, and email.

The second file might contain email, uniqueId, lastName, firstName, and loginId.

The table we load this into contains the appropriate columns. Each file could contain a different quantity of columns. We have over 2000 files and can't build a custom mapping for each client.

I guess I am a bit confused about what components are the ideal way to make this work. I have tried using tjava, but my java skills are not where they need to be. I have also tried using tSplitRow to create a data dictionary. Additionally, I have tried using tExtractDynamicFields, but couldn't get it to work. Any help in pointing me in the correct direction would be super appreciated!

Lastly, if I am looking for a course or book to help my Java skills, especially related to Talend, can anyone make a recommendation? I don't want to become a java expert, but would like to expand my knowledge from where it currently is at (beginner).

vapukov · ‎2018-02-13

in case when target database structure same to all files, and You only need manage flexible CSV structure - situation could be resolved.

Variant 1

Convert csv to json - for each file

it could be done with tJavaFlex (if You familiar with Java) or with any of available command line tools, like:
https://github.com/darwin/csv2json

http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html

https://csvkit.readthedocs.io/en/1.0.2/

Read file raw by raw (document by document) and parse it with

tExtraxtJSONFields

in this case - if column there missed it would contain NULL

Variant 2

Check https://exchange.talend.com for component tFileInputTextFlat Jan Lolling

tFileInputTextFlat

define for component structure with all target columns, and it do same as Variant 1 - extract if present or Null

In both cases - column names must be same for all files, no doubts

Talend Data Integration