Finding field separator before processing the file

McJingles · ‎2020-02-11

My folder contains numerous files (.txt format) for which the field separator for few of the files is comma(,), few of the files is pipe(|), few of the files is semicolon( and so on.

Is there any option to extract the columns in the same job?

Thanks in advance

Anonymous · ‎2020-02-11

Yes, what you need to do is use the tFileInputDelimited and set the "Field Separator" field to a context variable. Then you can set the value of the context variable at the beginning of the job or between processing files. This can be done. dynamically, but it may require a tiny bit of code.

McJingles · ‎2020-02-11

Thanks for the reply @rhall

Can you please elaborate this further more?

I am new to Talend. Can you please share the code or any link to what you discussed?

Anonymous · ‎2020-02-11

OK, in this example I have created a context variable called "sep". This can be seen here....

I have given it a value of ";" for this example. But context variable values can be set dynamically as well. You can assign the values in numerous ways. This is covered by other questions on the Community.

After doing this, I configured my tFileInputDelimited component as below....

Notice the "Field Separator" field is populated by ....

context.sep

This tells the component to use the value held by context.sep as the separator.

McJingles · ‎2020-02-11

Good one @rhall

I've nearly 20 files in the folder which is defined in TFileList component. I would take all the files as input.

That input files looks like

AtmosRX | http://www.atmosrx.com/ | Product Feed 04-12-2019 -- Pipe(|) as a delimiter
Bellelily , http://www.bellelily.com/ , Bellelily products feed -- Here, Comma(,) as a delimiter

In this scenario, How can i extract the data using the delimiter Pipe(|) and Comma(,) in same job.?

Moreover, I don't know Which delimiter is present in that all the input files but Pipe, comma, semi-colon will be there.

Anonymous · ‎2020-02-11

I can't tell you exactly how to do this without doing it myself. It doesn't really serve you to do this myself as you will not learn. However, I am happy to give you my considerations for a problem like this.

1) You know the separator types that it could be beforehand

2) With every file from the tFileList, you have the opportunity to pre check the first row of the file

3) You know how many columns there are in each file (otherwise this model for processing files will not work)

As such you have have all of the clues you need to identify which of the 3 separators that it could be. For each iteration of the tFileList you do not need to process the data in one subjob. You could load the the tFileList data, check the first row of the file, identify the separator (look at the split() function), save the details in a tHashOutput, then use a tHashInput to read the data in, iterate over it and read the file with the correct separator.

Talend Data Integration

v7.x