topic Re: Source File Delimiter Capture in Talend Studio

Source File Delimiter Capture

krishu — Tue, 27 Nov 2018 12:08:23 GMT

Hi Team

Please could you help me on the below scenario..

How to capture the file delimiter information(whether semicolon/pipe/comma separated) from the source file and compare it with the existing template in Database.

Thanks in advance

Krishu

Re: Source File Delimiter Capture

vapukov — Tue, 27 Nov 2018 12:20:55 GMT

if you know structure (because you have template) it simple:

read first line from the file and check what between to columns

for example

in template you have

id;name;phone

and in file

id,name,phone - all what you need to do is check 3rd character

of course, it simplified logic, leal could be different

Re: Source File Delimiter Capture

Anonymous — Wed, 28 Nov 2018 02:52:23 GMT

That's an interesting problem. If the delimiter can only be one of several things (e.g. comma, semi-colon, etc.), and the data itself doesn't contain a lot of possible delimiters in close sequence, then you could read the first several lines and count the number of commas, etc. in each. If the first 10 lines each contain 20 commas, but only 3 semi-colons and zero pipes, then your delimiter is probably the comma.

The more potential delimiters you have in the actual data, the more rows you need to read to be sure you've found the actual delimiter; even so, if you read 1000 rows, and every single one has exactly 20 commas, then the probability of the comma *not* being the delimiter is vanishingly small (I'm tempted to estimate it based on the relative density of each possible delimiter, but it's late, and I've had a long day).

Re: Source File Delimiter Capture

krishu — Thu, 29 Nov 2018 06:17:27 GMT

The template in DB is like this:

File_Name Header_Info Delimiter_Type

-----------------------------------------------------------

ABD.txt Y ;

XYZ.txt Y |

Source file looks like this: File_Name is XYZ.txt

Name | ID | City

-----------------------

Krishu |10 | Bangalore

I have to capture the Delimiter which is coming from the Source (don't know what type of delimiter) and i have to compare the source delimiter value with the Template Delimiter_Type (as above). If both are matching i should process the file further or else i should reject the source file.

Thanks in advance

Krishu