Solved: Re: Best way to count number of commas in a row? - Qlik Community

talendtester · ‎2019-07-08

I have a large comma delimited file which has a single column which sometimes contains commas in the data, unfortunately there aren't double quotes around this column.

What is the best way to count the number of commas in a row and add double quotes around the column which needs them?

For example:

Col1,Col2,Col3,Col4

junk,junk,junk,junk

junk,junk,j,u,n,k,junk

junk,junk,junk,junk

junk,junk,ju,nk,junk

talendtester · ‎2019-07-10

I disagree because I know the following:

1. How many commas are supposed to be in a good row

2. I know which specific column has the commas in the data

This is what I am doing:

1. Filter the rows which have too many columns into file1

2. Reverse the data in the rows from file1

3. Put a double quote to close the column which has commas in the data

4. Reverse the data in the corrected rows, put in file2

5. Put a double quote before the start of the column which has commas in the data, put in file3

6. Insert the good rows from the original file into file3

View solution in original post

Anonymous · ‎2019-07-08

Hi @talendtester ,

You can read the file with tFileInputFullRow and in tMap for the fulle input string use StringUtils.countMatches(input_row,',').

Whenever this count+1 is greater than your number of columns, bifurcate those rows to some other files as those will contain extra commas and correct and re-process them.

You cannot correct them automatically as comma is a delimiter as well as it is appearing in the file.

Thanks and Regards,

Subhadip

talendtester · ‎2019-07-08

I probably need to do one pass which counts from the left side and then I need to do another pass which counts the commas starting from the right side of the row because I have a variable number of commas in the column which doesn't have double quotes around it.

Anonymous · ‎2019-07-08

Hi,

You do not have to count two times one from left and another from right. Either way, the number of commas will be more which means the records is not as per standard.

Its always better to agree with source flow during starting of the project to have robust data segregation and schema definition. Else you never know when the issue will hit you and it may even lead to data privacy law suits (For example, due to schema miscalculation, the data from person A went to person B).

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

talendtester · ‎2019-07-08

The source of the data isn't going to change how they format the data.

Yes, I know there will extra commas.

If I split the good columns file1 and everything else file2, how do I clean up file2 when there is a variable amount of extra commas?

Is there a way to reverse all the data in a row in file2 and then count the number of commas before splitting the column with the extra commas into file3?

Anonymous · ‎2019-07-08

Hi,

You will have to send the data back to source to get the right data.

If you are trying to do the correction, it may (I would rather say will) lead to unwanted data errors.

Lets take your example. You got below data with extra columns. How do you know how to merge the data to which column?

junk,junk,j,u,n,k,junk

You may merge it as below.

junk,junk,junk,junk

But source might have expected the data in below format.

junk,junkju,nk,junk

junk,junkj,unk,junk

junk,junkjun,k,junk

junk,junk,junk,junk

So you never know about a data other than taking some assumptions. So ideal way is to send back to the source or send the data to a Data Steward Team for validation. I hope the answer cleared the fog 🙂

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

talendtester · ‎2019-07-10

I disagree because I know the following:

1. How many commas are supposed to be in a good row

2. I know which specific column has the commas in the data

This is what I am doing:

1. Filter the rows which have too many columns into file1

2. Reverse the data in the rows from file1

3. Put a double quote to close the column which has commas in the data

4. Reverse the data in the corrected rows, put in file2

5. Put a double quote before the start of the column which has commas in the data, put in file3

6. Insert the good rows from the original file into file3

Best way to count number of commas in a row?

csv file

Ingestion

Java

Talend Data Integration

v7.x