Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
hi guys,
I'm a newbie in Talend and need some help.
I have a job which converts xlsx file to csv file with 100 columns like (col1;col2;col3;...;col100;). There is in the red section (1st) this file.
In the blue section (2nd) I want to reduce this file to by actual number of columns (header)…
I use tFileInputFullRow component ("read raw file") to convert all rows to one string
then use tMap-expression to trim all the last “;” characters and add one “;” like below:
TalendString.talendTrim(row10.newColumn,';',-1)+";"
so, for majority of my files this flow works correct, but some of them is not correct.. cause I have missed some values in some rows.. and its reduce not correct.. for example, I have actually header with 18 columns, and some of rows have cells with missed values.. and as a result it only contains 17 “;” .. and finally I have file with incorrect structure…
1st row is header
2nd and 3rd are correct
4-6th are incorrect and have only 17 instead of 18 separators.
can anybody help me how to do it?
I don't understand why you are removing the last ";" and re-adding it. Is this because you end up with some rows like this....
col1;col2;col3;col4;col5;col6;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
...and you want to remove the excess ";" characters? If so (and I presume that is the reason), I have knocked up a quick example. You will need to extrapolate from this to fit it to what you are doing, but it works for data like above.
Here is a screenshot of my job....
I have simply add a dataset like yours. It looks like this. It is 3 rows and a header. The header is the first row.
The tMap makes use of a routine I have written to do this with a bit of Java. Java is a MASSIVE advantage when trying to accomplish some tricky stuff like this. I'll show the Java next, but first I'll show the tMap......What is happening is 3 things. First, I am identifying the first row. The header row with a bit of code using the Numeric.sequence method. This counts the records that pass through the tMap. This sets the "RowNum" variable. The next thing I do is calculate the number of valid columns from the first row. I use the "RowNum" variable to decide whether to run this code or not. If it is 1, then it runs. Otherwise it uses the same value for the rest of the rows. The code I use for this I have written and will share below. The final thing I do is to use some more code to restrict all rows (even the header) to a strict number of columns (using the "ColumnCount" variable).
The code for each of the variables looks like this.....
RowNum
routines.Numeric.sequence("filename", 1, 1)
"filename" is just a String, but if you are using this for several files, you can use the filename to reset the counter for each file.
ColumnCount
Var.RowNum==1 ? routines.ValidColumnHeader.countValidColumns(row14.newColumn, ";") : Var.ColumnCount
This is an in-line IF condition. It says "IF the RowNum is equal to 1, use the "countValidColumns" method. Otherwise, use the last value of this variable.
FixedRow
routines.ValidColumnHeader.restrictValidColumns(row14.newColumn,";", Var.ColumnCount)
This processes each row (even the header) and limits it to the number of columns counted in the previous variable.
The routine that I have quickly knocked up can be seen below. It basically has two methods. One to count the number of valid columns in the header and the other to restrict all rows to that number of columns.
*******************************************************************************************
package routines;
public class ValidColumnHeader {
public static int countValidColumns(String header, String separator) {
String[] data = header.split(separator);
int counter = 0;
for(int i=0; i<data.length; i++) {
if(data[i].trim().compareTo("")!=0) {
counter++;
}
}
return counter;
}
public static String restrictValidColumns(String row, String separator, int columns) {
String[] data = row.split(separator,-1);
String returnString = "";
for(int i=0; i<columns; i++) {
returnString = returnString + data[i]+";";
}
return returnString;
}
}
*******************************************************************************************
You can create a Routine, call it ValidColumnHeader and paste this code in there to use it.
This should solve your issue, but may need a bit of tweaking to exactly suit your requirement. As an idea of what it does, this is the output of my job running, with the input data just before.....
Input Data
col1;col2;col3;col4;col5;col6;;;;;;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;;
aa;;cc;;ee;;;;;;;;;;;;;;
Output Data
col1;col2;col3;col4;col5;col6;
aa;bb;cc;dd;ee;ff;
aa;bb;cc;dd;ee;ff;
aa;;cc;;ee;;
I don't understand why you are removing the last ";" and re-adding it. Is this because you end up with some rows like this....
col1;col2;col3;col4;col5;col6;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;
...and you want to remove the excess ";" characters? If so (and I presume that is the reason), I have knocked up a quick example. You will need to extrapolate from this to fit it to what you are doing, but it works for data like above.
Here is a screenshot of my job....
I have simply add a dataset like yours. It looks like this. It is 3 rows and a header. The header is the first row.
The tMap makes use of a routine I have written to do this with a bit of Java. Java is a MASSIVE advantage when trying to accomplish some tricky stuff like this. I'll show the Java next, but first I'll show the tMap......What is happening is 3 things. First, I am identifying the first row. The header row with a bit of code using the Numeric.sequence method. This counts the records that pass through the tMap. This sets the "RowNum" variable. The next thing I do is calculate the number of valid columns from the first row. I use the "RowNum" variable to decide whether to run this code or not. If it is 1, then it runs. Otherwise it uses the same value for the rest of the rows. The code I use for this I have written and will share below. The final thing I do is to use some more code to restrict all rows (even the header) to a strict number of columns (using the "ColumnCount" variable).
The code for each of the variables looks like this.....
RowNum
routines.Numeric.sequence("filename", 1, 1)
"filename" is just a String, but if you are using this for several files, you can use the filename to reset the counter for each file.
ColumnCount
Var.RowNum==1 ? routines.ValidColumnHeader.countValidColumns(row14.newColumn, ";") : Var.ColumnCount
This is an in-line IF condition. It says "IF the RowNum is equal to 1, use the "countValidColumns" method. Otherwise, use the last value of this variable.
FixedRow
routines.ValidColumnHeader.restrictValidColumns(row14.newColumn,";", Var.ColumnCount)
This processes each row (even the header) and limits it to the number of columns counted in the previous variable.
The routine that I have quickly knocked up can be seen below. It basically has two methods. One to count the number of valid columns in the header and the other to restrict all rows to that number of columns.
*******************************************************************************************
package routines;
public class ValidColumnHeader {
public static int countValidColumns(String header, String separator) {
String[] data = header.split(separator);
int counter = 0;
for(int i=0; i<data.length; i++) {
if(data[i].trim().compareTo("")!=0) {
counter++;
}
}
return counter;
}
public static String restrictValidColumns(String row, String separator, int columns) {
String[] data = row.split(separator,-1);
String returnString = "";
for(int i=0; i<columns; i++) {
returnString = returnString + data[i]+";";
}
return returnString;
}
}
*******************************************************************************************
You can create a Routine, call it ValidColumnHeader and paste this code in there to use it.
This should solve your issue, but may need a bit of tweaking to exactly suit your requirement. As an idea of what it does, this is the output of my job running, with the input data just before.....
Input Data
col1;col2;col3;col4;col5;col6;;;;;;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;;
aa;bb;cc;dd;ee;ff;;;;;;;;;;;
aa;;cc;;ee;;;;;;;;;;;;;;
Output Data
col1;col2;col3;col4;col5;col6;
aa;bb;cc;dd;ee;ff;
aa;bb;cc;dd;ee;ff;
aa;;cc;;ee;;
@rhall, thank you very much for such a detailed explanation! Very understandable! I've never used a routine before. Your solution is pretty good!
No problem, I hope it helps!
Helleo I m new with Talend and i need to get the number of Clumns in csv file Input to make condition
if NumberColumns==y strat do..
if NumberColumns==x strat do..
else die the job
please any person can help
thank you in advance
The routine I posted above will help with that. You would need to use the "countValidColumns" method to get the number.
Thank you rhall for your response , yes i tried with your solution but i don't know how to configure my file csv with the componente tFixedFlowIput
You will need to use a tFileInputDelimited component for the CSV. You can see an example here....
https://help.talend.com/r/en-US/8.0/delimited/tfileinputdelimited-tlogrow-reading-data-from-delimited-file-and-display-output-standard-component-the