[resolved] Collect rejects from tFileInputDelimited
Hello Team, I need to process delimited file and collect all the rejects. To do so I use tFileInputDelimited -> Rejects -> tFileOutputDelimited where tFileOutputDelimited is configured to output to a .txt file. Unfortunately in this output .txt file I collect partially parsed rejected lines along with error message. What I really need to collect in this file is the original line(s) from input file that were rejected without any additional information. Here is an example. in the input file I have the following line: "AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158 this line will be rejected since it is not properly formed according to csv schema that I defined. In output file, instead of this full line above, I only get the following: "AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|||||||||||||For input string: "Softcover" - Line: 0 Is there any way that I can collect original input lines? Thank you! Svetlana
Hi Sabrina,
I am not sure what column you are referring to, but I tried the following:
1. if I modify the schema and read all columns according to their types (first two columns as strings) then yes, everything works fine, all lines reach tSchemaComplianceCheck elements and can be validated.
2. when I modify input to not match schema (in this case one extra string column in the beginning of the line which is read as string) - please see attached screenshot. As you can see the rejects happened in tFileInputDelimited - they did not even reach tSchemaComplianceCheck for schema validation. Rejects happen because one of the fields of this line was expected to be long but turned out to be String.
3. I also tried to read entire line as one big "input" of type String and pass it to tSchemaComplianceCheck hoping it will figure out how to parse it but it did not and it gave me "input cannot be resolved or not a field" error:
The only workaround that I see so far is to read each line of input file as one long string input and pass it to tExtractDelimitedFields for parsing. Then when it fails to parse a line I can use tJavaRow to collect value of the input row to tExtractDelimitedFields. But I am facing strange problem here. For whatever reason it does not parse correctly my input line. It splits every letter as a separate field (see snapshot below). If you could help me to figure out this one, I can use this workaround to collect info that I need. Configuration of tExtractDelimitedFields is attached.
Hi,
What does your expected result look like? How did you define csv schema?
Have you tried to use the component
TalendHelpCenter:tSchemaComplianceCheck which is used to validate all input rows against a reference schema or check types, nullability, length of rows against reference values to see if it works?
Best regards
Sabrina
Hi Sabrina,
My expected result will contain full line from input file which was rejected.
i.e. this line will throw exception:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
in this case, my expected result (file with rejects) will contain this line in full:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
My schema definition is attached as a screenshot.
I did not try to use tSchemaComplianceCheck component, so I am going to take a look at it.
Thank you!
Svetlana
Hi Sabrina,
tSchemaComplianceCheck doesn't seem to help. The line that I want to have in rejects file will fail because the number of columns is different from what is defined in the schema so it will always fail in tFileInputDelimited without even reaching tSchemaComplianceCheck element - see attached screenshot.
Thank you!
Svetlana
This field contains string values such as "
Softcover
", but you are using "Long" data type to read it.
Try to read this column with string data type and
validate the input rows against a reference schema by using
tSchemaComplianceCheck to see if it works?
Best regards
Sabrina
Hi Sabrina,
you are absolutely right - this line is deliberately incorrect. If I remove "AAA|BBB" part, it will be processed without any issues. My goal here is to collect all original lines from input file that may throw exception. Our program will process input csv file from our client application and there is absolutely no guarantee that all the lines in the file will be well formed. We will need to process what we can, and collect the rest in a rejects file to send it back to the client, so that they can deal with these rejects, fix them and resubmit. So I need to be able to send them back original line as it came from their input file, meaning I need to have this full incorrect line in my rejects file:
"AAA|BBB"|Literacy 9 Student Book A|NSB2B/NSB2C|Softcover|Softcover|1024118|0176398163|9780176398163|Literacy 9 Student Book A|ITEM-B2B|Online Student Centre, 5 year||Literacy 9 Student Book A Online Student Centre, 5 year|31|PRDONLYSUP|158
Thank you,
Svetlana
Hi,
Could you please try to read this column with string data type (Actually, there is no check for String in talend)and validate the input rows against a reference schema by using tSchemaComplianceCheck to see if it works?
Best regards
Sabrina
Hi Sabrina,
I am not sure what column you are referring to, but I tried the following:
1. if I modify the schema and read all columns according to their types (first two columns as strings) then yes, everything works fine, all lines reach tSchemaComplianceCheck elements and can be validated.
2. when I modify input to not match schema (in this case one extra string column in the beginning of the line which is read as string) - please see attached screenshot. As you can see the rejects happened in tFileInputDelimited - they did not even reach tSchemaComplianceCheck for schema validation. Rejects happen because one of the fields of this line was expected to be long but turned out to be String.
3. I also tried to read entire line as one big "input" of type String and pass it to tSchemaComplianceCheck hoping it will figure out how to parse it but it did not and it gave me "input cannot be resolved or not a field" error:
The only workaround that I see so far is to read each line of input file as one long string input and pass it to tExtractDelimitedFields for parsing. Then when it fails to parse a line I can use tJavaRow to collect value of the input row to tExtractDelimitedFields. But I am facing strange problem here. For whatever reason it does not parse correctly my input line. It splits every letter as a separate field (see snapshot below). If you could help me to figure out this one, I can use this workaround to collect info that I need. Configuration of tExtractDelimitedFields is attached.