Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi I am quite new to Talend and need some assistance.
I have 2 files.
file1: CSV data
file file2: JSON file containing schema & validation rules.
Validating input data file.
The json schema rules file specifies datatype, sizes, even valid values that are allowed for each type of field.
Complex - Some of these valid values need to be obtained from an SQL result provided in the schema for that field, with the SQL statement, e.g. field name “STATE” in json file
All the rows in the data file need to be checked against the json rules file.
When fields are found to be invalid a list of these errors needs to be built up and sent back so that the data file
can be corrected.
I believe it should be easy enough to read the JSON file and the CSV data file.
What I could not find in my research, was which Talend components can be used to do the validation of the csv data file against the json rules file?
ATTACHED are my 2 dummy sample files.
DummyData.txt needs to be renamed to DummyData.csv
DummyFieldRules.txt needs to be renamed to DummyFieldRules.json
Note: The columns / fields in the DummyData.csv file that will fail validation and need to be added to the error file, with explanation as follows:
Example contents of error out put file created
data row 2 is invalid, reason:
DOB is incorrect format, format expected yyyyMMdd
data row 4 is invalid, reason:
DOB is incorrect format, format
expected yyyyMMdd
STATE is not valid
SALARY not in range of 1000.01 to 1000000
@MARIO MOJSOVSKI, you can read json using tfileinputjson and then you can connect to tMap and you can validate the fields.
Thanks,
Manohar
Thanks for the quick response.
Questions
So the map the data file and the json rules file to tMap for validation?
While in tMap, can i also execute the SQL statement against SQL server to validate the data for the STATE code, like in my example?
I think it is impossible to read the rules from file and apply the rules to validate the data during the job execution. We usually use tSchemaComplianceCheck to validate the data with predefined schema.
Regards
Shong
Hi Shong,
Thanks for your response.
I did some research and wrote a simple job that uses two Talend components:
First i defined the data schema then i used:
(1) tFileInputDelimited with "Advanced settings" "Check each row structure against schema" - this rejects any records with an invalid structure.
Valid records from above flow into
(2) tSchemaComplainceCheck - this rejects any records where the columns are not compliant with the defined schema
this leaves only valid records.
Questions
(1) Is this the type of approach you were referring to in your previous response?
(2) for the remaining valid records that are not rejected by the above 2 components - how do i go about checking for valid values in the columns?
eg. SALARY value can only be between 100 and 1000 OR STATE value has to be validated against result returned from DB: eg. SELECT state_code from tSTATES where country = 'AUSTRALIA'
(3) Finally how will these valid data rules be associated to the schema defined in Talend?
here is what i have done so far based on your first response.
Hi Manohar
Thanks for your response yesterday, i had a question, but accidentally did not post it as a reply.
And now that Shong has also responded, i have an extra question.
Questions
1. with regards to Shong's response do you think it's still possible to do this the way you suggested? ,
if so, can you please elaborate by confirming that you mean
2. to map the data file and the json rules file to tMap for validation?
3. While in tMap, can i also execute the SQL statement against SQL server to validate the data for the STATE code, like in my example?
Hi
Yes, this is the type of approach we usually use. About validate the value, you can use tFilterRow or tMap to filter the rows.
About the query, You can NOT execute any query on tMap, however, you can use a tDBInput component to execute the query and store the result in memory or somewhere for used later.
can you explain why country = 'AUSTRALIA', the value 'AUSTRALIA' comes from dummyData file?
Regards
Shong
Hi Shong,
Thanks for getting back to me.
I like the idea of executing the SQL upfront and storing it in memory to access the values later.
The SQL statement is in the SJON rules file for the data validation.
"AUSTRALIA" is just a hardcoded example the SQL can be anything.
The JSON file is what is going to drive a "validation engine" that i need to build in Talend.
When i read the data file I cannot define a schema for it, because the number of fields/columns can change dynamically, based on the JSON file.
I have to read it in and then verify the number of columns against the JSON rules file, as well as their data types, and the valid values.
So it seems i can't even use the components
tFileInputDelimited
tSchemaComplainceCheck
it all has to be dynamic based on the JSON file definition.
So a validation engine needs to validate it all.
Looks like i will have to be writing java in the java components.
for example if i was to code this in C# the algorithm would be something like this
For every data_row
For every data_row_field
IF data_row_field = JSON_rules_fieldName
Iterate thru all columns of the current JSON_rule_fieldName row
Apply all the JSON checks for the JSON_rule_fieldName row
to the current data_row_field
and apply the NON NULL rules to the data_row_Field being processed
ALL invalid rules/errors are collected for later notification
I have attached a doc containing the dummy data, dummy data rules and expected results.
Looks to me like a tricky situation 🙂