Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
MOJ
Contributor
Contributor

Which Talend components can be used to validate a csv data file against json schema rules file?

Hi I am quite new to Talend and need some assistance.

I have 2 files.

file1: CSV data

file file2: JSON file containing schema & validation rules.

Validating input data file.

The json schema rules file specifies datatype, sizes, even valid values that are allowed for each type of field.

Complex - Some of these valid values need to be obtained from an SQL result provided in the schema for that field, with the SQL statement, e.g. field name “STATE” in json file

All the rows in the data file need to be checked against the json rules file.

When fields are found to be invalid a list of these errors needs to be built up and sent back so that the data file

can be corrected.

I believe it should be easy enough to read the JSON file and the CSV data file.

What I could not find in my research, was which Talend components can be used to do the validation of the csv data file against the json rules file?

ATTACHED are my 2 dummy sample files.

DummyData.txt needs to be renamed to DummyData.csv

DummyFieldRules.txt needs to be renamed to DummyFieldRules.json

Note: The columns / fields in the DummyData.csv file that will fail validation and need to be added to the error file, with explanation as follows:

Example contents of error out put file created

data row 2 is invalid, reason:

DOB is incorrect format, format expected yyyyMMdd

data row 4 is invalid, reason:

DOB is incorrect format, format

expected yyyyMMdd

STATE is not valid

SALARY not in range of 1000.01 to 1000000

Labels (4)
7 Replies
manodwhb
Champion II
Champion II

@MARIO MOJSOVSKI​, you can read json using tfileinputjson and then you can connect to tMap and you can validate the fields.

 

Thanks,

Manohar

 

 

 

MOJ
Contributor
Contributor
Author

Thanks for the quick response.

Questions

So the map the data file and the json rules file to tMap for validation?

While in tMap, can i also execute the SQL statement against SQL server to validate the data for the STATE code, like in my example?

 

Anonymous
Not applicable

I think it is impossible to read the rules from file and apply the rules to validate the data during the job execution. We usually use tSchemaComplianceCheck to validate the data with predefined schema.

 

Regards

Shong

 

 

 

 

 

 

MOJ
Contributor
Contributor
Author

Hi Shong,

Thanks for your response.

I did some research and wrote a simple job that uses two Talend components:

First i defined the data schema then i used:

(1) tFileInputDelimited with "Advanced settings" "Check each row structure against schema" - this rejects any records with an invalid structure.

Valid records from above flow into

(2) tSchemaComplainceCheck - this rejects any records where the columns are not compliant with the defined schema

this leaves only valid records.

 

 

Questions

(1) Is this the type of approach you were referring to in your previous response?

(2) for the remaining valid records that are not rejected by the above 2 components - how do i go about checking for valid values in the columns?

eg. SALARY value can only be between 100 and 1000 OR STATE value has to be validated against result returned from DB: eg. SELECT state_code from tSTATES where country = 'AUSTRALIA'

(3) Finally how will these valid data rules be associated to the schema defined in Talend?

 

 

here is what i have done so far based on your first response.

 

0693p000009SUXEAA4.gif

 

 

 

 

 

 

 

 

 

 

MOJ
Contributor
Contributor
Author

Hi Manohar

 

Thanks for your response yesterday, i had a question, but accidentally did not post it as a reply.

 

And now that Shong has also responded, i have an extra question.

 

Questions

 

1. with regards to Shong's response do you think it's still possible to do this the way you suggested? ,

 

if so, can you please elaborate by confirming that you mean

 

2. to map the data file and the json rules file to tMap for validation?

 

3. While in tMap, can i also execute the SQL statement against SQL server to validate the data for the STATE code, like in my example?

 

Anonymous
Not applicable

Hi

Yes, this is the type of approach we usually use. About validate the value, you can use tFilterRow or tMap to filter the rows.

About the query, You can NOT execute any query on tMap, however, you can use a tDBInput component to execute the query and store the result in memory or somewhere for used later.

 

can you explain why country = 'AUSTRALIA', the value 'AUSTRALIA' comes from dummyData file?

 

Regards

Shong

 

 

 

 

 

 

MOJ
Contributor
Contributor
Author

Hi Shong,

Thanks for getting back to me.

I like the idea of executing the SQL upfront and storing it in memory to access the values later.

The SQL statement is in the SJON rules file for the data validation.

"AUSTRALIA" is just a hardcoded example the SQL can be anything.

The JSON file is what is going to drive a "validation engine" that i need to build in Talend.

When i read the data file I cannot define a schema for it, because the number of fields/columns can change dynamically, based on the JSON file.

I have to read it in and then verify the number of columns against the JSON rules file, as well as their data types, and the valid values.

 

So it seems i can't even use the components

tFileInputDelimited 

tSchemaComplainceCheck 

 

it all has to be dynamic based on the JSON file definition.

So a validation engine needs to validate it all.

Looks like i will have to be writing java in the java components.

 

for example if i was to code this in C# the algorithm would be something like this

 

For every data_row

For every data_row_field

IF data_row_field = JSON_rules_fieldName

Iterate thru all columns of the current JSON_rule_fieldName row 

Apply all the JSON checks for the JSON_rule_fieldName row 

to the current data_row_field

and apply the NON NULL rules to the data_row_Field being processed 

ALL invalid rules/errors are collected for later notification

 

I have attached a doc containing the dummy data, dummy data rules and expected results.

Looks to me like a tricky situation 🙂