Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Write Table now available in Qlik Cloud Analytics: Read Blog
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Best way of dealing with rejected records of data cleaning components?

Hi all,

I have a job where I get a file with a number of records (each row contains a user with each column being phone number, email address, postcode, etc. of the user) and I want to load these records into a database. Before I do it, I have created a flow that has certain data cleaning components (tfilterrow, tpatterncheck...) that make sure the records follow certain rules (name without invalid chars, name longer than 2 chars, postcode no overseas...). Otherwise, these records get 'flagged', that is, they are added to the 'reject' flows of the different data cleaning components.

What I am trying to find out is the best way of dealing with these 'flagged' records. When the job finishes running, I would like to manually analyse these rejects (all rows with names with invalid characters, names longer than 2 chars, etc.) and decide if they are actually rejects instead of false positives (names like 'Jo' are still valid, for example) or even correct them (when they have invalid characters). Once these corrections are made, I would like to add these records to the flow with the 'unflagged' records so that they get added to the database as well.

This can be achieved with an excel file in a quite manual process(get rejects from each component in one tab of an excel file and correct them manually), but I was thinking if there could be some more effective way of doing this. I have tried using Talend Stewardship Console, but given that it is designed for working with Matching rejections it doesnt really work.

Any ideas? 
Best regards

Labels (2)
1 Reply
Anonymous
Not applicable
Author

Talking from my experience....
In a previous assignment I did this:
1 - Loaded every row as Strings to a LOAD_<table>, setting a status field as "not processed". Now this LOAD_<table> is your source data.
2  - Process all "not processed" rows, changing their status to "processed" or "with error". 
       At the same time, load any row that's with status "manually validated" without going through the validation process.
3 - A team member built a webservice (you can just go manually through the BD) for support to go through any "with error" or "invalid" row and:
        fix it, marking it as:
                 "manually validated" - if it contains valid data that always fail rules, like your Jo example (this should be avoided as much as possible) 
                  or "not processed" - if they just fixed a postal code and want everything to be validated again
         or just get rid of it (mark as "invalid")
4 - Go to 2 (or 1 if you have more data to load)


Hope it helps