Script for checking unique values?

oddgeir · ‎2023-02-20

I seek to discover unwanted duplicates in the load script (preferably also an option to abort the script if this happens)

What is the best way of doing this ?

I get it done by the script below (*) but I assume there is a more direct way to do this

//These should be unique
UniquePeriods:
Load 
		yearMonth as thisfield 
	,   count(yearMonth) as counter
Resident MasterCalendar
group by yearMonth;

//Returning max of counter
TmpMax:
Load
		max(counter) as MaxValue
Resident UniquePeriods;        

//Setting this value into script
let vMaxValue = Peek('MaxValue', 0, 'TmpMax');

//Abort with comments if duplicate
Call AbortIfDuplicate(vMaxValue,'UniquePeriods');

//This should fail with MaxValue 31

(*) My abort script uses load of a non existing table to force an error. But this seems to be very unstable. Often only Trace part and not Load part is triggered in this script. If someone has other suggestions I'm happy to hear, but that's a different topic.

Ideally I would like the option to address any table with this check, maybe even by two dimensions (like yearMonth and personID).

I would also prefer not having to plan many extra tables in regards to the check . Now I used group by YearMonth in the initial table for showing failure. However to make this example work, another tmp-table will have to be used for the counter. It seems that even when using "distinct" counter relates to the initial table.

Basically I would love being able to just...

Call CheckIfUnique (yearMonth, personID, [optional dimensions]);

marcus_sommer · ‎2023-02-20

I understand what you are trying to achieve but I think it's not possible - at least not with a general solution for an automatically checking of the data-quality respectively not with sensible efforts because it may go easily to the hundreds times of work to the efforts that a "normal" environment would need.

There are a lot of reasons because you would need to know (and to store anywhere) for each field and each combination of fields the relevant target-values and not only the count of the field-values else also checking the values and their format itself - if they are valid or not. In regard to your example by the half year the value of 31 underlying days would be invalid. This means not only the field alone needs to be checked else their distribution in regard to other fields. Also related would be from where and when the data come from. Further a value and format check would be even more complicated.

So from a theoretically point of view it would be desirable to have such quality check methods but mostly it's too late. The essential point is that the data-quality work has to be done before in ensuring that only valid data are created / exchanged / exported and then also loaded and merged within the target-tool. Detecting any data-quality issue within following logic must be much more expensive and is on itself not really a solution because it raised the question - what should be now done? Abort, ignoring, cleaning/fixing, whatever ...

So that a general solution is rather not sensible doesn't mean that no check of the data-quality is sensible. But IMO only in a selective way for certain data on a few KPI (number of records, filesize/filetime, ...) and not against the entire environment.

Regarding to your example with the master-calendar is a programmatic approach not needed. It's enough to create a master-calendar ones and then using them against all applications and for multiple years - maybe 20 years back and 10 in the future. It would result in about 10 k of records and could and should be checked manually. IMO it's not a very hard work.

Another method to prevent unwanted results in regard to the number of records is to avoid each kind of joins within the scripts else replacing them with mappings.

oddgeir · ‎2023-02-21

Thanks for your explanation. To me it sounds a bit like "Perfect is the enemy of good". Yes I do understand that all data quality can't or shouldn't be checked in load script, but this was intended for the very basic and structural data. And the use of mapping isn't always a suitable approach all the way through a bigger application.
I may not even use it permanently in a "stable" application but it's convenient when making a new application, collecting data from many sources.

This is not something i have to do, but I have found that even with my best intentions I may slip, either because I'm mistaken myself, or because underlying data has changed in a way it shouldn't . The calendar example was used because I accidentally missed "distinct" and suddenly discovered that the months were duplicated, with consequences further down the line. I also have had issues where the assumption that a person is unique has failed due to poor underlying data quality.

Hence I like to do my checks on basic structural elements in the process. For the comfort of knowing that the basics should be fairly intact, and also to be notified as soon as possible if it isn't.

It's not my intentions to make this a bigger discussion of best practices overall. I personally value a couple of these checks and would like to know how neat such a check can be scripted. I understand that going overboard and check all possible angles of data quality might not be a good idea, but finding a tool is a separate discussion from when to use this tool.

marcus_sommer · ‎2023-02-21

Of course not all possible issues with the data-quality and/or own mistakes could be fetched but many things could be prevented. Like above hinted a master-calendar could be created ones for years and everything and ones checked and proofed is enough. I know that's quite common to re-create master-calendar logic again and again but it's definitely not necessary.

Further the mentioned mapping is capable to be applied for all scenarios to merge tables horizontally. If needed the mappings could be multiple times nested - horizontally and vertically - and are simple, fast and very flexible - by no risks to the number of records. Currently I remember not a single use of join-logic within our environment unless intended used to create cartesian products (no common key) or some inner-joins to filter data - rather seldom - mostly done per exists().

If then just a few basics are applied - developing everything in the direction of star-scheme, no use of qualifying and no use of wildcards else defining everything explicitly - there is not much what could be going really wrong on the inside of Qlik.

Beside this you could of course check the data within a more elegant and performant way, for example by looping through all tables and within this through all fields and reading the distinct field-values from the system-tables. For this take a look on the table-and field-functions like, tablename(), nooftables(), nooffields(), fieldvalue() and so on. Compared with resident-loads it's much more performant and simpler. You may also include various if-loops within the other loops to skip certain tables and fields and/or doing any checks.

Not possible with reading the system-fields is to check any dependency between multiple fields. If this is partly needed the relevant fields might be combined in beforehand, like: Date & '|' & YearMonth as CheckYearMonth and after a successfully check be dropped again in the loop.

Such a logic needs like a master-calendar also be developed ones and stored as one and/or several external script-files and loaded and executed by requirement with an include-variable.

General Question