Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Announcements
Support Case Portal has moved to Qlik Community! Read the FAQs to start exploring Support resources.
Henric_Cronström

 

When building business intelligence solutions one problem is that data usually contains errors, e.g. attributes are written in different ways so that data cannot be grouped correctly. The attribute could be written in upper case or not; it could be abbreviated or not; and sometimes several synonyms exist for the same thing.

For instance, ‘United Kingdom’ could be referred to as ‘UNITED KINGDOM’, ‘United Kingdom’, ‘Great Britain’, or just ‘UK’.

Image2.png

As a consequence, what the users really think of as the same instance will appear on several rows in a list box, or be displayed in several bars in a bar chart. This will cause problems in the data analysis, since selections and numbers displaying totals often will be incomplete.

But there are ways to solve this. The best way, is of course to correct it in the source data. But this is not always possible, so it may be that the correction must be made elsewhere.

In QlikView and Qlik Sense there are several ways to do this. The most obvious (but not the best), is to use a hard-coded, conditional expression in the script:

Image4.png

Similar constructions can be made using Replace() or Pick(). These all work and will do the job.

But they are not manageable.

Should you want to add more cases or change some previous ones, you will soon realize that this isn’t a good method. The expressions will become too long and they will be error-prone. So I strongly recommend not doing this.

There is however a solution which is both manageable and simple: Mapping Load. The first step is to create a mapping table with all changes you want to make:

Image5.png

Then you load this table using the Mapping prefix:

    MapTable:
    Mapping Load ChangeFrom, ChangeTo
      From MapTable.xlsx (...) ;

Now you can use this table in the script to correct all field values. The simplest way to use the Map statement: Declare the mapping early in the script before any of the relevant fields are loaded, and the corrections will be made automatically:

     Map Country, Department, Person Using MapTable ;

Alternatively, you can use either ApplyMap() or MapSubstring() when you load the field, which both will make a lookup in the mapping table and if necessary make the appropriate replacement, e.g.:

     ApplyMap( 'MapTable', Country ) as Country ,

The mapping table will be discarded at the end of the script run and not use any memory in the final application.

Using a mapping table is by far the best way to manage this type of data cleansing in QlikView and Qlik Sense:

  • It is easy to add new corrections and to change the old ones
  • The mapping table can be stored separately from the script; in an Excel sheet or even in a database

Good Luck!

HIC

 

Further reading related to this topic:

Don't join - use Applymap instead

Mapping as an Alternative to Joining

12 Comments
emptyfish
Contributor III
Contributor III

I also like to add a default value to the ApplyMap function therefore reducing the risk of null values.

  • ApplyMap('MapTable', Country, '#not known#') as Country,

This is then easier to drill down into the records which don't have a Country, the client will then hopefully be able to cleanse to the data.

It looks ugly but it works

3,548 Views
bill_markham
Champion III
Champion III

I also have a dedicated Data Quality qvw that shows data that does meet agreed data quality rules, e.g Employees that are not assigned to a Business Unit.  This qvw has standard QlikView drill downs, sums, counts etc for slice 'n dice analysis to give visibility of data quality, identify Business Units in need attention then go down to row level to give people the information they need to be able to locate the dodgy data in the source operational system.

3,548 Views
Henric_Cronström

Be careful with the third parameter of Applymap()...

You are right that there are many cases where you should use '#not known#' or something similar as third parameter. But I would not use it for data cleansing.


If you omit the third parameter, the Applymap() will return the input value; it will not return NULL. In other words: if the Country isn't found in the mapping table, the Applymap() will return the un-altered Country - which is what you want.


This means that your solution must have a mapping table which has all possible countries, whereas my solution can have a mapping table that contains only the changes.


HIC

3,548 Views
Not applicable

Is this possible to use a fuzzy search (as Levensthein algorithm) to measure the risk of a hidden duplicate for each record where there is no match in the map table ?

0 Likes
3,548 Views
Henric_Cronström

Not easily, no. But I am fairly sure it would be possible to create an extension object that would do exactly this.

HIC

0 Likes
3,548 Views
IAMDV
Luminary Alumni
Luminary Alumni

Thank you! Nice & simple.

0 Likes
3,548 Views
rwunderlich

Good post HIC. Especially the reminder about only needing to include changes in the mapping table.

One more reminder; MAP USING will not modify values in an optimized QVD load.

I frequently find I want to map using wildcards in the "from" values. It's not part of the native MAPPING, but can be simulated with a pick(wildmatch()) as shown in this example.

Qlikview Cookbook: Mapping With Wildcards http://qlikviewcookbook.com/recipes/download-info/mapping-with-wildcards/

Or if you use QlikView Components:

CALL Qvc.CreateWildMapExpression (vMapExpr, WildMapTable);

2,621 Views
rbecher
MVP
MVP

@Xavier Retaillaud: there is a Data Quality group where I have posted an Levenshtein example: Levenshtein Distance VBScript

Probably a good place to diskuss data quality topics..

2,621 Views
Gabriel
Partner
Partner

Thanks HIC.

One benefit of using Mapping table as well is if the source file is excel or text file users can manage the FieldFrom and FieldTo update.

0 Likes
2,621 Views
datanibbler
Champion
Champion

Yes, that is definitely a benefit - I like to keep as much key-information as possible out of my apps and in some publicly accessible space (or at least accessible to whoever must have that info). That makes everything more robust.

It works with Rob's approach, too since you can load that Wildmap-table from an Excel file that is maintained by whoever is in charge of those associations.

0 Likes
2,621 Views