Data Cleaning in Qlik Sense

Ouadie · ‎2023-09-01

Ever found yourself stuck with a messy pile of data that seems more like a labyrinth than a pathway to clean insights? You're not alone. Today, we're diving into the world of data cleaning in Qlik Sense to help you uncover the analytical potential hiding behind your data.

The Importance of Data Cleaning:

Imagine you're baking a cake. Would you eyeball the measurements of your ingredients? Probably not, unless you fancy a disaster cake. Just like one poorly measured cup of flour can ruin your entire recipe, a small data error can throw off your whole analysis. That's why, before you dive into the fun part—data analysis—you've got to make sure your key ingredient (data) is as clean and precise as possible.

Why Data Cleaning is More than Just a Chore:

It's not just about tidying up; it's about quality control. Skipped steps or overlooked errors can lead to inaccurate results that could misinform your business decisions.

Data Accuracy:
The accuracy of your analytics depends heavily on your data's quality. Data cleaning helps to weed out errors and inconsistencies, ensuring your insights are both trustworthy and actionable. Tools like mapping tables or functions like SubField can be invaluable in this stage.

Data Consistency:
Inconsistent data formats or naming conventions can be a real roadblock. Qlik Sense offers features like the SubField function and mapping tables to help you standardize data for consistent reporting and visualization.

Data Integration:
When you're integrating data from various sources, alignment is crucial. Qlik Sense provides numerous functions that help in aligning these disparate datasets into a cohesive, unified form.

Enhanced Visualization and Performance:
Clean data doesn't just make your visualizations more meaningful; it also enhances the performance of your Qlik applications. Expect faster data retrieval and more efficient analysis when your data is in good shape.

Data Cleaning techniques in Qlik Sense:

Duplicates removal:
Duplicate records can distort your analysis and reporting. Qlik offers built-in functions like Keep when loading tables or the DISTINCT keyword in your script to load only unique rows.

Missing values:
You can address missing values by removing records or filling in gaps based on specific criteria. Functions like IsNull, IsNullCount, and NullAsValue come in handy.

Data formatting:
Using the numerous string functions available in Qlik Sense, you can standardize data values to a consistent format. For example, the Upper, Lower, Date, and Num functions can be used to unify text or dates.

Data manipulation:
Sometimes the data you import into Qlik Sense doesn’t exactly fit your needs. Qlik offers ways to reshape your data accordingly.
For instance inconsistent field values can often occur when pulling data from multiple tables and this inconsistency can disrupt the connections between data sets. An efficient solution to this is to use Mapping tables.

Mapping Tables:

These types of tables behave differently than other tables in that they are stored in a separate area of the memory and are strictly used as mapping tables when the script is run, they are then automatically dropped.

Let’s take a look at how to do this and the different statements and functions that can be used:

MAPPING prefix
This is used to create a mapping table. For instance:

CountryMap:
MAPPING LOAD * INLINE [
Country, NewCountry
U.S.A., US
U.S., US
United States, US
United States of America, US
];

Keep in mind that a mapping table must have two columns, the first containing the comparison values and the second contains the desired mapping values.

ApplyMap()

The ApplyMap function is used to replace data in a field based on a previously created Mapping Table.

CountryMap:
MAPPING LOAD * INLINE [
    Country, NewCountry
    U.S.A., US
    U.S., US
    United States, US
    United States of America, US
];
	
Data:
LOAD
    ID,
    Name,
    ApplyMap('CountryMap', Country) as Country,
    Code
FROM [lib://DataFiles/Data.xlsx]
(ooxml, embedded labels, table is Sheet1);

The first parameter in ApplyMap is the Mapping Table name in quotes. The second parameter is the field containing the data that needs to be mapped.
You can add a third parameter to the ApplyMap function that serves as a default to handle cases when the value doesn’t match one in the Mapping Table.
For instance:
ApplyMap('CountryMap', Country, 'Rest of the world') As Country

after mapping:

MapSubstring()
The MapSubstring function is used to map parts of a field, this can be used as an alternative to Replace() or PurgeChar() functions.
For instance, let’s clean up these phone number values from unwanted characters:

ReplaceMap:
MAPPING LOAD * INLINE [
  char, replace
  ")", ""
  "(", ""
  "\"", ""
  "/", ""
  "-", ""
] (delimiter is ',');

TestData:
LOAD
  DataField as data,
  MapSubString('ReplaceMap', DataField) as ReplacedString
INLINE [
  DataField
  "(415)555-1234",
  "(415)543,4321",
  "“510”123-4567",
  "/925/999/4567"
] (delimiter is ',');

after cleaning:

MAP … USING
The Map…Using statement works differently than the ApplyMap() function in that ApplyMap does mapping every time the field name is encountered, whereas Map… Using does mapping when the values is stored under the field name in the internal table.
For instance, in the following load script, the Mapping will be applied to the Country field in Data1, however it will not be applied to Country2 field in Data2 table.
That’s because Map… USING statement is only applied to the field named Country. But in Data2, the field is stored as Country2 in the internal table.

Map Country Using CountryMap;
Data1:
LOAD
    ID,
    Name,
    Country
FROM [lib://DataFiles/Data.xlsx]
(ooxml, embedded labels, table is Sheet1);
					
Data2:
LOAD
    ID,
    Country as Country2
FROM [lib://DataFiles/Data.xlsx]
(ooxml, embedded labels, table is Sheet1);
UNMAP;

Useful functions for data cleaning

SubField()
Used to extract substrings from a string field that consists of two or more parts separated by a delimeter.
The arguments it takes are a Text (original string), a delimiter (character within the input text that devides the string into parts), and field_no that’s either 1 to return the first substring (left) or 2 to return the second substring (right))
SubField(text, delimiter, field_no)

For instance:

UserData:
LOAD * INLINE [
  UserID, FullName
  1, "John,Doe"
  2, "Jane,Doe"
  3, "Alice,Wonderland"
  4, "Bob,Builder"
];
CleanedData:
LOAD
  UserID,
  SubField(FullName, ',', 1) as FirstName,
  SubField(FullName, ',', 2) as LastName
RESIDENT UserData;
Drop Table UserData;

Len()
Returns the length of the input string
Left()
Returns a string of the first (left) characters of the input string, where the number of characters is determined by the second parameter.
Left(text, count)
Right()
Similar to left, it returns a string of the last (rightmost) characters of the input string. The second parameter determines the number of characters to be returned.
Index()
The index function searches a string and returns the starting position of the nth occurrence of a provided substring.
For instance:
Index(‘qwerty’, ‘ty’) will return 5
Index(‘qwertywy’, ‘w’, 2) will return the second occurrence of ‘w’, i.e: 7

Example 1:
Using a combination of the functions above to clean up a field. Let’s take a more complex field and try to extract the first name and last name.

UserData:
LOAD * INLINE [
  UserID, Object
  1, "37642UI101John.Doe"
  2, "98322UI101Jane.Doe"
  3, "45432UI101Alice.Wonderland"
  4, "32642UI101Bob.Builder"
];

CleanedData:
LOAD
  UserID,
  SubField(Right(Object, Len(Object) - Index(Object, 'UI101') - 4), '.', 1) as FirstName,
  SubField(Right(Object, Len(Object) - Index(Object, 'UI101') - 4), '.', 2) as LastName
RESIDENT UserData;
Drop Table UserData;

after cleaning:

Example 2:
Cleaning HTML in a field

Paragraphs:
LOAD * INLINE [
  Paragraph_ID, Paragraph
  1, "<p>This is a <strong>paragraph</strong>.</p><br><p>This is another <em>paragraph</em>.</p>"
];

// Loop through each paragrpah in the Paragraphs table
For vRow = 1 to NoOfRows('Paragraphs')
  Let vID = Peek('Paragraph_ID', vRow-1, 'Paragraphs'); // Get the ID of the next record to parse
  Let vtext = Peek('Paragraph', vRow-1, 'Paragraphs');  // Get the original paragraph of the next record
  
  // Loop through each paragraph in place
Do While len(TextBetween(vtext, '<', '>')) > 0
    vtext = Replace(vtext, '<br>', chr(10)); // Replace line breaks with carriage returns - improves legibility
    vtext = Replace(vtext, '<' & TextBetween(vtext, '<', '>') & '>', ''); // Find groups with <> and replace them with ''
  Loop;

  // Store the cleaned paragraphs into a temporary table
  Temp:
  Load
   $(vID) as Paragraph_ID,
    '$(vtext)' as cleanParagraph
  AutoGenerate 1;
Next vRow;
// Join the cleaned paragraphs back into the original Paragraphs table
Left Join (Paragraphs)
Load *
Resident Temp;
// Drop the temporary table
Drop Table Temp;

after cleaning:

I hope you found this post helpful!
Attached you can find a QVD that contains the scripts used in the post.

Happy data cleaning!

The Importance of Data Cleaning:

Why Data Cleaning is More than Just a Chore:

Data Accuracy:The accuracy of your analytics depends heavily on your data's quality. Data cleaning helps to weed out errors and inconsistencies, ensuring your insights are both trustworthy and actionable. Tools like mapping tables or functions like SubField can be invaluable in this stage.

Data Consistency:Inconsistent data formats or naming conventions can be a real roadblock. Qlik Sense offers features like the SubField function and mapping tables to help you standardize data for consistent reporting and visualization.

Data Integration:When you're integrating data from various sources, alignment is crucial. Qlik Sense provides numerous functions that help in aligning these disparate datasets into a cohesive, unified form.

Enhanced Visualization and Performance:Clean data doesn't just make your visualizations more meaningful; it also enhances the performance of your Qlik applications. Expect faster data retrieval and more efficient analysis when your data is in good shape.

Data Cleaning techniques in Qlik Sense:

Duplicates removal:Duplicate records can distort your analysis and reporting. Qlik offers built-in functions like Keep when loading tables or the DISTINCT keyword in your script to load only unique rows.

Missing values:You can address missing values by removing records or filling in gaps based on specific criteria. Functions like IsNull, IsNullCount, and NullAsValue come in handy.

Data formatting:Using the numerous string functions available in Qlik Sense, you can standardize data values to a consistent format. For example, the Upper, Lower, Date, and Num functions can be used to unify text or dates.

Mapping Tables:

Useful functions for data cleaning

Data Accuracy:
The accuracy of your analytics depends heavily on your data's quality. Data cleaning helps to weed out errors and inconsistencies, ensuring your insights are both trustworthy and actionable. Tools like mapping tables or functions like SubField can be invaluable in this stage.

Data Consistency:
Inconsistent data formats or naming conventions can be a real roadblock. Qlik Sense offers features like the SubField function and mapping tables to help you standardize data for consistent reporting and visualization.

Data Integration:
When you're integrating data from various sources, alignment is crucial. Qlik Sense provides numerous functions that help in aligning these disparate datasets into a cohesive, unified form.

Enhanced Visualization and Performance:
Clean data doesn't just make your visualizations more meaningful; it also enhances the performance of your Qlik applications. Expect faster data retrieval and more efficient analysis when your data is in good shape.

Duplicates removal:
Duplicate records can distort your analysis and reporting. Qlik offers built-in functions like Keep when loading tables or the DISTINCT keyword in your script to load only unique rows.

Missing values:
You can address missing values by removing records or filling in gaps based on specific criteria. Functions like IsNull, IsNullCount, and NullAsValue come in handy.

Data formatting:
Using the numerous string functions available in Qlik Sense, you can standardize data values to a consistent format. For example, the Upper, Lower, Date, and Num functions can be used to unify text or dates.