Skip to main content
Announcements
NEW: Seamless Public Data Sharing with Qlik's New Anonymous Access Capability: TELL ME MORE!
cancel
Showing results for 
Search instead for 
Did you mean: 
GeorgeQV
Contributor III
Contributor III

Replace NULL and missing values as part of ETL for data quality?

Sorry for any inconvience with posts, there were a lot of errors when trying to publish this one and you answered faster than I could repair that

Dear all,

Suppose within the ETL process, I extract a table as following:

The NULL resulted from a JOIN, the missing value in Field2 of row "b" is just empty.

My question is: As part of ETL, are the following steps any good or do they even produce more harm later?

Step 1: Replace all NULLs with <NULL>; to make them selectable in Listboxes:

// Treat all NULL fields as a value

NullAsValue *;

// Set the value to replace NULL. In optimized QVD-loads add the condition where 1=1 or it will not work.

Set NullValue = '<NULL>';

New Table:

Step 2: Replace also the missing value with <NULL>:

Set NullInterpret = ''; // This will convert all blank fields to NULL in Excel files

New Table:

These steps in my opinion enable advanced analyses, but I am not sure whether to implement it everywhere. Going further: Do you think it makes sense to fill all NULL or missing values in all tables already when they are extracted from the source systems?

Nachricht geändert durch Georg Gottsmann

1 Solution

Accepted Solutions
marcus_sommer

I think you are right that there could be no clear recommendation to replace NULL in a single field and/or all fields - it will be always depend on the particular requirements of an application if it's useful or not.

In my experience it's often useful to replace NULL within fields which are used as dimensions but not in fields which are measured within the expressions (whereby you could of course react on any replacing-value within the expressions).

- Marcus

View solution in original post

7 Replies
MK_QSL
MVP
MVP

What is your question? I mean what exactly your requirement is !

swuehl
MVP
MVP

Please try to avoid posting multiple times the same request.

This makes it hard to follow a discussion.


If you want to replace the NULL after a JOIN, you would need to perform a subsequent RESIDENT LOAD in which you check for NULL and replace it with something meaningful:


LOAD

     if(Len(Trim(FIELD)), FIELD, 'Formerly known as NULL or empty') as FIELD

RESIDENT YourTable;


Or don't JOIN, use ApplyMap() instead:

Don't join - use Applymap instead

alexdataiq
Partner - Creator III
Partner - Creator III

I think he is asking if it's a good practice to replace NULL values with a "selectable" <NULL> String, but he is unsure if that will have any impact later in his application.

My opinion is that it all comes down to your own particular requirements. Do selectable nulls make sense o not in the context of your application?

Cheers

marcus_sommer

There are different types of NULL which also needs different methods to be handled. A very good explanation could you find here: NULL handling in QlikView.

- Marcus

alexdataiq
Partner - Creator III
Partner - Creator III

That's a very good explanation. Thanks for the heads-up Marcus.

Cheers

GeorgeQV
Contributor III
Contributor III
Author

Thank you, that's right. My question was:

"Does it generally make sense to replace all NULLs with a placeholder, say "unknown", "w/o" or "<NULL>" etc.?

I read all I could find in the community, be it "NULL and Nothing" from HIC orothers. I think that it might lead to following issues:

1) Aggregations like avg() will lead to wrong results because e.g ."0" instad of "NULL" is counted as a valid denominator.

2) When doing JOINS, rows might get duplicated, again distorting mathematical operations

3) Sometimes there are reasons for NULL which is not the same as a "missing value". In that case, interpretations may be effected as well.

So all in all, it really depends on the particular case and you have to ask questions like "What does this NULL mean?", "Why is it there"? and "Does it make sense to replace it e.g. with <NULL> as a workaround or is this the result of bad data quality in the source systems".

Please correct me or add something, but I think that is everything general one can say about it.

marcus_sommer

I think you are right that there could be no clear recommendation to replace NULL in a single field and/or all fields - it will be always depend on the particular requirements of an application if it's useful or not.

In my experience it's often useful to replace NULL within fields which are used as dimensions but not in fields which are measured within the expressions (whereby you could of course react on any replacing-value within the expressions).

- Marcus