Qlik Community

QlikView Scripting

Discussion Board for collaboration on QlikView Scripting.

gottsmann
New Contributor II

Replace NULL and missing values as part of ETL for data quality?

Sorry for any inconvience with posts, there were a lot of errors when trying to publish this one and you answered faster than I could repair that

Dear all,

Suppose within the ETL process, I extract a table as following:

The NULL resulted from a JOIN, the missing value in Field2 of row "b" is just empty.

My question is: As part of ETL, are the following steps any good or do they even produce more harm later?

Step 1: Replace all NULLs with <NULL>; to make them selectable in Listboxes:

// Treat all NULL fields as a value

NullAsValue *;

// Set the value to replace NULL. In optimized QVD-loads add the condition where 1=1 or it will not work.

Set NullValue = '<NULL>';

New Table:

Step 2: Replace also the missing value with <NULL>:

Set NullInterpret = ''; // This will convert all blank fields to NULL in Excel files

New Table:

These steps in my opinion enable advanced analyses, but I am not sure whether to implement it everywhere. Going further: Do you think it makes sense to fill all NULL or missing values in all tables already when they are extracted from the source systems?

Nachricht geändert durch Georg Gottsmann

1 Solution

Accepted Solutions

Re: Replace NULL and missing values as part of ETL for data quality?

I think you are right that there could be no clear recommendation to replace NULL in a single field and/or all fields - it will be always depend on the particular requirements of an application if it's useful or not.

In my experience it's often useful to replace NULL within fields which are used as dimensions but not in fields which are measured within the expressions (whereby you could of course react on any replacing-value within the expressions).

- Marcus

7 Replies

Re: Replace NULL and missing values as part of ETL for data quality?

What is your question? I mean what exactly your requirement is !

MVP
MVP

Re: Replace NULL and missing values as part of ETL for data quality?

Please try to avoid posting multiple times the same request.

This makes it hard to follow a discussion.


If you want to replace the NULL after a JOIN, you would need to perform a subsequent RESIDENT LOAD in which you check for NULL and replace it with something meaningful:


LOAD

     if(Len(Trim(FIELD)), FIELD, 'Formerly known as NULL or empty') as FIELD

RESIDENT YourTable;


Or don't JOIN, use ApplyMap() instead:

Don't join - use Applymap instead

alexdataiq
Contributor III

Re: Replace NULL and missing values as part of ETL for data quality?

I think he is asking if it's a good practice to replace NULL values with a "selectable" <NULL> String, but he is unsure if that will have any impact later in his application.

My opinion is that it all comes down to your own particular requirements. Do selectable nulls make sense o not in the context of your application?

Cheers

Re: Replace NULL and missing values as part of ETL for data quality?

There are different types of NULL which also needs different methods to be handled. A very good explanation could you find here: NULL handling in QlikView.

- Marcus

alexdataiq
Contributor III

Re: Replace NULL and missing values as part of ETL for data quality?

That's a very good explanation. Thanks for the heads-up Marcus.

Cheers

gottsmann
New Contributor II

Re: Replace NULL and missing values as part of ETL for data quality?

Thank you, that's right. My question was:

"Does it generally make sense to replace all NULLs with a placeholder, say "unknown", "w/o" or "<NULL>" etc.?

I read all I could find in the community, be it "NULL and Nothing" from HIC orothers. I think that it might lead to following issues:

1) Aggregations like avg() will lead to wrong results because e.g ."0" instad of "NULL" is counted as a valid denominator.

2) When doing JOINS, rows might get duplicated, again distorting mathematical operations

3) Sometimes there are reasons for NULL which is not the same as a "missing value". In that case, interpretations may be effected as well.

So all in all, it really depends on the particular case and you have to ask questions like "What does this NULL mean?", "Why is it there"? and "Does it make sense to replace it e.g. with <NULL> as a workaround or is this the result of bad data quality in the source systems".

Please correct me or add something, but I think that is everything general one can say about it.

Re: Replace NULL and missing values as part of ETL for data quality?

I think you are right that there could be no clear recommendation to replace NULL in a single field and/or all fields - it will be always depend on the particular requirements of an application if it's useful or not.

In my experience it's often useful to replace NULL within fields which are used as dimensions but not in fields which are measured within the expressions (whereby you could of course react on any replacing-value within the expressions).

- Marcus

Community Browser