Solved: Re: Optimizing by splitting tables with null value... - Qlik Community

jchambers123 · ‎2017-08-02

Hello,

I have a large facts table, with around 100 fields and 50+ million records (along with many dimensions tables linked to it). The problem is that many of the fields within the facts table contain a large percentage of null values. In an effort to optimize the data model, I'm wondering if taking those fields containing mostly nulls and splitting them off into a separate table would help. The example below will help illustrate:

Field Name: Information Density:

patientID, 100%

primarydiag, 100%

otherdiag1, 22%

otherdiag2, 15%

otherdiag3, 8%

...

otherdiag30, 1%

[other fields] 90%-100%

In this example, the fields 'patientID' and 'primarydiag' contain no null values, whereas the fields 'otherdiag1' through 'otherdiag30' contain a large portion. (This is hospital data - Every patient admitted has a primary diagnosis, and some patients have 1 and up to 30 other diagnoses).

My idea is to split off the fields otherdiag1-otherdiag30 into their own table (linked to the main facts table through the patientID field)-- I'll call it the "OtherDiag" table -- and then reduce that table by eliminating the null values within the 'otherdiag1' field (because that field contains the fewest nulls of the otherdiag1-30 fields.

My questions are:

1) would this be a good way of optimizing my data model (and reducing the overall size of my app file)? And if it is, then,

2) what script would I use to create the second table ("OtherDiag")? Would the script below be correct (I'm guessing it would involve the use of some either "where exists" or "Where not is null")?

???

OtherDiag:

LOAD

patientID,

otherdiag1,

otherdiag2,

...

otherdiag30

RESIDENT FactsTable Where Not IsNull(otherdiag1); ????

...or something like this?

Thanks in advance for any suggestions.

-John

marcus_malinow · ‎2017-08-02

John,

That may be useful, though given that you're not going to reduce the overall number of values & pointers in each field I doubt your data model will reduce in size.

Had you considered a slightly different structure:

patientId, otherdiagNumber, otherdiag

for example

LOAD

patientID,

1 as otherdiagNumber,

otherdiag1 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag1)

CONCATENATE

LOAD

patientID,

2 as otherdiagNumber,

otherdiag2 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag2)

etc..etc...

with a loop this would be:

for n = 1 to 30

otherdiag:

LOAD

patientID,

$(n) as otherdiagNumber,

otherdiag$(n) as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag$(n))

next n

In fact, if this were my data I might go further than this and include the primary diagnosis also...

View solution in original post

marcus_malinow · ‎2017-08-02

John,

That may be useful, though given that you're not going to reduce the overall number of values & pointers in each field I doubt your data model will reduce in size.

Had you considered a slightly different structure:

patientId, otherdiagNumber, otherdiag

for example

LOAD

patientID,

1 as otherdiagNumber,

otherdiag1 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag1)

CONCATENATE

LOAD

patientID,

2 as otherdiagNumber,

otherdiag2 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag2)

etc..etc...

with a loop this would be:

for n = 1 to 30

otherdiag:

LOAD

patientID,

$(n) as otherdiagNumber,

otherdiag$(n) as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag$(n))

next n

In fact, if this were my data I might go further than this and include the primary diagnosis also...

jmvilaplanap · ‎2017-08-02

Hi,

I asked a similar question time ago, and the conclusion was that null don't use memory, only the diferent values from a field.

Check it here:How much memory uses a null value in a table?

marcus_sommer · ‎2017-08-02

Like Marcus and Jose I don't believe that you would optimize much your application by splitting the fact-table into several tables from a RAM and UI performance point of view.

But the UI handling from the otherdiag-fields would be easier if you merged them into a single-field by loading this part with a crosstable prefix and removing the NULL within a following where-clause. After them you could use a set analysis within your expressions to calculate only those otherdiag-values which you want.

- Marcus

jchambers123 · ‎2017-08-02

Marcus, Jose, and Marcus,

I appreciate the three of you taking the time to reply. I was unaware that null values consume very little memory, which prompted me to do some additional reading on Qlikviews data types (I found this useful article: Memory sizes for data types | Qlikview Cookbook).

Nonetheless, Marcus' suggestion to concatenate the separate fields into a single one using a looping function was a clever and elegant solution (it actually solves some other issues I had with them separated) -- much appreciated, Marcus.

Thanks again all,

-John

marcus_malinow · ‎2017-08-03

John,

you're welcome! Nice to have some appreciation on here!

Marcus

Optimizing by splitting tables with null value fields?