Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
jchambers123
Contributor II
Contributor II

Optimizing by splitting tables with null value fields?

Hello,

I have a large facts table, with around 100 fields and 50+ million records (along with many dimensions tables linked to it). The problem is that many of the fields within the facts table contain a large percentage of null values. In an effort to optimize the data model, I'm wondering if taking those fields containing mostly nulls and splitting them off into a separate table would help. The example below will help illustrate:

Field Name:       Information Density:

patientID,             100%

primarydiag,        100%

otherdiag1,          22%

otherdiag2,          15%

otherdiag3,          8%

...

otherdiag30,        1%

[other fields]         90%-100%

In this example, the fields 'patientID' and 'primarydiag' contain no null values, whereas the fields 'otherdiag1' through 'otherdiag30' contain a large portion. (This is hospital data - Every patient admitted has a primary diagnosis, and some patients have 1 and up to 30 other diagnoses).

My idea is to split off the fields otherdiag1-otherdiag30 into their own table (linked to the main facts table through the patientID field)-- I'll call it the "OtherDiag" table -- and then reduce that table by eliminating the null values within the 'otherdiag1' field (because that field contains the fewest nulls of the otherdiag1-30 fields.

My questions are:

1) would this be a good way of optimizing my data model (and reducing the overall size of my app file)?  And if it is, then,

2) what script would I use to create the second table ("OtherDiag")? Would the script below be correct (I'm guessing it would involve the use of some either "where exists" or "Where not is null")?

???

OtherDiag:

LOAD

     patientID,

     otherdiag1,

     otherdiag2,

    ...

     otherdiag30

RESIDENT FactsTable  Where Not IsNull(otherdiag1);  ????

...or something like this?

Thanks in advance for any suggestions.

-John

1 Solution

Accepted Solutions
marcus_malinow
Partner - Specialist III
Partner - Specialist III

John,

That may be useful, though given that you're not going to reduce the overall number of values & pointers in each field I doubt your data model will reduce in size.

Had you considered a slightly different structure:

patientId, otherdiagNumber, otherdiag

for example

LOAD

    patientID,

    1 as otherdiagNumber,

    otherdiag1 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag1)

CONCATENATE

LOAD

    patientID,

    2 as otherdiagNumber,

    otherdiag2 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag2)

etc..etc...

with a loop this would be:

for n = 1 to 30

    otherdiag:

    LOAD

        patientID,

        $(n) as otherdiagNumber,

        otherdiag$(n) as otherdiag

    RESIDENT FactsTable

    WHERE not isnull(otherdiag$(n))

next n

In fact, if this were my data I might go further than this and include the primary diagnosis also...

View solution in original post

5 Replies
marcus_malinow
Partner - Specialist III
Partner - Specialist III

John,

That may be useful, though given that you're not going to reduce the overall number of values & pointers in each field I doubt your data model will reduce in size.

Had you considered a slightly different structure:

patientId, otherdiagNumber, otherdiag

for example

LOAD

    patientID,

    1 as otherdiagNumber,

    otherdiag1 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag1)

CONCATENATE

LOAD

    patientID,

    2 as otherdiagNumber,

    otherdiag2 as otherdiag

RESIDENT FactsTable

WHERE not isnull(otherdiag2)

etc..etc...

with a loop this would be:

for n = 1 to 30

    otherdiag:

    LOAD

        patientID,

        $(n) as otherdiagNumber,

        otherdiag$(n) as otherdiag

    RESIDENT FactsTable

    WHERE not isnull(otherdiag$(n))

next n

In fact, if this were my data I might go further than this and include the primary diagnosis also...

jmvilaplanap
Specialist
Specialist

Hi,

I asked a similar question time ago, and the conclusion was that null don't use memory, only the diferent values from a field.

Check it here:How much memory uses a null value in a table?

marcus_sommer

Like Marcus and Jose I don't believe that you would optimize much your application by splitting the fact-table into several tables from a RAM and UI performance point of view.

But the UI handling from the otherdiag-fields would be easier if you merged them into a single-field by loading this part with a crosstable prefix and removing the NULL within a following where-clause. After them you could use a set analysis within your expressions to calculate only those otherdiag-values which you want.

- Marcus

jchambers123
Contributor II
Contributor II
Author

Marcus, Jose, and Marcus,

I appreciate the three of you taking the time to reply. I was unaware that null values consume very little memory, which prompted me to do some additional reading on Qlikviews data types (I found this useful article: Memory sizes for data types | Qlikview Cookbook).

Nonetheless, Marcus' suggestion to concatenate the separate fields into a single one using a looping function was a clever and elegant solution (it actually solves some other issues I had with them separated) -- much appreciated, Marcus.

Thanks again all,

-John

marcus_malinow
Partner - Specialist III
Partner - Specialist III

John,

you're welcome! Nice to have some appreciation on here!

Marcus