Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello,
I have a large facts table, with around 100 fields and 50+ million records (along with many dimensions tables linked to it). The problem is that many of the fields within the facts table contain a large percentage of null values. In an effort to optimize the data model, I'm wondering if taking those fields containing mostly nulls and splitting them off into a separate table would help. The example below will help illustrate:
Field Name: Information Density:
patientID, 100%
primarydiag, 100%
otherdiag1, 22%
otherdiag2, 15%
otherdiag3, 8%
...
otherdiag30, 1%
[other fields] 90%-100%
In this example, the fields 'patientID' and 'primarydiag' contain no null values, whereas the fields 'otherdiag1' through 'otherdiag30' contain a large portion. (This is hospital data - Every patient admitted has a primary diagnosis, and some patients have 1 and up to 30 other diagnoses).
My idea is to split off the fields otherdiag1-otherdiag30 into their own table (linked to the main facts table through the patientID field)-- I'll call it the "OtherDiag" table -- and then reduce that table by eliminating the null values within the 'otherdiag1' field (because that field contains the fewest nulls of the otherdiag1-30 fields.
My questions are:
1) would this be a good way of optimizing my data model (and reducing the overall size of my app file)? And if it is, then,
2) what script would I use to create the second table ("OtherDiag")? Would the script below be correct (I'm guessing it would involve the use of some either "where exists" or "Where not is null")?
???
OtherDiag:
LOAD
patientID,
otherdiag1,
otherdiag2,
...
otherdiag30
RESIDENT FactsTable Where Not IsNull(otherdiag1); ????
...or something like this?
Thanks in advance for any suggestions.
-John
John,
That may be useful, though given that you're not going to reduce the overall number of values & pointers in each field I doubt your data model will reduce in size.
Had you considered a slightly different structure:
patientId, otherdiagNumber, otherdiag
for example
LOAD
patientID,
1 as otherdiagNumber,
otherdiag1 as otherdiag
RESIDENT FactsTable
WHERE not isnull(otherdiag1)
CONCATENATE
LOAD
patientID,
2 as otherdiagNumber,
otherdiag2 as otherdiag
RESIDENT FactsTable
WHERE not isnull(otherdiag2)
etc..etc...
with a loop this would be:
for n = 1 to 30
otherdiag:
LOAD
patientID,
$(n) as otherdiagNumber,
otherdiag$(n) as otherdiag
RESIDENT FactsTable
WHERE not isnull(otherdiag$(n))
next n
In fact, if this were my data I might go further than this and include the primary diagnosis also...
John,
That may be useful, though given that you're not going to reduce the overall number of values & pointers in each field I doubt your data model will reduce in size.
Had you considered a slightly different structure:
patientId, otherdiagNumber, otherdiag
for example
LOAD
patientID,
1 as otherdiagNumber,
otherdiag1 as otherdiag
RESIDENT FactsTable
WHERE not isnull(otherdiag1)
CONCATENATE
LOAD
patientID,
2 as otherdiagNumber,
otherdiag2 as otherdiag
RESIDENT FactsTable
WHERE not isnull(otherdiag2)
etc..etc...
with a loop this would be:
for n = 1 to 30
otherdiag:
LOAD
patientID,
$(n) as otherdiagNumber,
otherdiag$(n) as otherdiag
RESIDENT FactsTable
WHERE not isnull(otherdiag$(n))
next n
In fact, if this were my data I might go further than this and include the primary diagnosis also...
Hi,
I asked a similar question time ago, and the conclusion was that null don't use memory, only the diferent values from a field.
Check it here:How much memory uses a null value in a table?
Like Marcus and Jose I don't believe that you would optimize much your application by splitting the fact-table into several tables from a RAM and UI performance point of view.
But the UI handling from the otherdiag-fields would be easier if you merged them into a single-field by loading this part with a crosstable prefix and removing the NULL within a following where-clause. After them you could use a set analysis within your expressions to calculate only those otherdiag-values which you want.
- Marcus
Marcus, Jose, and Marcus,
I appreciate the three of you taking the time to reply. I was unaware that null values consume very little memory, which prompted me to do some additional reading on Qlikviews data types (I found this useful article: Memory sizes for data types | Qlikview Cookbook).
Nonetheless, Marcus' suggestion to concatenate the separate fields into a single one using a looping function was a clever and elegant solution (it actually solves some other issues I had with them separated) -- much appreciated, Marcus.
Thanks again all,
-John
John,
you're welcome! Nice to have some appreciation on here!
Marcus