Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all,
I have a huge dataset (100M rows), where, in the same table, I have the following fields
The cost field is already the result of a SUM(Cost) group by [All other dimensions].
Every dimension is independent to each other, so no further simplification can be done.
The question is: which model is more efficient?
1. the one with a single table that contains all the fields
2. A model where I have a key table, computed by the concatenation of all the dimensions, and 3 tables, one for Dimension, one with the Measure and one with data. Note that here, ALL the tables will have 100M rows. This is why I cannot see any value in using this model...
Pleased suppose there are no other fields and never will.
Thank you very much
Have a great day
Under the hood, the Qlik Engine usually has to assemble temporary tables of dimension combinations on the fly. If everything is already in one table, this process is already completed and chart calculation is faster.
We usually split Dimensions off into a table(s) of their own because they typically have fewer distinct values. Doing so can save a tremendous amount of memory.
In your case you stated that you have as many distinct dimension values as you have fact rows. So there would be no memory savings to move the dimensions to their own table.
-Rob
If I understand correctly, all of your dimension values are unique. In that case, I think a single table will be most efficient in terms of performance and memory.
-Rob
http://www.easyqlik.com
http://masterssummit.com
http://qlikviewcookbook.com
Thank you, may I ask you why?
Under the hood, the Qlik Engine usually has to assemble temporary tables of dimension combinations on the fly. If everything is already in one table, this process is already completed and chart calculation is faster.
We usually split Dimensions off into a table(s) of their own because they typically have fewer distinct values. Doing so can save a tremendous amount of memory.
In your case you stated that you have as many distinct dimension values as you have fact rows. So there would be no memory savings to move the dimensions to their own table.
-Rob
Yes exactly, my dimension table would have the very same amount of rows of the fact table.
Thank you very much
Please explain your data-set in more details because if the single fact-table has 100M of records the possible derived dimension-tables couldn't have 100M of records - at least not each one - as the aggregation-result against all dimensions.
I assume that there is a misinterpretation of the data-set and it may rather be a "normal" one with a lot of fact-records but just a few hundreds / thousands of different dimension-values in each dimension. If this is the case I would suggest a star-scheme with a single fact-table and n dimension-tables.
It's not difficult to find it out - just load this data-set as a single table and then take a look on the meta-data - which could be also done in an UI object by using the system-fields like $Table, $Fields and so on.
Please see example attached. Hopefully this can clarify the misinterpretation.
Thank you!
It's a "normal" data-set. The combination of all dimensions in the group by leads to 100M of records but it doesn't mean that each dimension-table would also have 100M of records. That's not linear because Qlik stores only distinct field-values in a system-table (for each field an own one) and a binary index to the data-tables.
Therefore I would further tend to apply a star-scheme although the number of records for an aggregation against only 5 dimensions is quite large. It might be on a threshold which data-model and UI would be more performant and usable.
If your example is further more simplified as the real data and there were more dimensions included and/or the date isn't a date else a timestamp it would go even more in the direction of a star-scheme (timestamps should be usually resolved in date and times).
Beside the above how many records are there before the aggregation? If the reduction rate of records isn't very significant the entire aggregation might be not necessary. Of course such measurements will reduce a data-set and increasing the UI performance but also increasing the efforts, complexity, run-times within the script by reducing the possibility to drill deeper and/or in another way as the group by dimensions providing.
The original dataset includes more columns than the example provided. The time field is a timestamp, and a single month can contain up to 40 million rows. However, after applying a GROUP BY (and converting the timestamp to a simple date), two years of data result in nearly 100 million rows. The aggregation is necessary.
My example is actually a further simplification of the real dataset. The additional fields have already been moved to separate tables in a star schema, where the central table is the one shown in the example (actually with a total of 11 "independent" fields, not only the 5 above).
That said, let’s focus only on this example: how can I structure it into a star schema?
I think that what you have is already a star-scheme with a single fact-table containing a measure-field and n dimension-fields which are the keys to the appropriate dimension-tables because there are surely further period-fields, like year and month or product-hierarchies in the relevant tables. Your origin statement with a single table hinted in the direction of a table in which all fields are included in a quite wide table.
In certain scenarios have such big tables an UI performance benefit against other data-models whereby I rather doubt that's true in your case. If your data-model is really already a star-scheme you don't need to change it and also not worrying about it because it would be the officially recommended data-model with the best compromise in regard to efforts, maintainability, performance und usability - especially by larger data-sets.
Because of the fact that there are some more dimensions were included it hints that there are mostly only small amounts of distinct field-values in it and just a single measure-field I would assume that quite simple expressions like sum(Cost) and count(MyDimension) would cover the most of the view-requirements. If so the UI performance would be quite smoothly with 100M of records even on a small hardware like an office-laptop and surely no problem on an appropriate sized server-environment.
The aggregation seems to be sensible by a rate of approximately 1:10 but the performance impact of it may not so big because it's mainly the data-table index which will significantly increase but not the various system-tables with the distinct field-values. Personally I take at first all records without any aggregation and only if it's really too big/slow for the available time-frames respectively response-times I consider an aggregation. In this regard you may also take a look on the idea to use a mixed granularity:
Fact Table with Mixed Granularity - Qlik Community - 1468238