I have a question regarding having large and wide fact tables in the data model vs large fact table in a star/snowfkale schema.
I currently have an application that has a fact table with 1 billion rows and is 30 fields wide.
Some fields could certainly be extracted from the fact table and be made into reference tables to make the fact table more skinny. This would make the fact table 15 rows wide.
However, I have concerns it will slow down the application if done like that.
In instances where the user needs to have everything in a front end table, the hypercube build would have to do joins between the table's billions of rows?
I've read articles that when a fact table is huge, it is better off to have everything in the single fact table.
Is this true? Which data model is most beneficial to have with large datasets?
I think you should remain by your large and wide fact table as it usually provides the shortest response times within the UI even if it has a larger RAM consumption as a star/snowflake-scheme.
But like always it depends on the concrete circumstances in your environment (where is the biggest bottleneck CPU/RAM/storage/network and if you are already hitting some limitations there ...), the data itself (number of distinct fieldvalues, data-quality, ...) and the requirements which results should be displayed in which way (are simple expressions like sum/count(FIELD) enough or are they more complex and expensive and which and how many objects should be used and so on) which solutions is in the end the most suitable one.
This means to be sure you will need to test both mentioned approaches. Especially the bold marked topic is important in regard how heavy the single-threaded creation of the virtual tables (the hypercubes on which the real expressions are performed) would be. Because your mentioned 1 billion records might just contain a few hundreds/thousands of distinct values and AFAIK these virtual tables are build with these system tables and not with the data tables. Also all selections, set analysis and the state tables refers to the system tables. And with that in mind I'm not sure that the large and wide fact table is always the fastest approach.
So, help me understand. You are saying that a fact table with let's say 20 million records, across 7 dimensions, each of which has an average of 30 columns, is better to denormalize all 210 dimensional attributes into the fact table, which, in and of itself may have 30 columns to start with? So, a 300 column fact table.
That's the best design?
Like above mentioned it depends on the data, data-quality, requirements to calculate and visualize the results, access controls (which user should access which data), the available environment (nodes, cores, RAM, network/storage), number of users and time-frames of their average usage which kind of data model and UI objects/calculations and their usability might fit best.
The general recommendation is to develop a datamodel in the direction of a star-scheme (which means a quite heavy denormalization compared against an usual SQL scheme but is not the maximization of this approach) which is usually the best compromise in regard of script and UI performance and the efforts to develop and maintain an application.
If I understand you right you have already a star-scheme with not a really big dataset and therefore I think you should remain by it unless there are serious issues which needs to be optimized (which would at first require to identify the cause of the issue).
Thanks Marcus for responding. I had added this below from another similar post:
For extremely large data dart sets (100m+ row wide table), I am thinking I would change the design to either (1) use a summary table for performance and then document chain to details, or (2) use a "direct discovery" (or the current Qlik technology for that) approach to send targeted queries directly to the database. I am thinking those approaches would allow me to satisfy high performance, while keeping all the design benefits that come with a star schema.
From an application maintainability / extensibility standpoint that a (single concatenated fact) star schema design gives you, I just cannot see using a single large and wide table. But from a pure response time benefit I see your point.
That there are more as 100m of records within a fact-table doesn't mean mandatory that you need further measures to handle such dataset with acceptable response times.
Of course you may get benefits by splitting the application into several ones and/or using a mixed granularity and/or applying of data-reductions in regard to the access rights of the user and/or direct discovery and similar features (AFAIK it's not suitable to pull larger datasets - at least not from a performance point of view).
Unless the access control which is often a must requirement independently of any performance relevance although it could have a quite big performance impact if the average user has only access to a quite small amount of the overall data - all mentioned measures are on top approaches which I would only consider if the end-user performance is too slow and if there are really no further optimizations within the application possibly (and often there is).