Re: Data Model Question - Fat or Thin Fact table? - Qlik Community

mjamesking · ‎2016-07-04

Hello,

Am designing a data model and am torn between two approaches for the fact table. Am wondering if anyone has any advice or preferences (or even an alternate suggestion).

Essentially I want to measure the balance of a client account on a daily basis, and also look at the number and value of orders by type (deposits, withdrawls) made each day.

Is my best approach to have:

1. a single row per account per day in the fact table, with a number of different measures on each row

e.g.

Date	Account	Balance	Deposit #	Deposit Value	Withdrawal #	Withdrawal Value
03/07/2016	1	$100	2	$5	1	$10
03/07/2016	2	$300	3	$23	2	$5

or

2. Multiple fact rows per account each day, containing the result for each particular measure

e.g.

Date	Account	Fact Type	Fact Value
03/07/2016	1	Balance	100
03/07/2016	1	Deposit #	2
03/07/2016	1	Deposit Value	5
03/07/2016	1	Withdrawal #	1
03/07/2016	1	Withdrawal Value	10

Bearing in mind the number of facts per account per day could grow and grow?

Any thoughts?

oknotsen · ‎2016-07-04

Go with option 1. Will be a lot easier for your front-end developers to get the right numbers to show without having to work with unneeded complex expressions.

Next to that, the way that Qlik handles the data I think option 1 will be better for performance (though I doubt you will notice much different the first 10 million rows).

May you live in interesting times!

mjamesking · ‎2016-07-04

Thanks Onno

and if I wanted to also track certain Account 'events' such as the date an account was opened or closed would it be valid to concatenate these event facts to the table, leaving most other columns blank?

e.g.

Date	Account	Balance	Deposit #	Deposit Value	Withdrawal #	Withdrawal Value	Account Opened	Account Closed
01/07/2016	1						1
02/07/2016	1	100	2	$100	0	$0
03/07/2016	1	0	0	$0	1	$100		1

again, there may be a good few more of these event based facts

Cheers

oknotsen · ‎2016-07-05

Don't you already have that information by the minimum date and the maximum date an account has facts for?

If not, you might just want to model this information in on a separate fact table. Having a central LinkTable with your MasterCalendar connecting two facts with your dimensions.

May you live in interesting times!

mjamesking · ‎2016-07-05

For open and close, yes, but I do have a number of other account based events that are not as obvious and can occur at any time between an account coming into existence and an account being closed - for example, interactions, account application dates.

So - on that basis, go with the event facts in a separate table with an entry per event and an event type?

johnw · ‎2016-07-05

I would, yes. An accounts table for facts about the account. An events table for the events that occur on that account. Technically open and close are events that you could store on the events table, but if they only occur once in the life of any account (if reopened it is reopened with a different number), then I would likely just store them on the accounts table. But if you can close and reopen the same account number, then I'd probably keep them on the events table.

I'm not clear why deposit # and deposit value are considered separate events. Those seem like two facts about a single event, a deposit, in which case I'd have a single row on the events table. And I suppose my fields would be [Number] and [Value] since the event type tells you what the number and value mean.

Hard to say without really knowing your data and reporting requirements well, but those are my initial thoughts.

marcus_sommer · ‎2016-07-06

It's a very interesting question then I use both types of fact-tables in different applications and have never checked which type would be perform better. I didn't want to play with my own applications which I couldn't share here and therefore created a small example-application with a variable record-number on randomly created values.

By each run of the script the data will be change and the model is very simple - only the fact-table, no dimension-tables - and it won't cover every possibility and each aspect and both types are included in this example - simply commented the other part out. To get a tendency it should be enough.

I have had run the script with 10 M records and checked the filesizes and the open-times for the pivots (within the document properties) and the results are quite clear. The fat-table need 98 MB and the thin-table 264 MB by the filesize (factor from about 2.5) and the opening-times differ from 312 to 1560 milliseconds (factor from about 5).

I think with other and more complex datamodels the differences could be a bit lower or higher but the tendency that the fat-table performs better will be remain and therefore the suggestion from Onno is right.

- Marcus

johnw · ‎2016-07-06

Thank you for checking the performance. I'll try to keep that in mind. And indeed, I modeled order proceeds in our applications just as you did - a fat table. And yes, I'd model it that way even though I'm aware that in the future, we'll be adding additional proceeds components.

LOAD
@1:10 as "Order Item"
,@11:22 as "Base Price"
,@23:34 as "Price Extras"
,@35:46 as "Price Surcharges"
,@47:58 as "Regular Discounts"
,@59:70 as "Excess Prime Discount"
,@71:82 as "Sales Tax"
,@83:94 as "Freight Charge"
,@95:106 as "Freight Proceeds Deduction"
,@107:118 as "Hub Proceeds Deduction"
,@119:130 as "Delayed Discounts"
,@131:142 as "Terms Discount"
FROM ...

Where the skinny table becomes better is a different case, where we could have an arbitrary number of deposits or withdrawals or other activities on an account in a day. And that would generally be true of such things as deposits or withdrawals. Imagine the headache of trying to add up the deposits if you had [Deposit Amount 1], [Deposit Amount 2], ... [Deposit Amount 500]. You could probably do it fairly easily with a dollar-sign expansion and concat on values of $Field picked with set analysis, or maybe with an appropriate valueloop(), or something along those lines. But that's complicated, and I suspect the performance would be worse than a simple sum.

The "right" answer depends on the nature of the data and the required ways of viewing that data. But I would say that if we can have multiple deposits or withdrawals in a day, I would absolutely NOT just keep adding additional deposit amount fields. That probably wasn't what you were suggesting of course. But maybe I'd still use the fat table, but represent additional deposits or withdrawals with additional rows instead of columns. Something like:

Date,Account,Balance,Deposit #,Deposit Value,Withdrawal #,Withdrawal #,Withdrawal Amount
03/07/2016,1,100,2,5,1,10
03/07/2016,1,,3,10,2,20
03/07/2016,1,,4,30,,

So you can do a sum([Deposit Value]) for simplicity rather than the skinny table sum({<[Type]={'Deposit'}>} [Value]) or sum({<[Type]={'Deposit Value'}>} [Amount]). It's a weird little hybrid, as for any activity that can only happen once per day, there's not specific record that that activity belongs on. I certainly wouldn't design a database that way, but in QlikView, it probably just doesn't matter which one it's on, as long as it's only on one. Idono. Just another thought. Again, it depends on the nature of the data and the required ways of viewing the data.

mjamesking · ‎2016-07-06

Thanks for everyone's input.

To clarify a little about the nature of the data and the required methods of viewing.

1. I have some data that is known only once per day, for example the Account Balance. It isn't as simple as a running total of deposits minus withdrawals as it fluctuates with the share price of holdings.

2. I also have some data that I want to track that may occur more than once per day, or not at all. For instance a deposit or a withdrawal. I want to be able to track the value of these as well as the volume of them. Each of these transactions/ events also has a 'payment method' which I want to be able to break these things down by.

3. As well as these deposits and withdrawal transactions I also have some other account related things I want to count and track, for instance the number of contacts made to client services about a particular account, and break these contacts down by the 'nature' of the contact.

The data volumes are not huge currently, but I do want this to be scalable.

Having thought some more, and read all of you posts.. .then thought again, i'm leaning to the thinner fact table, because of the amount of different types of thing that I need to count.

So a dimension table for each account including all the account detail I require.

A single, thinner (realistically a mid-size) fact table containing a 'Fact Type' column and a fact Fact Type specific columns that would be redundant for other fact types (such as payment method, contact nature but keep these to a minimum).

Anyone disagree? (brave question to ask I know )

johnw · ‎2016-07-06

I think I agree. I might potentially keep contacts (which don't involve money, I gather) and transactions (anything that does involve money) separate. But I'd probably combine them since they're both activities involving the account. I might have both a generic [Value] field and redundant fields like [Account Balance], [Deposit Amount], and [Withdrawal Amount]. But I might well stick with the single [Value] field and use set analysis to look at specific fact types.

Data Model Question - Fat or Thin Fact table?

QlikView App Dev