I have a number of questions relating to best practices when architecting a solution built off of a ledger, so as to incorporate incremental loads, any help would be much appreciated!
Our database is a ledger, which will only ever have rows added. A reduction to a previous entry is applied as a new row with a negative value, such that the aggregation of rows results in the current values. This means the table is ideal for incremental extracts.
We want to do extracts broken down by various different criteria, some overlapping (eg, region, year, etc).
Once the data has been extracted and chopped up into the various QVDs, we will have a suite of dashboards which consume various different selections of these QVDs. A large part of this exercise is to ensure that data is the same across the various different dashboards, where currently it might be interperated in different ways.
Would the best approach to the initial extract be to have multiple QVWs running much the same code with a different where clause with multiple QVDs generated, or have one initial QVW that spits out a huge QVD which multiple QVWs then access?
I see advantages to multiple initial loads, in that
there is no one single point of failure,
less overall QVWs to maintain,
there is possibly some time saved to generate one particular QVD if there is not much new data. Say one region (A) only has one row, it will read and save that off quickly, where with the one QVD approach, another region (B) may have a million new rows, and the dashboard which consumes A would have to wait for B to extract before it can run. I'm not sure that in reality this would actually pose a problem.
It also means that should a particular dataset become corrupt or need to be reset for whatever reason, you don't need to fully extract the entire table into a QVD first, only the smaller sub section.
Along these lines, hard disk space might become an issue, as we would effectively have the entire table in both the initial QVD and then again separated into the individual ones.
It means that the logic for keeping track of how much data has been extracted already can all be done in the DB with stored procedures, rather than handling it in QV where it is less reliable and harder to track.
I also see advantages to having just one initial extract.
It wouldn't hit the database with multiple connections which can cause blocking.
It would mean that the extract from the DB (slow) is a straight 'SELECT *', taking all the logic into the extracts dealing with the QVD (fast) which should be a performance gain, though I'm not sure how much of a gain to expect doing it this way?
Only one sproc need be created and maintained in the database
Am I obviously missing anything here?
Once we have the QVDs in place split by region / year etc, we then plan to have another aggregation layer sitting beyond this. While some of our dashboards will be able to directly consume this data, for others we will want to aggregate the data to various different levels. As such there will be a bunch of QVWs that take this data and apply aggregation, creating a new selection of tailored QVDs. This starts to get into the territory of ad-hoc logic, which we want to get away from, so ideally these aggregation QVWs will all be copied from a template, which has a user interface allowing configuration of the backend script via variables to select the levels of aggregation / fields excluded etc. Considerations at this stage:
Will need to offer various levels of granularity aggregation for any applicable fields
Will need to offer checkbox style selection of fields to include / exclude from extract
Offer ability to rename any field (some dashboard users will expect fields named a certain way)
Offer ability to change nulls / blanks to a given token for any field
Offer ability to add ad-hoc logic fields. (While this is undesirable, it is unfortunately always going to be neccessary somewhere.)
Any other common abstractions that I might need to take into account? I'm surely missing something here.
Ideally, this should mean that for every dashboard, we should be able to load in the section access from database, and then load the data from the QVD/s. (Both fact and relevant dimension data). With the knowledge that the QVDs are incrementally loaded, what is the best approach to refreshing the dashboards themselves? I have seen in the past that when a dashboard loads, it can either kick the current users off if it takes too long, or reset all their selections. Is there a way to maintain a users session through a reload? Would incrementally loading the dashboard from the QVD make the most sense, or simply doing an optimised load from the entire QVD? I imagine this is something that differs on a case by case basis and would need testing?
Thanks for any and all help, I've obviously got quite a lot to investigate.