What in the world is Type 2 and why isn't there a Type 1?

Dalton_Ruer · Jun 5, 2023 2:47:05 PM

History

Imagine you are starting your first Qlik Cloud Data Integration project. You've seen all the videos, your organization had a demo and the POC went great. Now here you are sitting down to begin a real use case you get the connections ok, and then suddenly you are presented with the Settings screen for the Onboarding stage and it defaults the History radio selections to "Data replication with historical data store (Type 2)" or choose "Data replication only."

You may have 2 quick questions:

What in the world is Type 2
Why in the world isn't there a choice for Type 1

The point of this post is to help you understand that fundamental question which is upfront as it really defines why you are moving the data and what you intend to use it for down the road. I would really love for the selections to read

Everyone down the road will only care that the data matches whatever is in the source system
There will be people who look at the data and will need to know what the values were as of the time transactions were conducted.

So which type of people will consume your data. The first type I mentioned or the second type I mentioned? Type 1 or Type 2? Oh sure you could read a bunch of fancy schmancy authors pontificating on Type 1 or Type 2, but in a nutshell that's what it boils down to.

Let's consider a practical example. Imagine that you have healthcare data. An Encounter table that contains information about a Patient's visit to the Hospital as well as information for the Physician who saw the patient. Imagine that a Patient or Physician changes their name or their phone number. Will anyone ever care to know what the Patient or the Physician information used to be? If for any reason the answer is YES, then you want to be sure that "Data replication with historical data store (Type 2)" is selected. If the answer in an emphatic NO, then you can feel free to change the response to "Data replication only" (because all my use cases are the first type, Type 1, where everyone only wants to know the current values).

Mixed Types

For the rest of this post we will pretend that our team declared that they only ever cared about seeing the current Patient information, but will need to see the historical Physician information. Since some of the data needs to be historical, I chose the Type 2 selection.

Which means that all of my tables will then have a history of changes. In my case, that is really no big deal. I mean after all how many Patients will change there names or phone numbers? Those changes will occur very slowly right? [I wrote that simply because when you read about Type 2 from those fancy schmancy authors pontificating they will refer to Type 2 as "Slowly Changing Dimensions." Which is what that means. You don't expect the data to change rapidly, you know the history will accumulate slowly.

I could simply choose to say "Who cares about a few bytes here and there" or perhaps I might say "Look I don't want any history written for this table no matter what the onboarding settings are." Or even "I know the onboarding settings say not to collect history but I want it for this table." The nice thing about Qlik Cloud Data Integration is that you have the flexibility to mix it up.

To override the Onboarding setting for any Dataset simply open the Storage task and where you see the Datasets listed, click on the Dataset you want to override and choose Settings. In this example I am going to ignore the history for the Patients table.

Then under the General section, I will tell Qlik Cloud Data Integration that I do not want it to maintain Historical Data for the table by selecting Off.

In my real world case, entire hospital system with thousands of tables, is it likely that I have a bunch of dimensions that we would never care about history for? Absolutely! Do you think I want the administrative nightmare of overriding the setting for a lot of tables? Absolutely not!

The great news is that I don't have to. Depending on what that threshold may be for you, you are welcome to create 2 different Onboarding tasks. One that will maintain that Type 2 history, and one that will not maintain the history. Then you can simply choose the appropriate tables for each of the tasks.

Changes

Imagine if you will over the course of Onboarding and Storing data for my Encounters, Patients and Physicians sure enough a few changes were made. Patient ID number 2 had a name change from "Patient Two" to "Patient Deux" and subsequently to "Patient Three." Notice that there are columns identifying which are the CURRENT values and which were the PRIOR values. There are also columns that identify when the value was current from and to.

There was also a change to Physician ID 2 over time and it's name changed from Dr. Two to Dr. Deux.

Use current data?

For the sake of your time I skipped over the Transform task and am jumping right to the Data Mart creation task. I have selected the Encounters table as my Fact. Notice at the bottom of the dialog below it wants to know if I want to "Use current data." Without the benefit of this history lesson (pun intended), and without the benefit of the above image you might think that's a really odd question. "No I don't want you to use my current data, use data from a few years ago for my data mart because analysts love that."

But as you do understand you won't ask that kind of silly question. So let's review our situation. We wish to not worry about changes over time for Patients, but we do wish to track changes over time for the Physicians. Thus, for this Data Mart based on Encounters if Physician 2 was name "Dr. Two" at the time of the Encounter we want to see "Dr. Two." but if the encounter was after the name change we want to see "Dr. Deux." But alas, Encounters (and many of your fact tables) has more than 1 date field. So Qlik Cloud Data Integration let's you (and me) select the one we wish to base our "slowly changing dimensional values" on. In my case, I have said "I want you to consider what the Physicians information was as of the Discharge_Date for the Encounter, not the Admission_Date.

To Denormalize or Not Denormalize? That is the question

Ah yes that famous quote from Shakespeare's "Hamlet." Ok, maybe it was slightly different. But we are dealing with much more important decisions like showing end users historical changes or not, and I want to ensure you get it right.

First you need to know that your FACT tables will always, always, always show the CURRENT values in the table. ALWAYS. If I changed an encounter over time from Physician ID = 1, to a 2, it will ALWAYS show that the Physician ID is 2. It's a FACT table. Our Dimension tables, Physicians and Patients can be either depending on all of the above.

Relational databases are all based on the premise that you don't want data duplicated, you want to maintain relationships. You "normalize" the data by using the relationships, thus we have a Physicians table and a Patients table to begin with. In order to see the Patient/Physician information for an encounter you have to utilize JOIN clauses in your SQL statements. Which aren't always fun.

Denormalizing the data means you want to reverse that process. Below I am basically saying "Look here Qlik Cloud Data Integration, for this Data Mart I want you to pull the Patient information into the Encounters FACT table so that my end users won't need to do any JOINS. I am fully aware there will be more BYTES written, but their ease of use is my primary goal."

I know that may be very childlike for you, but here is why I've put it in here. If you denormalize a Dimension to the Fact table ... the data will ALWAYS show the CURRENT value. It will not matter if you maintained history for the table or not. Because at the point you put it into the FACT table ... it becomes Factual data, not Dimensional data.

In my scenario that's exactly what I wanted anyway right. I know that my end users will never care about Patient history so making their life easy and removing a JOIN for them can go hand in hand. But I do not want that for the Physicians. My users will need to see the appropriate Physician information based on the Discharge Date in the Encounter. Thus as part of my modeling for the Data Mart I need to add the Physicians Dimension.

Results

After preparing and running the Data Mart task I end up with an Encounter (FACT) table that looks like this. Notice that despite the fact that Patient_ID 2 changed over time, the patient Name shows as "Patient Three" for every record. Remember, if it's in the Fact table, it's always the CURRENT value.

Type1OrType2_DataForDenormPatient_PhysicianDimension.png If you look closely at the above information for the Physician information you will see several Encounters for Physician_ID 2. The cool thing is that Qlik Cloud Data Integration automatically handles the "slow changes" by creating it's own synthetic key for the Physicians. Thus if the Encounter Discharge_Date was when "Dr. Two" was the active name, it can link to it, and if the Discharge_Date occurred after the name change to "Dr. Deux" it can link to it.

The Physicians table in the Data Mart looks like this. Notice it also generated the synthetic key. Making it the "Primary Key" for the Dimension in our Data Mart. Thus, if you need to associate the two, that hdr__ field is the one to do the association/join on.

Additional Data Marts

At any given time we would only focus on the specific needs for a given Data Mart. Clearly what I just showed was spot on for that Data Mart's needs and the needs I began talking about.

But guess what? In the "real world" there might also be a use case where end users do want to see the Patient information as of the Discharge_Date for the Encounter. If the Patient Data Set was configured to track that Type 2 history, we are in business and can create Data Mart 2 that would end up looking like this. Notice below that there are multiple Encounter records for Patient_ID 2 but hdr_..Patient key values 1, 2 and 4.

Things to Ponder

I rolled that off pretty quickly and there was 1 big assumption. That the Patient history was available in the project. If I had overwritten the Type 2 setting for the Patient table as I showed, or I had brought it in as part of an onboarding where I said to ignore it ... we would have been out of luck. Hence, the reason the Onboarding defaults to Type 2 so that you have the flexibility downstream.

I can't stress enough that your understanding of how Qlik Cloud Data Integrations tracks the historical values for tables is absolutely critical. Hopefully, this post helps gives you a great foundation to work from. The more you understand it, the more you are prepared to support a wide variety of scenarios.

Please feel free to post your comments or questions.

The information in this article is provided as-is and to be used at own discretion. Depending on tool(s) used, customization(s), and/or other factors ongoing support on the solution below may not be provided by Qlik Support.

What in the world is Type 2 and why isn't there a Type 1?