Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Workflow for cleaning and modelling the data

Hello,

I am a beginner ETL developer in Qlik and data modelling. I have this messy data set, and I am pretty confused how to start analyzing it. It has one big table, containing transactions, and several other smaller tables. My questions are:

1. I know what I want to evaluate, but I dont know the data. What is the best or most suitable practice for beginning to inspect the data - in other words - how to load the massive data set to start inspecting it? It takes a lot of time to load it all, and what approach should I use to get a good grasp of all the fields by loading just a small amount of it?

2. How to identify which columns to keep and which one's to dismiss? Is there are any rules, for example, if a column has 90% null values - should i drop it or inspect it if it has some important information in it? If yes, how should I do it?

3. How to identify the meaning of dimensions if they are unclear and values are unclear too?

The problem is, that I want to know the best practice to load the massive data set into a smaller one, to get insights about dimensions and values, but I am afraid that I will lose and miss a lot of information with this approach.

Also, any tips about improving as a beginner ETL developer and links to open data sets to practice these kinds of dimensional modelling problems would be gladly appreciated

Regards,

Juras

1 Solution

Accepted Solutions
petter
Partner - Champion III
Partner - Champion III

What you talk about here is what is often referred to as "The Data Discovery Phase".

First of all you should have an end goal in mind. You say you already know what you want to evaluate. If your need is to fully understand all of the messy data - you probably have a huge task and unrealistic task ahead of you (how do you eat an elephant?). A BI and analytics project always have a limited scope when it comes to what it should deliver. If you don't get this scope from anyone else you will have to formulate it yourself and make sure to get it validated by the "customer". Within a defined scope, as to what kind of charts and tables you need to produce, you can more easily rule out the information you don't need to include. To work efficiently you will have to try to pull out the minimum amount of columns/fields that you need to satisfy the end result. Of course this will demand a highly iterative approach as you will never get it right the first time round.

1) Limit the one big table (transactions) to a window of time that makes the data manageable. Maybe even just a days worth of day or a few hours worth. You didn't state whether you are using SQL based sources. If you do you can filter the data by using a WHERE clause in your SELECT statement and refer to a date/time column to get the window of data you need. Since the other tables are smaller maybe you can load them fully?

You could also use the LIMIT clause or equivalent for the SQL based sources. All major vendors usually provide a way of sampling data from tables. The syntax varies and you will have to study the source databases/systems documentation.

You can land the tables in QVD-files and then base your development on these QVD-files because reloading from QVD-files are exceptionally much faster than reading from the original sources. Then you could include most or all columns in the QVD-files and for the QlikView "Discovery App" in the loadscript you could select just a few columns and with every iteration it is easy to include more columns without getting the time penalty of hitting the original sources.

2) You're right that column with 90% or more of null values might be candidates for leaving out. Try to get hold of a domain expert a person or persons that know the source systems/data so they can tell you the essential columns you would need. Alternatively try to find literature on the subject or documentation.

3) It is no silver bullet for this (How to identify the meaning of dimensions if they are unclear and values are unclear too?). Focus first on what is clear and try building your data model and application visualizations around that. For the unclear dimension see how the values change with time and how often they change and what other related data change with the change of those values. QlikView can be a fantastic tool in that resepect due to the associative model.

Maybe these Community documents could be of help:

   QvM - Advanced Data Discovery

   QvM - Qlikview Modules

If you haven't already studied Ralph Kimball's books on ETL, Data Warehousing and BI then do - you will learn a lot and there are plenty of examples to enhance your understanding:

    The Data Warehouse ETL Toolkit

    The Kimball Group Reader - Relentlessly Practical Tools for Data Warehousing and Business Intelligence

      Remastered Collection - 2016

    The Data Warehouse Toolkit, 3rd Edition - The Definite Guide To Dimensional Modeling

Other resources:

   Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing

    Star Schema - The Complete Reference

View solution in original post

3 Replies
petter
Partner - Champion III
Partner - Champion III

What you talk about here is what is often referred to as "The Data Discovery Phase".

First of all you should have an end goal in mind. You say you already know what you want to evaluate. If your need is to fully understand all of the messy data - you probably have a huge task and unrealistic task ahead of you (how do you eat an elephant?). A BI and analytics project always have a limited scope when it comes to what it should deliver. If you don't get this scope from anyone else you will have to formulate it yourself and make sure to get it validated by the "customer". Within a defined scope, as to what kind of charts and tables you need to produce, you can more easily rule out the information you don't need to include. To work efficiently you will have to try to pull out the minimum amount of columns/fields that you need to satisfy the end result. Of course this will demand a highly iterative approach as you will never get it right the first time round.

1) Limit the one big table (transactions) to a window of time that makes the data manageable. Maybe even just a days worth of day or a few hours worth. You didn't state whether you are using SQL based sources. If you do you can filter the data by using a WHERE clause in your SELECT statement and refer to a date/time column to get the window of data you need. Since the other tables are smaller maybe you can load them fully?

You could also use the LIMIT clause or equivalent for the SQL based sources. All major vendors usually provide a way of sampling data from tables. The syntax varies and you will have to study the source databases/systems documentation.

You can land the tables in QVD-files and then base your development on these QVD-files because reloading from QVD-files are exceptionally much faster than reading from the original sources. Then you could include most or all columns in the QVD-files and for the QlikView "Discovery App" in the loadscript you could select just a few columns and with every iteration it is easy to include more columns without getting the time penalty of hitting the original sources.

2) You're right that column with 90% or more of null values might be candidates for leaving out. Try to get hold of a domain expert a person or persons that know the source systems/data so they can tell you the essential columns you would need. Alternatively try to find literature on the subject or documentation.

3) It is no silver bullet for this (How to identify the meaning of dimensions if they are unclear and values are unclear too?). Focus first on what is clear and try building your data model and application visualizations around that. For the unclear dimension see how the values change with time and how often they change and what other related data change with the change of those values. QlikView can be a fantastic tool in that resepect due to the associative model.

Maybe these Community documents could be of help:

   QvM - Advanced Data Discovery

   QvM - Qlikview Modules

If you haven't already studied Ralph Kimball's books on ETL, Data Warehousing and BI then do - you will learn a lot and there are plenty of examples to enhance your understanding:

    The Data Warehouse ETL Toolkit

    The Kimball Group Reader - Relentlessly Practical Tools for Data Warehousing and Business Intelligence

      Remastered Collection - 2016

    The Data Warehouse Toolkit, 3rd Edition - The Definite Guide To Dimensional Modeling

Other resources:

   Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing

    Star Schema - The Complete Reference

Anonymous
Not applicable
Author

Wow, thank you very much!!!!

petter
Partner - Champion III
Partner - Champion III

You're welcome - glad to help.

If you considered this an adequate answer to your question please mark it as both answered and helpful.