Bulk load and then incremental load of almost 200G... - Qlik Community

meimagine · ‎2020-06-11

Hey Guys,

We are having a very huge data set (more than 200 GB) which also needs to be incrementally updated after the bulk update. I have loaded a little bit and am already running into performance issues.(I have tried creating QVDs and direct DB loads both and I have already reduced the number of columns that are being extracted) Can some one please suggest what can be done to implement this. The stakeholders actuallly want to extract 3-4 years of data from the dashboards as well. I am pretty new to Qlikview and have not handled this much amount of data in qlik dashboards before. Any suggestions on what process I can try?

Are there options to archive the data and only load it on the dashboard when someone is trying to extract a specific data set? But even if I keep the minimum the data set for a year would also be pretty huge.(20 million rows a month) Thanks.

Vegar · ‎2020-06-11

Yes that sounds like a lot of data. A Qlik approach to create an aggregate view of all data and connect to the details of the database using a Direct Discovery.

Vegar
Qlik Community MVP

marcus_sommer · ‎2020-06-11

I think there are several things which should be considered by handling larger datasets, for example:

- using a multi-tier data-architecture
   - means the use of at least 3 data-layers, like generators --> datamodels --> reports
   - each logically part gets an own generator / datamodel
   - those tasks could then run in parallel and/or within different timeframes or frequencies
   - makes it easier to develop and maintain the logic
   - especially in regard to implement incremental approaches
   - and slicing the data maybe into monthly slices

- loading only the needed data and prepare them properly
- no record-id's from the database
(they might be needed to evaluate this or that but aren't necessary in any reports)
- splitting timestamps into dates and times
- respectively if no time part is needed remove it - and the same with similar fields
- remove formatting from the fields
(especially mixed formats within a field could be need significantly more resources as pure numerical values)
- by a quite redundant dataset (means the distinct values in the fields) the data-size on disc
respectively RAM could become much smaller as the size from the rawdata
- use rather no joins else mappings
- avoid if-loops and aggregations with group by
(only for essential things - maybe by creating a dataset with a mixed granularity)

- using a quite de-normalized datamodel
- even if Qlik could handle a quite normalized datamodel like databases often use it's rather adverse
especially from a performance point of view
- recommended is the use of a star-scheme and by really big datasets a single table might be even more suitable

Even if the above didn't sound for it - the aim is to keep it as simple as possible by a quite high degree of division of work .

- Marcus

meimagine · ‎2020-06-17

Thankyou for your suggestion Vegar. Direct Discovery seems to be having a few limitations whichi includes calculated dimensions,advanced calculations and oracle lonng data type. I have data which has long data type and I am also doing calculated dimensions on the dashboard

meimagine · ‎2020-06-17

Thanks for such detailed answer Marcus. When you say 'at least 3 data-layers' what sort of layers are you talking about? Sorry about my ignorance.

An second thing was loading only needed data. Our stake holders run yearly reports every month, because of which we need so much data on the dashboard. can you please help out further with how we could do this?

QVDs seem to be a bit slow. So I had created direct connections to the tables. These tables have incremental loads every 30 mintues. Is it possible to do some sort of incremental load and keep all the data as well on the dashboard without affecting the performance?

meimagine · ‎2020-06-17

Is there a way to load the full data set once into the dashboard and then only make incremental loads?? Would that help with the performance?

Vegar · ‎2020-06-17

I see. You could also look into On Demand App Generation (ODAG)

This could be a place to start looking: https://community.qlik.com/t5/Qlik-FinServ/On-Demand-App-Generation/gpm-p/1464989

Even if usibg ODAG @marcus_sommer 's advice is good advice.

Vegar
Qlik Community MVP

meimagine · ‎2020-06-17

Thanks Vager. I'll try it out and see if thats a good option for us.

Vegar · ‎2020-06-17

It would help to reduce database loads and reduce reload durations, but you still need to load all your data into memory. 200GB will be a large app and need a system that can handle it even if the final app is somewhat smaller than 200 GB.

Vegar
Qlik Community MVP

marcus_sommer · ‎2020-06-17

By handling multiple loadings/tables is nothing faster as loading OPTIMIZED from a qvd. Of course I don't know all tools/databases but I doubt that any of them could load data faster as Qlik from a qvd.

This means of course that you need to have all historical data within qvd's and only the new ones are pulled from the database and then added to these qvd's. Important is like already highlighted above that the loadings are optimized - which means that no processing is needed to load the data - means a load-statement like:

load Field1, Field2 as FieldX, ... from Data.qvd (qvd) where not exists(Field1);

means it's only allowed to rename existing fields and to use a where-clause with exists() with a single-parameter (which is needed for the incremental load). All other transformations will prevent an optimized load (by an optimized load the data are directly transferred from the disc into the RAM - so it depends completely on the performance of your storage and from the RAM how fast it will be). With a normal ssd the load of 200 GB would take less as 2 minutes (with some specially configured raid-system it might be even speed up more).

But like above mentioned 200 GB of rawdata doesn't mean that the qvd's contain also 200 GB. By rather normal redundant data which aren't enriched with any record-id's or timestamps the file/RAM size will probably have only 5% - 20% of it.

A simple explanation of layers within a multi-tier data-architecture could you find here:

Three Tier Architecture

By using more layers you may speed up an incremental approach even more - by loading the report-layer binary from a historical datamodel and adding then with an optimized qvd-load the current data. But you shouldn't start with it else I recommend to begin here:

Get-started-with-developing-qlik-datamodels

because you won't be successful with a large dataset which needs various (performance) optimizations without mastering the basics.

- Marcus

Bulk load and then incremental load of almost 200GB data in Qlikview

BULK

Incremental Load