Solved: Re: Storing millions of records in QVD files - Page 2 - Qlik Community

morenoju · ‎2018-01-04

Hi,

I have a table in my DB with approximately 220 millions of records and rising every day.

In my data script, I decided to store the records in on QVD for each month but reading the data from the QVDs is taking more time than I had expected (around 15 min) and it's consuming a lot of resources in my server.

Would it be better if I had one QVD per year? Is it faster for Qlik to read a single QVD than several of them?

Thanks,

Juan

morenoju · ‎2018-01-05

Well, I'm supposed to make all records available, so that's why I read the qvd of all months. The problem I'm seeing is that when I load the 220 million records in the table, the App stops working (I can't open it, I see the animated circles indefinitely).

martinpohl · ‎2018-01-05

try to debug script with a few records to see that data model is ok (no synth keys or loops)

petter · ‎2018-01-05

When you have typically more than a million records it is always important to look at how you can help Qlik to get the most out of it's built-in compression mechanisms.

The #1 method for doing that is to look at the cardinality of each field. If you can turn low-cardinality field into a high-cardinality field you can save a lot of space (memory) and it affects also calculation time accordingly.
Don't store redunant data - fields that contains just variants of each other. You have several of them.

"link_timestamp",

"link_date",

"link_time",

"link_time_15min",

"link_minute",

"link_hour",

[link.id],

speed,

"speed_mph",

TravelTimeIndex,

traveltime

link_timestamp might be stored with a high precision. So it would very often be a high-cardinality field. You can often round them off to a much lower resolution and save a lot of memory. Sometimes as much as 6-20 times the reduction in storage.
Maybe you don't need the link_timestamp at all - why do you keep it - doesn't the other fields contain the same information if you combine them?
link_time and link_time_15min is redundant. If you need to keep link_time at least round it off to the lowest acceptable resolution like seconds or minutes. You might save a lot there too.
Be sure that link_date contains only the date part so truncate it to an integer by doing av Floor(link_date)
speed and speed_mph are redundant. Why not just use a conversion factor when you need mph? it's a constant of 1.69034 and is easy to add to any expression.
be sure to round off the speed to the lowest resolution possible to save memory. Do you really need it to be with decimals - if not make it into an integer - you can round it off too.

Splitting a high-cardinality field like a timestamp into it's constituent parts like date, hour, minute, second and tenths of seconds will help tremendously - but don't keep the original timestamp. This will enable much much higher compression under the hood for Qlik. You might save a lot of memory.

morenoju · ‎2018-01-08

Thanks Petter. I've been editing the script following your advice and I've seen some improvement. I'm keeping the timestamp and I think the cardinality is not that high since in my DB table all values are in 15min samples (9:00, 9:15, 9:30, 9:45, 10:00, 10:15...). But I've got rid of all the other "link" fields and of speed. Speed_mph I've rounded it as you suggested.

Still, it takes a long time to load, but I'm now thinking of the resources of the machine where I have Qlik Sense installed. I'm using two cores and the CPU usage is of 100% during the load. I'm going to increase to 4 and see what happens.

Same with the memory, but this time not when loading but while editing or using the app. The 8 GB I have are used most of the time so I'm going to increase the memory too.

Thanks for the help, and I'll let you all know how it goes.

Regards,

Juan

morenoju · ‎2018-01-10

This is working very well now. Thanks petter-s for the recommendations to improve the load script. And all participants for the good ideas provided.

pbriggs11 · ‎2023-08-09

Thanks for this solution, I have never heard anyone speak of evaluating the cardinality of fields (or maybe I was just asleep in class that day). I implemented this approach on a QVD with over 100 million rows that was taking just over an hour to load, even though it stated it was an optimized QVD load. Now loading in under 5 minutes.