Solved: Storing millions of records in QVD files - Qlik Community

morenoju · ‎2018-01-04

Hi,

I have a table in my DB with approximately 220 millions of records and rising every day.

In my data script, I decided to store the records in on QVD for each month but reading the data from the QVDs is taking more time than I had expected (around 15 min) and it's consuming a lot of resources in my server.

Would it be better if I had one QVD per year? Is it faster for Qlik to read a single QVD than several of them?

Thanks,

Juan

petter · ‎2018-01-05

When you have typically more than a million records it is always important to look at how you can help Qlik to get the most out of it's built-in compression mechanisms.

The #1 method for doing that is to look at the cardinality of each field. If you can turn low-cardinality field into a high-cardinality field you can save a lot of space (memory) and it affects also calculation time accordingly.
Don't store redunant data - fields that contains just variants of each other. You have several of them.

"link_timestamp",

"link_date",

"link_time",

"link_time_15min",

"link_minute",

"link_hour",

[link.id],

speed,

"speed_mph",

TravelTimeIndex,

traveltime

link_timestamp might be stored with a high precision. So it would very often be a high-cardinality field. You can often round them off to a much lower resolution and save a lot of memory. Sometimes as much as 6-20 times the reduction in storage.
Maybe you don't need the link_timestamp at all - why do you keep it - doesn't the other fields contain the same information if you combine them?
link_time and link_time_15min is redundant. If you need to keep link_time at least round it off to the lowest acceptable resolution like seconds or minutes. You might save a lot there too.
Be sure that link_date contains only the date part so truncate it to an integer by doing av Floor(link_date)
speed and speed_mph are redundant. Why not just use a conversion factor when you need mph? it's a constant of 1.69034 and is easy to add to any expression.
be sure to round off the speed to the lowest resolution possible to save memory. Do you really need it to be with decimals - if not make it into an integer - you can round it off too.

Splitting a high-cardinality field like a timestamp into it's constituent parts like date, hour, minute, second and tenths of seconds will help tremendously - but don't keep the original timestamp. This will enable much much higher compression under the hood for Qlik. You might save a lot of memory.

View solution in original post

zhadrakas · ‎2018-01-04

Hello Juan,

In General it doesn't matter in Performance which way you go (monthly or yearly QVD's)

Both ways Need the same RAM and CPU if you load all data.

I think monthly QVD's are a good way. This way you can shrink down the data model if your Client only Needs the last 6 months.

I think what you Need to check here is if you Load is an optimized load?

if not you can save a lot of time.

Turning Unoptimized Loads into Optimized Loads

regards

tim

morenoju · ‎2018-01-04

For the monthly QVDs I'm using a temp table where I put the records within the month, then I store it in a qvd, drop this temp table and load (concatenate) the qvd to my main table.

Looks like I'm doing a lots of steps. Do you think this could be done in a simpler way? I'm using the temp table to avoid storing records of previous months in the monthly QVDs.

pradosh_thakur · ‎2018-01-04

1: you guys use incremental load ?

2: any aggregation function used in the script with group by ?

check this out this may help

https://community.qlik.com/docs/DOC-19450

Learning never stops.

marcus_sommer · ‎2018-01-04

It sounds that your temp-table approach isn't necessary and could be removed by loading the data with a where exists-clause to keep the loadings optimized. Take a closer look on the link provided by Tim.

- Marcus

morenoju · ‎2018-01-04

Hello all,

Thanks a lot for your replies. I've had a look on the DOC but I don't see how it applies to my case. I don't want to overwhelm you with code, but I'd like to share a piece of it where I'm doing the following:

1. If the QVD for the month exists, I concatenate its data in the main table (traffic)

2. If the QVD does not exist yet, I put the data in a temporary table (temp_traffic)

2.1. STORE temp_traffic into new QVD

2.2 DROP table temp_traffic

2.3 Concatenate this new QVD to the main table (traffic)

It takes a lot of time, that's why it seems to me I must be doing something wrong.

Any idea is welcome! Thanks!

--

The code:

// If the monthly qvd exists, I concatenate its data to the main table

IF not isnull (qvdcreatetime(_vFileName)) THEN

[traffic]:

CONCATENATE LOAD "link_timestamp",

"link_date",

"link_time",

"link_time_15min",

"link_minute",

"link_hour",

[link.id],

speed,

"speed_mph",

TravelTimeIndex,

traveltime

FROM $(_vFileName) (qvd);

ELSE

[temp_traffic]:

CONCATENATE LOAD "start_date" as "link_timestamp",

Date(Floor("start_date")) AS "link_date",

Time("start_date",'hh:mm') AS "link_time",

Time(Round(Frac(Time("start_date",'hh:mm')),1/96),'hh:mm') as "link_time_15min",

Minute("start_date") as "link_minute",

Hour("start_date") as "link_hour",

"item_id" as [link.id],

speed,

"speed_mph",

Round(freeflow_speed / speed_mph, 0.1) as TravelTimeIndex,

traveltime;

SQL SELECT "start_date",

"item_id",

speed,

"speed_mph",

traveltime,

link.freeflow_speed

FROM pems."public"."measure_link_15min" tc

LEFT JOIN link on tc.item_id = link.link_id

WHERE (start_date > '$(vStartDate)' AND start_date <= '$(vEndDate)') AND item_id=11001069 ;

STORE temp_traffic into $(_vFileName) (qvd);

DROP Table temp_traffic;

[traffic]:

CONCATENATE LOAD "link_timestamp",

"link_date",

"link_time",

"link_time_15min",

"link_minute",

"link_hour",

[link.id],

speed,

"speed_mph",

TravelTimeIndex,

traveltime

FROM $(_vFileName) (qvd);

ENDIF

marcus_sommer · ‎2018-01-04

In this way you will need a temo-table to be able to store this data-slice. It's not quite easy to say what could be changed/improved but I think I would split the task in at least two layers - the first which reads from the database and creates the monthly qvd-slices and a second one which then reads the qvd's and performed further transformations on it.

- Marcus

martinpohl · ‎2018-01-04

It looks like you load all your datas from your data source and resident load them per month.

A resident load is much lower than a SQL select.

So I would load datas für month from data source and store them into qvd.

Also, is it neccessary to load months 3 or 4 periods before again? If they don't be changed (invoices never be changed only a new one is made) you only have to load 2-3 months each night and store into qvd.

Has to be tested.

Regards

morenoju · ‎2018-01-05

Hi Martin,

When you speak of not loading again previous months, do you mean not doing the sql query?

I always check if the monthly qvd already exists before doing the sql query (I don't do it if there's a qvd). But still, reading qvds takes some time.

Thanks

martinpohl · ‎2018-01-05

yes I meant (for example today) I would only query Jan 2018 and Dec 2017, older months not (but they are still available for data model)