Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Here is thedeal.
This is the statement which I need to tune:
TempDates:
LOAD
'D' as AggrLevel,
*,
if(dmDateHour = MonthStart(dmDateHour), ES) as MonthES, \\ calc ES for Monthlyaggregation
if(dmDateHour = WeekStart(dmDateHour), ES) as WeekES, \\ calc ES for Weeklyaggregation
MonthStart(dmDateHour) as MonthStartDay,
WeekStart(dmDateHour) as WeekStartDay
;
LOAD
DPId,
TZId,
StationId,
DayDateHour as dmDateHour,
AggrCountry,
sum(TLH) as TLH,
sum(SS) as SS,
sum(if(DPESDateHour = dmDateHour, ES)) as ES \\ calc ESfor Daily aggregation
Group by
DPId,
TZId,
StationId,
DayDateHour,
AggrCountry
;
LOAD
DPId,
TZId,
StationId,
dmDateHour,
AggrCountry,
DayDateHour,
sum(TLH) as TLH,
sum(SS) as SS,
sum(ES) as ES,
Min(dmDateHour) as DPESDateHour \\first DateHour for a day part
Group by
DPId,
TZId,
StationId,
dmDateHour,
AggrCountry,
DayDateHour
;
LOAD
DPId,
TZId,
StationId,
dmDateHour,
AggrCountry,
DayDateHour,
TLH,
SS,
ES
Resident Temp
Where DayDateHour >= $(vDateStart) \\dailyloop condition
and DayDateHour < $(vDateEnd);Temp table:
The table has aggregated information about sessions on different stations for predefined day parts.
Field | Description |
DPId | 30 possible day parts (e.g. “Whole Week/Whole Day”, “Monday-Friday/10-13”, “Tuesday-Sunday / 9 - 20” and etc.) |
TZId | 5 possible time zones (e.g. GMT, EST and etc.) |
StationId | About 16 500 Stations’ Ids |
dmDateHour | A timestamp rounded to each hour sharp (e.g. 1/1/2011 10:00:00) or in QV syntax = floor(timestamp * 24)/24 |
AggrCountry | 2 possible countries – Main and Others |
DayDateHour | A beginning of a day calculated derived from dmDateHour (= floor(dmDateHour)) |
TLH | Total hours for a sessions during dmDateHour (> 0) |
SS | # of Started Sessions during dmDateHour |
ES | # of Entered Sessions from previous hour into current dmDateHour |
The table has 1 month data. As you can see it may have 30 * 5 * 16500 * (30 * 24) * 2
= 3564 000 000 records. Usually it’s not that huge because not all stations have
an activity during a month. It’s about 30-40 millions of records.
Main problem is related with ES. I need to derive number of Active Sessions (AS)which
is Sum(SS) + Sum(ES of first dmDateHour in a day part). And I need to aggregate it to Day.
That’s why I have these multi level aggregations.
I’ve made this condition in a begging
Where DayDateHour>= $(vDateStart)
andDayDateHour < $(vDateEnd);
to reduce a number of records for the aggregation. I have a loop where I derive vDateStart
and vDateEnd for each day of a month. If I remove this condition it will kill whole memory
on a server (98GB).
So, daily aggregation takes 45 mins per each day (about 22 hours for whole month).
If I have just one day data in the Temp table it takes about 3 mins to do sameaggregation
for a day. But most interesting thing is all the rest days of a monthin this daily loop
will calculates during 90 seconds to retrieve NO DATA fromthe Temp table. It means that
a day aggregation takes about 90 second and other90 seconds for data retrieval for
the condition.
My assumption is – the more data I have in the table the longer this data retrievalis.
(It’s too pity that we don’t have indexes in QV).
Is there other way to make this retrieval faster? Or maybe there is another way
for my statement.
Please help.
P.S. The Word version is in attachment.
Nick, is really hard to give a solution without being able to test things with the real volume and really understand why you're doing each thing.
I think it will be best to work with a consultant on site.
A few ideas that might help
1) Instead of doing something like this
if(dmDateHour = MonthStart(dmDateHour), ES) as MonthES, \\calc ES for Monthly aggregation
if(dmDateHour = WeekStart(dmDateHour), ES) as WeekES, \\calc ES for Weekly aggregation
MonthStart(dmDateHour) as MonthStartDay,
WeekStart(dmDateHour) as WeekStartDay
Perhaps you can have a table, built just from script using autogenerate, that only has records por monthstart and weekstart dates
So you can do something like
LEFT JOIN(TempDates)
LOAD
dmDateHour,
MonthStart(dmDateHour) as MonthStartDay,
WeekStart(dmDateHour) as WeekStartDay,
1 as FLG_MonthES,
1 as FLG_WeekES
RESIDENT CALENDAR;
And then use the FLGs in the calculations instead of creating new fields, something like SUM(ES*FLG_MonthES).
2) Instead of doing this
Where DayDateHour >=$(vDateStart) \\daily loop condition
and DayDateHour < $(vDateEnd);
You can again, create a table with the dates you want to obtain, i.e all dates in the week, or all dates in the month, and do something like this
LOAD
Date AS DayDateHour
RESIDENT CALENDAR
Where Date >=$(vDateStart)
and Date < $(vDateEnd);
LOAD
DPId,
TZId,
StationId,
dmDateHour,
AggrCountry,
DayDateHour,
TLH,
SS,
ES
FROM File (qvd)
WHERE Exists(DayDateHour);
This will cause the load to be optimized and will also filter records with the desired dates.
Hi Daniel,
Thank you for your help.
1) It can work if the deriving data is based on just one dimension. But in my case I have 6 of them. And the flags can be different for the same dmDateHour.
2) I've tried Exists:
LOAD
DPId,
TZId,
StationId,
dmDateHour,
AggrCountry,
DayDateHour,
TLH,
SS,
ES
Resident Temp
Where Exists(DayDateHour, $(vDateStart));
The result is the same slow.
Exists will be faster than where when done loading a QVD file and with a single parameter Exists(DayDateHour).
It's good to know. But I don't have both of these cases. It doesn't make sense to store temporary data in QVD. And single parameter Exists is not possible because all possible values for the field already exist.
Thank you
They were just examples of things you can try to improve performance, you will be better of working with an experience qlikview consultant on site.
You should comment out the top portion and load just the first load statement and record the time. Then uncomment each preceeding load statement one at a time, and perform a reload to capture the time. Use the time differences to figure out where you really start to loose performance. I would also monitor PC and RAM and write down some notes each step of the way.
Also, as stated above its hard to really know the impact of a suggestion without using your hardware and data, but I would try the following:
1) Seperate Day, Date and Hour into different fields this will make the possible values less distinct and possibly decrease your memory footprint. Probably only a slight difference but it may add up for you.
2) All of your date functions result in Dual value outputs, try putting a Num() around them so only the numeric value is passed to the next load. Not sure if the overhead of the Num() will outweight the removal of the Daul value. Only testing will tell.
3) You may want to try using numbers as your IDs not codes. 1, 2, ... may perform better than EST, GMT, ... again I would test.
4) Try saving the Temp resident table to a QVD and dropping it, and then loading it back into memory with the where statement, this may sound crazy but this will cause QlikView to compress the data and free up the memory. Loading a large resident into another resident is ofter a bad idea because until you drop that Temp table (that won't be needed in the end) your using twice the memory.
I don't promise any of these suggestions will work, but I would try each one and record the results. Please share the results I would be interested to know whether each suggestion has a negative or positive impact. There are also other things to consider like what are your hardware and software specs. I have seen machines where the QlikView load process slows down if the CPU affinity is too high (this only started occuring after the multi-threaded loads were introduced).
Hi,
is there a document or a Blog around for general performance tuning tips and tricks, everything one can do to make different parts of a QlikView document run faster?
Thanks a lot!
Best regards,
DataNibbler
P.S.: I've bookmarked one thread where someone put a collection of some things. Still it seems like a good idea to put all that into a document...