Forget about IntervalMatch

Report Inappropriate Content · ‎2009-12-09

Let's review different pros and cons.

Imagine that we have employee Vacations and a company Calendar, like this:

SET DateFormat='M/D/YYYY';
Vacation:
Load * Inline [
EmployeeId, StartDate, EndDate
1,1/1/2009, 1/15/2009
2,1/10/2009,1/14/2009
3,1/2/2009,1/4/2009
3,1/15/2009,1/25/2009
4,1/1/2009, 1/15/2009
];
Let vStartDate = num(date('1/1/2009'));
Let vEndDate = num(date('1/31/2009'));

Calendar:
load $(vStartDate) + RecNo() - 1 as Date
autogenerate vEndDate - vStartDate + 1;

left join
load Date,
 year(Date) as Year,
 month(Date) as Month
resident Calendar;

How to tie Calendar to Vacations? That's right - use IntervalMatch.

We can do this by different ways:

1)

left join (Vacation)
intervalmatch (Date) load StartDate, EndDate resident Vacation;

pros: no surrogate keys

cons: vacation record duplication for number of dates in the range

2)

VacationToDate:
intervalmatch (Date) load StartDate, EndDate resident Vacation;

pros: no duplications, cross-reference table, easy to read the logic

cons: surrogate key (hate surrogate keys)

3) But there is another way to do this without IntervalMatch. If you need something good, do it by yourself 🙂

left join (Vacation)
load StartDate
 ,EndDate
 ,autonumber(StartDate & '-' & EndDate) as DateRangeId
resident Vacation;

VacationToDate:
load distinct
 DateRangeId,
 StartDate + IterNo() - 1 as Date
resident Vacation
while StartDate + IterNo() - 1 <= EndDate;

pros: all pros mentioned above

cons: no

Good luck

johnw · ‎2009-12-10

Note that QlikView's synthetic key approach is doing something very close to what you're doing. You are both creating a unique ID for each combination of start date and end date. You are both tying two tables together using this new ID.

The difference is that you leave the start and end date on the vacation table. The synthetic key approach stores these dates on a separate table. So if fifty employees go on vacation together, you will duplicate the start and end date information for each employee, while the synthetic key approach will only have one copy of the information for all fifty employees. While I don't consider duplicate data of this sort to be much of an issue in QlikView, and indeed sometimes prefer to denormalize, you DID list "record duplication" as a con. So then it is a con of the method you provided as well, and one that you missed (or perhaps dismissed as irrelevant in QlikView, which I would normally agree with).

Now, we could avoid record duplication while still avoiding synthetic keys by doing this:

Period:
load distinct
autonumber(StartDate & '-' & EndDate) as DateRangeId
,StartDate
,EndDate
resident Vacation;

Vacation2:
load
EmployeeId
,autonumber(StartDate & '-' & EndDate) as DateRangeId
resident Vacation;
drop table Vacation;
rename table Vacation2 to Vacation;

VacationToDate:
load distinct
DateRangeId,
StartDate + IterNo() - 1 as Date
resident Period
while StartDate + IterNo() - 1 <= EndDate;

I believe you would then say this approach doesn't have a con. But look at the table structure. Our new Period table sits EXACTLY where the synthetic key table sat, and is serving EXACTLY the same role. But since we did everything EXPLICITLY, it's a "real" key instead of a "synthetic" key. Does that matter? Here it does not, or at least I can't think of any down side to leaving the key synthetic. Not all synthetic keys are bad, and I think this particular synthetic key is exactly what is needed. So really, all we've done is written a bunch of script to do explicitly what a much shorter and clearer script would have accomplished.

So in my opinion, the simple intervalmatch with no join is the best approach of those presented.

Report Inappropriate Content · ‎2009-12-10

You may be don't know, but autonumber generates the same number for the same content. To avoid duplication I use DISTINCT for VacationToDate. What duplication are you talking about? Run the script and check frequencies. This post was not about NF, if you want to discuss it we can open another topic. This Vacation/Period splitting depends on requirement.

Synthetic key kills more memory than autonumber. And for 10 millions vacation during 10 years this synthetic key will grab a half of your application.

In my real application by doing this I've reduced my QVW from 15M to 10M and performance was raised from 10 sec to 1 sec for a chart, may be less.

In my opinion the less in-memory application the better moreover if it is raising the application performance.

johnw · ‎2009-12-11


orka wrote:What duplication are you talking about? Run the script and check frequencies.

When I do that, I can see that both StartDate and EndDate are duplicated when people share the same vacation days, just like I suggested. Actually, it's WORSE than I thought, as it shows FOUR copies each, not the two I expected. Ah, that's caused by a script error. Where you have this:

left join (Vacation)
load StartDate
,EndDate
,autonumber(StartDate & '-' & EndDate) as DateRangeId
resident Vacation;

It needs to be this:

left join (Vacation)
load DISTINCT StartDate
,EndDate
,autonumber(StartDate & '-' & EndDate) as DateRangeId
resident Vacation;

That fixed, we're back to the two copies I would expect. "So if fifty employees go on vacation together, you will duplicate the start and end date information for each employee", specifically in the Vacation table. That probably wasn't clear.


orka wrote:This post was not about NF, if you want to discuss it we can open another topic.

OK, the word "denormalize" is unnecessary if you think it's off topic. We'll stick to your wording. In regards to your first intervalmatch solution, you say:

"cons: vacation record duplication for number of dates in the range"

In regards to your final solution, I can say:

"cons: vacation date range duplication for number of people with identical vacations"

Now the duplication in your final solution is much less significant, I'll admit. But it seems consistent to at least state it, particularly since it may affect OTHER data sets much more significantly. And if data duplication is a stated con, the synthetic key approach doesn't have this con. We could also solve it without using synthetic keys by using the script in my previous post. So two options for avoiding this con.


orka wrote:Synthetic key kills more memory than autonumber. And for 10 millions vacation during 10 years this synthetic key will grab a half of your application.
In my real application by doing this I've reduced my QVW from 15M to 10M and performance was raised from 10 sec to 1 sec for a chart, may be less.

THIS is what surprised me. I wouldn't have thought the synthetic key would cause problems AT ALL, much less problems of THIS magnitude. Seemed worth testing.

So I wrote a script that generates 10 million vacations for 500,000 employees over a 10 year period. I'm not sure what sort of chart you're using, but I threw together a big line chart that uses most of the data - for each date, how many employees are on vacation? I started with a fresh copy of QlikView to make sure nothing was cached, then opened up the minimized chart, then checked the CalcTime. I'm told it would be more accurate to write some sort of macro using activedocument.getapplication.getcpumilliseconds, but I figured that might be overkill. Going through the various solutions in order, here's the raw data:

load time file KB RAM KB CalcTime
intervalmatch and join out of memory on a 64-bit machine
intervalmatch synthetic key 5:19 200,477 387,080 116,312 ms
your solution fixed 15:26 29,253 106,924 22,938 ms
my solution 14:01 49,507 77,248 46,000 ms

There is a LOT of information to be gained here. We can see that the intervalmatch solutions ARE terrible, both with a join and with a synthetic key. We can see that your solution DOES solve the problem, both from a memory and chart calculation time perspective. My solution appears to do the same, with significantly less RAM but also with double the CalcTime for the chart. Might be specific to that chart, or it might not. Hard to know.

However, looking over my logs as well, I can see that the Period table in "my solution" compressed 10 million vacations down into about 50,000 distinct vacation date ranges. I just used random vacation dates, so in a real world application, you'd probably see even greater compression due to correlation between the vacation dates of different employees (clustering around holidays and weekends, for instance).

Now, your first two solutions joined your calendar either explicitly (left join) or implicitly (synthetic key) to the full table of 10 million vacations instead of to a MUCH smaller table of 50,000 distinct vacation date ranges. Then you proposed a third solution that explicitly created and then joined to a date range table instead of to the vacation table. This produced phenomenally less data, and therefore vastly increased performance and reduced memory requirements. So even from ONLY that perspective, it is no wonder that we're seeing such a big difference.

Still, we haven't nailed down which of these plausible factors is really causing the trouble - intervalmatch, synthetic keys, or joining to the "wrong" table. Fortunately, it's easy to check. "My solution" did away with the intervalmatch and the synthetic key just like yours, but used a STRUCTURE of the data that we could duplicate with an intervalmatch and a synthetic key. Like so:

Period:
LOAD DISTINCT
StartDate
,EndDate
RESIDENT Vacation
;
LEFT JOIN (Period)
INTERVALMATCH (Date)
LOAD
StartDate
,EndDate
RESIDENT Period
;

That leads us to this comparison:

load time file KB RAM KB CalcTime
my solution 14:01 49,507 77,248 46,000 ms
new intervalmatch synthetic key 3:27 49,401 77,160 44,797 ms

As we can see from this, staying away from intervalmatch and synthetic keys did nothing for us. It appears that the synthetic key mimics creating our own table tying together a date range ID with the start and end dates, just like I would expect. They both use similar amounts of memory, and take similar amounts of time to process, just like I would expect. The main difference is that letting QlikView do all the work via intervalmatch and a synthetic key allows us to load the data about four times faster. So I'd just let QlikView do the work.

Still, the chart in your solution loads twice as fast, so it seems that your data structure is probably better for this specific chart. It's possible that your data structure is better for your own real world data and charts as well. QlikView is often happier with certain kinds of data denormalizations. But regardless, the problem you were having wasn't with the intervalmatch or the synthetic key. The problem you were having was with joining data in inefficient ways.

TL;DR - After much testing and thought, I still think that the best general solution is a more normalized data structure with a synthetic key created by intervalmatch. The DateRangeId solution with duplicated StartDates and EndDates on the Vacation table may be better for specific charts if RAM and reload times are not issues.

Report Inappropriate Content · ‎2009-12-11

It's a really good test for all of these solutions. I did it on real application, but without measurements.

load time file KB RAM KB CalcTime
intervalmatch and join out of memory on a 64-bit machine
intervalmatch synthetic key 5:19 200,477 387,080 116,312 ms
your solution fixed 15:26 29,253 106,924 22,938 ms
my solution 14:01 49,507 77,248 46,000 ms
new intervalmatch synthetic key 3:27 49,401 77,160 44,797 ms

Your pointing to duplications of Start/End dates in vacation table is very good. I'll think about it. I've just fabricated the problem. The real one is more complex. And definitely, the performance is a main criteria for my application. I'm going to massage my solution with some of your hints. Thank you.

Report Inappropriate Content · ‎2009-12-11

Hello,

I just followed your interesting discussion about how to avoid intervalmatch, what kills my memory almost every day and I believe it's a problem of more than 10.000 records, but let me explain:

To keep close to your example, even it does not match from the content: I'd like to find out how often at the same minute my employees have been on vacation (to say it quiet frankly, ID 1 from 01/1/2009 07:00 till 15/01/2009 7:30 and ID 2 from 15/01/2009 7:00 to 20/01/2009 7:00) will generate 30 minutes were 2 employees have been on vacation in the same hour. For the hour from 07:00 to 08:00 the maximum would be 2, for the next hour 08:00 - 09:00 =1.

What I'd like to generate is a list which stats for every hour in the period we look at the max number of employees on vacation based upon the minutes. Hope you get this?! I know that this sounds a bit weird.

To do so I have created a "Minutecalender", that allows me to generate every minute in the period I look at.

Timefield :
LOAD ('$(varMinTime)' + ((rowno()-1)/1440)) AS timefield
AUTOGENERATE (1)
WHILE ('$(varMinTime)' + ((rowno()-1)/1440))<=$(varMaxTime);

Minutes:
LOAD
 timefield AS timefield,
 timestamp (timefield , 'DD.MM.YYYY hh:mm') AS MinCal,
 dual(time(floor(timefield, 1/24),'hh:mm') & ' - ' & time(ceil (timefield, 1/24),'hh:mm'), floor(frac(timefield),1/24)) as HourkalenderDate
RESIDENT timefield ;

What I usually do next is to do an intervalmatch on the Start and the End so that I generate for the same record the number of rows as it has minutes. And in a next step I count the ID per minute:

Data1:LOAD
 ID As IDmin,
 Year,
 Month,
 Weekday,
 Start,
 End,
 Date
RESIDENT Main.Data
WHERE "Year" >= 2009;

LEFT JOIN
INTERVALMATCH (MinCal)
LOAD
 Start,
 End
RESIDENT Data1;

Data2:
LOAD
 IDmin,
 Start,
 End,
 Weekday,
 Month,
 Year,
 Date,
 MinCal
RESIDENT Data1;

LEFT JOIN (Data2)
LOAD
 MinCal,
 Count (Distinct IDmin) AS Result
RESIDENT Data2
GROUP BY MinCal;

This generates a lots of records (one for each minute), but the information I need is the maximum per hour ('HourKalenderDate'). So I just need to get the maximum value (24 values for every day). This code just works with less then 10.000 records, what is in my data the year 2009. If I want to do this for 2008 (12.000 records) or even 2008 + 2009 (20.000+ records) together there will be the "out of memory message" during the intervalmatch calculation of the minutes.

The question is: is there maybe a solution to solve my problem without the intervalmatch function?

Best

Nico (who is still a newbie with QV)

johnw · ‎2009-12-11


orka wrote:Your pointing to duplications of Start/End dates in vacation table is very good. I'll think about it. I've just fabricated the problem. The real one is more complex. And definitely, the performance is a main criteria for my application. I'm going to massage my solution with some of your hints. Thank you.

Honestly, I doubt that the duplicated start/end dates in the vacation table are a problem. For my sample data, I'm guessing that that's why that solution took more RAM than the two that didn't do that, but I can't say for sure without looking at the .mem file, and that's probably unnecessary. And in any case, if duplicate data makes your charts go faster, it's probably a good thing instead of a bad thing. I denormalize a lot of data in QlikView to make reporting easier, even to make my data structure more clear. Normally, QlikView's compression does a pretty good job of handling it, and there's little concern for duplicated data being out of sync since there are no updates.

And yes, real world problems are usually more complex than these example problems we solve. You arrived at your solution through actual testing on your actual data. Perhaps what I consider "the best general solution" would help, but it could easily make things worse instead. My suggestion is more of a "this is what I'd start with, and then modify from there while testing performance" rather than "this is what everyone should do in all cases". Hopefully I didn't come across too strong, and I apologize if I did.

johnw · ‎2009-12-11


Droz112 wrote: The question is: is there maybe a solution to solve my problem without the intervalmatch function?

Well, I can't give this the attention it deserves at the moment, but yeah, I agree that intervalmatch is a bad solution. As you observed, this generates a lot of records.

We could consider your problem to be similar to an even bigger problem. Say that instead of minutes, you had timestamps. You wanted to be able to select one particular point in time, down to ten thousands of a second, and know how many employees were on vacation at that time. That is basically impossible to solve with intervalmatch, so it needs a different solution.

I know I've thought about this question before, but I can't remember if I've come up with any solution or not. I'll have to take a look through all of my sample files. If I don't have anything, I'll spend some more time thinking about it. Perhaps someone else will have solved it for you by then.

Report Inappropriate Content · ‎2009-12-14

Normal 0 21 false false false DE X-NONE X-NONE MicrosoftInternetExplorer4

Thanks for your valuable comment, John. If you don't mind I would appreciate if could check your sample files?

At the moment I am trying to solve this question by using a different way. Even the results are not appropriate at the moment, because of the raw data - it's the WAY I'd like to discuss. I am still using the IntervalMatch function but do not create a row for every minute of every record. In the example below I just use IntervalMatch to check if the Start time (Start2, what is similar to Start1) is in-between the Start time (Start1) and End time (End1) of any other record. Doing so I just create less rows then before and QV can handle the whole data without generating separate data sets (e.g. for the year 2008 and 2009, as noted above). I am not sure if the count function in the table Data5 is used appropriate. The result of count (End1) should represent the number of records what have been a start time (Start2) in-between the Start1 and End1. Since Start1 and Start2 possibly have a similar timestamp I use End1 for counting.

Data1:LOAD ID As IDmin,Year,Month,Weekday,Start,End,
Hour(Start) as StartHour, Date
RESIDENT Main.Data;

Data2:
Load
Start as Start1,
End as End1,
IDmin
Resident Data1;

Data3:
Load
Start as Start2,
Date as Date2,
Month as Month2,
Weekday as Weekday2,
StartHour,
Date,
IDmin
Resident Data1;


Data4:
Intervalmatch (Start2) Load Start1, End1 Resident Data2;

Drop Table Data1;

Data5:
LOAD
Start2, IDmin, Month2, StartHour, Weekday2
Resident Data3;

LEFT JOIN (Data5)
LOAD
Start2, Start1, End1
RESIDENT Data4;

LEFT JOIN (Data5)
LOAD
Start2,
Count(End1) AS CountX
Resident Data4
Group by Start2;

Drop table Data2;
Drop table Data3;
Drop table Data4;

MAXIMUM:
LOAD
StartHour,
Date,
dual(time(floor(StartHour, 1/24),'hh:mm') & ' - ' & time(ceil (StartHour, 1/24),'hh:mm'),floor(frac(StartHour),1/24)) as Hours
Resident Main.Data;

inner join
LOAD
StartHour, max(CountX) AS MaxPerHour
RESIDENT Data5
GROUP BY StartHour;

Even this way seems to be sort of a work-around; I am not satisfied with it. Any suggestions how to make this tangled mass of tables more comfortable?

Report Inappropriate Content · ‎2010-07-05

Hi Jonh and All,

Can some one please help me out here. I am facing the same problem here. I struggling to overcome from this problem. Hope some can help me out here. I have also posted the sample application in the below link.

http://community.qlik.com/forums/t/31125.aspx

QlikView App Dev