Re: Reference Date & Dashboard Performance - Qlik Community

Report Inappropriate Content · ‎2016-10-27

Hello Everyone,

I had to create a daily follow-up of events so I have read the very interesting Reference Date tutorial.

As it is mentioned in the post, this method creates a lot of data as it attaches a date to the ID of the event everyday this event is existing.

Ex: If a ticket is opened on Jan 1st and closed on Jan 30th, the ticket ID will be replicated 30 times with each day of January attached to it (30 rows) .

I followed the tutorial and it works well but I have 2 points:

1) In the tutorial, there is an option to link the Reference Date to a Master Calendar: What is the advantage of this ?

2) So the method is working well, but as the dashboard is handling millions of events, the Reference Date field is much larger and my graphs have to go through this Reference Date field that I am using for the time dimension and in the Expressions. So that is why the dashboard performance is very poor (near to 20 seconds to load a new tab).

So my assumption is that it is because I created this Reference Date field in the script (so in the QV server) when it should be done prior to this (in the datalake for example). And once this Reference Date field created in the datalake I should only load it in the script, and then apply some simple "Count Distinct Event ID" in the Expressions of the graph.

Can anyone confirm my assumption ?

If you are thinking of any other way to improve the performance, feel free

Thank you !

M.

johnw · ‎2016-10-27

If your events are active over long time frames, here's one way to reduce the number of rows and hopefully improve performance, but it's a bit complicated:

Re: While loop for creating dates between 1985 and 2016 - performance issues

Master calendars are just a fairly straightforward way to associate Months, Years, Quarters, whatever you want with a date field. You can always add that data to your main table, and QlikView handles that pretty well, but I usually have a master calendar instead. I wouldn't add that data to your main table because you have such a large number of rows that it might be impractical in load time and memory usage.

I'm unclear what you're talking about with the reference date field being built in the script or prior somehow. Either you create it yourself, or you load it from some source, but either way your resulting data model in QlikView could be expected to be the same, and thus performance could be expected to be the same. But yes, if you have all those reference dates for the events, then you should be able to just count(distinct [Event ID]) in an expression.

Report Inappropriate Content · ‎2016-10-28

Hey John,

thank you for you reply.

I read the link that you gave me. It seems very similar to what I have done for what you call the "KeyDate".

Here is my script below, what it does is just associate a date to an event_id every day this event is "active", this is this field that I called "ReferenceDate". And as you can see I need a daily granularity. Also, you can see that in the same table I create the Year & Month field. FYI the EVENT_TABLE is the data source I loaded.

EVENT_DATES:

event_id,

ReferenceDate,

Year(ReferenceDate) as Year,

Month(ReferenceDate) as Month

;

Date(event_creation_date + IterNo() - 1 ) as ReferenceDate,

event_id,

event_completion_date,

event_completion_date

RESIDENT EVENT_TABLE

While IterNo() <= ceil( event_completion_date - event_creation_date + 1);

The output is something like this with millions of rows:

Event_id	ReferenceDate
12345	01/01/2015
12345	02/01/2015
12345	03/01/2015
12346	04/03/2015
12346	05/03/2015
12346	06/03/2015

Now what I do from this, is I use this ReferenceDate as the Time dimension on my x axis charts.

And then in the Expressions, I do simple expressions like for example: If (ReferenceDate > event_completion_date, count distinct event_ID)

When I mentioned ReferenceDate being built before I meant that I could ask data engineers to build this field previously, in the datalake and then all I would need to do is load this field as a data source in the script.

So I was hoping that if I was only importing this ReferenceDate previously created from data source and if the only thing I had to do in the Expressions for the graphs was a simple use of "if " and "count distinct" then performance would be good. But you seem to think it won't improve the current poor performance.

So I am hoping your solution might help !

But where I don't follow is your last step:

" And build a table that links all date keys to associated dates:

DateKey, Date
1985-01-01, 1985-01-01 "

For me, DateKey is ReferenceDate, what is the purpose of building a new table that would link all the associated dates as they would be exactly the same dates as in your example ?

Thank you for your time John !

johnw · ‎2016-10-28

Thank you for the script. It looks very sensible. It looks like a very classic model for this data, tried and true. But yes, with enough rows, I can understand it causing performance problems, and I'm not sure anything I've said will help with that.

Right, I don't think having the data engineers build the ReferenceDate will change performance at all, at least if I've understood your situation. Either way, you would end up with a table with fields event_id, ReferenceDate, Month, and Year, and with exactly the same number of rows and exactly the same data in each row. So it should perform exactly the same because it IS the same. QlikView can create the ReferenceDate very efficiently with the script you show, so it would be a waste to have the engineers add a new field.

You would be better off with a master calendar. Let's say you have 20 million rows and 2000 dates. Right now, you have 20 million copies of the Year and Month field cluttering up your main data table. If you move them into their own table connected only by ReferenceDate, you only have 2000 copies of the Year and Month fields. It will use less memory, though the savings probably won't be as good as you might think, and I don't know if it will improve performance. QlikView deals very well with denormalized data.

Here's the difference between your ReferenceDate and my DateKey, and it's really only meaningful if you have long date ranges spanning months or years. So let's say event 12345 occurred between November 28, 2014 and February 2, 2016. You would have this:

event_id ReferenceDate Month Year
12345    28/11/2014    11    2014
12345    29/11/2014    11    2014
12345    30/11/2014    11    2014
12345    01/12/2014    12    2014
12345    02/12/2014    12    2014
... hundreds of rows ...
12345    31/01/2016    01    2016
12345    01/02/2016    02    2016
12345    02/02/2016    02    2016

The structure I'm proposing as a possibility would look like this:

Events:
event_id DateKey
12345    28/11/2014
12345    29/11/2014
12345    30/11/2014
12345    12/2014
12345    2015
12345    01/2016
12345    01/02/2016
12345    02/02/2016

DateKeys:
DateKey    ReferenceDate
28/11/2014 28/11/2014
29/11/2014 29/11/2014
30/11/2014 30/11/2014
12/2014    01/12/2014
12/2014    02/12/2014
... 28 rows ...
12/2014    31/12/2014
2015      01/01/2015
2015      02/01/2015
... hundreds of rows ...
2015      31/12/2015
01/2016    01/01/2016
01/2016    02/01/2016
... 28 rows ...
01/2016    31/01/2016
01/02/2016 01/02/2016
02/02/2016 02/02/2016

Calendar:
ReferenceDate Month Year
28/11/2014    11    2014
29/11/2014    11    2014
30/11/2014    11    2014
01/12/2014    12    2014
02/12/2014    12    2014
... hundreds of rows ...
31/01/2016    01    2016
01/02/2016    02    2016
02/02/2016    02    2016

On the surface, it looks like it has even MORE data, as well as being much more complicated. And it is much more complicated, but in practice, the number of rows will be much lower in the proposed model. Your main table has shrunk from hundreds of rows to 8 rows. And while both your DateKeys and Calendar tables now have hundreds of rows, those rows do not include the event_id, so they will be reused across all event_ids. So they won't be very large.

But it may not improve performance. If your date ranges are large, it will significantly improve the amount of memory you are using, but that may not translate at all into improved performance unless the data model is so large that the computer is swapping to disk instead of processing entirely in memory. So if memory isn't an issue, it might not be helpful. And even if memory is an issue, adding memory may be a cheaper solution than creating and maintaining a more complicated data model.

With either data model, if ReferenceDate with a chart dimension, a simple count(distinct event_id) would tell you the number of events that were active on that date. Expressions from the current data model should produce the same results in the other data model.

johnw · ‎2016-10-28

It's always possible that your performance problems lie elsewhere, too. for instance, if your actual expression is this:

count(distinct if(ReferenceDate > event_completion_date,event_id))

That would perform disastrously bad on such a big data model. It appears that you don't even need the condition, as all event_ids should have a ReferenceDate > their event_completion_date. But if you need a condition like this, it should either be baked into the data model, or done with set analysis.

Report Inappropriate Content · ‎2016-10-29

Hey John,

First, a big thank you for the time you took to give me a very detailed reply.

I have to admit the data model you proposed seems complicated to me, and I am not sure to understand the purpose o fthe mix of different date formats in the DateKey field:

12345 12/2014

12345 2015

12345 01/2016

But anyway, I think the last point you made is as you pointed out, where lies the reason of poor performance:

Because I actually have around 10 graphs where I use similar logic calculations with the ReferenceDate like:

To display a non closed event carried over from one day to another I do something like:
- count(distinct if(ReferenceDate > event_creation_date & ReferenceDate < event_completion_date,event_id)),event_id))
To display a new event created on a specific day:
- count(distinct if(ReferenceDate = event_creation_date ,event_id))

So from what you say I am trying to think of another way of calculating the graph expressions.

But I am not sure how to do, would create the logic in the script and then apply a 1 or 0 type flag would work ?

For example in the script I would have: if(ReferenceDate = event_creation_date , 1, 0) AS flag_new_event

And then in the expression I would have: count (distinct if (flag_new_event = 1, ,event_id))

I cannot test it now but if according to you this method would probably improve significantly the performance I will try it out as soon as possible.

Thank you a lot once again. Let me know !

johnw · ‎2016-10-31

OK, so just to be sure I understand, your data model has ReferenceDates per event_id ranging from the event_creation_date to the event_completion_date, inclusive.

If you have ReferenceDate as a dimension in your chart, you can just count(event_id) to tell you all events active on each day. This should perform pretty well.

But then you have other cases, such as counting events created on that day, for which you count(distinct if(ReferenceDate = event_creation_date,event_id)). The performance of these expressions should be horrible. Any time you combine a large number of rows with an if() inside your aggregation function, you're toast. This is probably the true root of your problem.

The solution you're looking for is set analysis, but you're also going to need the flags you mentioned. So yes, create your flag_new_event. But then write your "if" in set analysis notation:

count({<flag_new_event={1}>} event_id)

That should perform quite well. As well as possible for such a large table, anyway.

To explain the difference, if you do a count(if()), QlikView evaluates it by going through every row in the table (or for the dimension value, anyway), and then comparing them to the if() one by one. If you do a count({set expression}), QlikView uses the same very efficient engine it uses for selections. In the case above, it "selects" a value of flag_new_event = 1, removes all other rows from consideration, and then evaluates the count. It can do this quite efficiently.

And though it's probably irrelevant to your problem, let me try to explain the purpose of the mix of different date formats in my DateKey field. The purpose is to reduce the number of rows in the primary table. In the posted example, the average number of dates per Key is about 4000, but the average number of DateKeys per Key is only about 40. The data model only has about 1/100 the number of rows in the primary table as a simpler data model would have. Let's say your date range includes all of 2015 in it. It might have more dates, but as long as it has at least ALL dates from 2015 in it, we will replace all 365 rows in the main table with a single row, with a DateKey value of 2015. A separate table will explode 2015 out into all 365 dates, but it only has to do so once, no matter how many key values we have in the original table.

Report Inappropriate Content · ‎2016-11-08

Hey John,

Sorry for the delay in my response.

Thank you a lot for your advice. I will use and test Set Analysis. Once I will be able to see if it improved the performance I will let you know !