Discussion Board for collaboration related to QlikView App Development.
Next week I'll be working on massive dataset (atleast for my experience with QV) which is around 35 million records with 12 columns and I wanted to know the best practices to speed up the document? Please can someone share good practices for efficient development.
I know this is a vague question but this is really important for me...
Many thanks in advance.
This is on the same topic so I'll go ahead and ask it in here rather than making a new thread. I've got a 2.35 gb .csv with ~10 columns and something like 22 million rows that I load this from every day:
//already tried removing amount_due but it still ran out of memory without even trying
@2 as 'date',
sum(@3) as 'amount_billed',
sum(@4) as 'amount_due'
(txt, codepage is 1252, no labels, delimiter is ',', msq)
GROUP BY @2;
My laptop isn't very powerful (like 3 gb of RAM and 2ghz duo) so this usually takes about 15 minutes (the result only loads about... 1000 rows. If I tried to load all 22 million rows it would run out of memory). I figured my laptop could barely handle this and one day that file would get too big and I think that day has finally come.
Now when I try to reload (the EXACT same document I've been reloading for 6 months) it thinks for like 10 seconds and I get the out of memory error (for some reason it says allocating 512 mb when it usually says something like 2 or 4mb when it ran out of memory after working for a long time).
Does anyone have any tips at all on how to get this to reload (without getting a new computer)? If I created a .qvd on my home computer, I'm pretty sure my laptop could handle this easily but as is all I have is a .csv. This makes me think there could possible be some solution but I'm just not sure what that is (and I've thought about this for quite awhile...).
Any help would be beyond awesome.
Edit: I tried a FIRST 5 and a FIRST 50 LOAD with all 10 columns and they both worked fine. Then I tried a FIRST 500 LOAD and it instantly gave me the out of memory error (but allocating 256mb this time). I guarantee I could load a .csv with 5 million rows just fine right now, but for some reason since this one has 22 million it won't even try to load the first 500? I don't get it.
So, almost a year after this post was written and I'm adding another comment, only because I'm now dealing with what I consider to be high volumes of data.
I have a raw trade fact table of 250 million rows of data and the fact table is 210 columns wide and Yes, we've been through all of the "is this necessary" questions on the 210 columns and not a single one can be removed.
The reason for adding the comment is that I didn't see Aggregation in the above threads as a constructive method to improve performance, this can obviously only be applied in the right environment but in our case we are not interested in our management dashboard looking at trade level data so our aggregation in the ETL stage reduces those 250 million rows of data to approx 60m.
This was our golden step which produced a qvw document that could actually be used.
A number of other things have also been done, and bear in mind that I have a rather complex document of some 30+ sheets with 2500+ objects.
1. I didn't know this, but a variable that is defined with an = at the start of the definition is re-evaluated every time anything in the document changes, use a dollar expansion rather than an =.
2. I also didn't know that if you put a field event trigger on an "On Change" event, this is also re-evaluated every time anything in the document changes, so avoid this.
3. In this size of document, moving away from a star schema and using applymap functions to produce (as far as possible) a single very large fact table brought the document down to such a poor performance that it was unworkable, we reverted to a star schema but removed the snowflaking.
4. In this size of document, avoid any use of set analysis at all, and any use of IF(... in expressions, part of the reason for such a wide fact table is that we have approx 80 1/0 flags held in the fact to speed up expressions.
Hope this proves of some use to somebody...
Thank you very much for sharing. Very useful and it doesn't matter how long ago was this posted, because QlikView is going enterprise more and more every day. Actually, some of the steps in the product features, such as the Direct Discovery, are going this way. I'm sure yours will help a lot of people as well.
Just one question about your scenario: what client are your deploying? Are users going to use IE Plugin, Ajax, and if Ajax, mobile, desktop...? Take into account that event triggers do not work (as designed) in an Ajax environment. Actions will trigger (i. e.: when you click on a button), but not events.
Great suggestions Nigel.
But sometimes you need to use the "=" in variables to evaluate a conditions at the document level. I'm curious to know why this would have performance impact? Because QlikView takes in to account - cost of calculation while caching the data. And if this is expression has lot of overhead on the processor then it will be cached according to their proprietary caching algorthim. If the expression is simple then it will be calculated, and QlikView uses multi threading for processing the calcuations. But I still buy your points, as thumb rule we should use only when it's really required.
I'm really not sure about the issues relating to caching, but none of
the variable expressions I had with an equal sign at the start was in
any way complex, they were all very simple.
In the document size I'm dealing with, the impact of removing and
replacing with dollar expansion cannot be over-exaggerated, it was
massive, we had delays of many seconds just clicking on a button that
changed a variable and then produced a list box, the variable in
question was not one with an equal sign at the start but this is the
point. Every single click we made in the document went and re-evaluated
every variable with an equal sign at the start.
In lower size documents with much lower volumes of data I seriously
doubt that anybody would be able to tell that this was going on but when
you get to larger volumes you're forced to look into the smallest issues
to improve performance.
There are also a couple of points on this thread that I haven't yet
looked at, so I might be spending some time looking at those soon.......
Deepak, I work with large datasets all the time (100m upto 1b rows with 60/70 columns) and generally have no issue, understanding the end goal is the key you should easily be able to aggregate 35m rows down to achieve summary results.
Normal response time for queries on a 100m row document should be ~1 second, ensure your data is aggregated and standardised using set analysis where possible.
The key as Nigel said in part is finding what servers give you the optimal performance: -
Order By - Let the SQL server do it, this is extremely bad for QV on large datasets from my experience.
Aggregation and Group By see which performs better, generally I use QV because aggregation happens in a 2 tier extract.
Thanks Felim. My original question was posted 1.5 years back and I had not aggregated the data in that App. It works fine with 35m and no aggregation at all.
Thanks for your inputs.