Solved: *** 6 Weeks in to QV Development, 30 Million Recor... - Page 2 - Qlik Community

IAMDV · ‎2011-04-15

Dear All,

Next week I'll be working on massive dataset (atleast for my experience with QV) which is around 35 million records with 12 columns and I wanted to know the best practices to speed up the document? Please can someone share good practices for efficient development.

I know this is a vague question but this is really important for me...

Many thanks in advance.

Cheers!

johnw · ‎2011-04-15

(Edit: Oh, and TEST. I'm putting this at the top because it's very important. What you THINK is fastest may well not be fastest. What we tell you is fastest most of the time may not be fastest in your case. Try different approaches and SEE which one is fastest. I knocked my load time in half on a recent application by just continually playing with different ways of loading the data, even the ways I thought would be slower. Some of them were faster for my specific case, not slower. Everyone's situation is a little bit different.)

When loading from your QVD, make sure you're getting an optimized load. Doing so often requires some strange gyrations, because the only condition you can put on an optimized load is a single where exists(). So if you want something like this:

[MyTable]:
LOAD a bunch of fields
FROM MyData.qvd (QVD)
WHERE "Status" = 'Active'
AND "Date" >= yearstart(today(),-1)
AND match("Product",'A','B','C');

It will often be faster to break this up to get an optimized load. First, what's your most restrictive condition? That's the one we want to focus on first. Let's say it's the date condition. We can generate all the dates that we care about, and do an exists() on that. The other conditions can be added as an inner join after the fact. So something like this:

[MyTable]:
LOAD date(today()+1-recno()) as "Date"
AUTOGENERATE today() - yearstart(today(),-1) + 1
;
INNER JOIN ([My Table])
LOAD a bunch of fields
FROM MyData.qvd (QVD)
WHERE exists("Date")
;
INNER JOIN ([MyTable])
LOAD * INLINE [
Status, Product
Active, A
Active, B
Active, C
];

In my experience, the second version, while significantly more complicated, is likely to execute faster. On the other hand, it may also use more memory because it is temporarily loading records that you will later throw out. So which version you use may depend on whether your CPU time or RAM is more plentiful. Experiment to see what works best for you.


Miguel Angel Baeyens wrote:3.- Avoid IF() conditionals in any object expression, use instead set analysis.

And where reasonable, consider object model changes instead of even set analysis. In my experience, object model solutions are fastest of all, then set analysis solutions, and then everything else. And from a maintenance standpoint, I prefer seeing the complexity in my load script instead of in my tables. I'm not sure why it matters to me, since in our shop the same people are responsible for both script and charts. But in some shops, this might actually be important.


Miguel Angel Baeyens wrote:9.- Avoid RESIDENT loads. If needed, load twice from a QVD.

This may seem counterintuitive, but yes, in many or maybe most case, it seems that loading again from the QVD is faster than loading from a resident table. I've seen exceptions, but those are the exceptions, not the rule. It's a strange rule, but something to keep in mind for certain.

That's not to say "avoid resident loads" - I use them all the time. More like "if you have a choice between a resident load and an optimized QVD load, in my experience the optimized QVD load is usually faster". The exceptions tend to be when the resident table has already been significantly reduced in size and number of columns, I believe.

Ah, didn't realize this had been improved in version 10. We haven't upgraded yet. I'll have to keep an eye out for that.


CheenuJanakiram wrote:3. If needed, iron out synthetic keys (even create your own link tables) if you get them, as this might also minimise reload time. Although, QV can easily deal with synkeys, by taking this process in your hands and incorporating it in the script, you will prevent longer reload times.

I'll strongly emphasize the "if needed" part of this. IF it is appropriate for your databases to be linked by two fields, it is appropriate for that link to be modeled using a synthetic key. In my experience, as long as your data model is correct, converting synthetic keys to their link table equivalents does nothing but waste script time and use a TINY (basically negligible, but I thought I'd mention it) bit more memory and CPU time in the application itself.

To me, the main reason to remove synthetic keys in a good data model is simply practice. You should understand what the synthetic key IS, and a good way to get that understanding is to model the connection yourself rather than relying on QlikView to do it for you automatically. Also, people less experienced with data modeling and/or with QlikView may often find that synthetic keys are a symptom of an underlying data model problem that SHOULD be fixed. But in a case like that, merely replacing the syntheic key with a link table equivalent is not going to fix the underlying data problem. You have to think about it a bit more deeply than that.

More information on synthetic keys here:

http://community.qlik.com/forums/t/31028.aspx


CheenuJanakiram wrote:5. If need to use link field in count computation, then enter a "1 as Linkfieldcounter" in the relevant table. On front-end, go sum(Linkfieldcounter), you will get the same result as count(counterID) and it is less expensive to do a sum over 1 values.

I've seen tests go both ways in this regard. It seems to depend on the QlikView version. I haven't tested version 10, and don't remember which is faster in version 9. I typically do a count(distinct LinkField), but strongly suspect that's the slowest option of all. Basically, if you need a count of the keys, and the expression is slow, try both a sum() and a count(), and use the faster one for your version.

Miguel_Angel_Baeyens · ‎2011-04-15


John Witherspoon wrote:Ah, didn't realize this had been improved in version 10. We haven't upgraded yet. I'll have to keep an eye out for that.

I may have got that wrong (I'm not English native), but in one of the sessions of last Qonnections I took note that Brad Peterman from QlikTech agreed to one of the attendees when he said that RESIDENT loads in QlikView 10 perform as fast as (or equivalent to) the graphic engine rendering a chart. It's quite a lot.

But yes, basically, I mentioned the avoidance of RESIDENT loads when they have a considerable amount of records. Of course, what is a considerable amount will depend on the hardware, the analyses that will be done based on that data, the charts to be rendered...


John Witherspoon wrote:When loading from your QVD, make sure you're getting an optimized load.

The optimized LOAD is exactly where the power of loading from QVD lies. That's exactly what I meant and I forgot to mention that, and that's a very good point and the most important, since loading from a QVD file with filters, concatenations, formats... may be as slow as any other source, thus losing the advantage of the QVD idea itself.

Thanks John for your expert contribution, as usual!

johnw · ‎2011-04-15


Miguel Angel Baeyens wrote:The optimized LOAD is exactly where the power of loading from QVD lies. That's exactly what I meant and I forgot to mention that, and that's a very good point and the most important, since loading from a QVD file with filters, concatenations, formats... may be as slow as any other source, thus losing the advantage of the QVD idea itself.

Filters I don't see as a huge problem as I've had good success handling them as per the example above. But data transformation during the load would be a problem, and of course we often find ourselves needing to do that.

What try to do instead is put as much of that as practical into the QVD itself. Let's say I'm storing a QVD of employee information. Different people may want to see employee names in different ways in different applications. While I think it would be best to standardize, that isn't always practical. So as required by user applications, I might store ALL of the following values as separate fields in the QVD:

John
Witherspoon
John Witherspoon
John E Witherspoon
JOHN WITHERSPOON
John Witherspoon (12345)
Witherspoon, John
Witherspoon, John E

Now, with any luck, people can load the employees in whatever way is appropriate for their application, and avoid deoptimizing the QVD load or needing to do some after-the-load adjustments. Naming all those similar fields in an understandable way, of course, can be a bit of a chore. "Employee Name Last First With Initial", for instance. Fortunately, renaming fields during a QVD load doesn't cause problems.

Conveniently, for us the same people build the user applications and the QVDs, so there isn't really any guesswork going on by the QVD developers. When I find myself transforming QVD fields during the load in a user application, I can just decide on the fly whether to leave it there or push it back into the QVD.

So I think with a little work, you should be able to get an optimized load a good portion of the time.

rwunderlich · ‎2011-04-17

This is a great thread. Thanks to all.

Report Inappropriate Content · ‎2011-04-18

Learned lot of things from this. Thanks to everyone

IAMDV · ‎2011-04-18

John - Big thanks to you for the valuable information. This is one of the best posts I had read...

You are a star 🙂

montubhardwaj · ‎2011-04-27

Very nice thread with lots of information in it.

Thanks John and Miguel for your effort. You people are ROCKING ....

Report Inappropriate Content · ‎2011-04-27

This one is really helpful post......Thanks Evrybudy

Thanks

Report Inappropriate Content · ‎2011-09-14

A must read for all the Qlikkers......!!!

Thanks.

gandalfgray · ‎2011-10-04

Great tips in this thread!

Another thing you may consider:

don't include unnecessary details. A typical example is to read a date column which includes timestamp information from a sql database as it is, when you really are only interested in the date part of the column. Make sure to get rid of the time-part by using floor() or similar functionality before storing it in a qvd (if you are not interested in the time of course...)

*** 6 Weeks in to QV Development, 30 Million Records QV Document and Help Needed!!! ****

* 6 Weeks in to QV Development, 30 Million Records QV Document and Help Needed!!! **