Unique key - question - Qlik Community

Report Inappropriate Content · ‎2010-07-14

Hello,

I extract data from the source systems every night and append it to existing QVD. I would like to create unique key values during extraction instead of creating them every time I load them into a new app. If I got it right, Autonumberhash128 can create different results for the same input in different scripts.

Is the best way to create the key's while loading the QVD into the final app or is there a good way of creating unique keys?

Could num(hash(Field1 & Field2 & xxx)) be an option?

All input appreciated.

Cheers

🙂 Fredrik

Report Inappropriate Content · ‎2010-07-14

Hi Fredrik,

I have been playing around with this concept recently (although not with incremental loads).

In a test I created keys with the autonumber field for QVD's in multiple scripts. The fact that I created these keys in multiple script was a no-go for me because the keys were not the same and resulted in some undesired behavior. You have to be sure that creating autonumber is occuring within the same script to avoid this (unless someone has a solution for this).

I think you can achieve this by the autonumberhash128(Field1 & Field2 & xxx)

Report Inappropriate Content · ‎2010-07-14

Hello,

Reading in the forums and in the help where it says "...for each distinct hash value encountered during the script execution." I think that autonumberhash128 can produce different results in different scripts - can someone confirm this?

Cheers // Fredrik

johnw · ‎2010-07-14

Yes, a hash function CAN produce different results in different scripts for the same key value.

To explain how it happens, note that a hash function maps data to a set of possible output values. Most types of hash functions (and certainly the type being employeed by autonumberhash128()) CAN map two different input keys to the exact same output value. When this happens, they employ "collision resolution", and assign the SECOND key encountered a NEW value.

If this "second" key is then encountered FIRST in a different script, it will not collide, there will be no collision resolution, and it will be mapped to the same value as the "first" key from the other script. You then have two different scripts producing different values for the same key.

The chance of this happening should be EXTREMELY low with a good 128-bit hash function being applied to the type of data you're likely hashing in QlikView. But since problems are POSSIBLE, however unlikely, I would never use autonumberhash128() in this way, just like you more obviously would not use autonumber() in this way.

I personally think that the best way to create unique keys is to create them in your data source, before QlikView ever encounters the data. If that isn't practical, then I think you're stuck creating the keys only when you load everything into the final application, at least if you want 100% reliability. Well, unless you just want to concatenate all of the fields that guarantee uniqueness, and just leave them as a long string value. But that would be a very inefficient key.

rwunderlich · ‎2010-07-15

I think this discussion is mixing up the AutoNumber*() functions with the Hash*() functions.

The AutoNumber*() functions return an integer value -- 0 for the first unique parameter seen, 1 for the next and so on. So therefore can only be used if all values are created within the same script. The function is non-deterministic. The returned values are dependent on the order of the function calls, not the parameter values.

The Hash*() functions (hash128, hash160, hash256) are deterministic. They will always return the same value for a given set of parameters, regardless of the calling sequence. They can therefore be used to create the same values across different scripts.

Using a hash*() function, it is possible to generate a collision -- that is, two different parameter values may produce the same value. The higher the number of bits (Hashnnn) the lower the probability for collision. The tradeoff for lower probablility is increased Field size. There is information and robust disagreement available on the 'Net about collision probabilities. Some of our more math savvy member may chime in here.

The only absolutey safe compound key is a concatenated string: F1 & '|' &F2

When I can't afford the storage to store a compound key, I generate them as autonumber() keys on the fly when I'm loading data -- both incremental loads and application loads.

-Rob

Report Inappropriate Content · ‎2010-07-15

Hi,

Thanks guys really interesting input. I'm a bit surprised that QlikView doesn't provide a collision resistant hash function.

The reason I started to look into this is that I'm under the impression that it is better to have integers as key or is the compound/string key just as fast? If not, is it possible to say anything about how much faster slower each option is?

Cheers

Fredrik

rwunderlich · ‎2010-07-15


fredrik.tham wrote:I'm a bit surprised that QlikView doesn't provide a collision resistant hash function

I think Qlikview has provided good options, Autonumber is 100% collision resistant, with the restriction that all keys are created in the same script execution. The Hash*() functions are highly collision resistant and can be used across script executions

.


fredrik.tham wrote: I'm under the impression that it is better to have integers as key or is the compound/string key just as fast? If not, is it possible to say anything about how much faster slower each option is?

The primary performance difference between Integer and String keys is RAM utilization. Integers occupy must less storage than Strings. A consecutive series of Integers, such as is built using AutoNumber(), occupies almost no storage.

-Rob

johnw · ‎2010-07-15


Rob Wunderlich wrote:The Hash*() functions (hash128, hash160, hash256) are deterministic. They will always return the same value for a given set of parameters, regardless of the calling sequence. They can therefore be used to create the same values across different scripts.

I'm not sure if we're disagreeing, or I'm merely not understanding your point. I suspect the latter, since your next paragraph discusses that it is possible to generate a collision. But just to clarify for myself and others...

Let's say that 'Ann' and 'Bob' both return hash value 123, and the collision resolution would produce value 456. If you load Ann then Bob, then Ann = 123 and Bob = 456. If you load Bob then Ann, then Ann = 456 and Bob = 123. They won't return the same value regardless of the calling sequence. So hash functions cannot with 100% reliability be used to create the same values across different scripts.

However, if keys are not being specifically chosen to defeat the hash function, and if the hash function is a good one, I'd be willing to bet that no QlikView customer has ever or will ever have a hash collision. Even hash128() can generate over 10^38 values. These values should be "randomly" distributed. It is probably jillions of times more likely that a bug will cause you problems than you having a key collision using a properly-written hash function.

Actually, let me just quote myself from an earlier post on the subject.


John Witherspoon wrote:With a 128 bit hash function, you have a VAST number of possible key values, 2^128, or over 3.4*10^38.
Let's see, looking up the probability since it's been way too long since I had a probability class, and assuming I'm getting this right:
The probability of a collision for N random keys = 1 - (2^128)! / ((2^128)^N * (2^128 - N)!)
And that's, uh, really not going to be computable, directly, on any computer. So, OK, what about an approximation function?
The probability of a collision for N random keys is ABOUT 1-e^(-N^2/(2*2^128)) is ABOUT 1-2.7182818^(-N^2/2^129).
So let's say you have ten million random keys. The chance of a collision with an ideal hash algorithm is about 1-2.7182818^(-10000000^2/2^129) = 0 to as many decimal places as Excel will calculate. Microsoft's calculator seems to be indicating 0 out to 25 decimal places or so. In other words, the chance of collision is ZERO for all practical purposes.

Now, this is an overly-simplistic analysis assuming some things that simply aren't true - random keys (unlikely) and a perfectly "random" distribution from the hash function (not happening). The chances of a collision in practice are therefore higher, and I'm sure that people more math-savy than myself could give an even better answer. I still suspect the probability is extremely low.

But still, problems are POSSIBLE. So I wouldn't solve the "generated keys across multiple scripts" problem in this way. But mostly it's because it offends me mathematically, not because there would EVER be a problem in practice.