Hash128

Report Inappropriate Content · ‎2010-04-28

Hello,

We have table A linked to table B by a field (key). We create two separate QVD for table A and table B in separate QVW application

These keys are complex and I use hash128 function in each QVW.

But it seems hash128 algorith is different between QVW because some keys don't match (test have been made to verify key matching without hash128) when loading A.QVD and B.QVD in new QVW application.

Matching OK if table A and table B hash128 is made in the same QVW application.

Matching not the same if table A and table B hash128 is made in different QVW applications.

Have you ever noticed that ?

Best regards

Régis

johnw · ‎2010-04-28

Hash functions can involve collisions and collision resolution. If the keys are in two different documents, the collisions and collision resolution can result in different hash values for the same keys between the documents. I haven't tried this, I haven't noticed this, and I don't know that it's the problem in your case. But it COULD be the problem. And since it COULD be the problem, I simply wouldn't use hashing in this way. You're asking for trouble, whether or not it's the cause of your current trouble.

Report Inappropriate Content · ‎2010-04-29

Thanks a lot.

What does it mean "collisions" and "collisions resolutions" please ?

Régis

Report Inappropriate Content · ‎2010-04-29

I believe John is correct - you can never rely on the same values input to hashing algorithms resulting in the same value in different QVWs. The result is simply the next unused number for the key values and it will depend on eg the same retrieval sequence in the data buffers, same key values etc etc.

Whether practical or not I think you are just going to have to use concatenated values of the key fields.

Regards,

Gordon

Report Inappropriate Content · ‎2010-04-29

ok thanks

johnw · ‎2010-04-29


rgluzman wrote:What does it mean "collisions" and "collisions resolutions" please ?

Wikipedia can explain it much better and more thoroughly than I can:

http://en.wikipedia.org/wiki/Hash_function

But I'll give it a shot. You're mapping some big compound key, say 100 bytes, into only 128 bits (16 bytes). It is IMPOSSIBLE to map EVERY 100 byte value into 16 bytes. So the hash function is simply trying to distribute your values as evenly as possible, to assign a random-looking but easily-reproduced value to each input value. A good hash function will tend to do a very good job of mapping your keys to unique values, but this is ultimately an impossible task unless it has as many bytes available as your input. So sometimes, two different input values will map to the same output value. For instance, "ABC" might map to 5, but "DEF" might ALSO map to 5. That's a collision. Well, it can't assign 5 to BOTH of them because it must give you a unique value. So it uses a second algorithm to search for an available value that hasn't been used yet. So it might then map "DEF" to 7 because nobody has used 7 yet. That's the collision resolution.

And that's all just fine if everything's in one application. Even though it isn't what the hash function assigned, DEF will still always be mapped to 7 in that one application.

But what happens if both ABC and DEF are in different applications? Well, in your first application, ABC will map to 5. In your second application, DEF could map to 5 as well, because it's likely that nothing else will have taken the 5 spot in your second application. Now you have, I suppose you could say, an "unresolved collision".

With a 128 bit hash function, you have a VAST number of possible key values, 2^128, or over 3.4*10^38.

Let's see, looking up the probability since it's been way too long since I had a probability class, and assuming I'm getting this right:

The probability of a collision for N random keys = 1 - (2^128)! / ((2^128)^N * (2^128 - N)!)

And that's, uh, really not going to be computable, directly, on any computer. So, OK, what about an approximation function?

The probability of a collision for N random keys is ABOUT 1-e^(-N^2/(2*2^128)) is ABOUT 1-2.7182818^(-N^2/2^129).

So let's say you have ten million random keys. The chance of a collision with an ideal hash algorithm is about 1-2.7182818^(-10000000^2/2^129) = 0 to as many decimal places as Excel will calculate. Microsoft's calculator seems to be indicating 0 out to 25 decimal places or so. In other words, the chance of collision is ZERO for all practical purposes.

Now, your data isn't perfectly random, and the hash algorithm isn't some ideal thing that perfectly distributes your input data, so perhaps the chances are higher. However, it does still seem EXTREMELY unlikely that this is actually causing your problem. I STILL wouldn't use a hash algorithm in this way, because it can cause problems in theory, but it'll probably never cause problems in practice.

So we probably need another explanation for what you're seeing.

Related Topics