Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I came across an existing dashboard with some composite keys at load script. To reduce the size and improve performance I thought to use autonumberhash256. I read in some blogpost that it has chances of collision.App is having 3 years of data with 35 million records.
I tried understanding hash function but I didn't got it.
Count(distinct keyname) of these keys are being used in calculations of some measures.
All keys are present in same table.
1) autonumberhash256(driver_name &'-'& "driver_id") as [Driver Name Key]
2) autonumberhash256(driver_name&vehicle_number&event_type&event_start_date_time&event_end_date_time)
as [Event Unique Key]
3)autonumber(country_name&'-'& pco &'-'&cluster_name &'-'& plant_code&'-'&month_year &'-'&driver_name
&'-'& transporter_name&'-'& plant_name &'-'& business_segment) as action_Key
Can I use autonumber or autonumberhash256 functions without any issues .
Please provide some suggestions on how i can tackle this thing.
Hi, I don't have the numbers but the probability of colision should be very low, the main advantadge of the hash version is that it's faster, but only if you use commas to separate the fields:
autonumberhash256(driver_name,"driver_id")
Because the costly operation is the concatenation of strings.
I usually use the autonumber, but as a statement, not as a funtion.
There is an excellent blog-posting about the risks of a hash-collision from Barry Harmsen which seems unfortunately not available for the public anymore: www.qlikfix.com/2014/03/11/hash-functions-collisions/ Like Ruben already mentioned is the probability of a collision very low and could be neglected for the most scenarios.
Beside this has a hash no benefits in regard to the performance because it creates large strings which need more resources as numbers. Therefore just using autonumber as statement like hinted would be fine and working well.
But I suggest to consider a change of the data-model because such composite key means that you want to link fact-tables (directly or per link-table) together. Very often this could be avoided by merging all facts per joins/mappings and/or concatenate into a single table and creating a star-scheme data-model which is also the official recommended data-model.
- Marcus
I think there is no general rule if a key-field is used as dimension or another field. It depends always on the data, the data-quality and the view-requirements which dimensions and appropriate developed expressions are expedient or not.
Hi, if discrepancy is something that worries you, use the autonumber statement.
And the count would be like any other key, it depends of the use, maybe just a count(AutoKey) returns an higher value than the facts because there could be a key in the dimensions table that doesn't have any fact, usually with just a selection or adding a filter to set analysis this kind of error is avoided.
Hi, I don't have the numbers but the probability of colision should be very low, the main advantadge of the hash version is that it's faster, but only if you use commas to separate the fields:
autonumberhash256(driver_name,"driver_id")
Because the costly operation is the concatenation of strings.
I usually use the autonumber, but as a statement, not as a funtion.
There is an excellent blog-posting about the risks of a hash-collision from Barry Harmsen which seems unfortunately not available for the public anymore: www.qlikfix.com/2014/03/11/hash-functions-collisions/ Like Ruben already mentioned is the probability of a collision very low and could be neglected for the most scenarios.
Beside this has a hash no benefits in regard to the performance because it creates large strings which need more resources as numbers. Therefore just using autonumber as statement like hinted would be fine and working well.
But I suggest to consider a change of the data-model because such composite key means that you want to link fact-tables (directly or per link-table) together. Very often this could be avoided by merging all facts per joins/mappings and/or concatenate into a single table and creating a star-scheme data-model which is also the official recommended data-model.
- Marcus
does count of such hash key will have some discrepancy and can we show such key in straight table as a dimension. I know key are typically used to link two or more tables but customer want to show them in tables also.
I think there is no general rule if a key-field is used as dimension or another field. It depends always on the data, the data-quality and the view-requirements which dimensions and appropriate developed expressions are expedient or not.
Hi, if discrepancy is something that worries you, use the autonumber statement.
And the count would be like any other key, it depends of the use, maybe just a count(AutoKey) returns an higher value than the facts because there could be a key in the dimensions table that doesn't have any fact, usually with just a selection or adding a filter to set analysis this kind of error is avoided.
Now I don't have doubt wrt use of Hash key's. Thanks for the valuable suggestions. 👍
Anurag