Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
kishore_p
Contributor

Is there a better alogorithm alternative to CRC32 ?

I'm currently using CRC32 to generate unique check sum values. I have learnt that, theoretically CRC32 could (in very rare cases à 1 in 2^32 cases) could produce a duplicate. Is there any better algorithm alternative to CRC32 that I could use?

Labels (1)
  • v7.x

3 Replies
manodwhb
Champion II

@kishore_p , you can try with MD5 or SHA-256 ,please check the below link to know more about these.

 

https://www.codejava.net/coding/how-to-calculate-md5-and-sha-hash-values-in-java

 

Anonymous
Not applicable

Hi,

 

   Could you please check below links? Apart from MD5, there are some other interesting options.

 

https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-un...

https://stackoverflow.com/questions/16122067/md5-vs-crc32-which-ones-better-for-common-use/16125537

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved

PhilHibbs
Creator II

All hash algorithms can produce duplicates. You are taking something with thousands or potentially millions of bytes, and squashing it down to a very small number of bytes, four to be precise. Those four bytes can only contain just over 4 billion different values, so if you hash more than four billion different documents then you are guaranteed to have a duplicate. If you hash a million documents, you have a 1 in 4,000 chance of a duplicate.

 

The only way to reduce the likelihood of a clash is to use a longer hash. CRC32 generates a 32 bit hash, MD5 generates a 128 bits. You might think that that only increases the range by a factor of four, but in fact it is much, much, much bigger than that. You go from a 1 in 4 billion chance of a clash to a 1 in 340 million million million million million million. So long as your source value is more than 128 bits, you would need to generate a million million million hashes per second for the entire existence of the universe so far before you had a noticeable chance of a clash.

 

CRC64 may be enough for you, that would reduce the chance of a clash to 1 in a million million million.

 

Note that I said "reduce". The very nature of hashing means that a clash is possible, it's just a case of making it sufficiently unlikely.