Magic hash maths: Dedupe does not have to mean high compute. Wait, what?

Thursday 12th October 2017 09:43 GMT Lysenko

I won't pretend to understand buckets vs. blooms...

... but otherwise it makes sense. We had an analagous data storage issue and had to compare the aggregate efficiency of computing a fully unique hash for every item vs. a simple CRC32 and brute forcing the occasional collisions. The latter was clearly the most efficient overall.

The same sort of thing happens with many RDBMS indexing strategies. You accept that there will probably be hundreds of "John Smiths" and probably a few dozen born on the same day, so you use indexing to get you 99% of the way to your target and then table scan the rest.

7 0 Reply
1. Friday 13th October 2017 15:57 GMT Anonymous Coward
  
  Re: I won't pretend to understand buckets vs. blooms...
  
  The wrinkle is a useful dedupe requires you do raw block compares on all hash matches, so it's more important to minimise collisions because the cost of checking them is so high. In particular you need to winnow out multiple collisions fast, a hierarchy of cheap hashes and fail early would be my instinctive choice.
  
  0 0 Reply
Thursday 12th October 2017 09:52 GMT AMBxx

unique mathematical hash

I must be missing something here. Surely a hash can only be unique if it's the same size as the block of data? OK, you could reduce that with compression and encoding duplicates, but surely it's still too big to be useful?

Help me out - what am I missing?

1 1 Reply
1. Thursday 12th October 2017 10:14 GMT Anonymous Coward
  
  Re: unique mathematical hash
  
  > I must be missing something here. Surely a hash can only be unique if it's the same size as the block of data?
  
  You're quite right that there are an infinite number of input documents, and only a finite number of hash values. What you are missing is probability.
  
  With a well-designed hash function, the distribution of hashes over input documents is uniform and (apparently) random. The probability of finding two documents with the same hash is vanishingly small.
  
  With a 256-bit hash, you would need to hash around 2^128 different documents before there is a reasonable chance that any two of your documents will have the same hash. (Google for "birthday paradox")
  
  Even if you are hashing one trillion documents per second, it would be about 10^19 years before you had hashed 2^128 documents. Plus you need somewhere to store those 2^128 hashes to compare them.
  
  So if you find two blocks in your storage array with the same hash, the probability is overwhelming that they *are* the same block. And if you are paranoid, you can always read them both and compare them before discarding the duplicate.
  
  (With a broken hash function like MD5 or SHA1, it's possible for an evil user to intentionally submit two different blocks to your storage array which have the same hash)
  
  13 1 Reply
2. This post has been deleted by its author
3. Thursday 12th October 2017 14:01 GMT Lennart Sorensen
  
  Re: unique mathematical hash
  
  Yes hashes are clearly not unique (and hence the article is just plain wrong about that).
  
  Any dedupe system that assumes they are unique and doesn't verify the data is the same after getting a hash match is insane and should not be used. It doesn't matter how low the probability is, I do not want to risk my data getting destroyed because it just happens to have the same hash as some other data.
  
  And yes there are programmers out there dumb enough to assume the hashes are effectively unique because the probability is so insanely low of them having a collision and then treat them as if they were actually unique. Some day their customer is going to pay for that mistake and they might not know it for a long time after it happens.
  
  5 0 Reply
  1. Thursday 12th October 2017 22:20 GMT Adam 1
    
    Re: unique mathematical hash
    
    > And yes there are programmers out there dumb enough to assume the hashes are effectively unique because the probability is so insanely low of them having a collision and then treat them as if they were actually unique.
    
    So what is your threat model here? That some 733T haxors precompute a collision with a document that you might one day they think that you may wish to store, then store it so that when you do eventually try to save then later retrieve that file, you get nonsense?
    
    Remember if the real document is stored first then the colliding fake will not be stored.
    
    2 2 Reply
    1. Friday 13th October 2017 16:01 GMT Paul Shirley
      
      Re: unique mathematical hash
      
      It's the risk of corruption by accident, playing Russian roulette with your data. However many chambers your gun has, 1 still has a bullet.
      
      Though a hacker could weaponise knowledge of that flaw for mischief.
      
      0 0 Reply
      1. Saturday 14th October 2017 08:33 GMT Adam 1
        
        Re: unique mathematical hash
        
        It is irrational to fear an accidental collision in even a broken hash like md5. A deliberate pre image attack is a possibility in other contexts like password storage but such attacks in the context of dedup are irrelevant because to take advantage of it would require you to predict something I haven't stored yet but will need to store.
        
        Why irrational? Because even if we used every computer ever made and dedicated their processing time to try to randomly create documents for billions of years, the probability shows it still be unlikely to occur. It is far more likely to suffer corruption due to bits being induced by a solar flare during a write operation. Russian roulette is not a good analogy in this case. In that game your odds are much worse and there isn't a reward. The reward in dedup is a *lot* less write I/O than you would need to do a subsequent comparison.
        
        0 0 Reply
Thursday 12th October 2017 10:06 GMT Duncan Macdonald

Hashes and duplicates

For a dedupe hash to be useful it needs to be smaller than the data block (otherwise the data block itself would be a better index). This implies that there will be hash clashes where two or more different data block have the same hash value. Because of this a dedupe tool has to compare the data blocks when there is a hash match to avoid losing or corrupting data. Using an easier hashing algorithm may increase the number of duplicates but will reduce the time spent calculating the hash. (Eg using a 64 bit CRC as the hash instead of a SHA-512 will slightly increase the number of hash collisions but will require less effort to compute.)

7 0 Reply
1. Thursday 12th October 2017 12:20 GMT Justicesays
  
  Re: Hashes and duplicates
  
  "Because of this a dedupe tool has to compare the data blocks when there is a hash match to avoid losing or corrupting data. "
  
  Sadly most methods do not bother doing a actual block comparison.
  
  This is because the math shows the odds of a block collision due to them having the same hash is less likely than the disk being corrupted by multiple simultaneous bit flips that bypass parity/checksum checks.
  
  There is also the fact that dedupe systems are limited in the size of the data set they can dedupe, due to the ever-increasing hash lookup table.
  
  This less computational expensive method might be one that does less exact/expensive hashing and full block comparisons when it gets a possible match.
  
  3 1 Reply
  1. Thursday 12th October 2017 12:49 GMT ibmalone
    
    Re: Hashes and duplicates
    
    The maths is right when you are dealing with normal data, but if there's potential for someone to be trying to corrupt the backup then either a cryptographically strong hash is needed, or block comparison is required.
    
    2 0 Reply
Thursday 12th October 2017 10:32 GMT Anonymous Coward

Dunno what he’s on about...

All storage systems calculate a hash or crc for error correction anyway. Two bits of data with the same crc may be the same. At a later time you go back and figure out if they are the same and dedup.

No system is going to be dumb enough to hold up the os while it is trying to write. Am I missing something? Even then: Sha512 isn’t *that* computationally expensive...

1 0 Reply
1. Thursday 12th October 2017 16:01 GMT dinsdale54
  
  Re: Dunno what he’s on about...
  
  It's not the calculating the hash that's the hard work, it's scanning the hash table looking for matches for every incoming write. Say you have 1TB of data and 50% of the 4K blocks are unique. If you are using a 256bit hash (SHA-256) that's 8GB of space for the hash table and you have to search for a match for every incoming block. Hence the need for clever ways to minimize the above. Some implementations are much more efficient than others.
  
  That's an overly simplistic example but points out where the issue lies
  
  7 0 Reply
  1. Thursday 12th October 2017 19:55 GMT John Gamble
    
    Re: Dunno what he’s on about...
    
    'It's not the calculating the hash that's the hard work, it's scanning the hash table looking for matches for every incoming write."
    
    Which, continuing this explanation, is where the Bloom Filter comes in (although reading the article it looks like they're tweaking the Filter for their own needs).
    
    Not bad explanation here (Bloom Filters Explained), and here (Bloom Filters for Dummies).
    
    2 0 Reply
Thursday 12th October 2017 12:06 GMT John Smith 19

I thought about dedupe but only down to the file level.

Obviously operating down to the block level is going to be more involved.

0 2 Reply
Friday 13th October 2017 07:34 GMT foxyshadis

Their trade secret route to reducing short-lived file overhead

Making every hash default to all zero, and actually hashing dirty blocks for real during periods of lower disk contention or after a set time expires? Seems straightforward enough. (Obviously also communicating with the OS, though interesting possibilities if you could get the OS to send a Trim when a file is deleted.) That would suck for blocks that randomly do hash out to zero, but they just get put in the "sorry, you don't get dedup" bucket. Even a 32-bit key pretty much obviates any need to care about that, losing one billionth of a percent of theoretical efficiency overall.

ZFS was an amazing feat of engineering, but "overengineered" doesn't even begin to scratch the surface. All of its competitors have struggled to achieve 90% of its efficiency while reducing the huge disk and memory footprint it requires, and it looks like X-IO might have really cracked open the nut.

Sadly, this just means NetApp, EMC, or Oracle is going to buy them out and silo their tech forever.

1 0 Reply
Friday 13th October 2017 17:41 GMT albegadeep

Bucket or Bouquet?

"Bouquet residence, the lady of the house speaking..." Hyacinth Bucket, Keeping Up Appearances

1 1 Reply
Friday 13th October 2017 20:31 GMT BillXIO

If you buy or are considering flash arrays, you should test X-IO's ISE900 series

The benefit of this for X-IO ISE900 series Flash arrays is that it produces the same flash array effective capacity and performance at about 60% of the cost of competitive products. If you buy or are looking at buying flash arrays, you should get a quote from X-IO and compare. If what you see looks too good to be true, ask for a PoC system and test it yourself. Richie Lary has been doing amazing things in computer science since Gordon Bell hired him to work on the design of the VAX 11/780 when he was just a kid. (If you don't know what I'm taking about, look it up: http://archive.computerhistory.org/resources/text/Oral_History/Bell_Gordon_1/102702036.05.01.pdf). Richie is a legend, and still a kid at heart. As you can tell from the video, he loves this stuff. This is the second time I have had the opportunity to work with Richie. The first was when he was CTO at Compaq storage (DEC storage) and Howard Elias asked him join the technical advisory board at StorageNetworks , where I was cofounder and CTO. Richie and Andy Bechtolsheim were both part of that group. That was fun! We are fortunate to have Richie inventing with us as we have re-invented and re-invigorated X-IO. If you haven't looked at X-IO lately, there has never been a better time. Bill Miller, CEO, Chairman, and investor at X-IO Technologies

0 2 Reply