back to article Git sprints carefully towards SHA-1 deprecation

Following the February controversy over whether or not Google's SHA-1 collision broke Git, its community has taken the first small steps towards replacing the ancient hash function. For context: the Chocolate Factory last month produced the first reproduceable SHA-1 collision needing relatively* low computing power – something …

  1. Deltics

    "it's no longer possible to prove that (for example) a hashed document is unique"

    It has *never* been possible to prove that. Quite the opposite. It is a mathematical certainty that a hash function that reduces an arbitrary sized input to a fixed size smaller than the input itself will have collisions.

    1. streaky

      Using two separate hash functions on the same doc would do it to be fair. If you used say md5 AND sha-512 the changes of there being a collision in the same input in both functions is infinitesimally small. It's mathematically effectively zero.

      Yeah it's not the intent of hashing in git, arguably a side effect to a certain standard though. I'd pay good money to see somebody produce a collision and the resulting file not be garbage mind - therefore it would detectable in other ways.

      1. frank ly

        @streaky

        All you've done there is change the probablity by making it smaller. All previous arguments still apply. It's effectively zero for practical purposes, at the moment. It can't be mathematically zero because you calculated it to be non-zero.

      2. bazza Silver badge

        @Streaky,

        Google's demonstration of a SHA1 collision was two valid PDFs which hashed to the same value. That's a lot harder to achieve than just two arbitrary, unintelligible byte streams that hash the same, but it has been achieved now.

        It is a mathematical certainty that you could have two pieces of source code that compile and have the same hash, but it's still tremendously hard to do that.

        1. Androgynous Cupboard Silver badge

          @bazza

          I took apart the two PDF documents they created, and I believe they started with two files containing an arbitrary binary stream - in this case, a JPEG with an embedded binary blob. They then diverged the content of both files until they had the same hash.

          The two key points here are:

          1. Both files had to be modified. Creating two files with the same hash is different to creating one file with the same hash as another, and much easier.

          2. The JPEG embedded in the PDF has a binary blob which is of considerable length, and this blob was modified to engineer the hash collision. The nature of PDF means these modifications will still give a valid file, and I imagine you could say the same about any format which allows an arbitrary binary marker, i.e. TIFF, JPEG, PNG, but not something like XML or - and I'd want to confirm this before I staked my life on it - ASN.1 encoded X.509. So your point about modifying PDF being harder than modifying "two arbitrary byte streams" is true, but not by much, as PDF is allowed to contain arbitrary byte streams.

          Point 1 is the key and personally I think some of the panic on this one is not yet warranted. SHA-1 is badly damaged, but 6000 CPU years to create two files which demonstrate a hash collision does not make an attack vector. Not yet.

      3. Anonymous Coward
        Anonymous Coward

        @streaky

        Using a second hash function is no different than using a single hash function that generates a signature/digest of the same size. Running two different hash functions over an input is no different than running one hash function twice, with the second run using the same input except pre-encrypted with a fixed key.

        1. Anonymous Coward
          Anonymous Coward

          Re: Using a second hash function is no different than using a single hash [twice]

          I've just designed this hash function that ROT13's the first 32 bytes... :-)

          1. petef
            Coat

            Re: Using a second hash function is no different than using a single hash [twice]

            Hah, my scheme applies a second ROT13 to the first 32 bytes which makes it twice as secure.

      4. John Robson Silver badge

        To all those saying bah to the concept of dual hashing...

        The idea is that you reduce the target to the area of changes which collide in BOTH MD5 and SHA1 (other hashes are available). And that collision also has to make sense in terms of being a file - hopefully of the expected type.

        The odds do 'only' go down. But they've only ever been 'long odds' - so making them longer is exactly what we are trying to do...

        1. Frumious Bandersnatch

          > To all those saying bah to the concept of dual hashing...

          I think that those doing the poo-pooing are imagining the scenario hash2(hash1(message)), and they're right to do so. If you use hash = hash1(message).hash2(message) then the strength of the resulting hash should be the product of the strength of both constituent hashes.

          1. Anonymous Coward
            Anonymous Coward

            @Frumious Bandersnatch

            Yes, that's exactly what I was saying, except you managed to explain it much better than I using that functional notation!

      5. TheVogon

        "is infinitesimally small. It's mathematically effectively zero."

        Nope, can be calculated, not infinitesimally small, not effectively zero. But quite unlikely.

    2. Paul J Turner

      @Deltics

      I know what you mean but I'm wondering if there is an exception.

      Consider if you used ZIP or 7z as the hash algorithm.

      You end up with less bits (which you can pack to a fixed size) but there is only one possibility for the source file.

      You can even verify that by unpacking and comparing, which makes it a bit useless unless you can lock it down from tampering somehow, say with a hash function... ;-)

      1. bazza Silver badge

        Re: @Deltics

        @Paul J Turner,

        You end up with less bits (which you can pack to a fixed size) but there is only one possibility for the source file.

        If you fix the output size, that's no longer ZIP or 7z or any other compression. The only way you could fix the output size is to pad the output up to a given size, and then fail with an error if the input were too large for the compression output to fit within that given size.

        An interesting feature of a perfect compression is that the output bit stream is (if one did not know that it was a compressor's output) perfectly random.

        You can even verify that by unpacking and comparing, which makes it a bit useless

        I wouldn't say that was useless at all, at least not from the point of view of certainty. If you have a zipped up file, and someone else is claiming to have the same file, being able to do a bit comparison on the two uncompressed files is a mathematically certain test of equality in a way that is stronger than any hash test.

        What makes it useless is that it's an inefficient use of storage and bandwidth, and instead we use a hash function to allow us to be almost certain of equality.

        1. Robert Carnegie Silver badge

          Re: @Deltics

          Practically it's useful to keep a zip file of an important data file or document as you received it. For instance - an Office document (old-compatibility Excel in particular) is liable to be changed by viewing it without "saving" it: presumably not intentional as Excel 2010 doesn't do this to its own files. If it's liable to be important to prove what you received and when, zip it. You also get a convenient CRC32 signature of the file, but you may have to beg for it. This isn't secure against serious attack or if you have a very large number of files to distinguish (say 2^30, or just every source file that has ever existed in Linux), but it serves for basic use.

          It's happened; we downloaded an important data set, then the compiler changed it without telling us, or anybody at all. How professional is this transaction? Well, the data set was in Excel... but we think we're professional. So when a partner asked why their later data download was different from ours, we had a riddle. But we also had document revision history. #smugface as it says in the commercial.

        2. Frumious Bandersnatch

          Re: @Deltics

          An interesting feature of a perfect compression is that the output bit stream is (if one did not know that it was a compressor's output) perfectly random.

          Not true, for two main reasons.

          First, you're not defining what "perfectly random" means. You have an general idea of what this means, but I'm pretty sure that you're not able to back that up with maths. Also, why the hedge "if you don't know it's a compressed file"?

          If you take a simple compressor that takes an input stream and does something like Lempel-Ziv-Welch (LZW) compression then you have a stream of output tokens that either encode for a verbatim section or a pointer + length symbol that refers back to a previously-seen section of the file. Both have patterns that make them distinguishable from random data.

          Second, this "perfect" compression. Again, there's this terminology deficit, but a fundamental theorem of compression is that not all data can be compressed using a given compression algorithm. If a compression algorithm is to be reversible (and can accept arbitrary input) then some inputs will compress to smaller outputs (or the same size) while others will compress to larger outputs. These larger outputs (and even smaller outputs, to some degree) will tend to be at least partly systemic, meaning that some input symbols will appear verbatim in the output. The only thing that running something through a compressor proves is how good the compressor is at exploiting certain kinds of redundancy/structure in the input file, not how much entropy the input/output files have in absolute terms.

          This false line of thinking (that compression outputs have to be random and uncorrelated with input) has been used before in attacks on SSL connections:

          https://en.wikipedia.org/wiki/CRIME_%28security_exploit%29

        3. Paul J Turner

          @bazza

          'Pad', 'Pack', same difference and a data length or end marker is used in some compression systems.

          Zip and 7z were just used for reference you could consider the old disk-compression systems which do output files obviously padded to the end of the last sector.

          Obviously the size of the resulting hash might make it impractical, but my point is that I think that the best theoretically possible achievable lossless compression must be the minimum size possible for a guaranteed collision-free hash.

          Also, the bigger the file the bigger the hash, which might disqualify it if only considering fixed-size hashes.

      2. Deltics

        Re: @Deltics

        You cannot just take any arbitrary data processing algorithm and call it a "hash".

        Neither Zip nor 7z is a hash algorithm. Apart from the lack of fixed size in the output, neither actually even guarantees an output that is smaller than the input (not that this is a pre-requisite for a hash).

        You would have to pick a ridiculously large size for the pseudo-hash value to which to pad/pack all your zip outputs, but unless you chose a truly ridiculous hash size there would always be some input which could not compress into that size.

    3. bombastic bob Silver badge
      Devil

      so basically to "prove" that a document is unique you'd at least need the file's size as well as the hash, and a 2nd hash using an independent algorithm [which could be as simple as a CRC]. And mathematically, even THAT has "fewer bits" than the original. So it's not really "proof".

      In any case, there's STILL a use for SHA1: as a secondary 'check' in verifying a file's contents. A crack to make the file size, SHA-256, _AND_ SHA-1 all match (and still have the code compile and do whatever malicious thing you wanted it to do) would be one hell of an accomplishment! Or, you could use MD5 or another "known to be insecure" method as the secondary method. Whichever.

      FreeBSD's ports system uses size, SHA256, _AND_ MD5 in this way, last I checked.

    4. Adam 1

      > "it's no longer possible to prove that (for example) a hashed document is unique"

      Agreed with OP. This is not the goal. It relies on the fact that the amount of hardware, CPU/GPU time, electricity and opportunity cost involved in "reversing" the hash back to the original bytestream means that a suitably resourced attacker wouldn't bother.

      Whilst source code is much harder than a PDF file to hide junk bytes of your discretion so they remain unnoticed, it wouldn't be impossible to use comment blocks or constants.

  2. a_yank_lurker

    Purpose

    Security measures will never be absolutely perfect as the various measures can be hacked with sufficient computing power and time. With available or reasonably foreseeable power how much time does it take to break measure X is the key. If the time is very long relative the value of the data or to produce hash collisions then the measures are effective. But there will come a time when the measures will be easily shredded; it may be years, decades, centuries, etc. in the future.

    In the case of hash functions, how different are the two documents that produce the same hash is important. I do not know enough about the generation of hash keys to know how likely the two documents are to be very similar, e.g. a minor change in a will or is a will and photo of the Eiffel Tower.

    1. phuzz Silver badge

      Re: Purpose

      Generally hash functions are chosen so that a small change between the files results in a large difference in the hash.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like