9,223,372,036,854,775,808 sha1 calculations
to get a collision. Its not ideal but I'd hardly call it a security hole. If they could publish the two documents I'll try not to use them as passwords.
Google researchers and academics have today demonstrated it is possible – following years of number crunching – to produce two different documents that have the same SHA-1 hash signature. This proves what we've long suspected: that SHA-1 is weak and can't be trusted. This is bad news because the SHA-1 hashing algorithm is used …
They have published the two documents, I've downloaded them and confirmed their claims.
They are 422435 byte PDFs, differ in 62 of those bytes, have the same sha1 hash, and show different contents (the background colour). I haven't checked if there are any other differences in the display.
They are 422435 byte PDFs, differ in 62 of those bytes, have the same sha1 hash, and show different contents (the background colour).
Still not much of an attack, IMO. Unless someone can arbitrarily set the difference to something meaningful, it only proves that its possible to overcome SHA-1. It's still really really REALLY improbable to produce a meaningful difference.
It's still really really REALLY improbable to produce a meaningful difference.
Remember the bit in the article about this getting easier over time? That means that SHA-1 is broken and will only get more so until it is trivial to roll it like an old carpet. This attack is just the tip... not what you want to hear if your security depends on this algorithm.
You need to read up some of the discussions that took place when similar attacks were first identified for MD5 hashes. This is a serious problem - I could get you to digitally sign a contract; and if I can produce a similar document with the word 'not' inserted (say) and the same SHA-1 hash, you're in trouble. Similarly using SHA-1 to authenticate a valid web server and then make a slight change to the URL, but keep the hash the same ...
Right now it needs a massive amount of computing power to defeat SHA-1, but the NSA can probably do it already. And these attacks only ever get quicker - not just by Moore's Law, but because someone clever will read the paper and think of a way of doing it 10x faster.
But if you can insert "not" AND just stash away the "KJ"BIUE_D H£(*ERNY£" in a garbage area, you're sorted.
What I want to know if this is more than a collision but a preimage attack (or more severely, a SECOND-preimage attack that found a collision with a specific target).
But if you can insert "not" AND just stash away the "KJ"BIUE_D H£(*ERNY£" in a garbage area, you're sorted.
While true, the likelihood of doing this AND getting a collision is highly improbable. Yes, it's only a matter of time, as many have pointed out. But then again, our solar system only has a matter of time before the Sun engulfs it. What's important isn't what is eventual. It's the time scale on which it will happen. For now, that time scale is still very far out for turning this finding into something meaningful. I doubt it happens in my lifetime, even with the looming presence of quantum computing around the bend.
Re: "While true, the likelihood of doing this AND getting a collision is highly improbable"
Aslong as you can pad the document in addition to making required changes AND making a change results in a financial advantage for you of more thanUS$130,000, I would be reluctant to call this highly improbable.
Based on previous hashes, the discovery of collisions has lead to more weaknesses being found, and usually large collision spaces within a hash function that need to be avoided.. What costs ~US$130,000 today will likely cost less than US$10,000 within 5 year.
Does it mean we have to throw away existing SHA-1 hashes? Probably not unless the financial incentive to attack them exists.
Does it mean we need to start patching SHA-1 hash functions to address discovered weaknesses deploying SHA-2 and later hash functions now for verifying important documents or software versions? Definitely - mainly because if we don't it just doesn't happen * stares at MD5 hashes *
Someone mentioned financial websites - all current browsers already require SHA-2 hash functions on certificates since January 2017. The only real exception I am aware of is old and cruddy Java 7 (or earlier) apps that refuse to upgrade or to die.
Why does the size have to be identical? As long as the artifact is a believable size you can launch an attack.
Imagine an executable file download through some sort of update mechanism that uses sha1 to validate the binary before executing it. No-one will notice that the 64MB upgrade.exe is 25KB larger. But now the attacker has replaced the intended payload with their own. It would be interesting to know about the collision algorithm. Like does it require the two files to be nearly identical? Does a large file take longer to generate a collision?
Just a few thoughts.
PDFs are compressed containers, right? That being the case, you can change quite a lot of the payload and all you have to do is ensure the compressed output of your manipulated file has the same size and hash. Being that the input to the hash generator is essentially already manipulated in a manner which essentially obfuscates the source document.
Of course, that is a narrow application but still seems practical. Going a similar route, look at any compressed file you might download: .zip, .7z, .tgz, etc. For full confidence a check for all parts should be applied, not just the download but the individual hashes of each and every part. For instance, the hash of the archive, the hash of all parts combined, the hash of each individual file, hunk, etc. Doing so gets heavy and resource expensive rather quickly.
More broadly, think of the Open Document formats which are zipped containers, among others.
But what of TLS sessions? Consider that a spook or nefarious agency (arguably one in the same) has a packet capture of a TLS-encrypted session signed with a SHA-1 certificate. We already know sessions signed by a certificate generated with poor entropy (the debacle from a few years back) can be undone. $130k is nothing to throw at this for such agencies, maybe even enough to get the GPU calculation requirements down to something more reasonable than a year (how much was paid for the San Bernadino iPhone hack?) Whatever was in that session is at risk, be it an email, web search, forum posting, or penis-enhancement purchase confirmation page.
If you are checking for tampering, checking the file size is a lot easier than checking the SHA1.
That assumes that you know what the size should be. If you have two documents with the same apparently correct signature you know that one is a forgery, but you have no idea which.
Checking the size does not tell you that -- unless you have some out-of-band mechanism for obtaining the expected size of the genuine document ... and how do you check the veracity of the out-of-band data? Well, it'll be signed, using a hash ...
To verify the hash meaningfully, no matter what kind is used, you need to know what the output of your hash computation should be. That is no different from knowing what the size should be. That you might need to get that out-of-band may not be a major objection.
To verify the hash meaningfully, no matter what kind is used, you need to know what the output of your hash computation should be.
We're talking about signed PDFs, here. You know exactly what the hash should be, because it is part of a signature block that's been signed with the document author's private key.
[The article is about weakness in the hash process, not about any shortcomings of PKI, so the validity of the hash in the signature block can be taken as a given.]
That is no different from knowing what the size should be.
It's very different, because the size of the document is not part of the signature block.
> What I want to know if this is more than a collision but a preimage attack (or more severely, a SECOND-preimage attack that found a collision with a specific target).
One presumes so.
If they were just going for birthday collisions, I would think that the chances of getting two random documents which are both valid PDFs with readable text in them is rather low.
OTOH, the chances of it being a valid Perl script are quite high :-)
>Yes, it's a problem if you can insert 'not'. It's less of a problem if you can only produce the collision if you insert 'KJ"BIUE_D H£(*ERNY£'
Hey, you know that, when it comes to PDF, you could place 'KJ"BIUE_D H£(*ERNY£' behind the image or something ...
Remember, this is bytes, so, you want the hash of 000011100010001xxxxxxx0001000111001100[...] to equal the hash of 00001111101000100100010001000111001100[...], where x can be anything because located behind an image or font color is white or same as background etc ... you name it. This has now been proven to be feasible, expensive, 130k or such, but feasible ... and it will only get cheaper ... in 10 years time, you and I might be able to do it over a weekend for the lulz, better hw, better algo to compute this ... ;-)
You could also say the random bits you want are for an image with total transparency ... you know, the sky's the limit.
@Hans 1 You are almost right *but* what they can do is get
000011100010001xxxxxxx0001000111001100[...] to equal the hash of 000011111010001yyyyyyy0001000111001100[...], where x and y can be anything.
This is the difference between a collision (which we now have), and a second pre-image attack (which we don't have - yet).
>"But if you can insert "not" AND just stash away the "KJ"BIUE_D H£(*ERNY£" in a garbage area, you're sorted."
> "Hey, you know that, when it comes to PDF, you could place 'KJ"BIUE_D H£(*ERNY£' behind the image or something ..."
It seems to me this is an argument for deprecating the pdf format (and anything else which has undefined buffers or hidden data) for contracts and anything else which needs to be digitally signed, to be honest.
If the document was in a form where you couldn't tweak it by tinkering with effectively 'wild' bits, it would be enormously harder to produce a document of the same size.
Larger documents would still be almost as easy, of course, so you'd need to add an exact filesize to the hash.
No, sorry, its not that easy.
For a very long time, its been known that there's a potential for a SHA-1 collision to occur. Its was just that no one found it.
While a random collision can occur, creating a specific modified document and then to figure out how to create a SHA-1 hash equivalent that is also the same size? Not that easy,
"Still not much of an attack, IMO. Unless someone can arbitrarily set the difference to something meaningful, it only proves that its possible to overcome SHA-1. It's still really really REALLY improbable to produce a meaningful difference."
Sometimes, just a few characters can be enough to change the whole meaning of a document, such as the inclusion or exclusion of a single "not".
"Sometimes, just a few characters can be enough to change the whole meaning of a document, such as the inclusion or exclusion of a single "not"."
But again, they have to be meaningful characters you've changed. You have to include/exclude that "not" and *then* run through 9 quintillion variations of meaningless padding at the end of the document and run the sha1 calculations until you get a collision.
Not saying it's impossible, we've always known that. It's just becoming less-difficult. Which is why we're already migrating to SHA-256.
"
Sometimes, just a few characters can be enough to change the whole meaning of a document, such as the inclusion or exclusion of a single "not".
"
The point is that after inserting the "not" you will have to change a large amount of data in addition in order to achieve the same hash. In a text document those changes will not make any sense, and so the change would be exposed. This can only work on data that can have many changes applied without being noticed - e.g. a BMP or JPG image where lots of bytes can be changed but still leave a recognisable image.
In a certificate or other small or well-structured file with few bytes that could be changed without invalidating the format, it will not be possible to create a viable but altered file with the same hash.
It's still really really REALLY improbable to produce a meaningful difference.
Not really any more improbable than the demonstration ... if you can make changes to the background colour (say) until you get the hash you want what's to stop you changing the content to whatever you want to say and then changing the background colour so that the hash matches that of the original document? The difficulty must be about the same as the demonstration.
What I'd like to see is an estimate of the cost of those 9,223,372,036,854,775,808 sha1 calculations and umpty-bazillion hours of GPU time. How much value does a document have to have for it to be worth doctoring it in this way?
There's nothing much new here apart from the fact that some folk have demonstrated something that was already known to be theoretically possible. We all know that SHA-1 is deprecated and that SHA-2 and SHA-3 are more secure. It's rather sad if it tales an actual demonstration to prove that the theoretical risk is real.
> It's still really really REALLY improbable to produce a meaningful difference.
No it's not. You make a meaningful difference, and put some hidden padding somewhere to make the SHA1 match the desired value.
This is the same process as when MD5 was broken - in that case they used it to create a forged SSL certificate. It had a "tumor" in it to fudge the MD5 hash, but was still a valid certificate.
https://www.win.tue.nl/hashclash/rogue-ca/
In short, it's is now practical to forge a certificate with an SHA-1 hash. You should be worried - efforts to disable SHA-1 in browsers have not come early enough.
But, did they change the account number and sort code of an account to receive gizillions, rather than some random shite. As for certs, your OS already trusts many govt controlled CA, so that claim is pointless.
Anyone care to share a financial institution with a sha-1 cert?
The point of a hash function is that even if you change just a single bit, it should give a completely different output. There are far more possible documents than there are hash function outputs, so collisions will always exist - but it should be computationally impossible to find them. This attack proves that it no longer is.
No hash function makes finding collisions computationally /impossible/, the best you can hope for is /practically useless/.
As in, too impractical to be of any utility to a miscreant.
As others have said, the point is not the ability to contrive a collision but to establish a means to change a document or file in a way that has any practical result (changing the terms of a contract, introducing functional malware or a backdoor) without changing the byte count in the file (nobody relies on JUST the has, I hope) and still contrive to create a collision.
AND the result of the change also has to remain undetectable when subject to an eyeball test. i.e. It's no good being able to change the terms of a contract if to achieve that whilst preserving the file size and the hash you have also introduced what is obvious junk content as well.
Changing the background color of a PDF ? I'm not sure what the possible malicious application of that could be. This doesn't prove that an effective attack is possible, only that a pointless attack is. Surely ?
I can drive my car into a concrete wall at 100mph and effectively demonstrate that the collision protection offered by my vehicle is inadequate to protect the occupants in the event of such a collision. But that doesn't mean that the collision protection in the vehicle is inadequate.
I think* that PDFs (and Word documents and almost any other document format except plain text) have a sufficient number of 'filler' characters that can be set to any value you like without significantly changing the document format. This gives scope for a clever algorithm to produce the necessary collisions.
*I haven't read the paper, but this is how it worked for MD5
AND the result of the change also has to remain undetectable when subject to an eyeball test. i.e. It's no good being able to change the terms of a contract if to achieve that whilst preserving the file size and the hash you have also introduced what is obvious junk content as well.
Take a 1,000 or so word document. One you know well, and as you're in HR have read a half a dozen times today alone with new hires coming into the company. Spot the one small change in the dry legalese that is numbing to the mind of even the most dedicated HR drone.
A decent text-comparison program should pick it up. The human eyeball probably never will unless you spot something else out, eg a paragraph that should have one word on the last line has 2. A drop of a character elsewhere could fix that.
Remember that research has shown that we respond more to the shape of words than the actual letters used in the words, and we can speed-read badly spelt documents without even noticing the errors. A human eye will not spot a small change in a large document or bit of source code.
> There are far more possible documents than there are hash function outputs
Known as the Pigeonhole principle. Where the size of input exceeds the size of the hash output, there not only "can be" but "must be" collisions. To those harping on about file size differences, Windows explorer rounds to the closest KB unless you specifically check the details of the properties. For most use cases, having a slightly different file size is unimportant, but it is impressive nonetheless that they could locate a collision even constraining themselves to the valid file format and same file size.
Designing effective hash functions is really hard. I had to last year and stuffed it big time. It wasn't a cryptographic hash but one that would see hundreds of thousands of our objects get hashed into different buckets for faster dictionary lookups. So basically you had an infinite combination of inputs that had to hash to 32 bits (about 4 billion). I managed to create an accidental swarm to zero which meant that whilst there was good distribution generally, a substantial proportion of real world objects would end up in bucket 0. After fixing it, the worst I saw was in the 3 objects per bucket out of a million range.
+1 for naming the problem.
Publishing different hash signatures for the same document, and/or partitioning the document and using the same hash method to "cover" overlapping portions of it* have got to be the only way of minimising the effect of the Pigeonhole Principle without moving to a different scheme which has a higher number of pigeonholes.
Using this multiplicity technique effectively increases the number of pigeonholes used to "describe" a given mapping, and has a similar effect to triangulating measurements for a map using a theodolite (i.e., reduces the chance of error in positioning).
With multiple hashes, let's be clear on this: Every single one has to match before provenance can be established. So long as the statistics show a suitably low collision probability for the number of hashes utilised I don't think it matters how secure the underlying crypto used to generate each of those hashes is. However, using different "salts" on the same method is probably inadvisable.
*Commentators here have mentioned background colouring and formatting obfuscation techniques to render an identical hash: Why not "layer" the hashing mechanism in such a way that the document is one layer and the presentation is a separate layer? This will mean each layer bearing its own hash. Only when the two hashes are identical are we looking at the authentic document. Of course, this method of hash generation would have to be put into the public domain for it to be useful.
> or partitioning the document and using the same hash method to "cover" overlapping portions of it
Just be aware that whilst the pigeon hole principle shows that it is mathematically certain that two documents that collide must exist where the input size exceeds the hash output size, it does not follow that two inputs that are smaller than the hash output cannot collide. In a well designed algorithm, it should be both very rare and computationally infeasible to find the other. IE. Nothing short of brute force
This post has been deleted by its author
Not just computationally impossible to find them, which may be a big ask as you could always brute force with enough computing power.
It is key that you cannot premeditatedly change a document and manipulate it to produce the same hash which is why this shows SHA1 as broken.
You may just find, unlikely but possible, that your picture of your cat creates the same hash as your contract for your house. However no one would expect that your solicitor got you to buy a house based on a cat picture.
If however you can work out that by changing a little bit of text you can then add extra text/bytes/graphics and work out what they should be to force a collision then it is broken.
If the only other hashes that exist for a document are all random garbage or completely unconnected to the original document as well as being impossible with current resources to premeditatedly work out then it isn't broken.