Thursday, June 04, 2009


Hash algorithm strength

In his “Practical Security” column in the May/June issue of IEEE Internet Computing magazine, Stephen Farrell talks about strength of cipher-suites, and the fact that the overall strength is not simply measured by the key length of which the user is aware. It’s a useful thing to understand if you’re responsible for implementing encryption, and it’s something that we often get wrong, using strong asymmetric encryption up front, but weaker symmetric encryption behind it, or the other way around.

As often happens, though, when we try to state things simply, he says something that’s a little confusing, which I’d like to look at more closely here. In talking about the strength of the hash (digest) algorithm, Stephen says this:

Similarly, if a digest operation’s output were small, say only 32 bits long, then an adversary could simply guess possible inputs until it found a matching output, probably only after roughly 65,000 attempts.

The number, 65,000, comes from the fact that 16 bits represents about 65,000 possible values (65,536, to be exact), and that 16 bits are half as many as 32 bits. On the surface, though, this looks like an error. If we were picking random values from a 32-bit range (which represents a bit more than 4 billion possible values), we’d expect the average number of attempts to crack it to be half that value, not half the number of bits: 31 bits, not 16 bits, or about 2 billion attempts.

The trick, though, is that the hash algorithms we’re using have a theoretical collision-resistance strength equal to (at most) half the number of bits in their output. So, theoretically, the strength of a hash algorithm that’s similar to SHA but has a 32-bit output would be 16 bits. An that means that a brute-force attack would crack it in an average of about 65,000 attempts.

But why is the strength only equal to half the number of bits in the first place?

The fault in what we expect is in the intuitive — but incorrect — expectation of the rate in which random values will collide. We’re mapping all plain-text messages into the set of numbers represented by 160 bits for SHA-1, or, in Stephen’s hypothetical example, 32 bits. Let’s simplify it a little, and consider a similar situation. Let’s map all people into the set of numbers from 1 to 366.

In other words, let’s consider people and their birthdays.

The “birthday paradox” — which isn’t really a paradox — is an unexpected result that I talked about in my first “Paradox” post. It hinges on the idea that we expect to need a large number of people before it’s likely that two have the same birthday — that we get, in hash terms, a collision — but that, in fact, we need relatively few.

Comparing it to our hash expectation, we’d figure we’d need 366 / 2, or 183 people, to have a 50% chance of a collision. In fact, though, we only need 23. Thinking of that in terms of bits, it’s a difference between an expected 7.5 bits and an actual 4.5 bits. (You can get lots more information about the birthday paradox at the Wikipedia entry.)

The effect here is the same. Because the probability of two randomly chosen source strings hashing to the same value goes up rapidly, not in the linear way we expect, it turns out that the strength of a hash algorithm with an output of n bits is not 2n / 2, but, rather, 2n/2 — that is, not half the number of values, but half the number of bits.


Frisky070802 said...

Val Henson used this argument a few years back to argue that file systems that dedup based on hashes of content could be in for a rude surprise ...

Jim Fenton said...

I don't have Stephen's entire article, but what you're describing sounds more like a preimage attack than a birthday attack. Is the attacker here trying to come up with a message that hashes to a particular value, or to the same value as any of 2^16 other messages? If the attacker has a particular message that they want the hash to agree with, then the work is on the order of 2^31, not 2^16, because the birthday attack doesn't apply. If, on the other hand, the attacker has a large number (~2^16 or more) of messages (perhaps credit card numbers, or key hash values...) then a birthday attack is possible if getting a match anywhere is considered to be a success.

Barry Leiba said...

Right, that's why I said that the whole thing became confusing in an attempt to oversimplify. Stephen wasn't really describing a specific attack — as Jim points out, there's a tremendous difference in the resistance to different attacks — but to give a general measure of the strength of a has algorithm.

Finding two messages at random that collide is much easier than finding a message that collides with a given specific message. Hardest yet is finding a way to make a specific modification that generates a collision, and the computational feasibility of do that would make a hash algorithm completely useless.

Cracks in the other two areas merely weaken things, and lead to the idea that we should begin a medium-term phase-out... which is why NIST is conducting a study to come up with a new "SHA-3" algorithm.