SHA1 is Dead, Long Live SHA1
Saturday, 25 January 2020
Earlier this month, a paper described a new attack against the SHA1 checksum calculation. As Ars Technica wrote:
There is a cryptographic system called PGP that can encrypt or digitally sign data. Using the vulnerability, attackers can manufacture PGP signatures that appear to be valid. As far as I can tell, this doesn't impact the encryption; it only attacks the digital signatures. So if you have an unencrypted file that includes a PGP signature for validation, then the signature could be forged.
As security, privacy, and cryptography goes, this is pretty bad. But does it mean that SHA1 is dead?
A simple random number generator takes an initial value (a seed) and runs it through a complex calculation to generate the next number. The calculation maintains some internal state values. The seed initializes the state, and each subsequent call updates the state. For example:
The output from this program (using PHP) looks like:
The first three calls show three random values. And if you initialize it with a different seed, then you'll see different random values. However, as long as you start with the same random seed, you will generate the same sequence of random numbers.
We could make the random number generator a little more complicated. For example, we could add entropy (noise) into the random generator. So rather than calling rand() to generate a random value, we might call the function with a parameter to add more variability. In this case, you would need to start with the same seed and add the same entropy data in the same order to generate the same sequence of random numbers.
This is how cryptographic hash functions work. Functions like CRC32, MD5, and SHA1 each start with a well-defined initial state (seed). Then they read a chunk of data (usually the bytes from a file). Each data chunk adds entropy in a known order. When all of the data is processed, the final random value is generated. The same data in the same order will generate the same value, but different data or a different ordering will generate a different value. This final random value is the checksum for the data. CRC32, MD5, and SHA1 use the same approach but with different random number generators.
Well, almost. If the checksums do not match, then you can be 100% certain that they are different files. However, CRC32 is only 32-bits long. If your data is smaller than 32-bits (4 bytes long), then two different 4-byte files will have different CRC32 values. But what if your file is larger than 4 bytes? Well, there's a 1 in 4,294,967,296 chance that two files will map into the same CRC32 checksum. This is why these checksums are often called digests or summaries; they are not unique per file. A missed checksum means they are different, but a match does not mean they are the same.
CRC32 is only 4 bytes long, but MD5 is much longer. The MD5 algorithm reads data in 512-bit chunks (64 bytes) and generates a 128-bit checksum (16 bytes long). Two files could generate the same MD5 checksum, but the odds of that happening naturally are extremely slim (1 in 340,282,366,920,938,463,463,374,607,431,768,211,456). SHA1 generates a bigger hash value; it reads data in 512-bit chunks (64 bytes) and generates a 160-bit checksum (20 bytes). Basically, the likelihood of two different files having the same MD5 or SHA1 checksum are unlikely in the extreme. (You're way better off playing the Powerball Lottery. With the grand prize odds of 1 in 292,201,338, you could win 1,164,547,600,124,057,143,993,552,169,948 times before coincidentally finding an MD5 match.) For this reason, a matched checksum is typically treated as representing the same file.
There's one little issue with this brute force attack: the file size. If I track files by SHA1 and file size, then a brute force attack will eventually match the SHA1 checksum but it won't match the file size.
Back in 2017, a group of researchers produced the first SHAttered collision using something called the birthday paradox (collisions do not require an exhaustive search). They produced two different files that had the same SHA1 checksum and same file size. Moreover, both were valid files (one was a JPEG and the other was a PDF). However, this proof of concept required over 9,223,372,036,854,775,808 SHA1 computations and took the equivalent of 110 years of GPU processing. (And they got lucky; the search could have taken much longer.)
When SHAttered first came out, there were declarations about "SHA1 declared dead" and "SHA1 function is now dead". However, 'dead' is relative. The use of SHA1 never went away. Old systems that use SHA1 still use SHA1, and newer systems may use SHA1 or may move on to stronger algorithms.
Shambles starts with the assumption that you already know the data and you only need to generate the signature. Fortunately, there are some mitigating factors:
In contrast, newer (non-human) systems have proceeded past SHA1. For example, TLS 1.2 deprecated MD5 and SHA1; most browsers and VPNs use SHA256 or SHA384 for encrypted communications, but SHA1 is often supported as a fallback -- just in case someone has really old software. Bitcoin algorithms use SHA256, RIPEMD160, or some other long checksum. And many patch systems and package managers rely on long hash functions.
Although there is some progress toward using longer hashing algorithms, the momentum is far from established. Despite reports to the contrary, SHA1 is not dead, not being rapidly replaced, and will be around for a very long time.
Three years ago, Ars declared the SHA1 cryptographic hash algorithm officially dead after researchers performed the world's first known instance of a fatal exploit known as a "collision" on it. On Tuesday, the dead SHA1 horse got clobbered again as a different team of researchers unveiled a new attack that's significantly more powerful.
There is a cryptographic system called PGP that can encrypt or digitally sign data. Using the vulnerability, attackers can manufacture PGP signatures that appear to be valid. As far as I can tell, this doesn't impact the encryption; it only attacks the digital signatures. So if you have an unencrypted file that includes a PGP signature for validation, then the signature could be forged.
As security, privacy, and cryptography goes, this is pretty bad. But does it mean that SHA1 is dead?
What is SHA1?
Let's back up a moment. What are SHA1 and other cryptographic hash functions?A simple random number generator takes an initial value (a seed) and runs it through a complex calculation to generate the next number. The calculation maintains some internal state values. The seed initializes the state, and each subsequent call updates the state. For example:
echo "Init with 3\n";
srand(3); // initialize the seed with the value "3"
echo rand() . "\n"; // generate the 1st random number
echo rand() . "\n"; // generate the 2nd random number
echo rand() . "\n"; // generate the 3rd random number
echo "\nReinit with 3\n";
srand(3); // reset the initialization back to the value "3"
echo rand() . "\n"; // generate the 1st random number
echo rand() . "\n"; // generate the 2nd random number
echo rand() . "\n"; // generate the 3rd random number
The output from this program (using PHP) looks like:
Init with 3
1182829493
151880524
1520735868
Reinit with 3
1182829493
151880524
1520735868
The first three calls show three random values. And if you initialize it with a different seed, then you'll see different random values. However, as long as you start with the same random seed, you will generate the same sequence of random numbers.
We could make the random number generator a little more complicated. For example, we could add entropy (noise) into the random generator. So rather than calling rand() to generate a random value, we might call the function with a parameter to add more variability. In this case, you would need to start with the same seed and add the same entropy data in the same order to generate the same sequence of random numbers.
This is how cryptographic hash functions work. Functions like CRC32, MD5, and SHA1 each start with a well-defined initial state (seed). Then they read a chunk of data (usually the bytes from a file). Each data chunk adds entropy in a known order. When all of the data is processed, the final random value is generated. The same data in the same order will generate the same value, but different data or a different ordering will generate a different value. This final random value is the checksum for the data. CRC32, MD5, and SHA1 use the same approach but with different random number generators.
Validating Files
If I were to share a file with you, how could you be sure that the file is the same as the one I gave you, and that it hasn't been tampered with or altered? Well, I could share the file with you and the SHA1 checksum, or MD5 checksum, or some other cryptographic checksum. You could then perform the same calculation over the file. If your computed checksum matches the one I supplied, then you know that the algorithm received the same data in the same order.Well, almost. If the checksums do not match, then you can be 100% certain that they are different files. However, CRC32 is only 32-bits long. If your data is smaller than 32-bits (4 bytes long), then two different 4-byte files will have different CRC32 values. But what if your file is larger than 4 bytes? Well, there's a 1 in 4,294,967,296 chance that two files will map into the same CRC32 checksum. This is why these checksums are often called digests or summaries; they are not unique per file. A missed checksum means they are different, but a match does not mean they are the same.
CRC32 is only 4 bytes long, but MD5 is much longer. The MD5 algorithm reads data in 512-bit chunks (64 bytes) and generates a 128-bit checksum (16 bytes long). Two files could generate the same MD5 checksum, but the odds of that happening naturally are extremely slim (1 in 340,282,366,920,938,463,463,374,607,431,768,211,456). SHA1 generates a bigger hash value; it reads data in 512-bit chunks (64 bytes) and generates a 160-bit checksum (20 bytes). Basically, the likelihood of two different files having the same MD5 or SHA1 checksum are unlikely in the extreme. (You're way better off playing the Powerball Lottery. With the grand prize odds of 1 in 292,201,338, you could win 1,164,547,600,124,057,143,993,552,169,948 times before coincidentally finding an MD5 match.) For this reason, a matched checksum is typically treated as representing the same file.
Intentional Collisions
A hash-collision attack tries to intentionally create a file that has a known cryptographic checksum. This type of attack has been known for years. The brute-force method allocates extra data that is systematically altered until it matches the checksum. For example, if SHA1 works on 64-byte blocks, then you can add in an extra 64 bytes and iterate through every possible value until you generate the checksum that you want. This type of attack can take seconds for CRC-32, but thousands of centuries for SHA1.There's one little issue with this brute force attack: the file size. If I track files by SHA1 and file size, then a brute force attack will eventually match the SHA1 checksum but it won't match the file size.
Back in 2017, a group of researchers produced the first SHAttered collision using something called the birthday paradox (collisions do not require an exhaustive search). They produced two different files that had the same SHA1 checksum and same file size. Moreover, both were valid files (one was a JPEG and the other was a PDF). However, this proof of concept required over 9,223,372,036,854,775,808 SHA1 computations and took the equivalent of 110 years of GPU processing. (And they got lucky; the search could have taken much longer.)
When SHAttered first came out, there were declarations about "SHA1 declared dead" and "SHA1 function is now dead". However, 'dead' is relative. The use of SHA1 never went away. Old systems that use SHA1 still use SHA1, and newer systems may use SHA1 or may move on to stronger algorithms.
Shambles
The new attack is called Shambles. When generating a cryptographic signature, you start with the data and add a key value. You can only validate the hash (signature) if you have both the data and the key value.Shambles starts with the assumption that you already know the data and you only need to generate the signature. Fortunately, there are some mitigating factors:
- Most PGP uses that I have seen use encryption, or encryption with signatures. Except for validating downloaded software packages (like Tor), I cannot recall the last time I saw a signature without encryption. So if the forged signatures test as valid, the data will still fail to decrypt. (Tor is at risk, but your encrypted email is not.)
- The attack works for any fixed prefix encoding that uses SHA1. However, SHA1 is usually computed over an entire file. Two different files are going to have two different prefixes (not fixed), so this attack does not apply to full-file checksums. The only time this would apply is if you have two files where the only differences are at the end of the files.
Dead! Dead! Dead!
I chuckle whenever I read declarations about this being the end of SHA1. There are some alternatives, such as SHA-256, SHA-512, and SHA-3, but none are widely used beyond a few niche purposes. In addition, MD5 has had known collision attacks for decades, and it is still widely used, so I expect SHA1 to be around for a very long time. And frankly, there are a couple of reasons for it to continue being widely used:- Old and New. It is easy enough to generate a different hash value. But most software packages either don't supply checksums for humans to read, or supply both older and newer hashes. For example, Open Office supplies signatures using PGP (broken by Shambles), MD5 (long considered weak and outdated), and SHA-256 (newer and considered good for now). Similarly, ClamAV uses MD5, SHA1, and SHA256.
Currently, there are no known methods for forcing a collision that works across multiple hash function, such as MD5 and SHA1, or MD5 and SHA-256, or SHA1 and SHA256. So if the software provider supplies hash values for two different algorithms, then that's good enough (for now) to say that it is the same file. - Replacing. Many systems, such as source code control systems like Github and Subversion, rely on SHA1 to track code changes. Updating the hash function isn't as simple as a drop-in replacement. Not only do you need to update the hash function, you also need to back-port all previous changes to the new hash function. Moreover, users often link to comments or code by referring to the hashed value; changing the hash function will break all reference links.
- Speed. As computation time goes, SHA1 and MD5 are pretty comparable. Usually SHA1 is a little faster than MD5, but there are a few cases where MD5 is faster. However, SHA256 is slower than SHA1, and SHA512 is slower than SHA256. If you're only computing one hash, then the speed difference is negligible. However, if you're computing lots of hashes (e.g., real-time cryptography, streaming video, or validating millions of files), then the speed difference is significant. Sometime faster is better than more bytes. And if you're on a system with limited resources (e.g., an embedded or IoT device), then SHA1 might be "good enough".
- Better tomorrow? Today, SHA1 and MD5 both have known vulnerabilities, but the vulnerabilities have limited applications. Newer algorithms, such as SHA256, SHA512, and SHA3, are considered 'stronger' by today's standards, but they also haven't had as much scrutiny. It is possible for a new attack to come out that completely breaks SHA512 or SHA3 or some other algorithm, but leaves SHA1 and MD5 alone. (E.g., the new Shambles attack against SHA1 appears to have zero impact on MD5.) So do you stick with what you know, or do you migrate to something newer that could be deemed completely ineffective tomorrow?
- Usability. If it were just up to a machine, then we could easily provide lots of long hashes. However, often hashes are displayed for humans to evaluate. For example, if you've ever seen a forensic report on a digital analysis, then you've probably seen lists of MD5 or SHA1 hashes. For evidence tracking, you're supposed to list every file and every checksum. This way, some other analyst can view the checksums and confirm that they are evaluating the same data.
With MD5 and SHA1, hashes are short enough to list in a table since MD5 only requires 32 characters (2 characters per byte) to display the hash, and SHA1 only needs 40 characters. However, SHA256 needs 64 characters, and anything longer than 40 characters usually means word-wrapping. SHA512 is even longer! At 128 characters, I have yet to see a report that includes SHA512 without word-wrapping. (SHA3 supports different lengths, ranging from 56 characters to 128 characters.)
Humans usually don't manually compare extremely long byte sequences. Typically, they will compare the first few bytes and last few bytes and just assume that the middles match. From a human usability aspect, longer is usually not more secure. - Legal. The Department of Justice still recognizes MD5 and SHA1 as viable checksum algorithms for tracking digital evidence (e.g., 2011) and they are still used today (e.g., United States vs Anthony Davis Stagnitta and United States vs Carlos Santiago Gomez). Ironically, back in 2015 NIST announced that Federal agencies should stop using SHA1 except for verifying old digital signatures. (I guess nobody heard them. Maybe NIST should say this louder.)
In contrast, newer (non-human) systems have proceeded past SHA1. For example, TLS 1.2 deprecated MD5 and SHA1; most browsers and VPNs use SHA256 or SHA384 for encrypted communications, but SHA1 is often supported as a fallback -- just in case someone has really old software. Bitcoin algorithms use SHA256, RIPEMD160, or some other long checksum. And many patch systems and package managers rely on long hash functions.
Although there is some progress toward using longer hashing algorithms, the momentum is far from established. Despite reports to the contrary, SHA1 is not dead, not being rapidly replaced, and will be around for a very long time.
Comments
#1
David Oftedal
(Homepage)
on
2020-01-27 14:25
(Reply)
I'm not sure how much cryptanalysis has been done on this yet, but BLAKE3 ( https://github.com/BLAKE3-team/BLAKE3/ ) seems to be doing really well performance-wise. According to its authors, it's both faster and more secure than both SHA-1 and MD5. Perhaps an interesting hash function for future uses.
Add Comment
Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

