Wednesday, February 04, 2009

.

Digital signatures on a large scale

Researchers at University of Washington have been working on a simple system for putting digital signatures on recordings that are intended to become part of a historical record. Because it’s easy to falsify or tamper with digital recordings, their idea is to make the recordings verifiable. The New York Times reported on it last week:

A Tool to Verify Digital Records, Even as Technology Shifts

On Tuesday a group of researchers at the University of Washington are releasing the initial component of a public system to provide authentication for an archive of video interviews with the prosecutors and other members of the International Criminal Tribunal for the Rwandan genocide. The group will also release the first portion of the Rwandan archive.

This system is intended to be available for future use in digitally preserving and authenticating first-hand accounts of war crimes, atrocities and genocide.

It’s a laudable concept, but it’s an odd mix. The Times article has a few inaccuracies that I’ll point out here, but we can mostly work through them and assume what’s meant. Even so, I’m a little puzzled, but I haven’t been able to find the original UW information yet.

On the surface, this should be very standard stuff, and not worth writing home about. The typical way one sets up digital “authentication” like this is as follows:

  1. The tribunal (for example) obtains a digital certificate from a recognized certificate authority. This certificate identifies the tribunal uniquely and securely, provided the private key for it is kept secure.
  2. The certificate and its private key are used to create a digital signature to go with the electronic document in question. See my series on digital signatures for more about how that works.
  3. Later, when the document is read/heard/viewed, it can be verified against the signature. If there is no signature, or the signature doesn’t match the document, then we can assume that the document’s been falsified or tampered with. If the signature matches, then we have confirmation that this is an unmodified document that was signed by the tribunal.

That mechanism is here now, has been around for years, and doesn’t need researchers to get it set up. So why is further research and development needed?

First, there are technical difficulties in doing this with large documents on a large scale, and we’ll get to those later.

Second, the whole process of obtaining and maintaining digital certificates, renewing expired ones, keeping them properly secured, and making sure that no one else can get a fake certificate identifying the tribunal is a cumbersome one. This definitely needs to be made easier, and doing so will benefit all uses of digital signatures.

Third, the whole certification model assumes a relatively short life of the certificates and the signatures. I sign an email message, I send it, you receive it, you verify the signature, and that happens within a few minutes, or hours, or days. Do we really expect that a digital signature will be secure and verifiable years from now — perhaps a great many years? Certificate authorities will come and go; keys may be compromised, resulting in revoked certificates; encryption algorithms and cryptographic hashes will fall to attacks; and myriad other things can happen over time that can render a digital signature meaningless over years and decades.

Fourth, data formats and encodings change over time. We develop better ones, and old ones become obsolete. Data files of today might no longer be readable decades from now, because the hardware and software needed might no longer be around. Who will be responsible for keeping track of this stuff over the years, and for making sure that we still have what we need to read it and verify it? Will it have to be periodically converted to new formats or media? Will we be able to make sure that the signatures go with it (that it’s re-signed, securely, each time it’s converted)?

The researchers are clearly not trying to address all of these issues, at least not now. From what I can guess from the Times article, it’s the first two that they’re working on. So let’s look at the inaccuracies in the article and some of the technical difficulties they have or will be encountering.

At the heart of the system is an algorithm that is used to compute a 128-character number known as a cryptographic hash from the digital information in a particular document. Even the smallest change in the original document will result in a new hash value.
No standard cryptographic hash algorithm produces “a 128-character number”. The MD5 algorithm produces a 128-bit number — that’s 16 characters (actually bytes), or 32 hexadecimal digits. But they surely aren’t using MD5, because that’s solidly broken now, and not reliable. SHA-1, commonly used now but starting to show weaknesses, produces 160 bits (20 bytes, 40 hex digits).

But the next paragraph addresses that and tells us that they’re not using either of those:

In recent years researchers have begun to find weaknesses in current hash algorithms, and so last November the National Institute of Standards and Technology began a competition to create stronger hashing technologies. The University of Washington researchers now use a modern hash algorithm called SHA-2, but they have designed the system so that it can be easily replaced with a more advanced algorithm.
SHA-2 is not an algorithm, but a collective name for a group of algorithms. Most likely, they’re using SHA-256 or SHA-512, which produce 256-bit and 512-bit hashes, respectively. Perhaps they’re using SHA-512, and the Times’s “128-character number” is really referring to the 128-hex-digit number that will come from that algorithm.

One technical problem we might eventually run up against is the maximum data size that can be hashed at once. With SHA-512, that size is about 2125 bytes, or something on the order of 4 followed by 37 zeroes. It’s big. In contrast, one terabyte is about 1 followed by 12 zeroes. It’s probably big enough. But we have a history of saying that and being wrong, and if it’s not big enough, the data will need to be split into pieces and signed, and then the mechanism for putting the pieces back together must itself be signed. Granted, though, they don’t have to deal with this any time soon.

The main technical problem they seem to be dealing with is making the process automated and usable by non-technical people. Someone from the tribunal needs to take a video recording, say, that has been in her hands all the time, put it into the system and attest to its accuracy through a mechanism that she can understand and trust well enough to make her attestation.

I’d love to see the user interface they’ve set up for that, and the processing behind it.

No comments: