Staring At Empty Pages: On compression and sound quality

Actor and filmmaker Adrian Grenier was on a local radio talk show the other day. As an addendum to the show, they posted a brief Q&A on the web, in which he says this:

Q: What are you listening to right now?
A: I just reunited with my record collection. Records sound better than MP3’s. I was just listening to Toots and the Maytals on vinyl.

Now, there’s certainly been a lot of debate about whether analogue sounds better than digital, when it comes to recorded music. When you hear live music, the vibrations reach your ears, your ears pick them up and send them to your brain through your nerves, and you hear exactly what someone sitting at that spot (and with your hearing capabilities) will hear. It’s perfect, in the sense that you can’t get more like really being there than... really being there.

Any recording provides a different experience, and whether that experience is better than the live one depends upon a lot of things, including where you were sitting and where the microphones were, how much extraneous stuff was heard by each (you and the microphones), and so on... along with the social experience and energy of being there to see it.

That said, there have always been those who say that digital recordings sound digital, changing the sound in unpleasant ways. Music is recorded on a CD, for example, by sampling the actual sound at frequent intervals (about 44,000 times a second), and by encoding the sampled sounds as numerical values (16-bit numbers, for CDs). The choice of the frequency of the sampling and the number of values (number of bits) used in the encoding affects the maximum quality of the sound.

I’m not going to get into the debate, here, and I’ll only note that since CDs have given way to other ways of listening, we’ve increased both the sampling rate and the number of bits per sample in some recordings. Whether or not CDs sound good, there’s better digital source material out there.

MP3 files, though, are not original source material: they’re compressed from the original, and their quality can vary greatly. Let’s look at why.

Broadly, there are two kinds of compression: lossless, and lossy. We use lossless compression in computer work all the time, such as when we make ZIP files. To do compression losslessly, we take advantage of redundancies, repetitions, and nearness that show up naturally in data, use alternative representations for common sequences, and that sort of thing. Any lossless compression algorithm works best on the kind of data its designed for, and doesn’t work well on certain other kinds.

Here, for instance, is a lossless algorithm I’m making up as I type this, designed for compressing English text, text consisting of letters, numbers, and a few punctuation marks and symbols:

In normal English text (US-ASCII), each character is represented by one byte. We know that the most common 11 letters in English are, in order, e, t, a, o, i, n, s, h, r, d, and l. So we’re going to represent each of those, plus the space character, with a half-byte instead of a full byte (shown here in binary, for clarity):

0000 = (space), 0001 = e, 0010 = t, 0011 = a, 0100 = o, 0101 = i,
0110 = n, 0111 = s, 1000 = h, 1001 = r, 1010 = d, 1011 = l

That leaves 15 lower-case letters, 26 upper-case letters, and 10 numerals to be represented, and we can introduce one-byte patterns with a half-byte of the form 11xx, unused above. We’ll reserve 1100 0000 for now, and assign the remaining one-byte patterns (11xx xxxx, where the x’s are not all zeroes) arbitrarily to those characters and two the 12 most common punctuation marks:

1101 0000 = b, 1101 0001 = c, 1101 0010 = f, 1101 0011 = g, ...,
1100 1111 = z, 1101 0000 = A, 1101 0001 = B, 1101 0010 = C, ...,
1110 1001 = Z, 1110 1010 = 0, 1110 1011 = 1, 1110 1100 = 2, ...,
1111 0011 = 9, 1111 0100 = (comma), 1111 0101 = (period), .....

Finally, we’ll represent anything else by using our reserved 1100 0000 as an escape, so that the byte immediately following it represents its normal US-ASCII value:

1100 0000 0010 0100 = (dollar), 1100 0000 0010 0101 = (percent),
1100 0000 0010 0110 = (ampersand), ....

Therefore, our system represents the space and the eleven most common letters in half the normal number of bits (4), and the remaining letters and numerals, along with twelve punctuation marks, in the normal number of bits (8)... but takes twice the number of bits (16) to represent anything else. It would be horrible for music files, which are made up of arbitrary binary data and which would get much bigger if put through this compression. But for plain English text, here’s an example, using hexadecimal notation to be more concise:

Original: This is a compressed string.
US-ASCII: 54 68 69 73 20 69 73 20 61 20 63 6F 6D 70 72 65 73 73 65 64 20 73 74 72 69 6E 67 2E
Our algorithm: E3 85 70 57 03 0C 24 C7 C8 91 77 1A 07 29 56 C4 F5

We’ve reduced the string from 28 bytes to 17, and it’s fully reversible (once we’ve dealt with padding needed when we end in the middle of a byte, but that’s easy enough and we don’t need to get into that here).

OK, that was fun to play with, but what about compressing music?

We can’t rely on common byte values for music, because the value of a given sample can be anything — silent, super-loud, or somewhere in between. But we can rely on the fact that in normal music, sounds don’t come in and go out instantaneously, and, therefore, adjacent samples are most likely to be relatively close to one another. If we optimize the algorithm for aspects like that, we can get fairly efficient compression. We can even get lossless compression to a point.

For example, suppose we’re using 32-bit samples, but we say that if we have a 32-bit sample and the next sample is within 15 bits of that one, we can instead use a 16-bit value that represents plus (first bit 0) or minus (first bit 1) from the reference sample. The next sample could similarly be coded as a delta from the second, and so on. We’d have to do some futzing around to signal that we’d gone back to a full 32-bit sample again, and we’d probably want to do that periodically, whether we need to or not, to set up resync points in case something goes wrong with the data streaming. But this is not an ideal nor complete mechanism... just the beginnings of an example.

But for the high levels of compression that we need to use to turn music or video into tolerably sized files, we need to go for lossy compression methods. That means that we can’t turn the compressed file back into the full version, because some information has been lost in the process. And information loss means quality loss — the compressed file is no longer a faithful copy of the original, and any playback is measurably different from the original.

But is it noticeably different from the original?

That, of course, depends upon how sensitive one is. Still, while the difference between CDs and vinyl records could be (and was) hotly debated, this one’s pretty straightforward: the difference is, in general, discernible on good equipment. There are a lot of compromises on the way from ten megabytes or more per minute to one megabyte or less per minute. That heavy compression is what makes it possible for us to put entire music collections in our pockets, so it’s worth it to many.

And when we’re putting the music in our pockets and listening to it with ear buds as we ride the subway and walk down noisy streets, we’re not noticing the digital and compression artifacts, the reduced frequency range, the lower sound quality. It’s entertaining us and keeping the world of chatty commuters and jackhammers and car horns way.

With docking stations, though, we’ve brought that model into our living rooms, and we’re often listening to MP3 files at home, through loudspeakers the size of a paperback book. Where we used to show off audiophile equipment stacked up on shelves and massive speakers that dominated the room and rattled the walls with great sound, we’re now boasting about how compact it all is.

It’s compact, though, at the cost of sound quality, and Adrian Grenier is going back to vinyl.