Monday, April 30, 2012

Audio Misconceptions around "Mastered for iTunes"

Ars Technica, among others, has been talking about Apple's new "Mastered for iTunes" product campaign. They talked to some real mastering engineers and got some real information about audio compression and how carefully tweaking the master before compression might make a difference to sound  quality after compression.

It's an interesting article and worth a read. Mostly, I think the conclusions are probably correct, although I think "Mastered for iTunes" fails to address the real problem of poor audio quality in most of the music we listen to today, which has absolutely nothing to do with the delivery format.

Unfortunately, they also managed to let loose some audio myths. Here are some corrections:

Using 16 bits for each sample allows a maximum dynamic range of 96dB. (It's even possible with modern signal processing to accurately record and playback as much as 120dB of dynamic range.) Since the most dynamic modern recording doesn't have a dynamic range beyond 60dB, 16-bit audio accurately captures the full dynamic range of nearly any audio source.
This is basically correct, but it sure is confusing. If you want to learn more, you can read all the gory details about the process, called dithering at Bob Katz website. (I am not sure where they got 60 dB from. That's HUGE even for orchestral music. If they are citing this source, they are confusing dB dynamic range with dB absolute volume. I am also not sure where the 120dB figure comes from -- that seems like a very contrived laboratory condition.)

Reality vs Theory
The maximum frequency that can be captured in a digital recording is exactly one-half of the sampling rate. This fact of digital signal processing life is brought to us by the Nyquist-Shannon sampling theorem, and is an incontrovertible mathematical truth. Audio sampled at 44.1kHz can reproduce frequencies up to 22.05kHz. Audio sampled at 96kHz can reproduce frequencies up to 48kHz. And audio sampled at 192kHz—some studios are using equipment and software capable of such high rates—can reproduce frequencies up to 96kHz.
Unfortunately, there's a big difference between "incontrovertible mathematical truth" and what can actually be implemented in hardware and software. In the real-world, we need to filter out all frequencies above the so-called Nyquist limit (one half the sample rate), or we get nasty artifacts called "aliasing". And, in the real-world, there is no filter that lets us keep everything below the limit and reject everything above the limit, so if we want this to work, we need a buffer between what we can hear and the Nyquist limit. That's why 44.1 kHz and not 40 kHz was chosen for CDs to reproduce up to 20 kHz audio. (Ideal filters could be designed if we relaxed certain constraints, such as one known formally as "causality", and if we had an infinite amount of data to work with.)

Typical Hearing
However, human ears have a typical frequency range of about 20Hz to 20kHz. This range varies from person to person—some people can hear frequencies as high as 24kHz—and the frequency response of our ears also diminishes with age. For the vast majority of listeners, a 44.1kHz sampling rate is capable of producing all frequencies that they can hear.
Haha. Sure, maybe my 9-week old son can hear 24kHz, but I doubt it. The range of human hearing which is so often cited as 20Hz to 20kHz does vary from person to person (last time I checked, a few years ago, my hearing went up to about 17kHz), but the 20Hz to 20kHz range is anything but typical. An acoustics textbook puts this more accurately: "a person who can hear the over the entire audible range of 20-20000 Hz is unusual." I would go further and say such a person is not living in the modern world, reading ars technica and buying pop or rock albums. Modern life and aging destroy the tiny hairs in our ears that are sensitive to those frequencies and that's all there is to it. Some people think they have better hearing because they are audiophiles. In fact, they may have superior hearing, but that has nothing to do with how well their ears work: exposure and critical listening improve our ability to hear. We exercise the appropriate parts of our brain and our hearing improves ("Golden Ears" is an example of a product designed for just that purpose).

Some people are reportedly sensitive to "supersonic" frequencies (it may give them headaches, for example). This is not the same as hearing.

Ultrasonics in Analog

Furthermore, attempting to force high-frequency, ultrasonic audio files through typical playback equipment actually results in more distortion, not less.
"Neither audio transducers nor power amplifiers are free of distortion, and distortion tends to increase rapidly at the lowest and highest frequencies," according to Xiph Foundation founder Chris Montgomery, who created the Ogg Vorbis audio format. "If the same transducer reproduces ultrasonics along with audible content, any nonlinearity will shift some of the ultrasonic content down into the audible range as an uncontrolled spray of intermodulation distortion products covering the entire audible spectrum. Nonlinearity in a power amplifier will produce the same effect."
Chris Mongomery is surely a genius, but I don't think he should be considered the authority on analog electronics. I think many analog engineers will tell a different story: when ultrasonics are pushed through most analog equipment it is steeply attenuated. It's phase might be altered, and it may produce some IM distortion, but at a very low level. For the most part, supersonics might as well not be there. On the other hand, it gives the benefit of allowing less stringent Nyquist filters, which reduces the amount of distortion in DAC. I think compelling arguments could be made either way, although I'm not a proponent of 96 kHz consumer formats. Even in the studio, well designed DSP mitigates the need for high sample-rates, though frequent ADA conversion may sound better at a high sample rate.

What Mastering is
When mastering engineers create a master file for CD reproduction, they downsample the 24/96 file submitted by the recording studio to 16/44.1. During this process, the mastering engineer typically adjusts levels, dynamic compression, and equalization to extract as much "good" audio from the source while eliminating as much "bad" audio, or noise, as possible.
Filtering as much useful dynamic range from 24/96 studio files into 16/44.1 CD master files is, in a nutshell, the mastering process.
This is a pretty poor representation of what mastering is, and it's sad that an article on mastering doesn't really bother to explain mastering. I've known top mastering engineers (even ones who have worked at masterdisk) who do all their work at 16/44.1. Many still prefer to work with analog as much as possible, where the bitdepth/samplerate doesn't mean much. All mastering engineers are all happy to deliver a wide variety of formats as the end product. Moreover, equating "bad" audio with noise, talking about level changes, dynamics and EQ as if it has something to do with "extraction" is all wrong, and none of that has anything to do with format. Fundamentally, mastering is about balancing levels, dynamics, and frequencies of a finished mix.

...since iTunes Plus tracks are also 16/44.1, it seems logical to use the files created for CD mastering to make the compressed AAC files sold via iTunes.
iTunes Plus tracks, if sourced from 24/96, never become 16/44.1. As you explain in the next paragraph, they go from 24/96 to float/44.1 to AAC/44.1. (They usually are played at 16/44.1, but with the volume control in between, so the effective bit depth is usually lower)

Null Test
Shepard performed what is known as a "null test" to prove his theory that specially mastering songs for iTunes to sound more like the CD version is "BS." 
About the only thing a "null test" is good for is determining if two files are identical. It's sort of the audio engineer's equivalent of the "diff" command-line tool. The Ars Technica article quotes Scott Hull arguing against the null test on artistic and perceptual grounds: "...objective tests give us some guide, but they don't account for the fact that our hearing still has an emotional element. We hear emotionally, and you can't measure that." But there are also very sound technical reasons why the null test is simply inappropriate here. When comparing perceptual coding, or even basic eq or other effects, the null test becomes useless because the it is nothing more than subtracting two files sample by sample and seeing what's left. Unfortunately, one of the basic operations you can perform on audio is to shift it in time, which means that data no longer corresponds sample by sample. Minute shifts in time are the only way to achieve eq and other frequency domain changes ("Aha," you say, "but FIR filters don't shift in time," but actually they do, they just don't do so recursively). Most other effects, including most dynamics changes and perceptual coding, do drastic changes in time as well, (although it's possible to do these kinds of changes without time shifts), so anything that changes here more or less here is really apples to oranges (apples to televisions?).


Phew, that's enough for now. I think I got the big ones. Like I said the conclusions are mostly correct, even if the above is wrong, but the whole "Mastered for iTunes" thing does seem to miss the point. (Unless the point is marketing, in which case, cheers!)

Updated 5/5/2012: fixed typo and included Scott Hull quote on null test along with some clarifications to that section.

1 comment:

  1. This comment has been removed by a blog administrator.