Thursday, May 30, 2013

The ABCs of PCM (Uncompressed) digital audio

Digital audio can be stored in a wide range of formats. If you are a developer interested in doing anything with audio, whether it's changing the volume, editing chunks out, looping, mixing, or adding reverb, you absolutely must understand the format you are working with. That doesn't mean you need to understand all the details of the file format, which is just a container for the audio which can be read by a library. It does mean you need to understand the data format you are working with. This blog post is designed to give you an introduction to working with audio data formats.

Compressed and Uncompressed Audio

Generally speaking, audio comes in two flavors: compressed and uncompressed. Compressed audio can further be subdivided into different kinds of compression: lossless, which preserves the original content exactly, and lossy which achieves more compression at the expense of degrading the audio. Of these, lossy is by far the most well known and includes MP3, AAC (used in iTunes), and Ogg Vorbis. Much information can be found online about the various kinds of lossy and lossless formats, so I won't go into more detail about compressed audio here, except to say that there are many kinds of compressed audio, each with many parameters.

Uncompressed PCM audio, on the other hand, is defined by two parameters: the sample rate and the bit-depth. Loosely speaking, the sample rate limits the maximum frequency that can be represented by the format, and the bit-depth determines the maximum dynamic range that can be represented by the format. You can think of bit-depth as determining how much noise there is compared to signal.

CD audio is uncompressed and uses a 44,100 Hz sample rate and 16 bit samples. What this means is that audio on a CD is represented by 44,100 separate measurements, or samples, taken per second. Each sample is stored as a 16-bit number. Audio recorded in studios often use a bit depth of 24 bits and sometimes a higher sample rate.

WAV and AIFF files support both compressed and uncompressed formats, but are so rarely used with compressed audio that these formats have become synonymous with uncompressed audio. The most common WAV files use the same parameters as CD audio: 44,100 Hz and bit depth of 16-bits, but other sample rates and bit depths are supported.

Converting From Compressed to Uncompressed Formats

As you probably already know, lots of audio in the world is stored in compressed formats like MP3. However, it's difficult to do any kind of meaningful processing on compressed audio. So, in order to change a compressed file, you must uncompress, process, and re-compress it. Every compression step results in degradation, so compressing it twice results in extra degradation. You can use lossless compression to avoid this, but the extra compression and decompression steps are likely to require a lot of CPU time, and the gains from compression will be relatively minor. For this reason, compressed audio is usually used for delivery and uncompressed audio is usually used in intermediate steps.

However, the reality is that sometimes we process compressed audio. Audiofiles and music producers may scoff, but sometimes that's life. For example, it you are working on mobile applications with limited storage space, telephony and VOIP applications with limited bandwidth, and web applications with many free users, you might find yourself need to store intermediate files in a compressed format. Usually the first step in processing compressed audio, like MP3, is to decompress it. This means converting the compressed format to PCM. Doing this involves a detailed understanding of the specific format. I recommend using a library such as libsoundfileffmpeg or lame for this step.

Uncompressed Audio

Most stored, uncompressed audio is 16-bit. Other bit depths, like 8 and 24 are also common and many other bit-depths exist. Ideally, intermediate audio would be stored in floating point format, as is supported by both WAV and AIFF formats, but the reality is that almost no one does this.

Because 16-bit is so common, let's use that as an example to understand how the data is formatted. 16-bit audio is usually stored as packed 16-bit signed integers. The integers may be big-endian (most common for AIFF) or little-endian (most common for WAV). If there are multiple channels, the channels are usually interleaved. For example, in stereo audio (which has two channels, left and right), you would have one 16-bit integer representing the left channel, followed by one 16-bit integer representing the right channel. These two samples represent the same time and the two together are sometimes called a sample frame or simply a frame.

Sample Frame 1:
Left MSB Left LSB Right MSB Right LSB
Sample Frame 2:
Left MSB Left LSB Right MSB Right LSB
2 sample frames of big-endian, 16-bit interleaved audio. Each box represents one 8-bit byte.

The above example shows 2 sample frames of big-endian, 16-bit interleaved audio. You can tell it's big-endian because the most significant byte (MSB) comes first. It's 16-bit because 2 8-bit bytes make up a single sample. It's interleaved because each left sample is followed by a corresponding right sample in the same frame.

In Java, and most C environments, a 16 bit signed integer is represented with the short datatype. Therefore, to read raw 16 bit data, you will usually want to get the data into an array of shorts. If you are only dealing with C, you can do your IO directly with short arrays, or simply use casting or type punning from a raw char array. In Java, you can use readShort() from DataInputStream.

To store 16-bit stereo interleaved audio in C, you might use a structure like this:

struct {
   short l;
   short r;
} stereo_sample_frame_t ;

or you might simply have an array of shorts:

short samples[];

In the latter case, you would just need to be aware that when you index an even number it's the left channel, and when you index an odd number it's the right channel. Iterating through all your data and finding the max on each channel would look something like this:

int sampleCount = ...//total number of samples = sample frames * channels
int frames = sampleCount / 2 ;
short samples[]; //filled in elsewhere

short maxl = 0;
short maxr = 0;
for( int i=0; i<SIZE; ++i )
   maxl = (short) MAX( maxl, abs( samples[2*i] ) );
   maxr = (short) MAX( maxr, abs( samples[r*i+1] ) );
printf( "Max left %d, Max right %d.", maxl, maxr );

Note how we find the absolute value of each sample. Usually when we are interested in the maximum, we are looking for the maximum deviation from zero, and we don't really care if it's positive or negative -- either way is going to sound equally loud.

Processing Raw Data

You may be able to do all the processing you need to do in the native format of the file. For example, once you have an array of shorts representing the data, you could divide each short by two to cut the volume in half:

int sampleCount; //total number of samples = sample frames * channels
short samples[]; //filled in elsewhere

for( int i=0; i
   samples[i] /= 2 ;

A few things to watch out for:

  • You must actually use the native format of the file or the proper conversion. You can't simply deal with the data as a stream of bytes. I've seen many questions on stack overflow where people make the mistake of dealing with 16-bit audio data byte-by-byte, even though each sample of 16-bit audio is composed of 2 bytes. This is like adding a multidigit number without the carry.
  • You must watch out for overflow. For example, when increasing the volume, be aware that some samples my end up out of range. You must ensure that all samples remain in the correct range for their datatype. The simplest way to handle this is with clipping (discussed below), which will result in some distortion, but is better than "wrap-around" that will happen otherwise. (the example above does not have to watch out for overflow because we are dividing not multiplying.)
  • Round-off error is virtually inevitable. If you are working in an integer format, eg 16-bit, it is almost impossible to deal with roundoff error. The effects of round-off will be minor but ugly. Eventually these errors will accumulate and be noticeable  The example above will definitely have problems with roundoff error.
As long as studio quality isn't your goal, however, you can mix, adjust volume and do a variety of other basic operations without needing to worry too much.

Converting and Using Floating Point Samples

If you need more powerful or flexible processing, you are probably going to want to convert your samples to floating point. Generally speaking, the nominal range used for audio when audio is represented as floating point numbers is [-1,1].

You don't have to abide by this convention. If you like, you can simply convert your raw data to float by casting:

short s = ... // raw data
float f = (float) s;

But if you have some files that are 16-bit and some that are 24-bit or 8-bit, you will end up with unexpected results:

char d1 = ... //data from 8-bit file
float f1 = (float) d1; // now in range [ -128, 127 ]
short d2 = ... //data from 16-bit file
float f2 = (float) d2; // now in range [ -32,768, 32,767 ]

It's hard to know how to use f1 and f2 together since their ranges are so different. For example, if you want to mix the two, you most likely won't be able to hear the 8-bit file. This is why we usually scale audio into the [-1,1] range.

There is much debate about the right constants to use when scaling your integers, but it's hard to go wrong with this:

int i = //data from n-bit file
float f = (float) i ;
f /= M;

where M is 2^(n-1). Now, f is guaranteed to be in the range [-1,1]. After you've done your processing, you'll usually want to convert back. To do so, use the same constant and check for out of range values:

float f  = // processed data
f *= M;
if( f < - M ) f = -2^(n-1);
if( f > M-1)  f = M-1;
i = (int) f;

Distortion and Noise

It's hard to avoid distortion and noise when processing audio. In fact, unless what you are doing is trivial or represents a special case, noise and/or distortion are inevitable. The key is to minimize it, but doing so is not easy. Broadly speaking, noise happens every time you are forced to round and distortion happens when you change values nonlinearly. We potentially created distortion in the code where we converted from a float to an integer with a range check, because any values outside the range boundary would have been treated differently than values inside the range boundary. The more of the signal is out of range the more distortion this will introduce. We created noise in the code where we lowered the volume because we introduced round-off error when we divided by two. We also introduce noise when we convert from floating point to integer. In fact, many mathematical operations will introduce noise.

Any time you are working with integers, you need to watch out for overflows. For example, the following code will mix two input signals represented as an array of shorts. We handle overflows in the same way we did above, by clipping:

short input1[] = ...//filled in elsewhere
short input2[] = ...//filled in elsewhere
// we are assuming input1 and input2 have size SIZE or greater
short output[ SIZE ];

for( int i=0; i<SIZE; ++i )
   int tmp = (int)input1[i] + (int)input2[i];
   if( tmp > SHRT_MAX ) tmp = SHRT_MAX;
   if( tmp < SHRT_MIN ) tmp = SHRT_MIN; 
   output[i] = tmp ;

If it so happens that the signal frequently "clips", then we will hear a lot of distortion. If we want to get rid of distortion altogether, we can eliminate it by dividing by 2. This will reduce the output volume and introduce some round-off noise, but will solve the distortion problem:

for( int i=0; i<SIZE; ++i )
   int tmp = (int)input1[i] + (int)input2[i];
   tmp /= 2;
   output[i] = tmp ;


A few final notes:
  • For some reason, WAV files don't support signed 8-bit format, so when reading and writing WAV files, be aware that 8-bits means unsigned, but in virtually all other cases it's safe to assume integers are signed.
  • Always remember to swap the bytes if the native endian-ness doesn't match the file endian-ness. You'll have to do this again before writing.
  • When reducing the resolution of data (eg, casting from float to int; multiplying an integer by a non-integer, etc), you are introducing noise because you are throwing out data. It might seem as though this will not make much difference, but it turns out that for sampled data in a time-series (like audio) it has a surprising impact. This impact is small enough that for simple audio applications you probably don't need to worry, but for anything studio-quality you will want to understand something called dither, which is the only correct way to solve the problem.
  • You may have come across one of these unfortunate posts, which claims to have found a better way to mix two audio signals. Here's the thing: there is no secret, magical formula that allows you to mix two audio signals and keep them both at the same original volume, but have the mix still be within the same bounds. The correct formula for mixing two signals is the one I described. If volume is a problem, you can either turn up the master volume control on your computer/phone/amplifier/whatever or use some kind of processing like a limiter, which will also degrade your signal, but not as badly as the formula in that post, which produces a terrible kind of distortion (ring modulation).