bjorg

Wednesday, November 20, 2013

Solving acoustics problems

A "waterfall plot" like this one is one of many tools used by
acousticians to determine the problems with a room.
Photo from realtraps which provides high quality bass traps,
an important type of acoustic treatment.
I recently received the following letter (edited):
      

Greetings,

The echo in my local church is really bad.  I am lucky if I can understand 10% of what’s being said.   I have checked with other members of the congregation and without exception they all have the same problem.

The church is medium size with high vaulted ceiling, very large windows with pillars spaced throughout.  The floor is mostly wood.   The speakers are flat against the side walls, spaced approx 15 metres apart and approx 10 feet above the floor.

The speakers are apparently ‘top of the range’… I just wonder if a graphic equalizer was used between the microphone and speaker, would this ‘clean up’ the sound a little?

I know that lining the walls with acoustic tiles and carpeting the floor would lessen the echo, but, we don’t want to do that if we can avoid it.

With regard to putting carpet on the floor, my thoughts are that instead of sound being absorbed by the carpet, the congregation present would absorb just as much as the carpet?.  One other theory I have is regarding the speakers.

If  the speakers were moved…

Michael

Hey Michael,

I sympathize with you. Going to service every week and not being able to understand what is being said must be very frustrating. While this is not the kind of thing I do every day, I do have some training  in this area and will do my best to give you something helpful.

Most churches are built with little attention to acoustics and old churches were built before there was any understanding of what acoustics is. With all those reflective surfaces and no care taken to prevent the acoustic problems that they create, problems are inevitable, and sometimes, such as in your church, they are simply out of hand. In a situation like that, even a great sound-system won't be able to solve the problem.

I recommend you hire a professional in your area to come look at the space and be able to give some more specific feedback. To have them improve the situation may cost anywhere from hundreds to tens of thousands of dollars (or even more) depending on the cause of problem. However, it's helpful to have some idea of what some of the solutions are so that when you hire that professional you are prepared for what's to come. You might be able to do some more research and take a stab at solving these issues yourself.

For example, it might be useful to listen to room and conjecture, even without measurements, if the problem is bound to specific frequencies or if it's just a problem of too many echos. If you are a trained listener you might be able to stand in the room in various places, clap loudly and listen to get a sense of this. Although even a trained listener would never substitute such methods for actual measurements, I often find this method useful for developing a hypothesis (eg. I might listen and say "I believe there is a problem in the low frequencies" before measuring. Then use measurements to confirm or reject this hypothesis). Also, look at the room, are there lots of parallel walls? If so, you are likely suffering from problems at specific frequencies and it's possible that a targeted, and probably less expensive, approach will help.

Another thing you can do is find someone with some stage acting experience and have them speak loud and clear at the pulpit. Have them do this both with and without the sound system and listen to the results. If they sound much clearer without the sound-system than with the sound-system, then that suggests that your sound-system may be causing at least some of the problems.

If you can't afford an acoustician, but you are willing to experiment a bit, this kind of testing might lead you to something. For example, maybe you notice some large open parallel walls and you agree that covering one or both of them with some heavy draperies is either acceptable or would look nice. You could try it and see if it helps. It's no guarantee, but it might make a difference. Draperies are, of course, unlikely to make that much difference by themselves, so you might consider putting acoustic absorbing material behind them.

Be warned, however, that acoustic treatments done by amateurs without measurements are often beset with problems. For example, you may reduce the overall reverberation time, but leave lots of long echos at certain frequencies. This can be yield results that are no better than where you started -- possibly even worse (although in your case I think that's unlikely).

Here are the types of things a professional is likely to recommend. You've already alluded to all of them, but I'll repeat them with some more detail. I put them roughly in order of how likely they are to help, but it does depend on your specific situation:
  • Acoustic treatments. Churches like the one you describe are notorious for highly reflective surfaces like stone and glass, and as you surmised, adding absorptive materials to the walls, floors and ceiling will reduce the echo significantly. Also as you surmised, floor covering may be of limited effectiveness since people do also absorb and diffuse sound, but, of course, it depends on how much of the floor they cover and where. I understand your hesitation to go this route since it may impact the aesthetics of the church, and it may be expensive, but, as I mentioned above, depending on the specific situation, you may be able to achieve a dramatic result in acoustics with relatively little visual impact, and depending on the treatment needed you may be able to keep your costs controlled. You should also be able to collaborate with someone who can create acoustic treatments that are either not noticeable or enhance the esthetics of your space. (Of course, you'll also need someone familiar with things like local fire codes!)
  • Adjusting the speakers. It's certainly possible that putting the speakers in another location would help. If they were hung by a contractor or someone who did not take acoustics into account, they are likely to be placed poorly. Location matters more than the quality of the speakers themselves. Also, if the speakers are not in one cluster at the front, adding the appropriate delay to each set of speaker may help to ensure that sound arrives "coherently" from all speakers, which can improve intelligibility significantly. Devices to provide this kind of delay, and lots of other features, are sold under various names such as "speaker processors," and "speaker array controllers," etc.
  • Electronic tools. Although this is likely to be least effective, you can usually achieve some improvement with EQ, as you suggested. For permanent installations, I prefer parametric EQs, but a high quality graphic will also work. An ad-hoc technique for setting the EQ is to increase the gain until you hear feedback, and then notch out the EQ frequency that causes the feedback. Continue increasing the gain until you are happy with the results. You must be very careful to protect your speakers and your hearing when using this technique, both of which can be easily damaged if you don't know what you are doing. Most speaker processors have built-in parametric EQs and some even come with a calibrated mike that you can use with the device to adjust the settings for you automatically. I've done this, and it works great, especially with a little manual tweaking, but you do have to know what you are doing. But, of course, you can't work miracles in a bad room.

Saturday, September 21, 2013

Mapping Parameters


Visualizing a Linear Mapping
Very often we need to "map" one set of values to another. For example, if we have a slider that ranges from 0 to 1, and we want to use it to control the value of a frequency setting. Or perhaps we have the output of a sine wave (which ranges from -1 to 1) and we want to use that to control the intensity of a EQ. In these cases and many more, we can use a linear mapping to get from one range of values to another.

A linear mapping is simply a linear equation, such as y = mx + b, that takes an input, your slider value for example, and gives you back an output. The input is x, and the output is y. The trick is to find the values of m and b.

Let's take a concrete example. Let's say you have the output of a sine wave (say from an LFO) that oscillates between -1 and 1. Now we want to use those values to control a frequency setting from 200 to 2000. In this case, x from the equation above represents the oscillator, and y represents the frequency setting.

We know two things: we want x=-1 to map to y=200, and x=1 to map to y=2000. Since our original equation, y = mx + b, had two unknowns (m and b), we can solve it:

Original equation with both unknowns:
y = mx + b

Substituting our known values for x and y:
200 = (-1)m + b
2000 = (1)m + b

Solving for b:
2200 = 2b
1100 = b

Solving for m:
2000 = m + 1100
900 = m

Final equation:
y = 900x + 1100

You can check the final equation by substituting -1 and 1 for x and making sure you get 200 and 2000 respectively for y.

So in our LFO/frequency example, we would take our LFO value, say .75, and use that as x. Then plug that value into the formula (y=900(.75) + 1100=1775) and get our final value for our frequency setting.

Sunday, July 21, 2013

Peak Meters, dBFS and Headroom

The level meter from audiofile engineering's
spectre program accurately shows peak values
in dBFS
Level meters are one of the most basic features of digital audio software. In software, they are very often implemented as peak meters, which are designed to track the maximum amplitude of the signal. Other kinds of meters, such as VU meters, are often simulations of analog meters. Loudness meters, which attempt to estimate our perception of volume rather than volume itself, are also becoming increasingly common. You may also come across RMS and average meters. In this post, I'm only going to talk about peak meters.

Peak Meters

Peak meters are useful in digital audio because they show the user information that is closely associated with the limits of the medium and because they are efficient and easy to implement. Under normal circumstances, we can expect peak meters to correspond pretty well with our perception of volume, but not perfectly. The general expectation users have when looking at peak meters is that if a signal goes above a certain level at some point, that level should be indicated on the meters. In other words, if the signal goes as high as, say -2 dBFS, over some time period, then someone watching the peak meter during that time will see the meter hit the -2 dBFS mark (see below for more on dBFS). Many peak meters have features such as "peak hold" specifically designed so that the user does not need to stare at the meter.

Beyond that, there are rarely any specifics. Some peak meters show their output linearly, some show their output in dB. Some use virtual LEDs, some a bar graph. In general, if there is a numeric readout or units associated with the meter, the unit should be dBFS.

Now that we know the basics of peak meters, let's figure out how to implement them.

Update Time

Peak meters should feel fast and responsive. However, they don't update instantly. In software, it is not uncommon to have audio samples run at 44100 samples per second while the display refreshes at only 75 times per second, so there is absolutely no point in showing the value of each sample (not to mention the fact that our eyes couldn't keep up). Clearly we need to figure out how to represent a large number of samples with only one value. For peak meters, we do this as follows:

  1. Figure out how often we want to update. For example, every 100 ms (.1s) is a good starting point, and will work well most of the time.
  2. Figure out how many samples we need to aggregate for each update. If we are sampling at 44100 Hz, a common rate, and want to update every .1s, we need N = 44100 * .1 = 4410 samples per update.
  3. Loop on blocks of size N. Find the peak in each block and display that peak. If the graphics system does not allow us to display a given peak, the next iteration should display the max of any undisplayed peaks.

Finding the Peak

Sound is created by air pressure swing both above
below the mean pressure.
Finding the peak of each block of N samples is the core of peak metering. To do so, we can't simply find the maximum value of all samples because sound waves contain not just peaks, but also troughs. If those troughs go further from the mean than the peaks, we will underestimate the peak.

The solution to this problem is simply to take the absolute value of each sample, and then find the max of those absolute values. In code, it would look something like this:



float max = 0;
for( int i=0; i<buf.size(); ++i ) {
   const float v = abs( buf[i] )
   if( v > max )
      max = v;
}

At the end of this loop, max is your peak value for that block, and you can display it on the meter, or, optionally, calculate its value in dBFS first.

Calculating dBFS or Headroom

(For a more complete and less "arm wavy" intro to decibels, try here or here.) The standard unit for measuring audio levels is the decibel or dB. But the dB by itself is something of an incomplete unit, because, loosely speaking, instead of telling you the amplitude of something, dB tells you the amplitude of something relative to something else. Therefore, to say something has an amplitude of 3dB is meaningless. Even saying it has an amplitude of 0dB is meaningless. You always need some point of reference. In digital audio, the standard point of reference is "Full Scale", ie, the maximum value that digital audio can take on without clipping. If you are representing your audio as a float, 0 dB is nominally calibrated to +/- 1.0. We call this scale dBFS. To convert the above max value (which is always positive because it comes from an absolute value) to dBFS use this formula:

dBFS = 20 * log10(max);

You may find it odd that the loudest a signal can normally be is 0 dBFS, but this is how it is. You may find it useful to think of dBFS as "headroom", ie, answering the question "how many dB can I add to the signal before it reaches the maximum?" (Headroom is actually equal to -dBFS, but I've often seen headroom labeled as dBFS when the context makes it clear.)

Thursday, May 30, 2013

The ABCs of PCM (Uncompressed) digital audio

Digital audio can be stored in a wide range of formats. If you are a developer interested in doing anything with audio, whether it's changing the volume, editing chunks out, looping, mixing, or adding reverb, you absolutely must understand the format you are working with. That doesn't mean you need to understand all the details of the file format, which is just a container for the audio which can be read by a library. It does mean you need to understand the data format you are working with. This blog post is designed to give you an introduction to working with audio data formats.

Compressed and Uncompressed Audio

Generally speaking, audio comes in two flavors: compressed and uncompressed. Compressed audio can further be subdivided into different kinds of compression: lossless, which preserves the original content exactly, and lossy which achieves more compression at the expense of degrading the audio. Of these, lossy is by far the most well known and includes MP3, AAC (used in iTunes), and Ogg Vorbis. Much information can be found online about the various kinds of lossy and lossless formats, so I won't go into more detail about compressed audio here, except to say that there are many kinds of compressed audio, each with many parameters.

Uncompressed PCM audio, on the other hand, is defined by two parameters: the sample rate and the bit-depth. Loosely speaking, the sample rate limits the maximum frequency that can be represented by the format, and the bit-depth determines the maximum dynamic range that can be represented by the format. You can think of bit-depth as determining how much noise there is compared to signal.

CD audio is uncompressed and uses a 44,100 Hz sample rate and 16 bit samples. What this means is that audio on a CD is represented by 44,100 separate measurements, or samples, taken per second. Each sample is stored as a 16-bit number. Audio recorded in studios often use a bit depth of 24 bits and sometimes a higher sample rate.

WAV and AIFF files support both compressed and uncompressed formats, but are so rarely used with compressed audio that these formats have become synonymous with uncompressed audio. The most common WAV files use the same parameters as CD audio: 44,100 Hz and bit depth of 16-bits, but other sample rates and bit depths are supported.

Converting From Compressed to Uncompressed Formats

As you probably already know, lots of audio in the world is stored in compressed formats like MP3. However, it's difficult to do any kind of meaningful processing on compressed audio. So, in order to change a compressed file, you must uncompress, process, and re-compress it. Every compression step results in degradation, so compressing it twice results in extra degradation. You can use lossless compression to avoid this, but the extra compression and decompression steps are likely to require a lot of CPU time, and the gains from compression will be relatively minor. For this reason, compressed audio is usually used for delivery and uncompressed audio is usually used in intermediate steps.

However, the reality is that sometimes we process compressed audio. Audiofiles and music producers may scoff, but sometimes that's life. For example, it you are working on mobile applications with limited storage space, telephony and VOIP applications with limited bandwidth, and web applications with many free users, you might find yourself need to store intermediate files in a compressed format. Usually the first step in processing compressed audio, like MP3, is to decompress it. This means converting the compressed format to PCM. Doing this involves a detailed understanding of the specific format. I recommend using a library such as libsoundfileffmpeg or lame for this step.

Uncompressed Audio

Most stored, uncompressed audio is 16-bit. Other bit depths, like 8 and 24 are also common and many other bit-depths exist. Ideally, intermediate audio would be stored in floating point format, as is supported by both WAV and AIFF formats, but the reality is that almost no one does this.

Because 16-bit is so common, let's use that as an example to understand how the data is formatted. 16-bit audio is usually stored as packed 16-bit signed integers. The integers may be big-endian (most common for AIFF) or little-endian (most common for WAV). If there are multiple channels, the channels are usually interleaved. For example, in stereo audio (which has two channels, left and right), you would have one 16-bit integer representing the left channel, followed by one 16-bit integer representing the right channel. These two samples represent the same time and the two together are sometimes called a sample frame or simply a frame.

Sample Frame 1:
Left MSB Left LSB Right MSB Right LSB
Sample Frame 2:
Left MSB Left LSB Right MSB Right LSB
2 sample frames of big-endian, 16-bit interleaved audio. Each box represents one 8-bit byte.

The above example shows 2 sample frames of big-endian, 16-bit interleaved audio. You can tell it's big-endian because the most significant byte (MSB) comes first. It's 16-bit because 2 8-bit bytes make up a single sample. It's interleaved because each left sample is followed by a corresponding right sample in the same frame.

In Java, and most C environments, a 16 bit signed integer is represented with the short datatype. Therefore, to read raw 16 bit data, you will usually want to get the data into an array of shorts. If you are only dealing with C, you can do your IO directly with short arrays, or simply use casting or type punning from a raw char array. In Java, you can use readShort() from DataInputStream.

To store 16-bit stereo interleaved audio in C, you might use a structure like this:

struct {
   short l;
   short r;
} stereo_sample_frame_t ;

or you might simply have an array of shorts:

short samples[];

In the latter case, you would just need to be aware that when you index an even number it's the left channel, and when you index an odd number it's the right channel. Iterating through all your data and finding the max on each channel would look something like this:

int sampleCount = ...//total number of samples = sample frames * channels
int frames = sampleCount / 2 ;
short samples[]; //filled in elsewhere

short maxl = 0;
short maxr = 0;
for( int i=0; i<SIZE; ++i )
   maxl = (short) MAX( maxl, abs( samples[2*i] ) );
   maxr = (short) MAX( maxr, abs( samples[r*i+1] ) );
}
printf( "Max left %d, Max right %d.", maxl, maxr );

Note how we find the absolute value of each sample. Usually when we are interested in the maximum, we are looking for the maximum deviation from zero, and we don't really care if it's positive or negative -- either way is going to sound equally loud.

Processing Raw Data

You may be able to do all the processing you need to do in the native format of the file. For example, once you have an array of shorts representing the data, you could divide each short by two to cut the volume in half:

int sampleCount; //total number of samples = sample frames * channels
short samples[]; //filled in elsewhere

for( int i=0; i
   samples[i] /= 2 ;
}


A few things to watch out for:

  • You must actually use the native format of the file or the proper conversion. You can't simply deal with the data as a stream of bytes. I've seen many questions on stack overflow where people make the mistake of dealing with 16-bit audio data byte-by-byte, even though each sample of 16-bit audio is composed of 2 bytes. This is like adding a multidigit number without the carry.
  • You must watch out for overflow. For example, when increasing the volume, be aware that some samples my end up out of range. You must ensure that all samples remain in the correct range for their datatype. The simplest way to handle this is with clipping (discussed below), which will result in some distortion, but is better than "wrap-around" that will happen otherwise. (the example above does not have to watch out for overflow because we are dividing not multiplying.)
  • Round-off error is virtually inevitable. If you are working in an integer format, eg 16-bit, it is almost impossible to deal with roundoff error. The effects of round-off will be minor but ugly. Eventually these errors will accumulate and be noticeable  The example above will definitely have problems with roundoff error.
As long as studio quality isn't your goal, however, you can mix, adjust volume and do a variety of other basic operations without needing to worry too much.

Converting and Using Floating Point Samples

If you need more powerful or flexible processing, you are probably going to want to convert your samples to floating point. Generally speaking, the nominal range used for audio when audio is represented as floating point numbers is [-1,1].

You don't have to abide by this convention. If you like, you can simply convert your raw data to float by casting:

short s = ... // raw data
float f = (float) s;

But if you have some files that are 16-bit and some that are 24-bit or 8-bit, you will end up with unexpected results:

char d1 = ... //data from 8-bit file
float f1 = (float) d1; // now in range [ -128, 127 ]
short d2 = ... //data from 16-bit file
float f2 = (float) d2; // now in range [ -32,768, 32,767 ]

It's hard to know how to use f1 and f2 together since their ranges are so different. For example, if you want to mix the two, you most likely won't be able to hear the 8-bit file. This is why we usually scale audio into the [-1,1] range.

There is much debate about the right constants to use when scaling your integers, but it's hard to go wrong with this:

int i = //data from n-bit file
float f = (float) i ;
f /= M;

where M is 2^(n-1). Now, f is guaranteed to be in the range [-1,1]. After you've done your processing, you'll usually want to convert back. To do so, use the same constant and check for out of range values:

float f  = // processed data
f *= M;
if( f < - M ) f = -2^(n-1);
if( f > M-1)  f = M-1;
i = (int) f;

Distortion and Noise

It's hard to avoid distortion and noise when processing audio. In fact, unless what you are doing is trivial or represents a special case, noise and/or distortion are inevitable. The key is to minimize it, but doing so is not easy. Broadly speaking, noise happens every time you are forced to round and distortion happens when you change values nonlinearly. We potentially created distortion in the code where we converted from a float to an integer with a range check, because any values outside the range boundary would have been treated differently than values inside the range boundary. The more of the signal is out of range the more distortion this will introduce. We created noise in the code where we lowered the volume because we introduced round-off error when we divided by two. We also introduce noise when we convert from floating point to integer. In fact, many mathematical operations will introduce noise.

Any time you are working with integers, you need to watch out for overflows. For example, the following code will mix two input signals represented as an array of shorts. We handle overflows in the same way we did above, by clipping:

short input1[] = ...//filled in elsewhere
short input2[] = ...//filled in elsewhere
// we are assuming input1 and input2 have size SIZE or greater
short output[ SIZE ];

for( int i=0; i<SIZE; ++i )
   int tmp = (int)input1[i] + (int)input2[i];
   if( tmp > SHRT_MAX ) tmp = SHRT_MAX;
   if( tmp < SHRT_MIN ) tmp = SHRT_MIN; 
   output[i] = tmp ;
}

If it so happens that the signal frequently "clips", then we will hear a lot of distortion. If we want to get rid of distortion altogether, we can eliminate it by dividing by 2. This will reduce the output volume and introduce some round-off noise, but will solve the distortion problem:

for( int i=0; i<SIZE; ++i )
   int tmp = (int)input1[i] + (int)input2[i];
   tmp /= 2;
   output[i] = tmp ;
}

Notes:

A few final notes:
  • For some reason, WAV files don't support signed 8-bit format, so when reading and writing WAV files, be aware that 8-bits means unsigned, but in virtually all other cases it's safe to assume integers are signed.
  • Always remember to swap the bytes if the native endian-ness doesn't match the file endian-ness. You'll have to do this again before writing.
  • When reducing the resolution of data (eg, casting from float to int; multiplying an integer by a non-integer, etc), you are introducing noise because you are throwing out data. It might seem as though this will not make much difference, but it turns out that for sampled data in a time-series (like audio) it has a surprising impact. This impact is small enough that for simple audio applications you probably don't need to worry, but for anything studio-quality you will want to understand something called dither, which is the only correct way to solve the problem.
  • You may have come across one of these unfortunate posts, which claims to have found a better way to mix two audio signals. Here's the thing: there is no secret, magical formula that allows you to mix two audio signals and keep them both at the same original volume, but have the mix still be within the same bounds. The correct formula for mixing two signals is the one I described. If volume is a problem, you can either turn up the master volume control on your computer/phone/amplifier/whatever or use some kind of processing like a limiter, which will also degrade your signal, but not as badly as the formula in that post, which produces a terrible kind of distortion (ring modulation).

Tuesday, November 27, 2012

Audio IIR v FIR EQs


Digital filters come in two flavors: IIR (or "Infinite Impulse Response") and FIR (or "Finite Impulse Response"). Those complex acronyms may confuse you, so let's shed a little light on the situation by defining both and explaining the differences.

Some people are interested in which is better. Unfortunately, as with many things, there is no easy answer to that question, other than "it depends", and sometimes what it depends on is your ears. I won't stray too deep into field of opinions, but I will try to mention why some people claim one is better than the other and what some of the advantages and disadvantages are in different situations.

How Filters Work

When you design a filter, you start with a set of specifications. To audio engineers, this might be a bit vague, like "boost 1 kHz by 3 dB", but electrical engineers are usually trained to design filters with very specific constraints. However you start, there's usually some long set of equations, and rules used to "design" the filter, depending on what type of filter you are designing and what the specific constraints are (to see one way you might design a filter, see this post on audio eq design). Once the filter is "designed" you can actually process audio samples.

IIR Filters

Once the filter is designed, the filter itself is implemented as difference equations, like this:

    y[i] = a0 * x[i] + a1 * x[i-1] ... + an * x[i-n] - b1 * y[i-1] ... - bm * y[i-m].

In this case, y is an array storing the output, and x is an array storing the input. Note that each output is a linear function of previous inputs and outputs, as well as the current input.

In order to know the current value of y, we need to know the last value of y, and to know that, you must know the value of still earlier values of y, and so on, all the way back until we reach our initial conditions. For this reason, this kind of filter is sometimes called a "recursive" filter. In principle, this filter can be given a finite input, and it will produce output forever. Because its response is infinite, we call this filter an IIR, or "Infinite Impulse Response" filter.

(To further confuse the terminology, IIR filters are often designed with certain constraints that make them "minimum phase." While IIR filters are not all minimum phase, many people use the terms "recursive", "IIR" and "minimum phase" interchangeably.)

Digital IIR filters are often modeled after analog filters. In many ways, analog-modled IIR filters sound like analog filters. They are very efficient, too: for audio purposes, they usually only require a few multiplies.

FIR Filters

FIR filters, on the other hand, are usually implemented with a difference equation that looks like this:

    y[i] = a0 * x[i] + a1 * x[i-1]  a2 * x[i-2] + ... an * x[i-n] + an * x[i-n-1] + ... + a1 * x[2i+1] + a0 * x[2i]

In this case, we don't use previous outputs: in order to calculate the current output, we only need to know the previous n inputs. This may improve the numerical stability of the filter because roundoff errors are not accumulated inside the filter. However, generally speaking, FIR filters are much more CPU intensive for a comparable response, and have some other problems, such as high latency, and both pass-band and stop-band ripple.

If an FIR filter can be implemented using a difference equation that is symmetrical, like the one above, it has a special property called "linear phase." Linear phase filters delay all frequencies in the signal by the same amount, which is not possible with IIR filters.

Which Filter?

When deciding which filter to use, there are many things to take into account. Here are some of those things:

  • Some people feel that linear phase FIR filters sound more natural and have fewer "artifacts".
  • FIR filters are usually much more processor intensive for the same response.
  • FIR filters have "ripple" in both the passband and stopband, meaning the response is "jumpy". IIR filters can be designed without any ripple.
  • IIR filters can be easily designed to sound like analog filters.
  • IIR filters require careful design to ensure stability and good numerical error properties, however, that art is fairly advanced.
  • FIR filters generally have a higher latency.

Thursday, August 23, 2012

Basic Audio EQs

In my last post, I looked at why it's usually better to do EQ (or filtering) in the time domain than the frequency domain as far as audio is concerned, but I didn't spend much time explaining how you might implement a time-domain EQ. That's what I'm going to do now.

The theory behind time-domain filters could fill a book. Instead of trying to cram you full of theory we'll just skip ahead to what you need to know to do it. I'll assume you already have some idea of what a filter is.

Audio EQ Cookbook

The Audio EQ Cookbook by Robert Bristow-Johnson is a great, albeit very terse, description of how to build basic audio EQs. These EQs can be described as second order digital filters, sometimes called "biquads"because the equation that describes them contains two quadratics. In audio, we sometimes use other kinds of filters, but second order filters are a real workhorse. First order filters don't do much: they generally just allow us to adjust the overall balance of high and low frequencies. This can be useful in "tone control" circuits, like you might find on some stereos and guitars, but not much else. Second order filters give us more control -- we can "dial in" a specific frequency, or increase or decrease frequencies above and below a certain threshold, with a fair degree of accuracy, for example. If we need even more control than a second order filter offers, we can often simply take several second order filters and place them in series to simulate the effect of a single higher order filter.

Notice I said series, though. Don't try putting these filters in parallel, because they not only alter the frequency response, but also the phase response, so when you put them in parallel you might get unexpected results. For example, if you take a so-called all-pass filter and put it in parallel with no filter, the result will not be a flat frequency response, even though you've combined the output of two signals that have the same frequency response as the original signal.

Using the Audio EQ Cookbook, we can design a peaking, high-pass, low-pass, band-pass, notch (or band-stop), or shelving filter. These are the basic filters used in audio. We can even design that crazy all-pass filter I mentioned which actually does come in handy if you are building a phaser. (It has other uses, too, but that's for another post.)

Bell Filter

Let's design a "bell", or "peaking" filter using RBJ's cookbook. Most other filters in the cookbook are either similar to the bell or simpler, so once you understand the bell, you're golden. To start with, you will need to know the sample rate of the audio going into and coming out of your filter, and the center frequency of your filter. The center frequency, in the case of the bell filter, is the frequency that "most affected" by your filter. You will also want to define the width of the filter, which can be done in a number of ways usually with some variation on "Q" or "quality factor" and "bandwidth". RBJ's filters define bandwidth in octaves, and you want to be careful that you don't extend the top of the bandwidth above the Niquist frequency (or 1/2 the sample rate), or your filter won't work. We also need to know how much of our center frequency to add in dB (if we want to remove, we just use a negative value, and for no change, we set that to 0).

Fs = Sample Rate
F0 = Center Frequency (always less than Fs/2)
BW = Bandwidth in octaves
g = gain in dB

Great! Now we are ready to begin our calculations. First, RJB suggests calculating some intermediate values:

A = 10^(g/40)
w0 = 2*pi*f0/Fs c = cos(w0) s = sin(w0) alpha = s*sinh( ln(2)/2 * BW * w0/s )

This is a great chance to use that hyperbolic sin button on your scientific calculator that, until now, has only been collecting dust. Now that we've done that, we can finally calculate the filter coefficients, which we use when actually processing data:

b0 = 1 + alpha*A b1 = -2*c b2 = 1 - alpha*A a0 = 1 + alpha/A a1 = -2*c a2 = 1 - alpha/A

Generally speaking, we want to "normalize" these coefficients, so that a0 = 1. We can do this by dividing each coefficient by a0. Do this in advance or the electrical engineers will laugh at you:

b0 /= a0 b1 /= a0 b2 /= a0 a1 /= a0 a2 /= a0

Now, in pseudocode, here's how we process our data, one sample at a time using a "process" function that looks something like this:

number xmem1, xmem2, ymem1, ymem2;

void reset() {
xmem1 = xmem2 = ymem1 = ymem2 = 0;
}
number process( number x ) {
number y = b0*x + b1*xmem1 + b2*xmem2 - a1*ymem1 - a2*ymem2;

xmem2 = xmem1;
xmem1 = x;
ymem2 = ymem1;
ymem1 = y;

return y;
}

You'll probably have some kind of loop that your process function goes in, since it will get called once for each audio sample.

There's actually more than one way to implement the process function given that particular set of coefficients. This implementation is called "Direct Form I" and happens to work pretty darn well most of the time. "Direct form II" has some admirers, but those people are either suffering from graduate-school-induced trauma or actually have some very good reason for doing what they are doing that in all likelihood does not apply to you. There are of course other implementations, but DFI is a good place to start.

You may have noticed that the output of the filter, y, is stored and used as an input to future iterations. The filter is therefore "recursive". This has several implications:

  • The filter is fairly sensitive to errors in the recursive values and coefficients. Because of this, we need to take care of what happens with the error in our y values. In practice, on computers, we usually just need to use a high resolution floating point value (ie double precision) to store these (on fixed point hardware, it is often another matter).
  • Another issue is that you can't just blindly set the values of your coefficients, or your filter may become unstable. Fortunately, the coefficients that come out of RJB's equations always result in stable filters, but don't go messing around. For example, you might be tempted to interpolate coefficients from one set of values to another to simulate a filter sweep. Resist this temptation or you will unleash the numerical fury of hell! The values in between will be "unstable" meaning that your output will run off to infinity. Madness, delirium, vomiting and broken speakers are often the unfortunate casualties.
  • On some platforms you will have to deal with something called "denormal" numbers. This is a major pain in the ass, I'm sorry to say. Basically it means our performance will be between 10 and 100 times worse than it should be because the CPU is busy calculating tiny numbers you don't care about. This is one of the rare cases where I would advocate optimizing before you measure a problem because sometimes your code moves around and it comes up and it's very hard to trace this issue. In this case, the easiest solution is probably to do something like this (imagine we are in C for a moment):


#DEFINE IS_DENORMAL(f) (((*(unsigned int *)&(f))&0x7f800000) == 0)
float xmem1, xmem2, ymem1, ymem2;

void reset() {
xmem1 = xmem2 = ymem1 = ymem2 = 0;
}
float process( float x ) {
number y = b0*x + b1*xmem1 + b2*xmem2 - a1*ymem1 - a2*ymem2;

if( IS_DENORMAL( y ) )
y = 0;

xmem2 = xmem1;
xmem1 = x;
ymem2 = ymem1;
ymem1 = y;

return y;
}

Okay, happy filtering!

Wednesday, August 8, 2012

Why EQ Is Done In the Time Domain

In my last post, I discussed how various audio processing may be best done in the frequency or time domain. Specifically, I suggested that EQ, which is a filter that alters the frequency balance of a signal, is best done in the time domain, not the frequency domain. (See my next post if you want to learn how to implement a time-domain filter.)

If this seems counter intuitive to you, rest assured you are not alone. I've been following the "audio" and "FFT" tags (among others) on stack overflow and it's clear that many people attempt to implement EQs in the frequency domain, only to find that they run into a variety of problems.

Frequency Domain Filters

Let's say you want to eliminate or reduce high frequencies from your signal. This is called a "low-pass" filter, or, less commonly, a "high-cut" filter. In the frequency domain, high frequencies get "sorted" into designated "bins", where you can manipulate them or even set them to zero. This seems like an ideal way to do low-pass filtering, but lets explore the process to see why it might not work out so well.

Our first attempt at a low-pass filter, implemented with the FFT might look something like this:
  • loop on audio input
  • if enough audio is received, perform FFT, which gives us audio in the frequency domain
    • in frequency domain, perform manipulations we want. In the case of eliminating high frequencies, we set the bins representing high frequencies to 0.
    • perform inverse FFT, to get audio back in time domain
    • output that chunk of audio

But there are quite a few problems with that approach:
  • We must wait for a chunk of audio before we can even begin processing, which means that we will incur latency in our processing. The higher quality filter we want, the more audio we need to wait for. If the input buffer size does not match the FFT size, extra buffering needs to be done.
  • The FFT, though efficient compared to the DFT (which is the FFT without the "fast" part), performs worse than linear time, and we need to do both the FFT and it's inverse, which is computationally similar. EQing with the FFT is therefore generally very inefficient compared to comparable time-domain filters.
  • Because our output chunk has been processed in the frequency domain independent of samples in neighboring chunks, the audio in neighboring chunks may not be continuous. One solution is to process the entire file as one chunk (which only works for offline, rather than real-time processing, and is computationally expensive). The better solution is the OLA or Overlap Add method but this involves complexity that many people miss when implementing a filter this way.
  • Filters implemented via FFT, as well as time-domain filters implemented via IFFT, often do not perform the way people expect. For example, many people expect that if they set all values in bins above a certain frequency to 0, then all frequencies above the given frequency will be eliminated. This is not the case. Instead, frequency responses at the bin values will be 0, but the frequency response between those values is free to fluctuate -- and it does fluctuate, often greatly. This fluctuation is called "ripple." There are techniques for reducing ripple but they are complex, and they don't eliminate ripple. Note that, in general, frequencies across the entire spectrum are subject to ripple, so even just manipulating a small frequency band many create ripple across the entire frequency spectrum.
  • FFT filters suffer from so-called "pre-echo", where the sounds can be heard before the main sound hits. In and of itself, this isn't really a problem, but sounds are "smeared" so badly by many designs, that many in the audio world feel that these filters can effect the impact of transients and stereo imaging if not implemented and used correctly.
So it's clear that FFT filters may not be right, or if they are, they involve much more complexity than many people first realize.

As a side note, one case where it might be worth all that work is a special case of so-called FIR filters (also sometimes called "Linear phase" filters). These are used sometimes in audio production and in other cases. In audio, they are usually used only in mastering because of their high latency and computational cost, but even then, many engineers don't like them (while others swear by them). FIR filters are best implemented in the time domain, as well, until the number of "taps"in the filter becomes enormous, which it sometimes does, and it actually becomes more efficient to implement using an FFT with OLA. FIR filters suffer from many of the problems mentioned above including pre-echo, high computational cost and latency, but they do have some acoustical properties that make them desirable in some applications.

Time Domain Filters

Let's try removing high frequencies in the time domain instead. In the time domain, high frequencies are represented by the parts of the signal that change quickly, and low frequencies are represented as the parts that change slowly. One simple way to remove high frequencies, then, would be to use a moving average filter:

y(n) = { x(n) + x(n-1) + .... + x(n-M) } / (M+1)

where x(i) is your input sample at time i, and y(i) is your output sample at time i. No FFT required for that (This is not the best filter for removing high frequencies -- in fact we can do WAY better -- but it is my favorite way to illustrate the point. The moving average filter is not uncommon in economics, image processing and other fields partly for this reason.). Several advantages are immediately obvious, and some are not so obvious:
  • Each input sample can be processed one at a time to produce one output sample without having to chunk or wait for more audio. Therefore, there are also no continuity issues and minimal latency.
  • It is extremely efficient, with only a few multiplies, adds and memory stores/retrievals required per sample.
  • These filters can be designed to closely mimic analog filters.
A major disadvantage is that it is not immediately obvious how to design a high-quality filter in the time domain. In fact, it can take some serious math to do so. It's also worth noting that many time-domain filters, like frequency domain filters, also suffer from ripple, but for many design methods, this ripple is well defined and can be limited in various ways.

In the end, the general rule is that for a given performance, you can get much better results with the time-domain than the frequency domain.