It's not necessarily as simple a it seems to find the pitch from an FFT. Some pre-processing is required as well as some knowledge of how the data is organized. |

At the end of the day, using the FFT is not actually the best pitch tracking method available for tracking or detecting pitch of an audio signal. While it is possible to make a good pitch tracker using the FFT, doing it right requires a tremendous amount of work. The algorithm shown here works, and works pretty well, but if you need something that converges on the correct pitch really quickly, is very accurate, or tracks multiple notes simultaneously, you need something else.

Still, you can create a decent pitch tracking algorithm that's reasonably easy to understand using the FFT. It doesn't require too much work, and I've explained it and provided code, in the form of a command-line C guitar tuner app which you can get from github. It compiles and runs on Mac OS X and you should be able to get it to run on other platforms without much trouble. If you want to port to other languages, that shouldn't be too hard either. It's worth noting that I specifically designed this app to be similar to the tuner described by Craig A. Lindley in Digital Audio with Java, so if you are looking for Java source code, you can check out his code (although there are differences between hi code and mine).

## The Big Picture

To do our pitch detection, we basically loop on the following steps:

- Read enough data to fill the FFT
- Low-pass the data
- Apply a window to the data
- Transform the data using the FFT
- Find the peak value in the transformed data
- Compute the peak frequency from from the index of the peak value in the transformed data

This is the main processing loop for the tuner, with some stuff left out:

while( running )

{

// read some data

err = Pa_ReadStream( stream, data, FFT_SIZE );

// low-pass

for( int j=0; j

data[j] = processSecondOrderFilter( data[j], mem1, a, b );

data[j] = processSecondOrderFilter( data[j], mem2, a, b );

}

// window

applyWindow( window, data, FFT_SIZE );

// do the fft

for( int j=0; j

datai[j] = 0;

applyfft( fft, data, datai, false );

//find the peak

float maxVal = -1;

int maxIndex = -1;

for( int j=0; j<FFT_SIZE; ++j ) {

float v = data[j] * data[j] + datai[j] * datai[j] ;

if( v > maxVal ) {

maxVal = v;

maxIndex = j;

}

}

float freq = freqTable[maxIndex];

//...

}

Let's go over each of the steps and see how they work.

## Audio Data

We always need to start with a sequence of numbers representing the amplitude of audio over time (sometimes called "Linear, PCM audio"). This is what we get from most uncompressed audio formats like AIFF and WAV. Its also what you get from audio APIs like ASIO, CoreAudio and ALSA. In this case, we are using PortAudio, which acts like a portable wrapper around these and other APIs. If you have a compressed format such as MP3 or OGG, you will have to convert it to uncompressed audio first.Your data might be 16-bit integer, 8-bit integer, 32-bit floating point or any number of other formats. We'll assume you know how to get your data to floating point representation in the range from -1 to 1. PortAudio takes care of this for us when we specify these input parameters:

inputParameters.device = Pa_GetDefaultInputDevice();

inputParameters.channelCount = 1;

inputParameters.sampleFormat = paFloat32;

inputParameters.suggestedLatency = Pa_GetDeviceInfo( inputParameters.device )->defaultHighInputLatency ;

inputParameters.hostApiSpecificStreamInfo = NULL;

You'll also need to know how often your audio is sampled. For a tuner, less is more, so we'll use a sample rate of 8 kHz, which is available on most hardware. This is extremely low for most audio applications (44.1 kHz is considered standard for audio and 48 kHz is standard for video), but for a tuner, 8 kHhz is plenty.

#define SAMPLE_RATE (8000)

## Low-Pass Filtering

There's no hard and fast rule about low-pass filtering (or simply "low-passing") your audio data. In fact, it's not even strictly necessary, but doing so can get rid of unwanted noise and the higher frequencies that sometimes masquerade as the fundamental frequency. This is important because some instruments have component frequencies called harmonics that are more powerful than the "fundamental" frequencies, and usually we are interested in the fundamental frequencies. Filtering, therefore, can improve the reliability of the rest of the pitch tracker significantly. Without filtering, some noise might appear to be the dominant pitch, or, more likely, the dominant pitch might appear to be a harmonic of the actual fundamental frequency.A good choice for the filter is a low-pass filter with a center frequency around or a little above the highest pitch you expect to detect. For a guitar tuner, this might be the high E string, or about 330 Hz. So that's what we'll use -- in fact, we low-pass it twice. If you are modifying the code for another purpose, you can set the center frequency to something that makes sense for your application.

If you aren't sure or you want to go with or want something less agressive, you could try a moving average filter, which simply outputs the average of the current input and some number of previous inputs. Intuitively, we can understand that this filter reduces high frequencies because signals that change quickly get "smoothed" out.

// Process every sample of your input with this function

// (this is not used in our guitar tuner)

function float twoPointMovingAverageFilter( float input ) {

static float lastInput = 0;

float output = ( input + lastInput ) / 2 ;

lastInput = input;

return output;

}

The moving average filter won't make a huge difference, but if the low pass filter I used in my code doesn't suit you and you don't have the degree in electrical engineering required to design the right digital filter (or don't know what the right filter is), it might be better than nothing. I haven't tested the moving average filter myself.

## Windowing

Generally speaking, FFTs work in chunks of data, but your input is a long or even continuous stream. To fit this round peg into this square hole, you need to break off chunks of your input, and process the chunks. However, doing so without proper treatment may prove detrimental to your results. In rough terms, the problem is that the edges get lopped off very sloppily, creating artifacts at frequencies that aren't actually present in your signal. These artifacts, called "sidelobes", cause problems for many applications. I know that some tuners are designed without special treatment, so you can skip this step, but I strongly recommend you keep reading because it's easy to deal with this problem.To reduce the sidelobes, we premultiply each chunk of audio with another signal called a window, or window function. Two simple and popular choices for window functions are the Hamming window, and the Hann window. I put code for both in the tuner, but I used the Hann window.

void buildHanWindow( float *window, int size )

{

for( int i=0; i

window[i] = .5 * ( 1 - cos( 2 * M_PI * i / (size-1.0) ) );

}

void applyWindow( float *window, float *data, int size )

{

for( int i=0; i

data[i] *= window[i] ;

}

For a tuning app, the windows may overlap, or there may be gaps in between them, depending on your needs and your available processing power. For example, by overlapping and performing more FFTs, and then averaging the results, you may get more accurate results more quickly, at the cost of more CPU time.

**I strongly recommend doing this in real apps. I did not do this in my app to make the code easier to follow, and you'll see that the values sometimes jump around and don't respond smoothly.**

## FFT

The FFT, or Fast Fourier Transform, is an algorithm for quickly computing the frequencies that comprise a given signal. By quickly, we mean O( N log N ). This is way faster than the O( N^{2}) which how long the Fourier transform took before the "fast" algorithm was worked out, but still not linear, so you are going to have to be mindful of performance when you use it. Because the FFT is now the standard way to compute the Fourier transform, many people often use the terms interchangeably, even though this is not strictly correct.

The FFT works on a chunk of samples at a time. You don't get more or less data out of a Fourier Transform than you put into it, you just get it in another form. That means that if you put ten audio samples in you get ten data-points out. The difference is that these ten data points now represent energy at different frequencies instead of energy at different times, and since our data uses real numbers, and not complex, the FFT will contain some redundancies -- specifically, only the first half of the spectrum contains relevant data. That means that for ten samples in, we really only get five relevant data-points out.

Clearly, the more frequency resolution you need, the more time data you need to give it. However, at some point you will run into the problem of not being able to return results quickly enough, either because you are waiting for more input, or because it takes too long to process. Choosing the right size FFT is critical: too big and you consume lots of CPU and delay getting a response, too small and your results lack resolution.

How do we know how big our FFT should be? You can determine the accuracy of your FFT with this simple formula:

binSize = sampleRate/N ;

For example, with a bin size of 8192 (most implementations of the FFT work best with powers of 2), and a sample rate of 44100, you can expect to get results that are accurate to within about 5.38 Hz. Not great for a tuner, but, hey, that's why we are sampling at 8000 Hz, which gives us an accuracy of better than 1 Hz. Still not perfect, for, say, a 5 string bass, but you can always use a a larger N if you need to. Keep in mind that getting enough samples to get that much accuracy takes longer than a second, so our display only updates about once a second. That's yet another reason you might want to overlap your windows.

The output of the FFT is an array of N complex numbers. It is possible to use both the real and imaginary part to get very accurate frequency information, but for now we'll settle for something simpler and much easier to understand: we simply look at the magnitude. To find the magnitude of each frequency component, we use the distance formula:

for( i in 0 to N/2 )

magnitude[i] = sqrt( real[i]*real[i] + cmpx[i]*cmpx[i] );

Now that we know the magnitude of each FFT bin, finding the frequency is simply a matter of finding the bin with the maximum magnitude. The frequency will then be the bin number times the bin size, which we computed earlier. Note that we don't actually need to compute the square root to find the maximum magnitude, so our actual code skips that step.

This is a fantastic little post! It is written so even if someone knows music theory but is not an EE, they can still get it.

ReplyDeleteFortunately, as an EE, I was able to really appreciate this fully :)

Thanks Bjorn!

Thanks, Paul.

DeleteI've found that autocorrelation is a much more accurate way of determining pitch than the FFT. This assumes the input has a single primary pitch. If there are multiple pitches, autocorrelation does not work. However, for a guitar tuner type app, autocorrelation is great. You need far fewer samples to get an accurate autocorrelation than you need for an FFT.

ReplyDeleteAutocorrelation can be computed as F'(F(x) * conj(F(x))) where F' is the inverse Fourier transform and F is the forward Fourier transform. Also, check out FFTW - a fantastic, fast Fourier transform library.

Yes, autocorrelation-based techniques are great, and, as I said in my second paragraph, the techniques outlined here are not necessarily they best, but they work, and, based on my following of stack overflow, many people already have an intuitive concept of the Fourier transform, but not autocorrelation, so this is easier.

ReplyDeleteAs for FFTW, yes, it's an excellent library. It is "better" in that it's usually faster, but it's worse in that it has a more restrictive license that prevents many users from using it commercially or in something closed source (unless they want to pay), and the library I used here is more than fast enough.

FROM what i read,, were supposed to do FFT and then correlation to see if 2 audio files are the same... is that true? or am i lost?

DeleteEman, I think you are lost. auto-correlation is the cross-correlation of a signal with itself. See the Wikipedia entry: https://en.wikipedia.org/wiki/Autocorrelation

DeleteBjorn, i know that autocorrelation is the cross-correlation of a signal with itself. i meant that we would get the 2 signals, find the FFT of each one, and then do cross-correlation between them.

DeleteEman, sorry for the misunderstanding. Unfortunately, I still don't understand what you are asking. If you are trying to figure out something to do with autocorrelation, the comments on a blog post about something else may not be the best forum.

DeleteReally nice blog. Did you also post the code? I am new to portaudio and it takes me a long time to work with it

ReplyDeletethanks,

Rafi

The link to the code is under the word "Guitar tuner".

DeleteNice Blog....

ReplyDeleteHow to cross compile guitar tuner with arm in ubuntu 12.04. Followed al steps as u have written in README.While doing make i getting below error:

src/main.c:56:21: error: storage size of 'action' isn't known.

Plz reply

Sorry I don't know how to help you with cross compiling. You'll need to edit the makefile or maybe even rewrite it from scratch. It's not a lot of code, though, and should be easy to compile.

Deleteisn't that an IIR filter that you implemented?

ReplyDeleteYes, the low-pass filters are IIR.

Deleteoh thank you didn t know that , can you share a good tutorial about IIR? and based on what formula is implemented yours? I don t understand much from wikipedia

DeleteMy entry on the topic is here: http://blog.bjornroche.com/2012/08/basic-audio-eqs.html

DeleteThanks I will read it now. Btw I mentioned you in my bachelor thesis because of this topic. Thank you very much

DeleteWhy there is a need to get input data from -1 to 1 range? Can't sampled data be processed straighforward?

ReplyDeleteIt's not necessary, but it's standard practice when working with floating point audio for it to be in the range [-1,1].

Delete