Wednesday, December 9, 2009

Linearity and dynamic range in Int->Float->Int

Update: some comments.

In my last blog post, I discussed converting audio from integer to floating point back to integer, mostly from a programming perspective. I showed how there are a lot of ways to do the conversion. Most audio folks would say, "huh, I thought there were only two ways to convert floating point numbers to integers." And they'd be right: with and without dither. So what's all the fuss about?

Indeed, that's a good question. Most audio folks have this expectation:
  1. When I have dither off and no effects (including volume, etc) I expect to be able to get out exactly what I put in.
  2. When I have dither on, I expect it to sound good.
Point 1 is what we referred to as bit transparency in the previous post, and we found lots of ways to do that. Point 2 is a bit more subtle. How do you make something sound good? In this case, we mean transparent, and what's especially critical is that we eliminate truncation and IM distortion which are the hallmarks of cold, harsh digital audio.
Figure 1. Comparison of 16-bit conversion using the same scaling factor (matched) vs. different scaling factors (mismatched). Mismatched scaling factors come from Method 3 from previous post and matched are Method 2.

What we need when it comes to transparency and avoiding that cold harsh sound is linearity. In this regard, the methods discussed in my last post, transparent or not, don't stack up equally. You might think you could judge them by inspection, but the mathematics are a bit more complex. Let's be clear about what we need to test: what we don't care about is how accurately a given conversion method responds to a DC signal: we aren't measuring the temperature or the amount of fuel in a tank. Rather, when we talk about linearity in audio we are referring to the ability to accurately translate dynamic information. Think about it: when you buy an analog-to-digital converter, you aren't concerned about its ability to accurately measure a certain input voltage, are you? No, you care about it's frequency response and dynamic range. In the same way, we must ensure maximum signal-to-noise ratio and dynamic range in our conversions. It turns out not all the conversions from my last post have good dynamic performance.

Tests

It is sometimes claimed that the percent error introduced by "mismatched" conversion (ie Method 3 from the previous post) is small, and therefore of little concern, but percent error is not what matters in a dynamic system such as audio, so we will not concern ourselves with that and investigate the dynamic performance instead. In Figure 1 we show the results of "mismatched" conversion. In this case we are converting from a source signal of 2 sine waves in double precision to 16-bit integer (to simulate A/D conversion), then to single-precision floating point and back to 16-integer (to simulate a standard editing workflow), and finally back to double precision (to simulate D/A conversion). This is more or less the minimum error we can expect with the mismatched method if we use audio editing software but do not use DSP, and therefore represents a best-case scenario. In the dynamic analysis, it becomes clear that using different scaling factors produces more noise whether dither is used or not. In fact, the difference made by dither is dwarfed by the difference in techniques. Just as importantly, the quality of the noise is bad: rather than shifting the noise floor up, we see spikes indicating that the noise is likely to be audible even at low levels. These results also suggests that it is important to use the same scaling factors throughout the processing chain.

Figure 2. Quantization and dithering from float to int and back to float is tested at 16 bits (a,b) and 24 bits (c,d) using a full-scale sine (a,c) and the sum of two sine (b,c). Notes: the sum of two sines does not clip; clipped signal and raw quantized signal are not shown in a.
Figure 2. shows the dynamic performance of conversion using 2^n, (2^n)-1 and "asymmetrical" conversion (ie Method 4 from my previous post). We will discuss below that "asymmetrical" is  a misnomer. We also looked at dithered and non-dithered versions.

Two types of tests were run: first, a full-scale sine wave was generated, converted to int, and back to float for FFT analysis. The second test was the same except that two sines, each at 1/2 full scale were summed together. Each test was run at 16 and 24 bits. Note that the full-scale sine wave cannot be accurately represented in some of these conversion methods, resulting in some clipping.

As you can see, all dithered converters performed fine at 16-bit as long as nothing was out of scale. At 24-bit, the weakness of the (2^n)-1 converter becomes clear: it actually performs worse than rounding (ie. no dithering). Clearly (2^n)-1 is not an acceptable transformation for 24-bit integers and single precision floating point numbers. The 2^n converter performed admirably on all tests except the 16-bit full-scale test (1a). Those small spikes line up perfectly with the spikes caused by clipping as expected (results not shown) meaning that it is harmonic distortion -- not the worst thing that could happen, but, still, the asymmetric converter does outperform it in this regard.

As mentioned, I'm calling Method 4 from my previous post the "asymmetric" method, but it is only asymmetric in the sense that you apply different math to positive and negative numbers. As these results show, it is linear. Moreover, it is symmetric with respect to dither amplitude, which is what ensures its linear behavior.

Conclusions

Clearly the two winners here are the so-called asymmetric method and the (2^n) method. Both methods excel in the critical areas of bit transparency and linearity. Even their un-dithered performance is quite good, and they are obviously superior to other methods.

The one area in which the asymmetric model outperforms the (2^n) model is in terms of clipping signals that originated from higher resolution. Even with dither, we still see incorrect behavior with the the (2^n) model because dither only finds its way to 1/2 LSB, whereas +1 clips by going 1 LSB over. The question is whether or not this matters. Indeed there is some debate about the importance of +1. My opinion? +1 is a value that occurs in the real world and it's not always possible for the code that's producing the +1 to know what the output resolution is going to be. For example, a VST synth plugin has no way of knowing what the output resolution is going to be, so it can't be expected to know what to scale its output to. When converting from 24 bit to 16 bit and using float as an intermediary, there is no simple way to solve this problem.

On the other hand, non-pro A/D converters frequently clip around -.5 dBFS, which is below +1 - 1 LSB anyway. Conceivably, you could also correct for this by introducing a level shift at the output equal to 1/2 LSB, but that's equivalent to turning your converter into a (2^n)-.5 converter -- it solves one problem, but introduces another. All that said, there is no reason not to develop software, especially libraries, drivers and other software intended for use by multiple type of users including audiophiles and pro audio engineers, that is convenient to use while meeting the highest audio standards: just use the asymmetric converters.

Given the potential hazards found in mixing and matching conversion methods, I recommend that all libraries (and drivers, if possible) offer options for various conversion settings, both to minimize bit transparency problems and unnecessary quantization noise, until all libraries and drivers can standardize on the asymmetric conversion method. This is the only way to guarantee transparency and maximize linearity. As these results show, this issue may be more important than dither.

Wednesday, December 2, 2009

Int->Float->Int: It's a jungle out there!

It turns out that the simple operation of converting from float to integer and back is not so simple. When it comes to audio, this operation should be done with care, and most programmers do, in fact, put a lot of thought into it. The problem most programmers observe is that audio, when stored (or processed) as an integer, is usually stored in what's called "two's complement" notation, which always gives us 1 more negative number than positive. When we process or store floating point numbers, we use a nominal range of -1 to +1.

The fact that there are more negative numbers than positive numbers has caused some confusion amongst programers, and a number of different conversion methods have been proposed. Here is my survey of how a number of existing software and hardware packages handle this conversion. In these examples, I show conversions for 16-bit integers, but they all extend in the obvious way to other bit depths. It is important to consider how these methods extend to larger integers, especially how they extend to 24-bit integers, so I've tested bit transparency for these methods up to 24-bit using single precision floating point intermediaries, correcting for the fact that IEEE allows for extended precisions to be used in computations. Endianness is irrelevant here, because everything works for big and little endian systems.

Transparency is only required or possible when the data has not been created synthetically or altered via DSP (including such simple operations as volume changes, mixing, etc). In cases where transparency is not possible, dither must be applied when converting to integer or reducing the resolution. In many software packages it is up to the end-user to make this determination and manually switch dither on or off. In my next post I will discuss dithering and linearity.


Int to Float
Float to Int*
Transparency
Used By
0)
((integer + .5)/(0x7FFF+.5)
float*(0x7FFF+.5)-.5
Up to at least 24-bit
DC DAC Modeled
1)
(integer / 0x8000)
float * 0x8000
Up to at least 24-bit
Apple (Core Audio)1, ALSA2, MatLab2, sndlib2
2)
(integer / 0x7FFF)
float * 0x7FFF
Up to at least 24-bit
Pulse Audio2
3)
(integer / 0x8000)
float * 0x7FFF
Non-transparent
PortAudio1,2, Jack2, libsndfile1,3
4)
(integer>0?integer/0x7FFF:integer/0x8000)
float>0?float*0x7FFF:float*0x8000
Up to at least 24-bit
At least one high end DSP and A/D/A manufacturer.2,4 XO Wave 1.0.3.
5)
Uknown
float*(0x7FFF+.49999)
Unknown
ASIO2
*obviously, rounding or dithering may be required here.
Note that in the case of IO APIs, drivers are often responsible for conversions. The conversions listed here are provided by the API.

Method 0 is one possible method for preserving the DC accuracy of a DAC, and is included here for reference.

Edited December 6, 2009: Fixed Method 3. (0x8000 and 0x7FFF were backwards)

Sources:
1 Mailing list
2 Perusing the source code (this, of course, is subject to mistakes due to following old, conditional or optional code)
3 libsndfile FAQ goes into detail about this.
4 Personal communication.

Wednesday, November 11, 2009

WAVE64 vs RF64 vs CAF

Right now I am choosing new a default internal audio file format for XO Wave, and I'd like to choose a format that offers large file sizes and high resolution. I'd like to use an existing popular standard rather than inventing my own or using RAW audio. The pro audio industry is finally moving towards 64-bit file formats, and the three options supported by most pro software are

  • Wave64, aka Sony Wave64, originally developed by Sonic Foundry before 2003, is an open standard and a true 64-bit format: all 32-bit fields are replaced with 64-bit fields, and all chunks are 8-byte word aligned. Instead of the dreaded FourCC it uses GUID. Other than that, it is pretty much the same as WAV, so the spec is barely 4 pages long, although in my opinion it could stand to be a bit longer, as many aspects of WAV are so poorly devised it really wouldn't hurt for someone to put it all in one place. Some people have criticized the use of GUID on the grounds that there will never be that many chunks, but this misses the point: the point of using GUIDs is that anyone can define their own chunk without having to check with Sony or register a chunk ID. It's actually rather clever.
  • RF64 was proposed in 2005 by the EBU with full knowledge of Wave64. Although the proposal stated basic requirements that could have easily been met by a few minor extensions to Wave64, and they stated a desire to "join forces" with the developers of Wave64, they made no effort to do so other than to say they hoped they'd be involved. Moreover, the same document proposes RF64 as an alternative, incompatible 64-bit extension to the WAV format. Unlike Wave64, RF64 is not a true 64-bit format. All existing "chunks" remain 32-bit, so, for example, markers, regions and loops will no longer work past a certain number of samples. Even EBU's levl chunk will not work with RF64 because it uses a 32-bit address for pointing to the "peak-of-peaks" in the raw data. RF64 offers the much made-of promise of backwards compatibility via a "junk chunk", but, of course, this is possible with Wave64 as well, as pointed out in the Wave64 spec.
  • CAF, or Core Audio Format was Apple's entry into the ring. Apple didn't want to be left out of the 64-bit game, after all, and around the same time in 2005 they released CAF. Since they are Apple, they figured people would adopt it (Logic would, if no one else), even if there were competing specs. Their approach, however, was to start from scratch, and it's pretty refreshing. Indeed, the spec addresses practical issues to ensure that important features are implemented, and it even makes that tiny little bit of extra effort required to avoid file corruption by not requiring a header rewrite to finalize a recording of unknown length (Anyone who's ever recorded using software knows that once in a while something goes wrong and a file ends up corrupted. It's so nice that someone finally addressed this in a spec.).
The WAVE format is problematic in many, many ways. For example, in some places it uses zero-based indexing, in others it uses one-based indexing. Sometimes it uses signed integers for raw audio data, other times unsigned. That may not seem so bad, but considering how simple the data it's trying to carry is, but when you add to that the fact that Microsoft had to use format extensions just to clear up ambiguous documentation (and they've still got an ambiguously documented "fact" chunk), it's really not good territory. It is a shame that both Sonic Foundry/Sony and the EBU chose WAVE as the format to extend. Moreover, it's annoying that EBU designed their own, incompatible 64-bit extension to WAVE when a superior one already existed.

Some people think the whole "backwards compatibility" thing is a bunch of hooey because it puts an undo burden on the people writing the libraries. Erik de Castro Lopo, author of the popular LGPL'ed libsoundfile says:

Quite honestly, its stuff like this that makes me think the people who write these specs smoke crack!
If I were to follow the ... insane advice [about retaining backwards compatibility], the test suite would have to write > 4Gig files in order to write a real RF64 file instead of just a normal WAV file.
In order to avoid this insanity, libsndfile, when told to write an RF64 file does exactly as its told.
I would add that the backwards compatibility adds another point of failure in the recording process, in the same way that header rewrites are a point of failure in most current formats (except for CAF and "chunkless" formats like RAW and AU).

All that aside, RF64 is gaining some popularity and support -- probably more than Wave64. As for CAF, it's less popular, but since it's an Apple standard it's probably not going anywhere even if it's not going to be the "next big thing." It could be a fine place to work from, but just scanning the docs everything I looked at brought up a few issues that worried me. For example:



  • The CAFMarker data-type has three design flaws I noticed. One is that the frame position is a floating point number. I might be missing something here, but in a format where everything else that counts frames and bytes as 64-bit integers, why are we suddenly using floats? Sure that will be integral to pretty big numbers since it's 64-bit, but it's still a float. I didn't use a format like this to get pretty accurate big numbers when I could get completely accurate big numbers! Internally, most apps are going to be converting 64-bit integers to 64-bit floats, which is insane. Another problem is mChannel, which is the channel (starting at 1) that the marker refers to or zero if the marker refers to all channels. Okay, seems reasonable, except that the spec also defined a channel mapping with a 32-bit channel layout bitmask. Why not use that? Granted you might have more than 32-channels, but that's not going to be the most common case, and you could give your users a choice. Consistency is important in APIs. Also, let's face it, the CAFMarker, if not all the basic chunks, should be versioned and extensible. Sure all that takes a few more bits (well, not the float/integer thing), but it's really nothing compared to the sea of data in most audio files.
  • In the SMTPE timecode types they define kCAF_SMPTE_TimeType30Drop. Now, the fact is that there's really no such thing as 30 Drop, but I can see an argument for including it out of completeness. However, the documentation states that: "30 video frames per second, with video-frame-number counts adjusted to ensure that the timecode matches elapsed clock time." Which is wrong. If you actually had 30 Drop it would run ahead of elapsed, or "wall-clock" time. "Aha!" you say, "they really mean 29 Drop, which is often just called 30 Drop because everyone knows there's no such thing as 30 Drop." But, I'm afraid you are wrong, because there's another constant for that, kCAF_SMPTE_TimeType2997Drop, with pretty much the same documentation, only in this case, it's correct to say that the timecode matches elapsed time. (well, it's very close anyway)
So CAF might be flawed, but probably no more so than WAVE and anything built on it. The reliability factor is sweet. Really. The fact that many people, especially in broadcast, seem to be wanting RF64 support is a detraction, though.

Of course, I might just be over-engineering it. The AU format has been around forever, is super simple and provides high resolution, uncompressed audio of ANY length (it's not even limited to 64-bit). Of course, it lacks metadata which might be useful for BWF-style info as well as region data, but hey, it's wicked simple.


An interesting side note is that by choosing an appropriately sized junk/empty chunk in the header, Wave64, RF64 and CAF can actually be converted from one to another in-place.