A better FFT-based audio visualization

There are a lot of FFT-based audio visualizations available, but they usually make the mistake of displaying a raw FFT-based bar graph. This leads to a display which flickers wildly and doesn’t appear to move in time with the audio. This problem can be fixed though, with a few simple modifications.

How it works

Like all FFT-based visualizations, this one starts by dividing the incoming audio stream into fixed-size slices. These slices contain a power-of-two number of samples, and the output video frame rate is locked to the slice rate. In this implementation, the slice/frame rate is always between 32 and 64 Hz, depending on the incoming audio sample rate.

For each slice, a Hamming window is applied, and then the slice is transformed to give a spectrum. The spectrum is divided into a fixed number of frequency bins, and a vertical bar drawn for each bin. So far, this is exactly how any other FFT visualization works. The key differences are explained in the following sections.

Gamma-corrected frequency range

First, the frequency ranges are not uniform. Human perception of frequency follows a geometric progression (high C sounds the same “distance” from middle C as middle C from low C, although the actual frequencies involved follow a geometric progression). Ideally, frequencies should be mapped to bin indexes logarithmically, with the number of frequencies covered by a bar always being a fixed proportion times the number covered by the previous bar.

However, due to the size of typical slices, we don’t have enough frequencies to do this without introducing large gaps. As a compromise, we use the same function used to correct for perceived brightness on CRT monitors:

B_i = ((f_i / f_max) ** (1 / gamma)) * B_max

Where f_i ranges over all frequencies obtained from the FFT, and B_i is the corresponding bin index. A gamma value of 1 produces a linear division. In this implementation, we use a gamma value of 2. Higher values tend to hide too much of the detail at the high end of the spectrum.

Power scale

The obvious way to obtain a instantaneous bar height would be to plot the average power in each frequency range (where power is the square of the magnitude of the complex amplitude). Since human hearing sensitivity has such a wide dynamic range, it’s much better to take the logarithm of the power instead.

A further improvement is to take the logarithm of the maximum power in each range, instead of the average. Real music often contains narrow but perceptible peaks, which are swamped when averaged with neighbouring frequency powers.

Time smoothing

This is the most important improvement. The instantaneous bar heights vary quite wildly, and even after applying the two improvements above, they flicker if plotted directly. Instead, we blend the instaneous values with the per-slice values from the last frame:

B_i' = B_(i-1)' * s' + B_i * (1 - s')

Where B_i represents the instantaneous value, and B_i' represents the smoothed value. s' is a smoothing constant which is adjusted for the actual slice rate so as to give approximately the same results over a range of different sample rates:

s' = s ** (1 / R)

In the above, s is the base smoothing constant, and R is the slice rate.

Implementation

This visualization has been implemented as a GStreamer plugin, which can be downloaded here. To compile, you’ll need development headers for GStreamer and the GStreamer plugins from the base set (for the FFT implementation). On Debian, these packages are named libgstreamer0.10-dev and libgstreamer-plugins-base0.10-dev.

Compile and install with:

make
su
make install

This will add a new visualization type in GStreamer-based media players (such as Totem) called “DLBFFT”.

Note that the version of Totem which ships with Debian 6.0 will always prefer displaying embedded cover art where available, rather than the selected visualization. There’s currently no option to disable this, but this patch can be applied to disable it permanently.