Programming lesson
Audio Feature Extraction for Music vs Speech Classification: A Step-by-Step Guide
Learn how to extract time-domain and frequency-domain features from audio files for music vs speech classification using Python, NumPy, and SciPy. This tutorial covers RMS, ZCR, spectral centroid, roll-off, flatness, and MFCCs with practical code examples.
Introduction
Audio classification is a fundamental task in machine learning, with applications ranging from automatic music genre recognition to voice assistants. In this tutorial, you'll learn how to extract time-domain and frequency-domain features from audio files to distinguish between music and speech. Using the Marsyas Music & Speech dataset, we'll implement feature extraction in Python with NumPy and SciPy. This guide is inspired by assignments from CS 4347, but the techniques apply broadly to any audio classification project.
Understanding the Dataset
The Marsyas dataset contains 64 music and 64 speech files, each 30 seconds long, sampled at 22050 Hz with 16-bit signed integers. The ground truth file maps each filename to a label (music or speech). Before diving into feature extraction, ensure you have the dataset downloaded and organized into music-wav/ and speech-wav/ directories.
Loading Audio Files
Use scipy.io.wavfile.read() to load each WAV file. Convert the integer samples to floats by dividing by 32768.0 (the maximum value for 16-bit signed integers). This normalization ensures consistent numerical range.
import scipy.io.wavfile as wav
import numpy as np
sample_rate, data = wav.read('music-wav/sample.wav')
data_float = data.astype(np.float32) / 32768.0Buffer Splitting with 50% Overlap
To analyze audio over time, split each file into buffers (frames) of length 1024 samples with a hopsize of 512 (50% overlap). Only keep complete buffers; discard any trailing incomplete buffer. Use NumPy array slicing for efficiency.
buffer_length = 1024
hopsize = 512
num_buffers = (len(data_float) - buffer_length) // hopsize + 1
buffers = []
for i in range(num_buffers):
start = i * hopsize
end = start + buffer_length
buffer = data_float[start:end]
buffers.append(buffer)Time-Domain Features
For each buffer, compute two fundamental time-domain features: Root Mean Square (RMS) and Zero Crossing Rate (ZCR).
Root Mean Square (RMS)
RMS represents the energy of the signal. It is calculated as the square root of the mean of squared samples.
rms = np.sqrt(np.mean(buffer**2))Zero Crossing Rate (ZCR)
ZCR counts how often the signal changes sign. It is useful for distinguishing speech (higher ZCR) from music (lower ZCR).
zcr = np.sum(np.abs(np.diff(np.sign(buffer))) > 0) / (len(buffer) - 1)Frequency-Domain Features
Before frequency analysis, apply a Hamming window to each buffer to reduce spectral leakage. Then compute the Discrete Fourier Transform (DFT) using scipy.fft() and discard the negative frequencies (keep only the first N/2 + 1 bins).
from scipy.fft import fft
from scipy.signal.windows import hamming
windowed = buffer * hamming(buffer_length)
spectrum = fft(windowed)
magnitude = np.abs(spectrum[:buffer_length//2 + 1])Spectral Centroid (SC)
The spectral centroid indicates the 'center of mass' of the spectrum. It is often associated with the brightness of a sound.
freqs = np.linspace(0, sample_rate/2, len(magnitude))
sc = np.sum(freqs * magnitude) / np.sum(magnitude)Spectral Roll-Off (SRO)
The roll-off point is the frequency below which a specified percentage (e.g., 85%) of the total spectral energy is contained.
cumulative = np.cumsum(magnitude)
total_energy = cumulative[-1]
roll_off_index = np.where(cumulative >= 0.85 * total_energy)[0][0]
sro = freqs[roll_off_index]Spectral Flatness Measure (SFM)
SFM measures how noise-like versus tone-like a spectrum is. A high SFM indicates noise (speech), low SFM indicates tonal (music). Use log-scale to avoid numerical overflow.
geometric_mean = np.exp(np.mean(np.log(magnitude + 1e-10)))
arithmetic_mean = np.mean(magnitude)
sfm = geometric_mean / arithmetic_meanAggregating Features per File
For each file, compute the mean and uncorrected sample standard deviation of each feature across all buffers. This yields 10 features per file: 5 means and 5 standard deviations.
# After processing all buffers for a file
rms_mean = np.mean(rms_list)
rms_std = np.std(rms_list) # uncorrected (ddof=0)
# repeat for ZCR, SC, SRO, SFMWriting the ARFF File
ARFF is a standard format for Weka and other ML tools. The header defines the relation and attributes. The data section lists comma-separated values per file, ending with the class label.
@RELATION music_speech
@ATTRIBUTE RMS_MEAN NUMERIC
@ATTRIBUTE ZCR_MEAN NUMERIC
@ATTRIBUTE SC_MEAN NUMERIC
@ATTRIBUTE SRO_MEAN NUMERIC
@ATTRIBUTE SFM_MEAN NUMERIC
@ATTRIBUTE RMS_STD NUMERIC
@ATTRIBUTE ZCR_STD NUMERIC
@ATTRIBUTE SC_STD NUMERIC
@ATTRIBUTE SRO_STD NUMERIC
@ATTRIBUTE SFM_STD NUMERIC
@ATTRIBUTE class {music,speech}
@DATA
0.057447,0.191595,128.656296,239.404651,0.329993,0.027113,0.036597,13.206525,27.957121,0.087828,music
0.062831,0.082504,78.481380,145.886047,0.198849,0.032323,0.070962,39.388633,66.942115,0.133545,speechEnsure at least 6 decimal places for numeric values.
Going Further: MFCCs
Mel-Frequency Cepstral Coefficients (MFCCs) are widely used in speech and music processing. Steps include pre-emphasis, framing, windowing, FFT, mel filterbank, log, and DCT. For this dataset, compute 26 MFCCs per buffer, then average and standard deviation across buffers for each file.
Pre-emphasis
A high-pass filter to boost high frequencies: y[t] = x[t] - 0.95 * x[t-1].
Mel Filterbank
Convert frequency to mel scale: mel = 1127 * ln(1 + f/700). Create 26 triangular filters covering 0 Hz to Nyquist frequency. Map filter edges to FFT bins using floor, round, and ceil for left, center, right.
DCT
Apply scipy.fftpack.dct() to the log filterbank energies to obtain MFCCs.
Why This Matters
Audio feature extraction is the backbone of many modern applications. For instance, Spotify uses similar features for music recommendation, and voice assistants like Siri rely on MFCCs for speech recognition. Even in gaming, audio classification helps in real-time sound event detection. Mastering these techniques opens doors to AI-driven audio analysis.
Conclusion
You've learned how to extract time-domain (RMS, ZCR) and frequency-domain (spectral centroid, roll-off, flatness) features from audio, and how to write them into an ARFF file for classification. These skills are directly applicable to projects in music information retrieval, speech processing, and beyond. Experiment with different buffer sizes, overlap percentages, and additional features like MFCCs to improve your classifier's accuracy.