Audio Feature Extraction for Music vs Speech Classification

Introduction

Audio classification is a fundamental task in machine learning, with applications ranging from automatic music genre recognition to voice assistants. In this tutorial, you'll learn how to extract time-domain and frequency-domain features from audio files to distinguish between music and speech. Using the Marsyas Music & Speech dataset, we'll implement feature extraction in Python with NumPy and SciPy. This guide is inspired by assignments from CS 4347, but the techniques apply broadly to any audio classification project.

Understanding the Dataset

The Marsyas dataset contains 64 music and 64 speech files, each 30 seconds long, sampled at 22050 Hz with 16-bit signed integers. The ground truth file maps each filename to a label (music or speech). Before diving into feature extraction, ensure you have the dataset downloaded and organized into music-wav/ and speech-wav/ directories.

Loading Audio Files

Use scipy.io.wavfile.read() to load each WAV file. Convert the integer samples to floats by dividing by 32768.0 (the maximum value for 16-bit signed integers). This normalization ensures consistent numerical range.

import scipy.io.wavfile as wav
import numpy as np

sample_rate, data = wav.read('music-wav/sample.wav')
data_float = data.astype(np.float32) / 32768.0

Buffer Splitting with 50% Overlap

To analyze audio over time, split each file into buffers (frames) of length 1024 samples with a hopsize of 512 (50% overlap). Only keep complete buffers; discard any trailing incomplete buffer. Use NumPy array slicing for efficiency.

buffer_length = 1024
hopsize = 512
num_buffers = (len(data_float) - buffer_length) // hopsize + 1
buffers = []
for i in range(num_buffers):
    start = i * hopsize
    end = start + buffer_length
    buffer = data_float[start:end]
    buffers.append(buffer)

Time-Domain Features

For each buffer, compute two fundamental time-domain features: Root Mean Square (RMS) and Zero Crossing Rate (ZCR).

Root Mean Square (RMS)

RMS represents the energy of the signal. It is calculated as the square root of the mean of squared samples.

rms = np.sqrt(np.mean(buffer**2))

Zero Crossing Rate (ZCR)

ZCR counts how often the signal changes sign. It is useful for distinguishing speech (higher ZCR) from music (lower ZCR).

zcr = np.sum(np.abs(np.diff(np.sign(buffer))) > 0) / (len(buffer) - 1)

Frequency-Domain Features

Before frequency analysis, apply a Hamming window to each buffer to reduce spectral leakage. Then compute the Discrete Fourier Transform (DFT) using scipy.fft() and discard the negative frequencies (keep only the first N/2 + 1 bins).

from scipy.fft import fft
from scipy.signal.windows import hamming

windowed = buffer * hamming(buffer_length)
spectrum = fft(windowed)
magnitude = np.abs(spectrum[:buffer_length//2 + 1])

Spectral Centroid (SC)

The spectral centroid indicates the 'center of mass' of the spectrum. It is often associated with the brightness of a sound.

freqs = np.linspace(0, sample_rate/2, len(magnitude))
sc = np.sum(freqs * magnitude) / np.sum(magnitude)

Spectral Roll-Off (SRO)

The roll-off point is the frequency below which a specified percentage (e.g., 85%) of the total spectral energy is contained.

cumulative = np.cumsum(magnitude)
total_energy = cumulative[-1]
roll_off_index = np.where(cumulative >= 0.85 * total_energy)[0][0]
sro = freqs[roll_off_index]

Spectral Flatness Measure (SFM)

SFM measures how noise-like versus tone-like a spectrum is. A high SFM indicates noise (speech), low SFM indicates tonal (music). Use log-scale to avoid numerical overflow.

geometric_mean = np.exp(np.mean(np.log(magnitude + 1e-10)))
arithmetic_mean = np.mean(magnitude)
sfm = geometric_mean / arithmetic_mean

Aggregating Features per File

For each file, compute the mean and uncorrected sample standard deviation of each feature across all buffers. This yields 10 features per file: 5 means and 5 standard deviations.

# After processing all buffers for a file
rms_mean = np.mean(rms_list)
rms_std = np.std(rms_list)  # uncorrected (ddof=0)
# repeat for ZCR, SC, SRO, SFM

Writing the ARFF File

ARFF is a standard format for Weka and other ML tools. The header defines the relation and attributes. The data section lists comma-separated values per file, ending with the class label.

@RELATION music_speech
@ATTRIBUTE RMS_MEAN NUMERIC
@ATTRIBUTE ZCR_MEAN NUMERIC
@ATTRIBUTE SC_MEAN NUMERIC
@ATTRIBUTE SRO_MEAN NUMERIC
@ATTRIBUTE SFM_MEAN NUMERIC
@ATTRIBUTE RMS_STD NUMERIC
@ATTRIBUTE ZCR_STD NUMERIC
@ATTRIBUTE SC_STD NUMERIC
@ATTRIBUTE SRO_STD NUMERIC
@ATTRIBUTE SFM_STD NUMERIC
@ATTRIBUTE class {music,speech}
@DATA
0.057447,0.191595,128.656296,239.404651,0.329993,0.027113,0.036597,13.206525,27.957121,0.087828,music
0.062831,0.082504,78.481380,145.886047,0.198849,0.032323,0.070962,39.388633,66.942115,0.133545,speech

Ensure at least 6 decimal places for numeric values.

Going Further: MFCCs

Mel-Frequency Cepstral Coefficients (MFCCs) are widely used in speech and music processing. Steps include pre-emphasis, framing, windowing, FFT, mel filterbank, log, and DCT. For this dataset, compute 26 MFCCs per buffer, then average and standard deviation across buffers for each file.

Pre-emphasis

A high-pass filter to boost high frequencies: y[t] = x[t] - 0.95 * x[t-1].

Mel Filterbank

Convert frequency to mel scale: mel = 1127 * ln(1 + f/700). Create 26 triangular filters covering 0 Hz to Nyquist frequency. Map filter edges to FFT bins using floor, round, and ceil for left, center, right.

DCT

Apply scipy.fftpack.dct() to the log filterbank energies to obtain MFCCs.

Why This Matters

Audio feature extraction is the backbone of many modern applications. For instance, Spotify uses similar features for music recommendation, and voice assistants like Siri rely on MFCCs for speech recognition. Even in gaming, audio classification helps in real-time sound event detection. Mastering these techniques opens doors to AI-driven audio analysis.

Conclusion

You've learned how to extract time-domain (RMS, ZCR) and frequency-domain (spectral centroid, roll-off, flatness) features from audio, and how to write them into an ARFF file for classification. These skills are directly applicable to projects in music information retrieval, speech processing, and beyond. Experiment with different buffer sizes, overlap percentages, and additional features like MFCCs to improve your classifier's accuracy.