A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds
Endowing machines with sensing capabilities similar to those of humans is a prevalent quest in engineering and computer science. In the pursuit of making computers sense their surroundings, a huge effort has been conducted to allow machines and computers to acquire, process, analyze and understand their environment in a human-like way. Focusing on the sense of hearing, the ability of computers to sense their acoustic environment as humans do goes by the name of machine hearing. To achieve this ambitious aim, the representation of the audio signal is of paramount importance. In this paper, we present an up-to-date review of the most relevant audio feature extraction techniques developed to analyze the most usual audio signals: speech, music and environmental sounds. Besides revisiting classic approaches for completeness, we include the latest advances in the field based on new domains of analysis together with novel bio-inspired proposals. These approaches are described following a taxonomy that organizes them according to their physical or perceptual basis, being subsequently divided depending on the domain of computation (time, frequency, wavelet, image-based, cepstral, or other domains). The description of the approaches is accompanied with recent examples of their application to machine hearing related problems.
Summary
This paper surveys the main physical and perceptual audio feature extraction techniques used for speech, music and environmental sounds, emphasizing how representations influence machine hearing tasks. Readers will learn comparative strengths, typical implementations (spectral, cepstral, time‑frequency and wavelet methods), and considerations for robustness and computational cost when selecting features.
Key Takeaways
- Identify common physical and perceptual audio features (e.g., STFT/spectrogram, MFCC, PLP, chroma, and wavelet coefficients) and their typical uses.
- Compare feature suitability across domains — speech, music, and environmental sounds — with respect to discrimination power and perceptual relevance.
- Apply guidelines for choosing time‑frequency parameters, windowing and filterbank designs to balance resolution, complexity and robustness.
- Assess noise robustness and preprocessing needs (e.g., normalization, voice activity detection, and augmentation) for reliable real‑world performance.
- Evaluate computational and implementation trade‑offs when integrating features into classification or detection pipelines.
Who Should Read This
Intermediate engineers, researchers or graduate students working on audio/speech processing, acoustic scene analysis or machine hearing who need to select or compare feature extraction methods.
Still RelevantIntermediate
Related Documents
- A New Approach to Linear Filtering and Prediction Problems TimelessAdvanced
- A Quadrature Signals Tutorial: Complex, But Not Complicated TimelessIntermediate
- An Introduction To Compressive Sampling TimelessIntermediate
- Lecture Notes on Elliptic Filter Design TimelessAdvanced
- Computing FFT Twiddle Factors TimelessAdvanced







