Monday, December 14, 2009

Audio features in two levels

Audio features are usually extracted in two levels: short term (frame) level, and long term (clip) level.

A frame is defined as a group of neighboring samples which last about 10-40ms, assume the audio signal is stationary and short-term features such as energy and Fourier transform coefficients can be extracted.

For a feature to reveal the semantic meaning, we use from one second to several tens seconds audio clips, sometimes called ‘window’.

