Audio features are usually extracted in two levels: short term (frame) level, and long term (clip) level.
A frame is defined as a group of neighboring samples which last about 10-40ms, assume the audio signal is stationary and short-term features such as energy and Fourier transform coefficients can be extracted.
For a feature to reveal the semantic meaning, we use from one second to several tens seconds audio clips, sometimes called ‘window’.
No comments:
Post a Comment