Hand-crafted versus learned representations for audio event detection


Kucukbay S. E., YAZICI A., KALKAN S.

MULTIMEDIA TOOLS AND APPLICATIONS, vol.81, no.21, pp.30911-30930, 2022 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 81 Issue: 21
  • Publication Date: 2022
  • Doi Number: 10.1007/s11042-022-12873-5
  • Journal Name: MULTIMEDIA TOOLS AND APPLICATIONS
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
  • Page Numbers: pp.30911-30930
  • Keywords: Audio event detection, Audio event classification, Deep learning, Log mel spectogram, Mel spectrogram, Spectrogram, MFCC, CLASSIFICATION
  • Middle East Technical University Affiliated: Yes

Abstract

Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (similar to 30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.