Hand-crafted versus learned representations for audio event detection

Kucukbay, Selver; YAZICI, ADNAN; KALKAN, SİNAN

doi:10.1007/s11042-022-12873-5

Hand-crafted versus learned representations for audio event detection

Atıf İçin Kopyala

Kucukbay S. E., YAZICI A., KALKAN S.

MULTIMEDIA TOOLS AND APPLICATIONS, cilt.81, sa.21, ss.30911-30930, 2022 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 81 Sayı: 21
Basım Tarihi: 2022
Doi Numarası: 10.1007/s11042-022-12873-5
Dergi Adı: MULTIMEDIA TOOLS AND APPLICATIONS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
Sayfa Sayıları: ss.30911-30930
Anahtar Kelimeler: Audio event detection, Audio event classification, Deep learning, Log mel spectogram, Mel spectrogram, Spectrogram, MFCC, CLASSIFICATION
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (similar to 30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.