Automated Audio Captioning With Topic Modeling

Eren, Aysegul; Sert, AYŞEGÜL

doi:10.1109/access.2023.3235733

Automated Audio Captioning With Topic Modeling

Atıf İçin Kopyala

Eren A. O., Sert M.

IEEE ACCESS, cilt.11, ss.4983-4991, 2023 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 11
Basım Tarihi: 2023
Doi Numarası: 10.1109/access.2023.3235733
Dergi Adı: IEEE ACCESS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.4983-4991
Anahtar Kelimeler: Semantics, Bit error rate, Feature extraction, Transformers, Event detection, Audio systems, Predictive models, Audio captioning, audio event detection, PANNs, topic modeling, BERTopic
Orta Doğu Teknik Üniversitesi Adresli: Hayır

Özet

Automatic audio captioning (AAC) is an important area of research aimed at generating meaningful descriptions for audio clips. Most existing methods use relevant semantic information to improve AAC performance and have demonstrated the feasibility of semantic information extraction. Audio events and keywords are commonly used for this purpose. Unlike previous studies, this study proposes a framework that uses topic modeling to obtain relevant semantic content since topic models explore the main themes of the documents. To this end, we present a framework that integrates audio embeddings with audio topics in a transformer-based encoder-decoder architecture. First, we represent each audio clip with a set of topics using a pre-trained topic model, BERTopic. Then, we design a multilayer perceptron (MLP)-based multi-label classifier to predict the topics of audio clips in the testing phase. Finally, in the proposed framework, we input audio embedding and extracted topics into the transformer model to generate captions. The results show that the proposed model improves performance and competes with the most advanced methods that utilize additional external data for training. We believe that the topic modeling can be used to extract semantic content in the AAC task.