Digital Signal Processing: A Review Journal, vol.158, 2025 (SCI-Expanded)
Audio detection refers to the process of identifying and analyzing audio signals to extract useful information or detect specific events or patterns within the audio data while utilizing computational techniques such as signal processing and machine/deep learning. This study provides a comprehensive overview of the applications of transformers in audio detection tasks. Transformers, originally designed for natural language processing, have shown remarkable performance in capturing long-range dependencies and complex patterns in audio signals, including sound event detection, deepfake audio detection, and speech recognition. The review begins with an overview of the fundamental concepts, including data preprocessing techniques to convert raw audio waveforms into spectrogram representations for effective transformer-based processing and the architecture of transformer encoders, emphasizing the self-attention mechanisms and feed-forward neural networks that enable them to model dependencies between time steps and frequency components in audio sequences. We then explore how integrating task-specific heads demonstrates their efficacy in mapping learned representations to detect specific events, patterns, or features within audio signals relevant to various audio detection tasks. Next, we summarize the current state of audio detection using transformers, identifying potential research directions and highlighting the significance of transformers in advancing the field of audio analysis, particularly in the context of the growing threat of deepfake audio manipulation. We compare the methods based on input data, models, datasets, performance, and application areas, and speculate on future directions, making it clear that our review offers insights and details not covered by other surveys on this topic. Overall, through an exploration of the existing literature and research advancements, this review provides valuable insights into the evolving landscape of audio detection techniques using transformers and paves the way for future advancements in this dynamic domain.