Bimodal automatic speech segmentation and boundary refinement techniques

EREN AKDEMİR

Bimodal automatic speech segmentation and boundary refinement techniques

Tezin Türü: Doktora

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Elektrik ve Elektronik Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2010

Öğrenci: EREN AKDEMİR

Danışman: TOLGA ÇİLOĞLU

Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu

Özet:

Automatic segmentation of speech is compulsory for building large speech databases to be used in speech processing applications. This study proposes a bimodal automatic speech segmentation system that uses either articulator motion information (AMI) or visual information obtained by a camera in collaboration with auditory information. The presence of visual modality is shown to be very beneficial in speech recognition applications, improving the performance and noise robustness of those systems. In this dissertation a significant increase in the performance of the automatic speech segmentation system is achieved by using a bimodal approach. Automatic speech segmentation systems have a tradeoff between precision and resulting number of gross errors. Boundary refinement techniques are used in order to increase precision of these systems without decreasing the system performance. Two novel boundary refinement techniques are proposed in this thesis; a hidden Markov model (HMM) based fine tuning system and an inverse filtering based fine tuning system. The segment boundaries obtained by the bimodal speech segmentation system are improved further by using these techniques. To fulfill these goals, a complete two-stage automatic speech segmentation system is produced and tested in two different databases. A phonetically rich Turkish audiovisual speech database, that contains acoustic data and camera recordings of 1600 Turkish sentences uttered by a male speaker, is build from scratch in order to be used in the experiments. The visual features of the recordings are extracted and manual phonetic alignment of the database is done to be used as a ground truth for the performance tests of the automatic speech segmentation systems.