Estimation of Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing

ÖZBEK İ. Y., Hasegawa-Johnson M., DEMİREKLER M.

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, vol.19, no.5, pp.1180-1195, 2011 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 19 Issue: 5
  • Publication Date: 2011
  • Doi Number: 10.1109/tasl.2010.2087751
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Page Numbers: pp.1180-1195
  • Keywords: Audiovisual fusion, audiovisual-to-articulatory inversion, Gaussian mixture model (GMM), Kalman smoother, maximum-likelihood trajectory estimation, VOCAL-TRACT SHAPE, FREQUENCIES, INVERSION, MOVEMENTS, HMM
  • Middle East Technical University Affiliated: No


This paper presents a detailed framework for Gaussian mixture model (GMM)-based articulatory inversion equipped with special postprocessing smoothers, and with the capability to perform audio-visual information fusion. The effects of different acoustic features on the GMM inversion performance are investigated and it is shown that the integration of various types of acoustic (and visual) features improves the performance of the articulatory inversion process. Dynamic Kalman smoothers are proposed to adapt the cutoff frequency of the smoother to data and noise characteristics; Kalman smoothers also enable the incorporation of auxiliary information such as phonetic transcriptions to improve articulatory estimation. Two types of dynamic Kalman smoothers are introduced: global Kalman (GK) and phoneme-based Kalman (PBK). The same dynamic model is used for all phonemes in the GK smoother; it is shown that GK improves the performance of articulatory inversion better than the conventional low-pass (LP) smoother. However, the PBK smoother, which uses one dynamic model for each phoneme, gives significantly better results than the GK smoother. Different methodologies to fuse the audio and visual information are examined. A novel modified late fusion algorithm, designed to consider the observability degree of the articulators, is shown to give better results than either the early or the late fusion methods. Extensive experimental studies are conducted with the MOCHA database to illustrate the performance gains obtained by the proposed algorithms. The average RMS error and correlation coefficient between the true (measured) and the estimated articulatory trajectories are 1.227 mm and 0.868 using audiovisual information fusion and GK smoothing, and 1.199 mm and 0.876 using audiovisual information fusion together with PBK smoothing based on a phonetic transcription of the utterance.