Combining Structural Analysis and Computer Vision Techniques for Automatic Speech Summarization


IEEE International Symposium on Multimedia, California, United States Of America, 15 - 17 December 2008, pp.515-516 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/ism.2008.90
  • City: California
  • Country: United States Of America
  • Page Numbers: pp.515-516


Similar to verse and chorus sections that appear as repetitive structures in musical audio, key-concept (or topic) of some speech recordings (e.g., presentations, lectures, etc.) may also repeat itself over the time. Hence, accurate detection of these repetitions may be helpful to the success of automatic speech summarization. Based on this motivation, we consider the applicability of music structural analysis methods to speech summary generation. Our method transforms a 1 - D time-domain speech signal to a 2 - D image representation, namely (dis)similarity matrix and detects possible repetitions within the matrix by using proper computer vision techniques. In addition, the method does not transcribe speech signal into words, phrases, or sentences. Hence, it can be generalized as speech-to-speech summarization method, in which summarization results are presented by speech instead of text. Furthermore, the method does not need a prior knowledge about the language or grammar of speech signal. Experiments show that, our method can capture the main theme of speech signals compared to the ideal transcription sections defined by experts and computational analysis shows our proposed method has a good performance.