Speech emotion recognition using auditory models

Thesis Type: Postgraduate

Institution Of The Thesis: Orta Doğu Teknik Üniversitesi, Graduate School of Informatics, Cognitive Science, Turkey

Approval Date: 2013




With the advent of computational technology, human computer interaction (HCI) has gone beyond simple logical calculations. Affective computing aims to improve human computer interaction in a mental state level allowing computers to adapt their responses according to human needs. As such, affective computing aims to recognize emotions by capturing cues from visual, auditory, tactile and other biometric signals recorded from humans. Emotions play a crucial role in modulating how humans experience and interact with the outside world and have a huge effect on the human decision making process. They are an essential part of human social relations and take role in important life decisions. Therefore detection of emotions is crucial in high level interactions. Each emotion has unique properties that make us recognize them. Acoustic signal generated for the same utterance or sentence changes primarily due to biophysical changes (such as stress-induced constriction of the larynx) triggered by emotions. This relation between acoustic cues and emotions made speech emotion recognition one of the trending topics of the affective computing domain. The main purpose of a speech emotion recognition algorithm is to detect the emotional state of a speaker from recorded speech signals. Human auditory system is a non-linear and adaptive mechanism which involves frequency dependent filtering as well as temporal and simultaneous masking. While emotion can be manifested in acoustic signals recorded using a high quality microphone and extracted using high resolution signal processing techniques, a human listener has access only to cues which are available to him/her via the auditory system. This type of limited access to emotion cues also reduces the subjective emotion recognition accuracy. A speech emotion recognition algorithm based on a model of the human auditory system is developed and its accuracy is evaluated in this thesis. A state-of-the-art human auditory filter bank model is used to process clean speech signals. Simple features are then extracted from the output signals and used to train binary classifiers for seven different classes (anger, fear, happiness, sadness, disgust, boredom and neutral) of emotions. The classifiers are then tested using a validation set to assess the recognition performance. Three emotional speech databases for German, English and Polish languages are used in testing the proposed method and recognition rates as high as 82% are achieved for the recognition of emotion from speech. A subjective experiment using the German emotional speech database carried out on non-German speaker subjects indicates that the performance of the proposed system is comparable to human emotion recognition.