An algorithm for speech-based emotion classification
December 5, 2012

MATLAB GUI prototype for speech signal input and predicted emotion output on a valence-arousal coordinate (credit: University of Rochester)
University of Rochester engineers have developed a computer program that gauges human feelings by analyzing 12 features of speech, such as pitch and volume, to identify one of six emotions from a sound recording with 81 percent accuracy.
The program has used to develop a prototype of an app that displays either a happy or sad face after it records and analyzes the user’s voice. It was built by one of Heinzelman’s graduate students, Na Yang, during a summer internship at Microsoft Research.
“The research is still in its early days,” said Wendi Heinzelman, professor of electrical and computer engineering, “but it is easy to envision a more complex app that could use this technology for everything from adjusting the colors displayed on your mobile to playing music fitting to how you’re feeling after recording your voice.”
Rochester psychologist Melissa Sturge-Apple, who is using the program in research on parent-teen communication, explained that emotion affects the way people speak by altering the volume, pitch, and even the harmonics of their speech.
The researchers established 12 specific features in speech that were measured in each recording at short intervals. The researchers then categorized each of the recordings and used them to teach the computer program what “sad,” “happy,” “fearful,” “disgusted,” or “neutral” sound like.
The system then analyzed new recordings and tried to determine whether the voice in the recording portrayed any of the known emotions. If the computer program was unable to decide between two or more emotions, it just left that recording unclassified.
Previous research has shown that emotion classification systems are highly speaker-dependent; they work much better if the system is trained by the same voice it will analyze. Their new results also confirm this finding. If the speech-based emotion classification is used on a voice different from the one that trained the system, the accuracy dropped from 81 percent to about 30 percent.
The researchers are now looking at ways of minimizing this effect, for example, by training the system with a voice in the same age group and of the same gender. As Heinzelman said, “there are still challenges to be resolved if we want to use this system in an environment resembling a real-life situation, but we do know that the algorithm we developed is more effective than previous attempts.”
More information on the project is available here.
Once the algorithm gets further developed (so it’s more speaker-independent and operates in real time), perhaps this system could do the impossible: improve automated customer service? For example, an angry customer could be routed to a supervisor. Also, how about combining this data with electrodermal-response data from Affectiva’s Q Sensor (which uses the same two dimensions of emotion (arousal: calm to excited, and valence, negative to positive)?
Comments (4)
by Arctic Poppy
Imagine using a conflicting accent and cadence. Speaking Spanish w/an American accent. Good luck with that one.
by Bri
I’m surprised that they make no mention of the changes in speed and cadence. It would be interesting to see how much this algorithm correlates the fluctuation of pitch and emphasis, to the actual sentences meanings. Much of the effect is in subtle variations in meaning. Sarcasm being the most obvious. I find this problem all the time while writing things. There is so much meaning in how we speak the words. Trying to convey this layer of meaning, we often use klunky modifiers such as LOL, otherwise the words can be easily miss interpreted.
by GatorALLin
…very good points…
by Gorden Russell
Right Bri, and as for “…improve automated customer service?” I certainly hope this tech gets into the system that answers the phone for the local Regal Theaters. I am so tired of the robot telling me that it does not understand me. I repeat myself speaking real slow and clear, enunciating perfectly, yet the damn robot hangs up on me. I just don’t go to the movies anymore.
My wife and I just wait for the movie to come out on DVD.
And as for sarcasm, Bri, I can see it going right over the heads of these robots for some time to come. Can you see a health care robot giving a sponge bath to somebody bedridden with arthritis? When the person complains angrily, “Oh you’re doing such a damn good job!” the robot will just answer, “Oh thank you for the wonderful compliment!”