Cambridge University & Toshiba | Zoe the emotional avatar of the future

April 13, 2015

Cambridge University & Toshiba | Meet Zoe, a digital talking head that can express human emotions on demand with unprecedented realism and could herald a new era of human computer interaction.

A virtual talking head that can express a full range of human emotions and could be used as a digital personal assistant, or to replace texting with face messaging, has been developed by researchers. The lifelike face can display emotions such as happiness, anger, and fear, and changes its voice to suit any feeling the user wants it to simulate.

Users can type in any message, specifying the requisite emotion as well, and the face recites the text. According to its designers, it is the most expressive controllable avatar ever created, replicating human emotions with unprecedented realism.

The system, called Zoe, is the result of a collaboration between researchers at Toshiba’s Cambridge Research Lab and the University of Cambridge’s Department of Engineering. Students spotted a striking resemblance between the head and the ship’s computer “Holly” in science fiction comedy Red Dwarf.

(credit: Cambridge University & Toshiba Cambridge Research Lab )

The face is Zoe Lister’s, an actress known as Zoe Carpenter in the series, Hollyoaks.

To recreate her face and voice, researchers spent several days recording Zoe’s speech and facial expressions.

The result is a system that is light enough to work in mobile technology.

It could also be used as a personal assistant in smartphones, or to “face message” friends.

The framework behind Zoe is a template that could enable people to upload their own faces and voices — but in a matter of seconds, rather than days.

That means in the future, users will be able to customize and personalize their own, emotionally realistic, digital assistants. For example, you could text the message “I’m going to be late” and ask it to set the emotion to “frustrated.”

Your friend would then receive a “face message” that looked like the sender, repeating the message in a frustrated way. The team who created Zoe are currently looking for applications, working with a school for autistic and deaf children, where the technology could help them read emotions and lip read.

Ultimately, the system could have multiple uses — including gaming, audio-visual books, online lectures, and other user interfaces.

“This could be the start of a new generation of interfaces that make computer interaction more like talking to a human,” said Professor Roberto Cipolla, from the Department of Engineering, Cambridge University.

“It took days to create Zoe. We had to teach the system to understand language and expression. Now that it understands those things, it shouldn’t be too hard to transfer the same blueprint to a different voice and face.”

As well as being more expressive than any previous system, Zoe is also remarkably data light. The program used to run her is just tens of megabytes in size, and can be easily incorporated into the smallest computer devices, including tablets and smartphones.

It works by using a set of fundamental, “primary color” emotions. Zoe’s voice, for example, has six basic settings — Happy, Sad, Tender, Angry, Afraid and Neutral. The user can adjust the settings to different levels, as well as altering the pitch, speed and depth of the voice itself.

By combining these levels, it becomes possible to pre-set or create almost infinite emotional combinations. For example, combining happiness with tenderness and slightly increasing the speed and depth of the voice makes it sound friendly and welcoming. A combination of speed, anger and fear makes Zoe sound as if she is panicking. This allows for a level of emotional subtlety which, the designers say, has not been possible in other avatars like Zoe until now.

To make the system as realistic as possible, the research team collected a data set of thousands of sentences, which they used to train the speech model with the help of real-life actress, Zoe Lister. They also tracked Lister’s face while she was speaking using computer vision software. This was converted into voice and face-modelling, mathematical algorithms which gave them the voice and image data they needed to recreate expressions on a digital face, directly from the text alone.

The effectiveness of the system was tested with volunteers via a crowd sourcing website. The participants were each given either a video or audio clip of a single sentence from the test set, and asked to identify which of the six basic emotions it was replicating. Ten sentences were evaluated, each by 20 different people.

Volunteers who only had video and no sound successfully recognized the emotion in 52% of cases. When they only had audio, the success rate was 68%. The two together, however, produced a successful recognition rate of 77% — slightly higher than the recognition rate for the real life Zoe, which was 73%.

This higher rate of success compared with real life is probably because the synthetic talking head is deliberately more stylized in its manner. As well as finding applications for their new creation, the research team will now work on creating a version of the system which can be personalized by users themselves.

“Present day human-computer interaction still revolves around typing at a keyboard or moving and pointing with a mouse.” Cipolla added. “For a lot of people, that makes computers difficult and frustrating to use. In the future, we will be able to open up computing to far more people if they can speak and gesture to machines in a more natural way.

That is why we created Zoe — a more expressive, emotionally responsive face that human beings can actually have a conversation with.”


related reading:
Cambridge University | Department of Engineering
Cambridge University | YouTube channel
Toshiba | Cambridge Research Lab