Amazingly realistic digital screen characters are finally here

March 21, 2013

Virtual talking head “Zoe” uses a basic set of six simulated emotions that can then be modified and combined (credit: Toshiba Cambridge Research Lab/Department of Engineering, University of Cambridge)

Meet Zoe: a digital talking head. She can express a range of human emotions on demand with “unprecedented realism” and could herald a new era of human-computer interaction, according to researchers at Toshiba’s Cambridge Research Lab and the University of Cambridge’s Department of Engineering, who created her.

Zoe, or her offspring, could be used as a visible version of Siri, as a personal assistant in smartphones, or to replace mobile phone texting with “face messaging” in which you “face-message” friends.

The lifelike face can display emotions such as happiness, anger, and fear, and changes its voice to suit any feeling the user wants it to simulate. Users can type in any message, specifying the required emotion, and the face recites the text. According to its designers, it is the most expressive controllable avatar ever created, replicating human emotions with unprecedented realism.

To recreate her face and voice, researchers recorded British actress Zoe Lister’s speech and facial expressions.

DIY digital assistants

The framework behind “Zoe” could in the near future enable people to upload their own faces and voices to customize and personalize their own emotionally realistic, digital assistants. A user could, for example, text the message “I’m going to be late” and set her emotion to “frustrated.” A friend would then receive a “face message” that looked like the sender, repeating the message in a frustrated way.

The team that created Zoe is currently looking for applications, and are also working with a school for autistic and deaf children, where the technology could be used to help pupils to “read” emotions and lip-read. Ultimately, the system could have multiple uses — including gaming, audio-visual books, for delivering online lectures, and in other user interfaces.

“This technology could be the start of a whole new generation of interfaces which make interacting with a computer much more like talking to another human being,” Professor Roberto Cipolla, from the Department of Engineering, University of Cambridge, said.

The program used to run Zoe is just tens of megabytes in size, which means that it can be easily incorporated into even the smallest computer devices, including tablets and smartphones.

It works by using a set of fundamental emotions. Zoe’s voice, for example, has six basic settings: Happy, Sad, Tender, Angry, Afraid and Neutral. The user can adjust these settings to different levels, as well as altering the pitch, speed and depth of the voice itself.

By combining these levels, it becomes possible to pre-set or create almost infinite emotional combinations. For instance, combining happiness with tenderness and slightly increasing the speed and depth of the voice makes it sound friendly and welcoming.

A combination of speed, anger and fear makes Zoe sound as if she is panicking. This allows for a level of emotional subtlety which, the designers say, has not been possible in other characters like Zoe until now.

To make the system as realistic as possible, the research team collected a dataset of thousands of sentences, which they used to train the speech model with Lister. They also tracked Lister’s face while she was speaking using computer vision software. This was converted into voice and face-modelling algorithms that provided voice and image data needed to recreate expressions on a digital face, directly from the text alone.

Face Works

In related news,  at the annual GPU Technology Conference, NVIDIA demonstrated “Face Works,” running on their Titan graphics card, Forbes reported Wednesday.

NVIDIA is able to take 32GB of facial data (the bump maps, texture maps, lighting, expressions, etc) and compress it down to 400MB, in a new way of rendering highly realistic facial (and voice) expression.

Applications include virtual actors for animated video and film s.

Here’s the demo (forward the video to 8:37):

UPDATE: graphics card corrected 3/21/2013