Amazingly realistic digital screen characters are finally here
March 21, 2013

Virtual talking head “Zoe” uses a basic set of six simulated emotions that can then be modified and combined (credit: Toshiba Cambridge Research Lab/Department of Engineering, University of Cambridge)
Meet Zoe: a digital talking head. She can express a range of human emotions on demand with “unprecedented realism” and could herald a new era of human-computer interaction, according to researchers at Toshiba’s Cambridge Research Lab and the University of Cambridge’s Department of Engineering, who created her.
Zoe, or her offspring, could be used as a visible version of Siri, as a personal assistant in smartphones, or to replace mobile phone texting with “face messaging” in which you “face-message” friends.
The lifelike face can display emotions such as happiness, anger, and fear, and changes its voice to suit any feeling the user wants it to simulate. Users can type in any message, specifying the required emotion, and the face recites the text. According to its designers, it is the most expressive controllable avatar ever created, replicating human emotions with unprecedented realism.
To recreate her face and voice, researchers recorded British actress Zoe Lister’s speech and facial expressions.
DIY digital assistants
The framework behind “Zoe” could in the near future enable people to upload their own faces and voices to customize and personalize their own emotionally realistic, digital assistants. A user could, for example, text the message “I’m going to be late” and set her emotion to “frustrated.” A friend would then receive a “face message” that looked like the sender, repeating the message in a frustrated way.
The team that created Zoe is currently looking for applications, and are also working with a school for autistic and deaf children, where the technology could be used to help pupils to “read” emotions and lip-read. Ultimately, the system could have multiple uses — including gaming, audio-visual books, for delivering online lectures, and in other user interfaces.
“This technology could be the start of a whole new generation of interfaces which make interacting with a computer much more like talking to another human being,” Professor Roberto Cipolla, from the Department of Engineering, University of Cambridge, said.
The program used to run Zoe is just tens of megabytes in size, which means that it can be easily incorporated into even the smallest computer devices, including tablets and smartphones.
It works by using a set of fundamental emotions. Zoe’s voice, for example, has six basic settings: Happy, Sad, Tender, Angry, Afraid and Neutral. The user can adjust these settings to different levels, as well as altering the pitch, speed and depth of the voice itself.
By combining these levels, it becomes possible to pre-set or create almost infinite emotional combinations. For instance, combining happiness with tenderness and slightly increasing the speed and depth of the voice makes it sound friendly and welcoming.
A combination of speed, anger and fear makes Zoe sound as if she is panicking. This allows for a level of emotional subtlety which, the designers say, has not been possible in other characters like Zoe until now.
To make the system as realistic as possible, the research team collected a dataset of thousands of sentences, which they used to train the speech model with Lister. They also tracked Lister’s face while she was speaking using computer vision software. This was converted into voice and face-modelling algorithms that provided voice and image data needed to recreate expressions on a digital face, directly from the text alone.
Face Works
In related news, at the annual GPU Technology Conference, NVIDIA demonstrated “Face Works,” running on their Titan graphics card, Forbes reported Wednesday.
NVIDIA is able to take 32GB of facial data (the bump maps, texture maps, lighting, expressions, etc) and compress it down to 400MB, in a new way of rendering highly realistic facial (and voice) expression.
Applications include virtual actors for animated video and film s.
Here’s the demo (forward the video to 8:37):
UPDATE: graphics card corrected 3/21/2013
Comments (29)
by Alejandra Lopez
This is amazing! How technology is advancing! although the voice and like the tongue of the women didnt look very realistic.
by Diego
GREAt video but for my level of speciality in graphics its a little low quality. great job for beginners
by Diego
you are so right! -ale
by Cru
Captain Planet.
by Cybernettr
“Show me white guy dancing…Only all the Asians laughed.”
Good technology, but not sure about the tastefulness of that joke…
by Blake Senftner
Sounds like they have a few years to go before they catch up to us: http://www.3D-Avatar-Store.com. We perform photo-realistic 3D Reconstruction given as little as one photo of a person. Given more photos and our reconstruction quality improves. We’re neural net driven, so we’re super fast: original reconstructions in under a second, and refinements require 40% less time. Multiple geometry outputs, including Maya performance-animation rigs, full body skin-tone corrected texture maps, perspective distortion correction and more.
by GeorgeV
There’s also the stuff from OTOY:
http://www.brightsideofnews.com/news/2013/1/24/has-otoy-bridged-the-gap-between-cgi-and-reality.aspx
http://www.fxguide.com/featured/octane-render-realtime-ray-tracing/
by H.K. Fauskanger
So soon we can communicate via an avatar of ourselves that always has perfect hair, perfect make-up and never ages. Some people may come to prefer ALWAYS communicating via a screen, so that they never have to expose their true looks. I see a new syndrome coming up …
by Ian
Amazing stuff. Voices would still appear to be a problem though. – when are they going to crack a realistic voice emulator?
by Allan Carson
Fantastic – well done guys. This is how we will communicate our memories to our Great Grand children. Digital rendition is the new face of machine to human communication and pretty soon we will choose our digital friends, assistants and maybe even lovers.
by melajara
I wonder if the underlying polygon mesh is based on real anatomical consideration of all the muscles of the face.
Nature’s solution is key here for realistic rendering.
A principled approach would be to record muscle activity (extension, contraction, tensile strength, timing) in each involved emotion, the effect (mostly torsion) on enclosing tissues (bones as sticky part below, then fat, dermis, skin) and combine the local results up to a complete face.
IMHO, this would give us a more faithful rendering of human faces than a pure engineered (i.e. VR) approach.
by Craig
These guys must have watch too much Red Dwarf.
by Vin
Yes, I was thinking there must be a slider for ‘Hollyness’ and one for ‘Max-ine Headroom’ in combination. It’s not bad, they need to work on the interpolation to make the samples presented more smoother. Salespersons or politicians or anyone who wants to control their non-verbal face/voice cues to sell something (I wonder what will happen to Facial Action Coding System ?) will love this as their presentation interface. Actually, it might save a ton on makeup overheads and having to shave etc. Those industries probably ought to start worrying.
Also, imagine all the remakes of old movies and tv with actors as they were when they made them back again today? Could be a revival of cult tv shows of the past!
by Bruce Thomson
Quantic dream’s Kara (see video on YouTube) is better than this.
by andmar74
Her tongue looks fake. Not very realistic, and of course the voice is horrible.
by Walter Baltzley
It would appear that we have finally crossed the “Uncanny Valley”…she seems far less creepy than other digital avatars I have encountered…now all they have to do is get the voice right. However, it does make one wonder how this technology might be abused.
by Frank
Lookout Hollywood! Enter, the AviStar!
by Frank
Oops! Make that, AvaStar.
by Bob Vasquez
Pretty neat stuff but, she’s not my type.
by GatorALLin
The eyes and the teeth seemed off the most in this model. There was an obvious skip when they first started to switch to the live input. It would have been interesting to let the audience ask a question vs. see only a pre-rendered reply. I hope they can fix the teeth in these models in the future….. the teeth to me are often the most creepy parts…. eyes are 2nd (if done where they are foggy or not realistic). Still this is amazing progress, but I am not sure I agree with the speaker that you want to use this wonderful technology for teleconferencing and use an Avitar vs. yourself. Maybe for some training videos….but that even seems a stretch.
by anon
Makes one wonder why is it that the ethnic minorities are always simulated first?
by de Broglie
What do you mean? The faerie looks white and I guess Ira is jewish. I am guessing you are being sarcastic. I think they simulate caucasians because the light skin makes the face more defined. In addition, caucasians have more prominent facial features, so expressions are more prominent and interesting.
by Vin
Or, its just easier for people to emphasize with people of their own kind. Since expression is also a function of the brain, and the brain is independent of looks, the definition of ‘own kind’ becomes less apparent and more diverse.
by EvilRobots
It looks like all they did was apply a picture of a real woman onto a very low poly model. I’m honestly not impressed by that aspect at all. It is ‘lifelike’ish because it is a picture of a woman! haha.
The emotional voice is interesting. I must say though, ‘she’ is kind of terrifying. the uncanny valley is brutal here, and she is talking like a nightmare evil robot lady from any number of bad hollywood films, hahaha.
I’m sure they’ll get there eventually.
by Pete
Wonderful. This may be useful for synthesizing very convincing fictional videos.
by Oneironaut
Correction to the article: Titan’s memory bandwidth is 288.4 GB/s. The 1 TB/s bandwidth is for Volta, which uses DRAM stacking and will be released in 2016+.
by Editor
Corrected, thanks.
by Oneironaut
It’s still not correct, though…
“NVIDIA demonstrated “Face Works,” a technology that will be made possible by a new graphics card that is capable of 1TB/s of memory bandwidth”
The demo is running in real-time on Titan, not the upcoming Volta. So it’s 288.4 GB/s bandwidth, like I said. I hope this was helpful.
by Editor
Fixed, thanks.