How to create a synthesized actor performance in post-production

More Disney Research magic
December 11, 2015

Given a pair of facial performances (horizontal and vertical faces, left), a new performance (film strip, right) can be blended (credit: Charles Malleson et al./Disney Research)

Disney Research has devised a way to blend an actor’s facial performances from a few or multiple takes to allow a director to get just the right emotion, instead of re-shooting the scene multiple times.

“It’s not unheard of for a director to re-shoot a crucial scene dozens of times, even 100 or more times, until satisfied,” said Markus Gross, vice president of research at Disney Research. “That not only takes a lot of time — it also can be quite expensive. Now our research team has shown that a director can exert control over an actor’s performance after the shoot with just a few takes, saving both time and money.”

And the work can be done in post-production, rather than on an expensive film set.

How FaceDirector works

Developed jointly with the University of Surrey, the system, called FaceDirector, works with normal 2D video input acquired by standard cameras, without the need for additional hardware or 3D face reconstruction.

“The central challenge for combining an actor’s performances from separate takes is video synchronization,” said Jean-Charles Bazin, associate research scientist at Disney Research. “But differences in head pose, emotion, expression intensity, as well as pitch accentuation and even the wording of the speech, are just a few of many difficulties in syncing video takes.”

The system analyzes both facial expressions and audio cues, then identifies frames that correspond between the takes, using a graph-based framework. Once this synchronization has occurred, the system enables a director to control the performance by choosing the desired facial expressions and timing from either video, which are then blended together using facial landmarks, optical flow, and compositing.


Disney Research | FaceDirector: Continuous Control of Facial Performance in Video

To test the system, actors performed several lines of dialog, repeating the performances to convey different emotions – happiness, sadness, excitement, fear, anger, etc. The line readings were captured in HD resolution using standard compact cameras. The researchers were able to synchronize the videos in real-time and automatically on a standard desktop computer. Users could generate novel versions of the performances by interactively blending the video takes.

Multiple uses

The researchers showed how it could be used for a variety of purposes, including generation of multiple performances from just a few video takes (for use elsewhere in the video), for script correction and editing, and switching between voices (for example to create an entertaining performance with a sad voice over a happy face).

Speculation: It might also be possible to use this to create a fake video in which a person’s different facial expressions are combined, along with audio clips, to make a person show apparently inappropriate emotions, for example.

The researchers will present their findings at ICCV 2015, the International Conference on Computer Vision, Dec. 11–18, in Santiago, Chile.


Abstract of FaceDirector: Continuous Control of Facial Performance in Video

We present a method to continuously blend between multiple facial performances of an actor, which can contain different facial expressions or emotional states. As an example, given sad and angry video takes of a scene, our method empowers a movie director to specify arbitrary weighted combinations and smooth transitions between the two takes in post-production. Our contributions include (1) a robust nonlinear audio-visual synchronization technique that exploits complementary properties of audio and visual cues to automatically determine robust, dense spatio-temporal correspondences between takes, and (2) a seamless facial blending approach that provides the director full control to interpolate timing, facial expression, and local appearance, in order to generate novel performances after filming. In contrast to most previous works, our approach operates entirely in image space, avoiding the need of 3D facial reconstruction. We demonstrate that our method can synthesize visually believable performances with applications in emotion transition, performance correction, and timing control.