The technology of Ramona

February 21, 2001

Ramona is “the world’s first live virtual performing and musical recording artist,” according to her developer and inventor Ray Kurzweil.

Kurzweil’s inventions have dramatically extended human senses and enhanced artistic expression. They include firsts in optical character recognition of multiple fonts, the CCD scanner, speech recognition and synthesis devices, and the revolutionary grand-piano-quality music synthesizer.

Ramona is¬†Kurzweil’s newest invention. “Virtual actors (‘vactors’) have performed live on stage before,” he says, “but this is the first time a singer and dancer have been transformed into a virtual person in real time.” To accomplish this feat, he assembled a team of engineers, sound and video technicians, and dancers and musicians plus an array of sophisticated computer graphics, video and sound systems. In performance, Kurzweil and dancer (daughter Amy in the TED11 performance) are wired up with motion-capture sensors that transmit movements to a motion-capture system.

Illustration by Thomas Reis

This motion-capture data is sent to an array of computers that generate 3-D moving images of Ramona and the dancers (including TED producer Richard Saul Wurman in the TED11 performance) in real time.

Meanwhile, Kurzweil’s voice is computer-processed to change gender. Reverb is added and his voice is combined with sound from the musicians and sent to speakers. At the same time, phonemes (basic speech sounds) are extracted and used to control Ramona’s lip and facial expressions.

Ramona’s final image is then rendered and converted to video, combined with video backgrounds to create a music-video effect, and displayed to the audience on a video projector.

Creating the Model

But this raises the question: How did Ramona’s image get created in the first place? Weeks before the performance, Ramona started out her incarnation as 3D laser scans of her face and body. These created a “point cloud” of color dots representing her basic shape.

The point cloud data was then tessellated (points joined to form polygons, or faceted surfaces) and missing data filled in. For the face scan, mouth and facial movements were defined for each phoneme and a set of facial expressions was developed.

Both models (body and face) were then integrated into a rough 3D model.

This model was then enhanced for realistic movement (using rigging and weighting) and for realistic skin and clothing (using texturing and deformation curves). The final model was loaded on computers for real-time image rendering during the performance.