How to teach machines to see

May lead to smarter future vision systems for driverless cars, smartphones, and cameras
December 22, 2015

In a test by KurzweilAI using a Google Maps image of Market Street In San Francisco, the SegNet system accurately identified the various elements, even hard-to-see pedestrians (shown in brown on the left) and road markings. (credit: KurzweilAI/Cambridge University/Google)

Two new technologies that use deep-learning techniques to help machines see and analyze images (such as roads and people) could improve visual performance for driveless cars and create a new generation of smarter smartphones and cameras.

Designed by University of Cambridge researchers, the systems can recognize their own location and surroundings. Most driverless cars currently in development use radar and LIDAR sensors, which often cost more than the car itself. (See “New laser design could dramatically shrink autonomous-vehicle 3-D laser-ranging systems” for another solution.)

One of the systems, SegNet, can identify a user’s location and orientation, including places where GPS does not function, and can identify the various components of a road scene in real time on a regular camera or smartphone (see image above or try it yourself here).

SegNet can take an image of a street scene it hasn’t seen before and classify it, sorting objects into 12 different categories — such as roads, street signs, pedestrians, buildings and cyclists — in real time. It can deal with light, shadow, and night-time environments, and currently labels more than 90% of pixels correctly, according to the researchers. Previous systems using expensive laser or radar based sensors have not been able to reach this level of accuracy while operating in real time, the researchers say.

To create SegNet, Cambridge undergraduate students manually labeled every pixel in each of 5000 images, with each image taking about 30 minutes to complete. Once the labeling was finished, the researchers “trained” the system, which was successfully tested on both city roads and motorways.

“It’s remarkably good at recognizing things in an image, because it’s had so much practice,” said Alex Kendall, a PhD student in the Department of Engineering. “However, there are a million knobs that we can turn to fine-tune the system so that it keeps getting better.”

SegNet was primarily trained in highway and urban environments, so it still has some learning to do for rural, snowy, or desert environments. The system is not yet at the point where it can be used to control a car or truck, but it could be used as a warning system, similar to the anti-collision technologies currently available on some passenger cars.

But teaching a machine to see is far more difficult than it sounds, said Professor Roberto Cipolla, who led the research. “There are three key technological questions that must be answered to design autonomous vehicles: where am I, what’s around me and what do I do next?”

SegNet addresses the second question. The researchers’ Visual Localization system answers the first question. Using deep learning, it can determine their location and orientation from a single color image in a busy urban scene. The researchers say the system is far more accurate than GPS and works in places where GPS does not, such as indoors, in tunnels, or in cities where a reliable GPS signal is not available.

In a KurzweilAI test of the Visual Localization system (using an image in the Central Cambridge UK demo), the system accurately identified a Cambridge building, displaying the correct Google Maps street view, and marked its location on a Google map (credit: KurzweilAI/Cambridge University/Google)

It has been tested along a kilometer-long stretch of King’s Parade in central Cambridge, and it is able to determine both location and orientation within a few meters and a few degrees, which is far more accurate than GPS — a vital consideration for driverless cars, according to the researchers. (Try it here.)

The localization system uses the geometry of a scene to learn its precise location, and is able to determine, for example, whether it is looking at the east or west side of a building, even if the two sides appear identical.

“In the short term, we’re more likely to see this sort of system on a domestic robot — such as a robotic vacuum cleaner, for instance,” said Cipolla. “It will take time before drivers can fully trust an autonomous car, but the more effective and accurate we can make these technologies, the closer we are to the widespread adoption of driverless cars and other types of autonomous robotics.”

The researchers are presenting details of the two technologies at the International Conference on Computer Vision in Santiago, Chile.

Cambridge University | Teaching machines to see

Abstract of PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization

We present a robust and real-time monocular six degree of freedom relocalization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking 5ms per frame to compute. It obtains approximately 2m and 3◦accuracy for large scale outdoor scenes and 0.5m and 5◦accuracy indoors. This is achieved using an efficient 23 layer deep convnet, demonstrating that convnets can be used to solve complicated out of image plane regression problems. This was made possible by leveraging transfer learning from large scale classi- fication data. We show that the PoseNet localizes from high level features and is robust to difficult lighting, motion blur and different camera intrinsics where point based SIFT registration fails. Furthermore we show how the pose feature that is produced generalizes to other scenes allowing us to regress pose with only a few dozen training examples.