A computer-vision algorithm that can describe photos

Machine-learning takes computer vision to the next level with a system that can describe objects and put them into context. Coming soon, better visual search?
November 19, 2014

A Stanford machine-learning algorithm generated captions for these images as follows:”baseball player is throwing ball in game”; “woman is holding bunch of bananas”;”surfer is riding wave on his surfboard”; “little girl is eating piece of cake”; “dog is playing frisbee in field” (credit: Creative Commons images from the yfcc100mdataset, http://labs.yahoo.com/news/yfcc100m/)

Computer software only recently became smart enough to recognize objects in photographs. Now, Stanford researchers using machine learning have created a system that takes the next step, writing a simple story of what’s actually happening in any digital image.

“The system can analyze an unknown image and explain it in words and phrases that make sense,” said  Fei-Fei Li, a professor of computer science and director of the Stanford Artificial Intelligence Lab.

“This is an important milestone,” Li said. “It’s the first time we’ve had a computer vision system that could tell a basic story about an unknown image by identifying discrete objects and also putting them into some context.”

At the heart of the Stanford system are algorithms that enable the system to improve its accuracy by scanning scene after scene, looking for patterns, and then using the accumulation of previously described scenes to extrapolate what is being depicted in the next unknown image.

“It’s almost like the way a baby learns,” Li said.

She and her collaborators, including Andrej Karpathy, a graduate student in computer science, describe their approach in a paper submitted in advance of a forthcoming conference on cutting edge research in the field of computer vision.

Eventually these advances could lead to robotic systems that can navigate unknown situations. In the near term, machine-based systems that can discern the story in a picture promise to enable people to search photo or video archives and find specific images.

“Most of the traffic on the Internet is visual data files, and this might as well be dark matter as far as current search tools are concerned,” Li said. “Computer vision seeks to illuminate that dark matter.”

These findings are based on two years of effort that flows from research that Li has been pursuing for a decade. Her work builds on advances that have come, slowly at times, over the last 50 years since MIT scientist Seymour Papert convened a “summer project” to create computer vision in 1966.

It took researchers 20 years to create systems that could take the relatively simple first step of recognizing discrete objects in photographs.

Machine learning algorithms

More recently the emergence of the Internet has helped to propel computer vision. On one hand, the growth of photo and video uploads has created a demand for tools to sort, search and sift visual information. On the other, sophisticated algorithms running on powerful computers have led to electronic systems that can train themselves by performing repetitive tasks, improving as they go.

Computer scientists call this machine learning, and Li likened this to how a child learns soccer by getting out and kicking the ball. A coach might demonstrate how to kick, and comment on the child’s technique. But improvement occurs from within as the child’s eyes, brain, nerves and muscles make tiny adjustments.

Li’s latest algorithms incorporate work that her researchers and others have done. This includes training their system on a visual dictionary, using a database of more than 14 million objects. Each object is described by a mathematical term, or vector, that enables the machine to recognize the shape the next time it is encountered. Those mathematical definitions are linked to the words humans would use to describe the objects, be they cars, carrots, men, mountains or zebras.

Li played a leading role in creating this training tool, the ImageNet project, but her current work goes well beyond memorizing this visual dictionary.

Her team’s new computer vision algorithm trained itself by looking for patterns in a visual dictionary, but this time a dictionary of scenes, a more complicated task than looking just at objects.

This was a smaller database, made up of tens of thousands of images. Each scene is described in two ways: in mathematical terms that the machine could use to recognize similar scenes and also in a phrase that humans would understand. For instance, one image might be “cat sits on keyboard” while another could be “girl rides on horse in field.”

These two databases – one of objects and the other of scenes – served as training material.  Li’s machine-learning algorithm analyzed the patterns in these predefined pictures and then applied its analysis to unknown images and used what it had learned to identify individual objects and provide some rudimentary context. In other words, it told a simple story about the image.”