New search program aims to teach itself everything about any visual concept
June 13, 2014
Computer scientists from the University of Washington and the Allen Institute for Artificial Intelligence in Seattle have created an automated computer program that they claim “teaches everything there is to know about any visual concept.”
Called Learning Everything about Anything (LEVAN), the program searches millions of books and images on the Web to learn all possible variations of a concept, then displays the results to users as a comprehensive, browsable list of images, helping them explore and understand topics quickly in great detail. You can try it here.
“It is all about discovering associations between textual and visual data,” said Ali Farhadi, a UW assistant professor of computer science and engineering. “The program learns to tightly couple rich sets of phrases with pixels in images. This means that it can recognize instances of specific concepts when it sees them.”
LEVAN is a fully-automated system that learns everything visual about any concept, by processing lots of books and images on the web. It acts as a visual encyclopedia for you, helping you explore and understand any topic that you are curious about, in great detail.
The program learns which terms are relevant by looking at the content of the images found on the Web and identifying characteristic patterns across them using object recognition algorithms. It’s different from online image libraries because it draws upon a rich set of phrases to understand and tag photos by their content and pixel arrangements, not simply by words displayed in captions.
Users can browse the current library of roughly 175 concepts, ranging from “airline” to “window,” and include “beautiful,” “breakfast,” “shiny,” “cancer,” “innovation,” “skateboarding,” “robot,” and “horse.”
If the concept you’re looking for doesn’t exist, you can submit any search term and the program will automatically begin generating an exhaustive list of subcategory images that relate to that concept. For example, a search for “dog” brings up the obvious collection of subcategories: Photos of “Chihuahua dog,” “black dog,” “swimming dog,” “scruffy dog,” “greyhound dog.” But also “dog nose,” “dog bowl,” “sad dog,” “ugliest dog,” “hot dog” and even “down dog,” as in the yoga pose.
The technique works by searching the text from millions of books written in English and available on Google Books, scouring for every occurrence of the concept in the entire digital library. Then, an algorithm filters out words that aren’t visual. For example, with the concept “horse,” the algorithm would keep phrases such as “jumping horse,” “eating horse” and “barrel horse,” but would exclude non-visual phrases such as “my horse” and “last horse.”
Once it has learned which phrases are relevant, the program does an image search on the Web, looking for uniformity in appearance among the photos retrieved. When the program is trained to find relevant images of, say, “jumping horse,” it then recognizes all images associated with this phrase.
The researchers launched the program in March with only a handful of concepts and have watched it grow since then to tag more than 13 million images with 65,000 different phrases.
Right now, the program is limited in how fast it can learn about a concept because of the computational power it takes to process each query, up to 12 hours for some broad concepts. The researchers are working on increasing the processing speed and capabilities.
The team wants the open-source program to be both an educational tool as well as an information bank for researchers in the computer vision community. The team also hopes to offer a smartphone app that can run the program to automatically parse out and categorize photos.
The research team will present the project and a related paper this month at the Computer Vision and Pattern Recognition annual conference in Columbus, Ohio.
This research was funded by the U.S. Office of Naval Research, the National Science Foundation and the UW.
Abstract of the Computer Vision and Pattern Recognition annual conference paper
Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50,000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.