Human vs. deep-neural-network performance in object recognition

… and how they can teach each other
September 27, 2017

(credit: UC Santa Barbara)

Before you read this: look for toothbrushes in the photo above.

Did you notice the huge toothbrush on the left? Probably not. That’s because when humans search through scenes for a particular object, we often miss objects whose size is inconsistent with the rest of the scene, according to scientists in the Department of Psychological & Brain Sciences at UC Santa Barbara.

The scientists are investigating this phenomenon in an effort to better understand how humans and computers compare in doing visual searches. Their findings are published in the journal Current Biology.

Hiding in plain sight

“When something appears at the wrong scale, you will miss it more often because your brain automatically ignores it,” said UCSB professor Miguel Eckstein, who specializes in computational human vision, visual attention, and search.

The experiment used scenes of ordinary objects featured in computer-generated images that varied in color, viewing angle, and size, mixed with “target-absent” scenes. The researchers asked 60 viewers to search for these objects (e.g., toothbrush, parking meter, computer mouse) while eye-tracking software monitored the paths of their gaze.

The researchers found that people tended to miss the target more often when it was mis-scaled (too large or too small) — even when looking directly at the target object.

Computer vision, by contrast, doesn’t have this issue, the scientists reported. However, in the experiments, the researchers found that the most advanced form of computer vision — deep neural networks — had its own limitations.

Human search strategies that could improve computer vision

Red rectangle marks incorrect image identification as a cell phone by a deep-learning algorithm (credit: UC Santa Barbara)

For example, a CNN deep-learning neural net incorrectly identified a computer keyboard as a cell phone, based on similarity in shape and the location of the object in spatial proximity to a human hand (as would be expected of a cell phone). But for humans, the object’s size (compared to the nearby hands) is clearly seen as inconsistent with a cell phone.

“This strategy allows humans to reduce false positives when making fast decisions,” the researchers note in the paper.

“The idea is when you first see a scene, your brain rapidly processes it within a few hundred milliseconds or less, and then you use that information to guide your search towards likely locations where the object typically appears,” Eckstein said. “Also, you focus your attention on objects that are actually at the size that is consistent with the object that you’re looking for.”

That is, human brains use the relationships between objects and their context within the scene to guide their eyes — a useful strategy to process scenes rapidly, eliminate distractors, and reduce false positives.

This finding might suggest ways to improve computer vision by implementing some of the tricks the brain utilizes to reduce false positives, according to the researchers.

Future research

“There are some theories that suggest that people with autism spectrum disorder focus more on local scene information and less on global structure,” says Eckstein, who is contemplating a follow-up study. “So there is a possibility that people with autism spectrum disorder might miss the mis-scaled objects less often, but we won’t know that until we do the study.”

In the more immediate future, the team’s research will look into the brain activity that occurs when we view mis-scaled objects.

“Many studies have identified brain regions that process scenes and objects, and now researchers are trying to understand which particular properties of scenes and objects are represented in these regions,” said postdoctoral researcher Lauren Welbourne, whose current research concentrates on how objects are represented in the cortex, and how scene context influences the perception of objects.

“So what we’re trying to do is find out how these brain areas respond to objects that are either correctly or incorrectly scaled within a scene. This may help us determine which regions are responsible for making it more difficult for us to find objects if they are mis-scaled.”


Abstract of Humans, but Not Deep Neural Networks, Often Miss Giant Targets in Scenes

Even with great advances in machine vision, animals are still unmatched in their ability to visually search complex scenes. Animals from bees [ 1, 2 ] to birds [ 3 ] to humans [ 4–12 ] learn about the statistical relations in visual environments to guide and aid their search for targets. Here, we investigate a novel manner in which humans utilize rapidly acquired information about scenes by guiding search toward likely target sizes. We show that humans often miss targets when their size is inconsistent with the rest of the scene, even when the targets were made larger and more salient and observers fixated the target. In contrast, we show that state-of-the-art deep neural networks do not exhibit such deficits in finding mis-scaled targets but, unlike humans, can be fooled by target-shaped distractors that are inconsistent with the expected target’s size within the scene. Thus, it is not a human deficiency to miss targets when they are inconsistent in size with the scene; instead, it is a byproduct of a useful strategy that the brain has implemented to rapidly discount potential distractors.