‘Data smashing’ could automate discovery, untouched by human hands

October 28, 2014

(Credit: iStock)

From recognizing speech to identifying unusual stars, new discoveries often begin with comparison of data streams to find connections and spot outliers. But simply feeding raw data into a data-analysis algorithm is unlikely to produce meaningful results, say the authors of a new Cornell study.

That’s because most data comparison algorithms today have one major weakness: somewhere, they rely on a human expert to specify what aspects of the data are relevant for comparison, and what aspects aren’t.

But these experts can’t keep up with the growing amounts and complexities of big data.

So the Cornell computing researchers have come up with a new principle they call “data smashing” for estimating the similarities between streams of arbitrary data without human intervention, and even without access to the data sources.

How ‘data smashing’ works

Data smashing is based on a new way to compare data streams. The process involves two steps.

  1. The data streams are algorithmically “smashed” to “annihilate” the information in each other.
  2. The process measures what information remains after the collision. The more information remains, the less likely the streams originated in the same source.

Data-smashing principles could open the door to understanding increasingly complex observations, especially when experts don’t know what to look for, according to the researchers.

The researchers— Hod Lipson, associate professor of mechanical engineering and of computing and information science, and Ishanu Chattopadhyay, a former postdoctoral associate with Lipson now at the University of Chicago — demonstrated this idea with data from real-world problems, including detection of anomalous cardiac activity from heart recordings and classification of astronomical objects from raw photometry.

In all cases and without access to original domain knowledge, the researchers demonstrated that the performance of these general algorithms was on par with the accuracy of specialized algorithms and heuristics tweaked by experts to work.

The researchers described the work in Journal of the Royal Society Interface. It was supported by DARPA and the U.S. Army Research Office.


Abstract of Data smashing: uncovering lurking order in data

From automatic speech recognition to discovering unusual stars, underlying almost all automated discovery tasks is the ability to compare and contrast data streams with each other, to identify connections and spot outliers. Despite the prevalence of data, however, automated methods are not keeping pace. A key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning. We demonstrate the application of this principle to the analysis of data from a number of real-world challenging problems, including the disambiguation of electro-encephalograph patterns pertaining to epileptic seizures, detection of anomalous cardiac activity from heart sound recordings and classification of astronomical objects from raw photometry. In all these cases and without access to any domain knowledge, we demonstrate performance on a par with the accuracy achieved by specialized algorithms and heuristics devised by domain experts. We suggest that data smashing principles may open the door to understanding increasingly complex observations, especially when experts do not know what to look for.