An AI tool for for understanding and indexing sound

November 15, 2011

(Credit: Mediamined)

Audio engineers at Imagine Research have developed a novel AI system called MediaMined for understanding and indexing sound, a unique tool for both finding and matching previously unlabeled audio files.

“It allows computers to index, understand and search sound — as a result, we have made millions of media files searchable,” says Imagine Research’s founder and CEO Jay LeBoeuf.

For recording artists and others in music production, MediaMined enables quick scanning for a large set of tracks and recordings, automatically labeling the inputs.

“It acts as a virtual studio engineer,” says LeBoeuf, as it chooses tracks with features that best match qualities the user defines as ideal. “If your software detects male vocals,” LeBoeuf adds, “then it would also respond by labeling the tracks and acting as intelligent studio assistant — this allows musicians and audio engineers to concentrate on the creative process rather than the mundane steps of configuring hardware and software.”

Three processing stages

The technology uses three tiers of analysis to process audio files. First, the software detects the properties of the complex sound wave represented by an audio file’s data. The raw data contains a wide range of information, from simple amplitude values to the specific frequencies that form the sound. The data also reveals more musical information, such as the timing, timbre and spatial positioning of sound events.

In the second stage of processing, the software applies statistical techniques to estimate how the characteristics of the sound file might relate to other sound files. For example, the software looks at the patterns represented by the sound wave in relation to data from sound files already in the MediaMined database, the degree to how that sound wave may differ from others, and specific characteristics such as component pitches, peak volume levels, tempo and rhythm.

In the final stage of processing, a number of machine learning processes and other analysis tools assign various labels to the sound wave file and output a user-friendly breakdown. The output delineates the actual contents of the file, such as male speech, applause or rock music. The third stage of processing also highlights which parts of a sound file are representing which components, such as when a snare drum hits or when a vocalist starts singing lyrics.

“MediaMined listens to audio files that are uploaded to our servers, and we generate an XML output with the low-level perceptual content, a universal sound signature and a high-level description of the audio in the file,” says LeBoeuf. “When software applications understand what they are listening to, they can do a better job processing audio and help users discover new content.”

One of the key innovations of the new technology is the ability to perform sound-similarity searches. Now, when a musician wants a track with a matching feel to mix into a song, or an audio engineer wants a slightly different sound effect to work into a film, the process can be as simple as uploading an example file and browsing the detected matches.

The developers believe MediaMined may also help with other, more complex tasks. For example, the technology could be used to enable mobile devices to detect their acoustic surrounding and enable new means of interaction. Or, physicians could use the system to collect data on such sounds as coughing, sneezing or snoring and not only characterize the qualities of such sounds, but also measure duration, frequency and intensity. Such information could potentially aid disease diagnosis and guide treatment.