DNA data overload
July 17, 2013
The source: rapidly increasing speed and declining cost of DNA sequencer machines, which chop extremely long strands of biochemical components into more manageable small segments.
But these sequencers do not yield important biological information that researchers “can read like a book,” say Michael C. Schatz, an assistant professor of quantitative biology at Cold Spring Harbor Laboratory, in New York state; and Ben Langmead, an assistant professor of computer science in Johns Hopkins’ Whiting School of Engineering.
“Instead, [they] generate something like an enormous stack of shredded newspapers, without any organization of the fragments. The stack is far too large to deal with manually, so the problem of sifting through all the fragments is delegated to computer programs.”
Therein lies the problem, Schatz and Langmead say: improvements in these computer programs have not kept pace with the enhancements and widespread use of the sequencers that are cranking out huge amounts of data.
“Computing, not sequencing, is now the slower and more costly aspect of genomics research,” Schatz and Langmead say.
In his own research at Johns Hopkins, Langmead is working on remedies. “The battle is really taking place on two fronts,” he said. “We need algorithms that are more clever at solving these data issues, and we need to harness more computing power.”
For the latter option, Langmead said, scientists may be able to do their work more quickly by tapping into the huge cloud computing centers run by companies such as Amazon and “renting” time on these systems.