Gene-expression data to hit one million deposited data sets
July 20, 2012

DNA microarrays allow researchers to analyze the expression of a huge number of genes simultaneously (credit: Guillaume Paumier/Wikimedia Commons)
With close to one million gene-expression data sets now in publicly accessible repositories, researchers can identify disease trends without ever having to enter a laboratory, Nature News reports.
Entering the search term “breast cancer” into a public repository called the Gene Expression Omnibus (GEO), the postdoctoral researcher retrieves a list of 1,170 experiments, representing nearly 33,000 samples and a hoard of gene-expression data that could reveal previously unseen patterns.
And those numbers are rising rapidly. Since 2002, many scientific journals have required that data from gene-expression studies be deposited in public databases such as GEO, which is maintained by the National Center for Biotechnology Information in Bethesda, Maryland, and ArrayExpress, a large gene-expression repository at the European Bioinformatics Institute (EBI) in Hinxton, UK.
Some time in the next few weeks, the number of deposited data sets will top one million (see ‘Data dump’).
The result is an unprecedented resource that promises to drive down costs and speed up progress in understanding disease. Gene-sequence data are already shared extensively, but expression data are more complex and can reveal which genes are the most active in, say, liver versus brain cells, or in diseased versus healthy tissue. And because studies often look at many genes, researchers can repurpose the data sets, asking questions other than those posed by the original researchers.
Gustavo Stolovitzky, a computational biologist at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York, has used publicly available data to train algorithms to recognize gene signatures for diseases such as lung cancer, chronic obstructive pulmonary disease (COPD) and psoriasis.
Not only can the algorithms distinguish lung cancer from COPD, they can also tell squamous-cell carcinoma from adenocarcinoma. “There is enough info in existing databases to predict disease in samples that algorithms have never seen before,” Stolovitzky says.
Comments (3)
by melajara
Data is useless if you don’t have a “mind” behind to “compress” the data in meaningful interrelated concepts.
I see a job here for Watson. It is no coincidence that the article mentions a researcher from IBM Thomas J. Watson Research Center. This could very well be a target application of the DeepQA project at the center of Watson’s Jeopardy feat last year.
Now imagine a “MetaDeepQA” layer whose job is to provide a reasoning engine to provide a string of answerable DeepQA queries possibly depending on the completion of not yet done automated scientific experiments (here new gene expression sampling), then you have the blueprint of a quite powerful “automated scientist”.
I make the bold prediction that such a scientist (but still guided by human operators) should be awarded a Nobel Prize before 2020!
Actually such “centaur” teams (as already present to play the highest level of chess) should become more and more common for conducting scientific research till the computer will be truly autonomous (up to 2040).
by Gorden Russell
This is what Ray talked about when he wrote his last book. The line of the graph of the curve of progress is getting steeper and steeper to the right. The curve is going to look like the blade of a hockey stick. In 18 months it will be even steeper. As computers double in speed they will gather twice the data. As their processors double in capacity the numbers they crunch will double in size.
by GatorALLin
..this is so awesome…and database will keep growing and really help broaden the understanding of how different cancers work or act like other disease. The idea that others can search the same data on a massive scale is genius.