AI program predicts key disease-associated genetic mutations for hundreds of complex diseases

June 15, 2015

A depiction of the double-helical structure of DNA. Its four nucleotide bases (A, T, C, G) are color-coded in pink, orange, purple and yellow (credit: NHGRI)

A decade of work at Johns Hopkins has yielded a computer program that predicts, with far more accuracy than current methods, which mutations are likely to have the largest effect on the activity of the “dimmer switches” (which alter the cell’s gene activity) in DNA — suggesting new targets for diagnosis and treatment of many diseases.

A summary of the research was published online today (June 15) in the journal Nature Genetics.

“Our computer program can comb through the genetic information from a specific cell type and predict which ‘dimmer switch’ mutations are most likely to alter the cell’s gene activity, and therefore its function,” says Michael Beer, Ph.D., associate professor of biomedical engineering at the Johns Hopkins University School of Medicine.

Which genetic mutations matter?

“The plan is to continually improve the formula as we learn more about these regulatory regions,” he says, “but already it can narrow down a list of disease-associated mutations by a factor of 20, allowing researchers to focus on the ones that are most likely to matter.”

Researchers have sequenced the genomes of many patients suffering from common multigene diseases, looking for shared mutations in their control regions. The trouble is, Beer says, that these studies yield hundreds of mutations, most of them benign. So he and his team of researchers designed a computer program that could learn the difference between mutations that are likely to affect gene activity levels and those that likely won’t.

“There are a lot of common diseases, like diabetes, that are probably the result of several different mutations in control regions. The mutations don’t directly cause a change in the proteins [that are] made, but they impact their abundance,” he says, and sorting out which ones matter most in diseases is key to advancing treatments.

The task has been difficult, Beer says, because a single alteration, say from a cysteine (C) to a guanine (G) in the four-letter alphabet of DNA, will have drastically different effects based on where it occurs in the genome, he explains.

“If it occurs in the middle of a gene that encodes a crucial protein, it could alter the code in such a way that no protein is made and the organism dies, or it could have no effect whatsoever if the function of the protein isn’t altered by the change,” he says. The same extremes could be true if the C to G mutation occurred outside of a gene, in a control region: The mutation could cause the region to stop working altogether, or it could have no effect. And between those extremes is everything else.

Training the program

Part of the first step in creating a mutation-impacts classifier program, using a positive training set of putative regulatory sequences and a negative training set of matched negative-control sequences (credit: Dongwon Lee et al./Nature Genetics)

To develop the new formula, Beer says his team first “trained” its classifier program to recognize potential control regions using a property called DNAse sensitivity. DNAse is an enzyme that cuts DNA wherever it is not tightly wound.

The openness of particular sequences of DNA varies among different types of cells, and only control regions in open DNA can be active. How vulnerable certain stretches of DNA are to DNAse is therefore an indication of which control regions are important in a given cell type, Beer says.

Dongwon Lee, Ph.D., then a graduate student in Beer’s laboratory, taught the computer program to recognize the features of DNAse-sensitive sequences in a type of cancer cell by giving the computer a list of already known sequences. It then predicted the rest of the DNAse-sensitive sequences and measured how much individual sections of a sequence contributed to that region’s overall DNAse sensitivity.

The computer then simulated “mutating” every DNA letter in turn and recalculated each section’s contribution to DNAse sensitivity. The larger the change in sensitivity after a given mutation, the more likely it is that that mutation will affect gene activity levels in the cell, Beer says.

To test the validity of the formula, the team compared their computer predictions to the predictions made by alternative programs. When the programs’ “rules” were set to be equally thorough in their searches, Beer’s program was 56 percent accurate — 10 times more accurate than the next best program.

To further directly test the formula, Beer worked with Andrew McCallion, Ph.D., an associate professor at the McKusick-Nathans Institute of Genetic Medicine at the Johns Hopkins University School of Medicine, to predict the impact of mutations in the control regions for two pigment-related genes in mouse melanocytes (skin pigment cells). They then selected 40 mutations with different levels of predicted impact and tested their effect in melanocytes grown in the laboratory. When they measured the activity levels of the two genes, they found that there was a strong correlation between the program’s prediction and the actual change experienced by the cells.

Predicting the effects of undecipherable mutations

“My group has been working for over a decade to shed some light on the nature of regulatory mutations in common disease,” McCallion says. “The synergy of our careers and our strategies bring the Beer group and mine to an exciting place in this effort. By training the computer program with the right cellular material, we can now predict the consequences of previously undecipherable regulatory sequence mutations.”

Beer and his team repeated this targeted testing of their formula in mouse and human liver cells and in human leukemia cells, with similar results. They also tested their formula on three control region mutations already known to affect cholesterol levels, hemoglobin levels and prostate cancer. Again they found that these mutations drew higher computer scores than other mutations in the same control regions.

Finally, the team examined the control regions for T helper cells, a type of immune cell that can contribute to autoimmune diseases when its genes become disregulated. Their calculations identified 15 different control region mutations associated with nine different immune system disorders, from allergies to multiple sclerosis and Crohn’s disease. Importantly, Beer says, previous studies had associated nine of the same control regions with immune disorders, but they had not been able to hone in on the exact mutation that mattered.

Beer says: “The next step is to collect cells from patients with these autoimmune diseases, test their gene activity levels and find out if our predictions were right. If so, it should help us determine how the activity is being perturbed and how we can fix it.” The same process can theoretically be repeated on many other diseases, providing timesaving insights for each.

This work was supported by grants from National Human Genome Research Institute and National Institute of Neurological Disorders and Stroke.

Abstract of A method to predict the impact of regulatory variants from DNA sequence

Most variants implicated in common human disease by genome-wide association studies (GWAS) lie in noncoding sequence intervals. Despite the suggestion that regulatory element disruption represents a common theme, identifying causal risk variants within implicated genomic regions remains a major challenge. Here we present a new sequence-based computational method to predict the effect of regulatory variation, using a classifier (gkm-SVM) that encodes cell type–specific regulatory sequence vocabularies. The induced change in the gkm-SVM score, deltaSVM, quantifies the effect of variants. We show that deltaSVM accurately predicts the impact of SNPs on DNase I sensitivity in their native genomic contexts and accurately predicts the results of dense mutagenesis of several enhancers in reporter assays. Previously validated GWAS SNPs yield large deltaSVM scores, and we predict new risk-conferring SNPs for several autoimmune diseases. Thus, deltaSVM provides a powerful computational approach to systematically identify functional regulatory variants.