Last week, as the 2014 Big Data in Biomedicine conference came to a close, a related story about the importance of computing across disciplines posted on the Stanford University homepage. The article describes research making use of the new Stanford Research Computing Center, or SRCC (which we blogged about here). We're now running excerpts from that piece about the role computation, as well as big data, plays in medical advances.
The human genome is essentially a gigantic data set. Deep within each person's 6 billion data points are minute variations that tell the story of human evolution, and provide clues to how scientists can combat modern-day diseases.
To better understand the causes and consequences of these genetic variations, Jonathan Pritchard, PhD, a professor of genetics and of biology, writes computer programs that can investigate those linkages. “Genetic variation effects how cells work, both in healthy variation and in response to disease, which ultimately regulates organism-level phenotypes,” Pritchard says. “How natural selection acts on phenotypes, that’s what causes evolutionary changes.”
Consider, for example, variation in the gene that codes for lactase, an enzyme that allows mammals to digest milk. Most animals don’t express lactase after they’ve been weaned from their mother's milk. In populations that have historically revolved around dairy farming, however, Pritchard's algorithms have shown that there has been strong long-term selection for expressing the genes that allow people to process milk. There has been similarly strong selection on skin pigmentation in non-Africans that allow better synthesis of vitamin D in regions where people are exposed to less sunlight.
The methods used in these types of investigations have the potential to yield powerful medical insights. Studying variations in gene regulation within a population could reveal how and where particular proteins bind to DNA, or which genes are expressed in different cell types - information that could be applied to design novel therapies. These inquiries can generate hundreds of thousands of data sets, which can only be parsed with clever algorithms and machine learning.
Pritchard, who is also a Stanford Bio-X affiliate, is bracing for an even bigger explosion of data; as genome sequencing technologies become less expensive, he expects the number of individual genomes to jump by as much as a hundredfold in the next few years. “There are not a lot of problems that we're fundamentally unable to handle with computers, but dealing with all of the data and getting results back quickly is a rate limiting step,” Pritchard says. “Having access to SRCC will make our inquiries go easier and more quickly, and we can move on faster to making the next discovery.”