"If our genes represent the blueprint of what all cells in our bodies can do, the proteome represents everything the cells actually are doing," said Elias, assistant professor of chemical and systems biology.
A proteome is the set of proteins in a given location in the body at a given time. Since proteins carry out most of the body's business (for example, fight a flu infection, beat as part of heart muscle, or divide to make more cells) a readout of every protein on the job would be very telling.
Until now, it's been hard to discern which proteins are being produced, in part because there are millions of them in any given cell, and each can have thousands of possible variations.
Elias and his team have recently developed a computational tool that gives researchers a way to quickly identify each protein and distinguish between even slight variations. It's a new statistical method for analyzing data produced by instruments called mass spectrometers, the most widely used tool for identifying the proteins in a sample en masse.
As a result, researchers now have a way to answer some of biology's especially vexing questions. Elias, for example, plans to use it to see which proteins are produced by the microbes that live within our gut, and which proteins help the immune system detect cells that are infected or mutated.
Elias and his collaborators describe the method in a report published in the journal Nature Biotechnology.
The analysis starts with the proteins from a blood sample being fed into a mass spectrometer.
The researcher then adds an enzyme to cut each protein into smaller fragments, referred to as peptides. The mass spectrometer breaks these peptides into even smaller pieces, sorts them according to their weight (mass) and presents a readout of all these pieces. This mass spectrum serves as a sort of fingerprint that can be interpreted to reveal the identity and amount of the original protein in the blood.
To identify the proteins, many researchers use a statistical method called target-decoy error estimation, which Elias and colleagues developed to extrapolate which protein identifications are actually correct, since the software that interprets these fingerprints gets the answers wrong quite often, Elias told me.
But that method has some weaknesses. Particularly challenging is efficiently identifying slight variations, especially additions of small molecules to the larger protein molecule.
"You can think of these modifications as decorations -- decorations that change how proteins associate with one another," said Elias.
Elias discussed a strategy to solve this problem with graduate student Arun Devabhaktuni, who wrote the code to carry out the new analysis on protein molecules. The model, which they've named TagGraph, makes use of a strategy pioneered in genome sequencing. It starts with a first guess at the identity of a molecule, in this case a peptide, based on the masses of its fragments. This is called de novo sequencing. "We know that guess will be mostly wrong, but that's OK," said Elias.
Next, the de novo sequence is fed into an algorithm that searches for matches with known peptides, or protein pieces. Usually the algorithm will indicate just a few close matches. From there, the new computational tool proposes new possibilities for the peptide's identity.
When the mass of the altered de novo sequence matches the mass of the hypothetical mass, they've hit the jackpot, and they claim they've found a correct match. And instead of taking months to complete the analysis it takes only hours. The program uses Bayesian statistics, which Elias described as "a fancy way of saying 'how much do I believe in this result, given everything I know about it and the overall experiment.'"
In the recent paper they described using TagGraph to re-analyze a human proteomic dataset of 25 million mass spectra first published in 2014.
"Some of the beauty of this paper is we started off with mass spectrum data that someone else had acquired -- from 30 different tissues from humans -- and were able to identify thousands of modification types and triple the protein identifications compared to the original analysis," said Elias. "And there are many terabytes of old mass spec data available. We could reanalyze them and discover a lot more."
Elias has made the software available for download at no cost from SourceForge.
Image by A. Valm, S. Cohen, J. Lippincott-Schwartz, National Institute of Child Health and Human Development, NIH, shows a cell treated so many of its proteins glow different colors.