In a first demonstration of the power of really big data, Stanford bioinformatician Nigam Shah, MBBS, PhD, and an international team asked some simple questions about health using the electronic medical records of 250 million patients.
The team — a multi-institution collaboration called the Observational Health Data Sciences and Informatics program, which is centered at Columbia University — came up with surprising results and likely set a record for studying the largest medical dataset ever.
“These kinds of numbers are going to become increasingly common,” Shah told me. “Most networks like this are already in the 100-million range.”
The current 250-million-patient, four-country study, publishing today in PNAS, combined data from 11 sites, including the Stanford Translational Research Integrated Database Environment (2 million patients), Columbia University Medical Center (4 million patients), MarketScan Commercial Claims and Encounters (119 million patients) and General Electric Centricity (33 million patients).
The researchers directed each of the 11 sites to identify the treatment protocols — the ordered sequence of medications a patient was prescribed — for each patient diagnosed with hypertension, Type 2 diabetes and depression so they could map the different ways patients were being treated.
Amazingly, 24 percent of hypertension patients received a sequence of medications that was unique. That is, thousands of patients were given a combination of medications received by no other patient.
When I asked Shah, “How is that even possible?” he answered, “There are that many different kinds of drugs!” He was quick to add that differences in the order in which drugs were prescribed helped drive up the number of unique treatments. Still, exactly why so many hypertension patients received seemingly idiosyncratic treatment is not yet clear. The high figure might indicate a lack of agreement on the most effective treatment.
Among the diabetes and depression patients, only 10 percent of patients received unique treatments.
A major hurdle the team had to tackle was standardizing the way the data in different databases is handled. Shah credited the paper’s second author Patrick Ryan, PhD, with creating a way to uniformly handle data that had been generated by so many different entities. Shah compared the approach, called the common data model, to a form used for a tax return. “Everybody uses a standard form and you can have all sorts of activities going on — you know, wages, and rental properties and businesses — but you report it all in one format.”
Since 2010, Ryan, an adjunct assistant professor at Columbia and director of epidemiology at Janssen R&D, and others have refined the common data model to the point where the team can now rapidly write computer code to query data from hundreds of millions of patient records, Shah said.
In their next study, Shah and the OHDSI team will examine which treatments resulted in the best outcomes for patients. And this time they hope to be looking at the full set of 650 million patient records available in the common data model.
Assembling the data in the electronic health records of virtually everyone in the world, the researchers write, is now technically feasible.