It’s not often in science that a doctoral dissertation becomes the basis for a lifetime of work, the infrastructure for international efforts to build effective databases by a community of users that includes the World Health Organization, eBay and thousands of individuals, not to mention the seed for an active open-source community of developers.
For Mark Musen, MD, PhD, a professor of biomedical informatics, his lifelong commitment to his graduate research at Stanford — an open-source tool for building intelligent computer systems called Protégé — is a source of both pride and embarrassment.
“One can argue that it’s stupid to stick with a project and not get rid of it after 25 years. Most people get bored or they want to move on to something else,” he told me.
But as I found when I visited Musen in his lab recently, he is immensely proud of Protégé, which supports the work of thousands of individuals; Fortune 500 companies, such as e-Bay and MTV; and government entities, such as the World Health Organization’s International Classification of Diseases, whose ICD-10 classifies mortality into nearly 70,000 possible causes.
Protégé is so popular that in April, Musen’s lab registered its 300,000th user, with much fanfare, a free Protégé tee shirt and other swag for the lucky winner.
Protégé is a software system that helps users build classifications of information. It’s an important component of the infrastructure that supports data science around the world. That might “sound boring,” says Musen, but it’s about as boring as public roads, bridges, airports, electric power lines, water supplies and sewer systems. All of them are boring until they break or do something unexpected. And then they become really interesting.
For example, when the U.S. just recently adopted the World Health Organization's disease classification system ICD-10, it created an uproar among providers in the U.S. health-care system.
The ICD-10 system includes seemingly bizarre classifications such as “Unspecified balloon accident injuring occupant, sequela,” “Sibling rivalry” or “Sucked into jet engine, subsequent encounter,” not to mention, “Other superficial bite of other specified part of neck, initial encounter.” Indeed, the ICD-10 classifications are so outré that many of the new codes have been illustrated in a popular art book titled Struck by Orca.
In the U.S., physicians use the ICD-10 system not only to identify the cause of death on death certificates, but also as billing codes. In other words, providers use the ICD classification system many times a day, and any reclassification can affect both their income and the burden on office workers. One survey showed that 38 percent of health-care providers believed the new ICD-10 codes would reduce their income compared to only 6 percent, who thought ICD-10 would increase income. Now WHO is using Protégé to build an entirely new classification system, ICD-11, which will have its initial release in 2018.
With 300,000 users, Protégé is put to use in thousands of other ways. At Stanford, for example, it's used as a tool for building computer systems that support clinical decisions. It is used by e-Bay to create merchandise catalogs, by MTV to organize their video library, and by many other companies to build ontologies to describe different kinds of content.
There will be a lot more on ways to classify, store and share massive amounts of data the upcoming Big Data in Biomedicine conference, at which Musen spoke last year. This year's event will be held May 25 and 26 and will focus on precision health.