Rethinking large language models in medicine

Stanford Medicine researchers and leaders discuss the need for medical and health professionals to shape the creation of large language models.

Hanae ArmitageAugust 7, 2023February 2, 2024

By now, you've probably turned to a large language model (think ChatGPT) to shoulder a linguistic burden. Maybe you tasked it with conjuring a heartfelt letter, or had it summarize a long transcript.

A large language model, or LLM, is a form of artificial intelligence that can generate human-like text, and the technology has exploded in popularity over the past year. As the world taps into its wealth of generative potential, doctors and medical researchers are asking themselves, "Are LLMs right for me?"

"It's easy to get caught up in the excitement of large language models," said Nigam Shah, PhD, MBBS, chief data scientist for Stanford Health Care. "We call it 'LLM bingo,' where people check off what these models can and can't do. 'Can it pass the medical exams? Check. Can it summarize a patient's history? Check.' While the answer may be yes on the surface, we're not asking the most important questions: 'How well is it performing? Does it positively impact patient care? Does it increase efficiency or decrease cost?'"

Those questions are critical, said Shah, who also is a professor of medicine and of biomedical data sciences. In a recent piece published by JAMA, Shah, along with Stanford Health Care's president and CEO, David Entwistle, and its chief information officer, Michael Pfeffer, MD, discussed how the medical field can best harness LLMs -- from evaluating the accuracy of algorithms to ensuring they fit with a provider's workflow. Simply put, the medical field needs to start shaping the creation and training of its own LLMs and not rely on those that come pre-packaged from tech companies.

It works, but is it accurate?

Large language models function as a sort of input-output machine. A text prompt goes in (say, "What should I do if I sprain my ankle?") and an answer comes out (such as "Rest, ice, compression, elevation and over-the-counter pain medications"). That output, generated in a few seconds, is powered by an algorithm that swiftly sifts through and condenses billions of data points into the most probable answer, based on the available information. In the case of large language models like ChatGPT, the algorithm has the entire internet at its disposal.

While that might be helpful for some questions, for medical purposes such as diagnosing an illness or summarizing patient electronic health records, it just won't do, Shah said. Crunching terabytes of data from the World Wide Web won't yield reliable answers needed to guide therapies or detect a suspicious mark on a scan. "Would you want your medical advice based on what's floating around Reddit?" he asked.

Probably not. Instead, the team suggested flipping the script. "Come with a use case, like identifying whether an X-ray is normal or abnormal, or condensing electronic health records into easily digestible summaries, then use de-identified patient data to train the model from the ground up," Shah said.

Shah said it's almost like training a new intern: "You teach an intern how to do a specific task. Then you check their work, and then they operate with minimal supervision."

As with any new tool or technology, it is incumbent on those who implement it to ensure its functioning smoothly and accurately. "The question isn't, 'How will LLMs change medicine,'" Shah said. "It's, 'What do we in health care need to do to better build and evaluate these models so that they're useful for medicine?'"

Making a reliable LLM

As researchers and doctors shape LLMs into credible, dependable tools, Shah and others already see some reliable use cases in related fields, such as generating practice questions for the U.S. Medical Licensing Examination. "Students spend thousands of dollars on these practice questions, and we showed that there are publicly available LLMs that can create reliable, accurate practice questions. Teaching institutions and students could save a lot of money if they didn't have to buy those," Shah said.

Regardless of direction, Shah emphasized the need to balance innovation with responsible use in medicine by encouraging questions such as "Who made this? What data was it trained on? Is that data applicable across multiple populations?"

"We also have to know and be ready to address AI hallucinations," Shah said. "We know that language models make mistakes -- they literally make everything up. We need to figure out how to spot those inaccuracies and preempt them, because in health care you can't afford factual errors."

Photo by photoopus