When a large language model first passed the United States Medical Licensing Exam in 2023, it was a big deal. But two years later, what was once a notable milestone in artificial intelligence progress is more of a bare minimum.
"It's not enough for a large language model to simply answer medical test questions accurately," said Nigam H. Shah, MBBS, PhD, chief data scientist at Stanford Health Care. "That type of evaluation doesn't tell us anything about what matters."
In other words, Shah said, it says nothing about how a model might perform in a real-world clinic or hospital.
That's why Shah and a team of researchers have devised a framework to help fill that gap. It's called MedHELM, HELM standing for holistic evaluation of language models. It's a resource for accurate and reliable evaluations of LLMs, supporting the core principles that power the RAISE Health Initiative.
Shah discussed how MedHELM is best used, how it can help researchers adapt and create their own models, and why it's crucial to evaluate LLMs in the context in which they're used. This interview was edited for clarity and length.
How did your team develop MedHELM?
Stanford's Center for Research on Foundation Models had already developed HELM, which is an LLM evaluation infrastructure that basically allows people in any domain to create various scenarios in which AI is being used. For instance, one could create a prompt for an LLM within the HELM framework that says, "Summarize this patient's medical record." A successful summary of that patient's information needs to already exist, giving you the gold standard answer. You could then evaluate how well the model's answer agreed with the gold standard. That's a complete scenario.
We took the HELM infrastructure and created more than 120 scenarios for 22 categories of health-related tasks that were related to clinical decision support, clinical note creation, patient communication and education, assisting with medical research, and administrative support. We tested all scenarios on six separate foundation models, including Open AI, Llama and Gemini, among others. The outcome of running all these scenarios is a stratification of which models perform the best for specific scenarios in health and medicine.
About 95% of LLM evaluations that are reported in the literature are not done using electronic health record data -- and that context is really important.
Nigam H. Shah
What value does MedHELM bring to the health care community?
About 95% of LLM evaluations that are reported in the literature are not done using electronic health record data -- and that context is really important. MedHELM does the back-end work that pulls in relevant datasets and executes hypothetical, but common, use cases for how people in health and medicine might use an LLM to inform their work. It then shows us which of six commonly available models performs best for a given task.
Computer scientists and AI researchers can use that information to start building their own models using the strongest, most accurate LLM that's available for their specific use case. It's important to differentiate. MedHELM doesn't approve models or guarantee a specific level of accuracy -- it's a tool that helps inform users on the best place to start developing a model specific to their needs.
Can you give an example of how this might work in practice?
If you're building a language model today, you start by asking, "Which model do I want to use as my base?" Say you're hoping to build a model that identifies alcohol dependence from the patient's medical history. You can sift through the existing evaluations and find a scenario that most closely matches your task, then use that information to build your model. At this point we've evaluated only six LLM foundation models, but we've noticed larger trends.
For example, some models seem to perform worse than they really do because there are guardrails that prevent them from answering certain types of questions. Some companies have safety instructions that say something like, "This is a medical question, and I'm not authorized to answer." It's overly safe, but you can remove those limitations and performance will improve.
What's next for for MedHELM?
We plan to add more models and more scenarios to make it even more powerful for users. We hope researchers contribute their own datasets to bolster the infrastructure, and we encourage users who don't see their task represented to create their own scenarios and add to the mix. Already, we're receiving positive feedback and requests from external labs and companies to adopt the framework and contribute datasets.
Photo illustration: Margarita Gallardo (Photo: Steve Fisch)
