Researchers create guide for fair and equitable AI in health care

Over the past several decades, health care systems have amassed giant stores of patient data through electronic health records, logging disease-linked genetic aberrations, drug interactions, success rates of cancer therapies and more.

Now, doctors and researchers with access to this trove of health information are at an inflection point. Through artificial intelligence, health systems have the ability to tap the once-amorphous aggregates of data that can predict health outcomes for patients.

It's time to put that technology to work widely -- in ways that prioritize conscientious protocols designed to prevent bias in data gathering and use in patient care, said Nigam Shah, MBBS, PhD, a Stanford Medicine professor of biomedical data science and of medicine.

"Broadly speaking, automated detection of disease or evaluation of a set of symptoms could save health care workers time and hospitals money, all while maintaining a standard of excellent care," said Shah, who was recently appointed Stanford Health Care's inaugural chief data scientist. "The data are there, and the incentive to use it is there, and that creates this sense of urgency to deploy AI. It's that eagerness that means we have to implement guidelines to ensure we're doing it the right way, with fairness and equity at the forefront."

Algorithms are rarely used regularly in patient care, but this may soon change as the dribs and drabs of machine learning swell and flood the health care system. So, the time to get these regulations right is now.

Using artificial intelligence in the clinic means harnessing automated recognition of patterns in health care data to inform diagnoses and predict medical outcomes. Did cancer patients with a certain set of mutations fair better on drug A or drug B? Do certain characteristics of an MRI signal symptoms of a condition? Does a lesion on the skin look cancerous?

Among those questions, an even larger one looms: Are the outputs of clinical algorithms fair and accurate for everyone? Shah and others are putting that question under a microscope, honing a set of standard principles that can guide any algorithm to be used in patient care -- something that's yet to be done.

It's not for a lack of trying -- artificial intelligence researchers feel an immense sense of responsibility when devising and implementing algorithms that impact human health, said Shah. The hitch is in the "how."

I spoke with Shah about this challenge and the discourse around standards that prioritize equity and fairness for clinical care algorithms, and about the solutions he and others are proposing. The following Q&A is based on our conversation.

What does it mean for an algorithm to be fair and how does one measure that?

Through the Coalition for Health AI, I'm working with a group of researchers from several institutions, including the Mayo Clinic and Duke and Johns Hopkins universities, to set and streamline guidelines for AI fairness.

Just like in regular clinical care, care informed by an algorithm must result in all patients being treated equitably. So that means ensuring that algorithms aren't biased toward certain demographics and that the data we use to train the algorithms are sufficiently inclusive. For instance, does the algorithm perform the same for men and women, Black or non-Black patients? Ideally, we want a model that's calibrated for all patient subgroups, too. That means the models perform with the same dependability in a primary care setting as they do in an oncology clinic.

To help evaluate this, we do something called a "fairness audit." A financial audit basically asks if your credits and debts add up to zero. An AI fairness audit instead asks if the way the algorithm performs is balanced across patient demographics, performing with the same accuracy for any and all people who may benefit from it.

If the algorithm does not create a systematic difference in the way care is assigned -- say, recommending the prescription of a statin -- then it can be dubbed "fair."

What are the challenges in creating new guidelines for broad implementation of AI in health care and what conclusions did you reach?

We started looking at the existing guidance from the research community -- basically the accepted research commandments for AI. What shall thou do to achieve responsible AI in health care? And we analyzed 16 or so publications that collectively proposed 220 things that thou shalt do. And let's just be honest -- no one is going to do 220 things every time. It's just not feasible. There was scant focus on fairness and almost none on how to assess usefulness.

So instead, we asked, how many of these 220 things is it reasonable to ask researchers to adhere to? Which are the elements that scientists most often include when reporting data in papers? Which do we agree are the most important? In the end, there were about 12 recommendations that were most commonly asked for and, fortunately, commonly reported in manuscripts, too.

Broadly speaking, this analysis informed the design of something we now call the FURM assessment, in which we seek to assess whether we are providing fair, useful and reliable model guided care. For usefulness assessment, we created a simulation-based framework that also factors in work capacity limits.

What kinds of algorithms are in use at Stanford Health Care and how do you ensure that they're fair?

We have two algorithms -- both operating under IRB approval -- helping to guide clinical care. The first is one that my team devised, and it's a mortality prediction tool to help doctors predict who might be at risk of dying in the next year and hence might benefit from having a goals of care conversation.

The second is an algorithm created by radiation oncologist and professor Michael Gensheimer, MD, that predicts survival for metastatic cancer patients based on certain treatment or drug regimens. It essentially helps the physician choose the treatment path that is most likely to help the patient survive the longest.

For both of these, and any algorithms in the future, we start by running it in "silent mode," which allows us to examine how the algorithm is performing without having it impact patient care. We essentially run it in the background and ensure that it's producing outcomes -- whether that's a treatment recommendation or a prediction of hospital length of stay -- that are equally correct across different subgroups. Once we've convinced ourselves that the algorithm's guidance is sufficiently reliable, we can deploy it in clinic and continue to monitor its utility.

We deemed both algorithms in use at Stanford Health Care fair during their "silent" phase, and recently completed a fairness audit using our new guidelines for responsible implementation of AI. Preliminary results from these new audits confirm our confidence.

Is a biased algorithm harmful? Can it be fixed?

First, you always want to ensure that your data are inclusive and representative. But when a case of bias arises, we look for the source of the bias. I ask a two-part question: Is there a difference in the numeric output that the algorithm produces? And how big is the impact of those numbers on consequences in the clinic?

That last question is important, because sometimes an algorithm can produce a number that might indeed be technically biased -- but it doesn't change the recommendation for patient care.

For example, a person's risk score for having a heart attack during the next 10 years is based off of several different data points, all collected and analyzed in a study called the Framingham Heart Study, which had a cohort that was mostly white men and subsequently updated to include three additional cohorts. However, it's well known that heart attack risk scores aren't as well-calibrated for Asian people, Black people and women, compared to calibrations for white men.

But it doesn't create a fairness issue because the bias in the risk score is largely negligible. To put it into context, if a score of 7 means the patient should be prescribed statins to reduce their risk of heart disease, and an Asian person has a score of 7.3 with the uncalibrated algorithm, and a score of 7.5 with an adjusted algorithm, their clinical outcome will still be the same: They receive statins.

It's when the bias in numbers changes a treatment or care protocol among subgroups that algorithms need to be rethought or retrained with more data. That's a tractable problem.

The next questions that are starting to arise are: Given the adoption of algorithms into care, how will it change the doctor-patient relationship? Will it fragment care or make it more cohesive? And do health care systems have the capacity to achieve the benefit that seems to exist on paper?

These are the big questions that the Stanford Health Care Data Science team is tackling now and in the future.

The Coalition for Health AI is funded in part by the Betty and Gordon Moore Foundation.

Stanford Health Care's Program for AI was initiated by a gift from Mark and Debra Leslie.

Photo by Chor muang