Artificial Intelligence (AI) | Issue 2 / 2024

AI steps into the looking glass with synthetic data

Medical data scientists are using generative AI to create new data from scratch

August 23, 2024

A few years ago, generative artificial intelligence image technology — including Dall-E and Stable Diffusion — first emerged, allowing anyone to type in whimsical text prompts and produce original images, seemingly from thin air. Intrigued and curious, a group of students and post-docs training under Stanford Medicine radiologist Curtis Langlotz, MD, PhD, and computer scientist Akshay Chaudhari, PhD, decided to see what generative AI might do if asked to create a chest X-ray from scratch. So, they gave it a go.

“What they got back looked a little bit like a chest X-ray, but it really was not anything close to what we would think of as a clinically realistic X-ray,” Langlotz said. “Then the students asked themselves: Can we make it better?”

That thought experiment led Langlotz, professor of radiology, medicine and biomedical data science, Chaudhari, assistant professor of radiology in the department’s integrative biomedical imaging informatics section — and several of their students — to create the RoentGen text-to-image generative model for X-rays. The RoentGen model creates lifelike and convincing X-rays from medically relevant text prompts. Chaudhari and Langlotz published a paper describing RoentGen in August in Nature Biomedical Engineering.

“I can type in, ‘Moderate bilateral pleural effusion and mild pulmonary edema,’” said Langlotz, offering up a concrete example he might ask in the course of his everyday work. “RoentGen will produce medically accurate X-ray images that are nearly indistinguishable from those taken from humans,” even to the trained eyes of medical professionals.

Data from scratch

RoentGen is a glimpse into the future of medical AI in which a considerable share of data used to train new AI models are synthesized, and those models in turn churn out synthetic data to solve problems. This might include helping visualize inoperable cancers or sorting through potential drug candidates to identify only the most promising for further study.

The term synthetic data may sound like an oxymoron or even an impossibility, but to medical AI experts it is very real and very promising, even while introducing ethical and scientific gray areas.

Leaders of this emerging field say that synthetic data could enhance medical AI, helping flesh out incomplete datasets, supplementing data from key demographics to eliminate bias and addressing privacy concerns of patients who fear AI could reveal their personal medical histories — all in a single stroke.

Yet many leaders also urge a go-slow approach as the field evolves, saying medicine must wrestle with synthetic data’s risks before it is too late. Over-reliance on synthetic data could breed a false sense of confidence, said Tina Hernandez-Boussard, PhD, an associate dean of research at the School of Medicine. Hernandez-Boussard and participants in Stanford’s Responsible AI for Safe and Equitable Health (RAISE Health) initiative are among those helping researchers interested in synthetic data think deeply about the ethical and societal implications of this new field.

We talked with a few Stanford Medicine researchers who are tapping into the potential of synthetic data about how it is being used now, where it might lead and what remedies the field might have at its disposal to manage risks as this new frontier in medicine evolves.

X-ray vision

While RoentGen is impressive in its ability to turn text into medically accurate X-ray images, Chaudhari noted that it is so much more. RoentGen’s synthetic X-rays could, for instance, be used to correct bias or address patient privacy concerns, he said. If a dataset lacks adequate representation of women, RoentGen can generate synthetic X-rays of female patients to fill gaps in the data. Similar approaches could address gender, socioeconomic, geographic, age and other demographic inequities. And because the images are not of any living person, synthetic data could help circumvent patient privacy concerns.

“The generative AI does something almost like virtual reality for cancer, turning low-resolution data into this very rich data visualization on the computer.”

James Zou, associate professor of biomedical data science

The generative model could also solve another challenge for medical AI — labeling — the time-consuming and expensive process done by highly trained medical professionals of annotating images to, in essence, tell AI what it is looking at. With patient permission, the RoentGen team trained their algorithm using a public library of more than 200,000 digitized X-rays, matching them against written electronic patient medical records to label their X-rays.

“We collected retrospective data from a hospital where the images already existed and where a trained radiologist had already written everything about the image,” Chaudhari said. “No additional or specialized human labor was needed to create that generative model. Because we leveraged what’s in the hospital already, it’s the closest that we can get to having a free lunch labeling-wise.”

Seeing the unseeable

Drug discovery is another promising application of synthetic data and could be a boon in the study of rare, inoperable cancers where existing data is scant and biopsies can be dangerous or impossible to conduct. In one of many avenues of his cancer research, Olivier Gevaert, PhD, associate professor of medicine and of biomedical data science, studies an elusive type of inoperable, untreatable cancer of the brainstem.

Gevaert is using the generative powers of AI to test the effectiveness of new drugs on these cancers. With other, more accessible cancers, drug efficacy is verified by taking tissue samples from the patient to see if the drugs are killing tumors. At the brainstem, however, getting such biopsies is not possible. Instead, Gevaert uses generative AI to synthesize the biopsy slides from genetic data.

His latest model is RNA-CDM, which allows cancer researchers to manipulate the genes in a patient’s RNA profile computationally, turning certain genes on and others off on a computer rather than in a person. RNA-CDM then creates synthetic biopsy slides, simulating the effect of new drugs on the unreachable, unseeable cancer. There are no real drugs being tested, no side effects for patients and no invasive biopsies necessary.

“Imagine if we now do this for all genes in the human genome and all drug candidates,” Gevaert said. “We can do computer-based experiments and rank results according to what the investigator wants to see in the images, that is … to see dead tumor cells … and quickly sort through drug candidates.” He and his co-authors described the method and how they tested it in Nature Biomedical Engineering in March.

Other targets, other applications

Cancer-killing drugs are but one avenue of drug discovery benefiting from synthetic data. James Zou, PhD, associate professor of biomedical data science, recently developed an AI model that can generate and reason about synthetic small molecules that have never been seen in nature. He used this approach to design potential new antibiotics at a time when antibiotic-resistant bacteria are a major concern for the medical community.

Using his model, Zou designed compounds to kill the bacterium Acinetobacter baumannii, a major source of drug-resistant infections. The outcome was not one or even a handful of new candidates, but 58 potential antibacterial drugs.

Zou’s team then had those candidates manufactured for testing in lab mice. Six molecules proved to have low toxicity while showing promising antibacterial effects on A. baumannii and other pathogens. Zou and his collaborators described the approach in Nature Machine Intelligence in March.

Along another direction, Zou is using synthetic data to increase access to a promising but expensive new imaging technique that can analyze cancer cells and their immediate environment. The technology, CODEX, can detect 50 to 100 biomarkers in a tissue sample at once — each one a potential target for new drugs. “The technology is powerful, but it is slow and costs thousands or tens of thousands of dollars for a single patient, limiting clinical applications,” Zou said.

“If you train a model using real data and then use that model to create lots of synthetic data, only to train yet another model on the synthetic data alone, over time your model just falls apart.”

Akshay Chaudhari, PhD, assistant professor of radiology in the department’s integrative biomedical imaging informatics section

Zou’s answer is 7-UP, a fast, inexpensive synthetic approach that expands the data obtained through a less powerful imaging technology: multiplex immunofluorescence, or mIF. From tissue stained with just seven biomarkers and imaged via mIF, 7-UP builds a robust picture of 40 or more additional biomarkers that can be used to classify cell types and predict patient survival from various drug interventions.

“The generative AI does something almost like virtual reality for cancer, turning low-resolution data into this very rich data visualization on the computer. But it only costs a few dollars per sample,” said Zou, who co-authored an article on the strategy in PNAS Nexus in June 2023. “It puts this promising technology within reach for more clinicians and researchers than ever.”

Risks and rewards

Promise aside, the concept of synthetic data seems risky to many. Hernandez-Boussard and her collaborators, including Arman Koul, a Stanford medical student, and Deborah Duran, PhD, senior adviser to the director at the National Institute on Minority Health and Health Disparities, are developing a framework to guide synthetic data research through an ethical and scientific minefield. The framework highlights the considerable risks while offering a measured pathway forward.

Over-reliance on AI could lead to what they call synthetic trust — a false sense of confidence in the models. Synthetic data could perpetuate biases rather than lessen them and produce nonexistent correlations that might lead to model degradation and misrepresentations that harm patients, they said.

“Generative AI is shown to preserve and, in some cases, worsen biases and inaccuracies in datasets,” said Hernandez-Boussard, professor of medicine and of biomedical data sciences and surgery.

“We think a go-slow, cautious approach is warranted in using synthetic data to train clinical algorithms. We must ensure data integrity, fairness and transparency to promote equitable outcomes of all sectors of society in health care applications.”

Synthetic data advocate Zou does not disagree and counseled vigilance in face of the risks. “We want to be extremely careful in evaluating the quality and potential biases in the synthetic data,” he said. “It’s really important for us to rigorously evaluate our models. Ask: Do I get to the same final outcomes and insights as if I would have done the same analysis on real data alone?”

“While we can use synthetic data for pretraining medical AI models,” Chaudhari said, “it is critical to evaluate the performance of our methods on real datasets to understand what current gaps synthetic data can minimize and what gaps they maintain or even exacerbate. There is no shortcut to robust evaluation and validation.”

Ground truths

Langlotz concurred. He is a co-lead of the faculty research council for the RAISE Health initiative — a collaboration between Stanford Medicine and the Stanford Institute for Human-Centered Artificial Intelligence to encourage ethical and responsible use of AI in biomedical research, education and patient care. The initiative, launched last year, is convening AI experts, stakeholders and decision makers to explore what it means to bring the technology into the medical realm and to define a structured framework for ethical health AI standards and safeguards. It is also curating high-quality tools and datasets to help guide ethical medical AI development.

“People have made very strong conclusions about the utility of synthetic data, but they don’t quantify the quality of their underlying generative models,” Chaudhari added, pointing out that an over-reliance on synthetic data produces a phenomenon known as model collapse. “If you train a model using real data and then use that model to create lots of synthetic data, only to train yet another model on the synthetic data alone, over time your model just falls apart.”

“When you think about the cost of clinical trials, if you can reduce costs by a significant amount using digital twin or synthetic controls, that’s a huge win for research, for medicine as a whole and for patients.”

Tina Hernandez-Boussard, PhD, an associate dean of research at the School of Medicine.

To verify RoentGen’s performance, for instance, Langlotz and Chaudhari asked two radiologists, one with seven years of experience reading chest X-rays and the other with nine, to conduct an audit of RoentGen’s output. Those professionals reviewed and rated both real patient and synthetic X-rays for quality and accuracy. Additionally, they gauged RoentGen’s alignment with highly specific medical language and concepts.

In reviewing those evaluations, the researchers found that the greatest uptick in classification performance was achieved when models were trained on a combination of real and synthetic data. They also noted that keeping humans in the evaluation loop is critical to improving results. Bottom line, Langlotz said: “The real test of any model is in whether your model gets at the ground truth. RoentGen does that.”

Future directions

These Stanford Medicine researchers point to several promising research avenues synthetic data might open, the most discussed being the “digital twin,” a computational facsimile of a given patient upon which drugs and other interventions could be performed in silico — on the computer — without risk to the patient.

“We could run simulations on these digital twins or against disease models to rapidly figure out insights that can then be used to improve results for the real patients,” said Zou, who is among several Stanford Medicine researchers, including Gevaert and Hernandez-Boussard, developing digital twins.

Another potential avenue of the future is the use of “synthetic arms” to supplement and speed clinical trials. Zou noted that pharmaceutical companies are already expediting drug discovery with digital surrogates. Instead of gathering 100 treatment patients for a trial, plus 100 controls, synthetic arms modeled on real patients from previous studies might be substituted for a large number of the controls, reducing time, effort and cost. “If you use 50 real controls and 50 digital arms, it could cut trial costs by a quarter,” Zou pointed out.

On the potential of digital twins and synthetic arms, all the interviewed researchers concur — though Hernandez-Boussard does not classify digital twins or synthetic arms as “synthetic data” per se.

“We’re using real patient data points. Only the outcomes are synthesized,” Hernandez-Boussard said. “When you think about the cost of clinical trials, if you can reduce costs by a significant amount using digital twin or synthetic controls, that’s a huge win for research, for medicine as a whole and for patients.”