Can you repeat that?

The crisis in research reliability

It was a doctor’s worst nightmare. Throughout the late ’80s and ’90s, clinicians had prescribed hormone therapy to millions of menopausal women to help with hot flashes and other unpleasant side effects. The treatment was also believed to lower a woman’s risk of heart attack or stroke. It seemed like a win-win.

But in 2002 a new study based on the Women’s Health Initiative delivered some jaw-dropping results.

The study, which was steered by a national committee headed by professor of medicine Marcia Stefanick, PhD, of the Stanford Prevention Research Center, showed that women taking a combination therapy using both estrogen and progestin (the synthetic form of the hormone progesterone) actually had an increased risk of heart attack and stroke within the first five years. They also had a significantly increased risk of breast cancer.

In other words, doctors who thought they were helping their patients were doing the exact opposite.

“I found myself apologizing to my patients,” recalls Adam Cifu, MD, a professor of medicine and general internist at the University of Chicago Medicine. “To this day I have patients who, when I recommend a prescription, say, ‘This isn’t like that estrogen therapy you gave me 20 years ago that you then told me to stop taking, is it?’”

The experience caused Cifu to begin to question the evidence upon which medical practice is based, and wonder how many other times established medical protocols have been proven not just invalid, but also harmful.

In 2005, John Ioannidis, MD, DSc, published a paper in PLOS Medicine provocatively titled “Why Most Published Research Findings Are False.” In it Ioannidis, who came to Stanford in 2010 as director of the Stanford Prevention Research Center, asserted that small study sizes, research bias, competition among scientists and financial conflicts of interest render the outcome of over half of scientific studies suspect.

Over the subsequent decade, rumbles of concern from researchers, clinicians and pharmaceutical companies about the reproducibility of published research findings have grown into a drumbeat of alarm that’s reached the popular press and even infiltrated late-night television.

As comedian John Oliver quipped recently during a Last Week Tonight segment on scientific inaccuracy, “There’s no Nobel Prize for fact-checking” already-published research. And pressure from stakeholders is mounting. Earlier this year, for example, the pharmaceutical giant Merck suggested that universities be required to return licensing fees paid to them by drug companies if the original experiments can’t be replicated.

An economic study published in 2015 in PLOS Biology suggests that, if Ioannidis’ estimates are correct, taxpayers, who foot the bill for government grants for scientific research, are pouring billions of dollars down the drain. Every year.

These numbers have attracted the attention of major research funding agencies such as the National Institutes of Health and the editors of high-profile journals such as Science and Nature. In the past year, these organizations have launched efforts to ensure the accuracy of published research and to modify an incentive system that places greater value on splashy results than behind-the-scenes grunt work by multiple researchers who take the time to verify previous findings and the care to design methodologically robust further studies — steps that are often necessary to home in on scientific truth.

The overall aim is to issue a course correction to the U.S. medical research behemoth by bringing back into focus its core mission: How to benefit those who need it most.

“I think that, somewhere along the line, we began to miss the big picture,” muses Ioannidis. “What have we been rewarding in science? Quantity of publications? Findings that appear statistically significant? We should be rewarding quality research that will make an impact on real people’s lives — both those who are sick and those who wish to remain healthy.”

How to remain healthy, and how to treat those who aren’t, is the crux of medical research. In 2015, the Stanford Prevention Research Center launched the Wellness Living Laboratory to determine the lifestyle and environmental factors that contribute to a person’s ability to live a long, healthy life. But as the hormone therapy experience shows, it’s often far from clear how to accomplish these goals because it can be hard to know which studies to trust.

In August 2015, Cifu and a colleague, Vinayak Prasad, MD, an assistant professor at the University of Oregon’s Health and Sciences University, published the results of a review they and their colleagues conducted of research published over 10 years in one of the most highly regarded medical publications, the New England Journal of Medicine. They identified 146 cases of what they termed “medical reversal,” in which previously accepted medical practices — including stenting for stable coronary artery disease, arthroscopic surgery for osteoarthritis of the knee and vertebroplasty for osteoporotic fracture — were proven to be, well, wrong.

The problems can have deep roots. As Ioannidis’ research suggests, scientific errors can occur in the laboratory during basic and preclinical research, as well as in clinical trials. In 2012, the former head of global cancer research at Amgen Inc., C. Glenn Begley, PhD, tried something new: Rather than spending resources to build directly on published research, he first sought to confirm it. He chose 53 well-known cancer studies published in high-profile journals and tried to duplicate their findings. He and his team were unsuccessful 47 out of 53 times.

‘We should be rewarding quality research that will make an impact on real people’s lives.’

So how is it possible that the conclusions of seemingly reputable scientific studies fail to substantiate and even often appear to contradict one another? Ioannidis points to an episode from his past as one example of how things can go awry.

As a young assistant professor, he found himself scratching his head at a puzzling result. He was attempting to develop a mathematical model correlating the amount of virus in the blood of a person infected with HIV and that person’s expected life span. Logic, and science, dictate that the higher the number of viral particles in the blood, the closer to death that patient would be. But Ioannidis’ equation wasn’t working.

“At some point, my calculations estimated that someone with an extremely high viral load would live 800 years,” he says. When he double-checked, Ioannidis found that in one equation a positive sign had mistakenly been converted to a negative sign — a slip-up we’ve all probably made at some point. “Once I corrected the error, the result was much more reasonable, predicting that the person would likely live less than one year,” he says. But that experience changed his career.

“I began to wonder what would have happened if the error was such that the result, although still wrong, was not so implausible,” he says. “There are many reasons why we might get results that are incorrect, but not easily recognizable at first as wrong. Particularly if these results fit our own expectations and biases.”

Ioannidis found himself increasingly devoting his time to studying when, how and why mistakes are made in a variety of biomedical fields. And in doing so he found some glaring problems.

“It seemed to me after reading a lot of research papers that it was not at all uncommon to encounter major weaknesses in study design, analysis and conclusion,” he says. “This is the core of scientific investigation, but it’s not easy. You need a lot of training and experience, and even then you can still get things wrong if you’re not very careful.”

Errors in study design, if not detected, then creep into clinical recommendations as physicians succumb to the all-too-​human desire to help people, Cifu says. “Doctors want to have something to offer that they think will help their patients. But problems can occur when physicians put their trust in data that are not terribly robust. This is easy to do when there is a physiologic rationale for the potential treatment.”

For example, the initial study suggesting that hormone therapy might reduce the risk of heart attack or stroke was observational. Clinicians tracked the health outcomes of women who chose whether to take hormone pills and found that those who did had lower rates of heart attack and stroke than those who decided to forgo the treatment. (In contrast, the Women’s Health Initiative randomly assigned women to receive either hormone therapy or a placebo, in a study format known as a randomized controlled trial.)

Although the observational study wasn’t rock-solid evidence — randomized controlled trials are considered to be superior — the lure of hormone therapy was hard to resist. Doctors were inclined to accept the findings of the less-​robust observational study rather than demanding more-solid evidence of the treatment’s efficacy because the reasons why it should work seemed so sensible.  

“There was good biochemical and biophysical rationale for why this therapy would work,” says Cifu. “The problem is, it didn’t. In retrospect, the women who took estrogen were younger, thinner and had better cholesterol levels than those who didn’t take it. So it was probably not the hormone therapy that benefited them, but everything else.”

Clearly, helping both physicians and researchers understand how to design and evaluate published studies is an important component of increasing research accuracy. In 2014, Ioannidis helped launch the Meta-Research Innovation Center at Stanford, or METRICS. Co-directed by Ioannidis and associate dean of clinical and translational research Steven Goodman, MD, PhD, the center is the first in the country to devote itself to making the practice of research more accurate and efficient across many scientific fields.

One of their aims is to educate scientists on how best to design preclinical and clinical studies, and how to chose appropriate statistical methods to analyze the resulting data.

“A big problem in much published research is the use of suboptimal methods,” says Ioannidis. “Over the years, the quantitative component of biomedical research, as well as research in general, has become more prominent. There are now very few disciplines in which researchers can do high-quality, influential work without also incorporating high-quality quantitative analysis.”

Collaboration among researchers and statisticians or computer scientists skilled in handling large amounts of data is one way to ensure that a study’s findings are robust and accurate. Another is to increase transparency and to encourage researchers to cross-check one another’s results. The journals Science, Science Translational Medicine and Nature, as well as major funding organizations including the National Institutes of Health, have launched efforts to promote data sharing and open access to scientific articles. The journals have agreed to eliminate the word limit on sections of an article devoted to describing in detail how the research was conducted, to encourage authors to provide more raw data to others in their field and to ask editors to partner with statisticians when necessary to assess how the study was analyzed.

The NIH’s newly created Rigor and Reproducibility website includes a training module for researchers that emphasizes enhanced transparency and good study design. In addition, several NIH institutes have deployed a checklist that reminds grant reviewers to review the key components of proposed research, including any plans for randomization and data analysis. They are also considering assigning reviewers to assess whether the proposed research is built on a strong foundation of previously verified studies.

“Sometimes some sloppiness creeps in,” said NIH director Francis Collins, MD, PhD, at a 2015 conference in Washington, DC. Collins cited the hypercompetitiveness of many scientific fields and the scarcity of available research funding as potential reasons for poor experimental design. “Maybe the right controls were not quite done, or you had a control but it wasn’t the perfect one. Or you didn’t repeat the experiment two or three times to be sure you always got the same result. And maybe you just sort of glossed that over when you finally submitted your paper, either because there was a space constraint, or because you were tired of writing that section, or, sorry to say, in some instances people don’t really want to give away a few of their lab secrets.”

In early 2014, Collins and the NIH’s principal deputy director, Lawrence Tabak, DDS, PhD, described the organization’s plans to enhance the accuracy of scientific research in an article in Nature.

“We need to renew our attention and our commitment that we’re about doing science that’s rigorous, that’s going to hold up, that we’re looking for truth,” says Collins.

That commitment, however, will require upending the culture of how science is conducted and rewarded in this country.  For decades, career capital has been amassed in the form of prompt, frequent publication of one’s results in preferably high-profile scientific journals. Numerous prestigious papers are parlayed into increases in grant funding and academic promotions. But the process doesn’t include incentives for collaboration or for verifying others’ work. It also discourages the publication of negative results.

Former National Cancer Institute director Harold Varmus, MD, has suggested changes to the summary of a researcher’s achievements that is required as part of a grant application. Rather than listing major publications, Varmus proposes instead that researchers include a narrative that describes five major accomplishments. This structure, which is used by the Howard Hughes Medical Institute, would give grant reviewers a more holistic view of a researcher’s career.

So how will we know if, or when, these efforts will succeed? Ioannidis is confident that the tide is shifting but thinks it will take time to see meaningful change.

“Until the early ’90s, there was not even an effort to look at the biomedical literature in terms of the totality of the evidence,” he says. Nowadays, “we’re seeing action on nearly all possible fronts, from scientific journals to funding agencies to professional organizations.”

Earlier this year, Ioannidis published an analysis of a random sample of 441 studies published in biomedical journals between 2000 and 2014 to ascertain how many followed any of the practices recently suggested to increase research transparency and accuracy. He found that none shared all the raw data, only one shared the full protocol and many did not report on funding and conflicts of interest.

But there are some promising signs.  

“There are focused areas where in the last few years data sharing has improved,” says Ioannidis. “For example, more genetic data is being deposited in the National Center for Biotechnology Information’s database of geno­types and phenotypes, which relates gene variants to disease states. There’s also a clear push to share more data for clinical trials and to make protocols routinely available to others.”

Gradual shifts are to be expected when recalibrating an entire culture, however.

“It’s an evolution,” Ioannidis says. “Science is a process of accumulating evidence. By scrutinizing that evidence carefully, we can begin to develop a gradient of truth. If we can make biomedical research more efficient, and more reliable, we can improve the health outcomes of real people.”

Krista Conger is a science writer for the medical school's Office of Communication & Public Affairs. Email her at kristac@stanford.edu.

email Email the author

Additional Reading

Brain attack

A psychiatric illness that strikes children out of the blue

Is war good for medicine?

War's medical legacy