By Erin Digitale
Illustration by Shannon May
On a spring evening in 2000, Henry Lowe, MD, had just finished dinner at a downtown Palo Alto restaurant with a group of Stanford biomedical informatics students. A master at assembling large, complex databases from multiple sources of patient data, Lowe was in town to interview for a position on the faculty. As they strolled under the plane trees along University Avenue, one student took him by surprise, asking: “Are you sure you want to come to Stanford?” “Yes, I think I do,” Lowe remembers saying.
“Well, it’s nearly impossible to get access to any clinical data for research,” the student said. “When our researchers want clinical data, they have to go to Kaiser or the VA.”
Telling the story now, Lowe pauses, choosing his words.
“I came to Stanford with the goal of changing that,” he says.
It wouldn’t be easy. “Back then, the idea that I was going to go to the hospitals and say, ‘Give us all your clinical data,’ seemed absurd,” Lowe says. “Hospitals are, for all the right reasons, very sensitive about protecting that data.”
Not only did Stanford’s two hospitals — Lucile Packard Children’s Hospital and Stanford Hospital & Clinics — lack a tradition of using their patient records for research, but the records were housed in several unrelated databases. Organizing the information would require a major investment of time and money, and new technical and legal structures to accommodate re-use of the data. Yet other institutions — including the National Institutes of Health and a few academic medical centers — were establishing such systems. Despite the obstacles, Lowe felt a Stanford clinical data warehouse was essential. “We have an amazing opportunity to use our patient-care-delivery system as a sort of natural experiment,” he says.
“If you look at the FDA statistics, the time required to get an idea to bedside is years,” says Elaine Ayres, deputy chief of the Laboratory for Informatics Development at the NIH Clinical Center, which has a clinical research data repository of its own. “These systems have the ability to shorten that cycle.” The rich lode of clinical data allows researchers to tackle questions that are logistically and ethically impossible to address with clinical trials, and enables much larger studies than traditional clinical trials. As demands for evidence-based medicine increase, the data provide a tremendous evidence base to tap.
Today Stanford is a national leader in the use of clinical databases for research. Lowe, senior associate dean for information resources and technology at the School of Medicine, spearheaded the development of a world-class clinical data warehouse at Stanford — the Stanford Translational Research Integrated Database Environment, known as STRIDE.
The fast, intuitive searches the database makes possible are engendering a new kind of scientific creativity at Stanford, as researchers preview the contours of Stanford’s population data — more than a million patient records and growing — to help brainstorm hypotheses and develop studies. “People look at this and they salivate over it,” says Lowe, who also directs the Stanford Center for Clinical Informatics. Scientists from as far away as Australia, New Zealand, China, Korea, Norway, Finland, Great Britain and Ireland have come to learn how STRIDE was built. He’s happy to share Stanford’s know-how, though he’s the first to admit that even after a decade, in many ways the project is still a work in progress.
When Lowe arrived at Stanford in 2001, the School of Medicine had an ambitious new dean, Philip Pizzo, MD, who wanted to promote translation of basic discoveries into clinical care. Pizzo was fresh from a stint at NIH, which led the world in tapping electronic medical records for research. His enthusiastic advocacy for a clinical data warehouse for research was integral to the success of Lowe’s effort.
Pizzo helped convince leaders of the two Stanford-affiliated hospitals that a research database based on patient records would give the entire medical center a powerful advantage. In addition to the ability to advance medical science, the hospitals would gain an excellent platform for assessing patient safety and quality of care. Pizzo also committed to financing STRIDE’s development and operation, at an eventual cost of several million dollars.
Todd Ferris, MD, now director of informatics services at the Center for Clinical Informatics, soon joined Lowe’s team to help determine how to protect the privacy of patients in an ever-growing pool of live clinical data while providing scientists access for research. The team spent a year and a half crafting an agreement with the hospitals, then subjected their plan to an external legal review.
With administrative, financial and legal backing in place, Lowe and his colleagues tackled logistics. In 2003, after a fruitless hunt for a database that could integrate the many types of data in clinical records — such as patients’ diagnoses, demographic information, prescriptions, lab values, radiology images and pathology reports — they decided to design and build a database instead.
“Slogging through the data was painful and time-consuming but it also engendered a huge amount of knowledge for the team,” Ferris says. “If we had bought a database and pointed a pipe at it, we might not have developed the same level of understanding.”
“We were acquiring data from dozens and dozens of clinical systems,” Lowe adds. “Then we had the challenge of unifying all of it, figuring out which patient each piece belonged to, and assembling it in the database.”
The sheer volume of data was part of the problem. STRIDE now contains 1.7 million patient records. “We are a small medical system in comparison to some others,” Lowe says. “But that’s still a very large data set.”
The team developed a wish list of functions for their database, many of which are now incorporated into the slick user interface that Stanford researchers use to access STRIDE, which went live in 2005.
“Because this database is comprehensive, centralized and continuously adding new information, it provides a unique and visionary resource to our community,” Pizzo says. “STRIDE anticipated the incredible changes now occurring as medicine becomes increasingly quantitative.”
After almost a decade of development, STRIDE is ready for the big time. It’s on par with the best similar databases in the country at institutions such as the Mayo Clinic, Harvard/Partners Healthcare and Vanderbilt, says the NIH’s Ayres. A few large medical systems, such as Kaiser Permanente and the Veterans Affairs Administration, also have regional databases, but these systems so far lack the comprehensive capabilities of those at large academic medical centers. When Lowe dreams really big, he thinks about how to link STRIDE data to medical records from other institutions across the country. For starters, his team is working on meshing STRIDE with databases from other health-care systems in Northern California. It’s daunting because the total data set is likely “1,000 times bigger and more diverse” than STRIDE, he says.
Today, when a researcher wants to determine if STRIDE contains the right data for a particular research project, the first stop is the system’s Cohort Discovery Tool, built to quickly query medical records without revealing the identity of any individual patient. “Let’s look for a group of patients that could be the basis for a study,” says Lowe, as he logs in to the online STRIDE interface. “How about young heart attack sufferers?” He drags “Diagnosis” from a menu on the left side of the Cohort Discovery Tool’s screen to the work space in the center, typing “myocardial infarction” in the search box that pops up.
Behind the scenes, the system matches his query to ICD-9 diagnosis codes, the standardized descriptors used in medical records. Hitting the “Go” button, he sees that about 9,625 Stanford patients have had an acute myocardial infarction.
“Let’s say we’re interested in clinical events from just the last five years,” Lowe says. He chooses “event date is after” and enters “01/01/2007.” The system spits out a new number: 3,350 patients. Narrowing again, he specifies “age at event is equal to or less than 55 years.” About 685 patients meet the narrowed criteria. In less than two minutes, he has assessed the electronic medical record data in a way that would be almost impossible by hand. Now he can further refine his search — by specifying patients taking particular medications, for instance, or those who have a certain type of lab result on file. As he searches, the system automatically displays bar graphs of the cohort’s demographics: gender, current age, race and most recent address.
For scientists building a study, the next step is a data review. Following a formal review by the school’s privacy officer, researchers may obtain permission to review data from all patients in the cohort to see whether their records will really answer the questions the researcher wants to ask. “There’s an awful lot of noise in clinical data; you can’t assume the answer from the Cohort Discovery Tool is correct,” Lowe says.
Lowe and his colleagues are now tackling what he calls ‘the hard problem in medical informatics’: extracting useful information from typed or dictated text within medical records.
To conduct a study using STRIDE data, researchers need to get approval from the Institutional Review Board — the organization that gives the thumbs up or down to any research on campus involving human subjects. As required by federal privacy statutes, the IRB assesses the minimum amount of information necessary for scientists to answer their research question. Many studies can be conducted with de-identified data — stripped by STRIDE of the patient’s name and anything else that could be traced back to the individual. Other studies require identified data. In some such cases, patients are contacted individually to request their permission to include them in the study. However, as with any proposed study of human subjects, if the overall risk to patients is judged to be negligible, the IRB sometimes grants a waiver that permits scientists to proceed without contacting individual patients first.
The IRB is the key patient-privacy gatekeeper, Lowe says, emphasizing that the IRB has the ultimate control in ensuring that researchers get only the data they need and no more. STRIDE also has several methods of ensuring that its interfaces cannot be used to underhandedly triangulate back to a specific patient. Another privacy safeguard is software that enables researchers to analyze patient data without downloading it from STRIDE’s secure servers.
The Notice of Privacy Practices that Stanford and Packard Children’s patients receive informs them about potential re-use of clinical data for research, and patients have the option to request in writing that their data never be used. Yet Lowe hopes patients will see the big-picture value of this type of research, adding “Here’s an opportunity for your data to be a small part of a much bigger data set that could lead to dramatic discoveries and breakthroughs in understanding.”
To help scientists build good studies, in 2005 Lowe’s team inaugurated the Stanford Center for Clinical Informatics. Demand for its services is steadily increasing: They received 318 requests for “substantial informatics consultations” to help scientists use STRIDE in 2011, more than double the requests per year in 2008 and 2009, and up from 198 in 2010. Clinical informatics research is catching on at Stanford, says Lowe, though he still thinks the gold mine of information in STRIDE’s data has barely been touched. So far, only a handful of studies have been published using STRIDE data.
Among STRIDE’s converts is Russ Altman, MD, PhD, whose team used the database to explore adverse drug interactions, turning up several.
“Initially, many of us were very nervous that medical record data would not be good for research because it would be so biased by the purposes for which it was designed,” says Altman, a professor of bioengineering, of genetics and of medicine. Although electronic medical records are indeed biased, for instance by clinicians’ tendency to favor diagnosis codes with higher insurance reimbursement rates, Altman says that with careful planning to deal with such bias, “I believe they’re extremely valuable.”
One of Altman’s studies, published this year in Science Translational Medicine, revealed potentially dangerous interactions between widely used drugs by tapping both STRIDE and the Food and Drug Administration’s database of more than 4 million adverse-event reports collected across the country.
“It’s only with the emergence of very large public databases that you can begin to ask questions about drug-drug interactions,” he says.
The scientists designed an algorithm that trolled through the FDA’s adverse-event data seeking pairs of patients who are extremely similar except for one drug prescription. Comparing these matched patients made it easy to spot side effects produced by drug combinations. The research found several dozen interactions; the biggest discovery was that patients taking both SSRI antidepressants and the thiazide type of blood pressure medication are at increased risk for a potentially deadly form of heart arrhythmia, long QT syndrome. The team used STRIDE data to check their findings. Sure enough, patients in STRIDE who were taking SSRIs and thiazides had increased risk of the arrhythmia.
One advantage to big data: “We’re slowly learning that you can throw away data that’s not perfectly useful,” Altman says. For instance, the adverse-event study used only data from well-matched pairs of patients, ignoring data from patients who could not be matched according to the algorithm his team designed. “Traditionally, statisticians weren’t able to insist on that because they didn’t have big enough data sets.”
But doesn’t throwing out data open the possibility of bias in what is kept? Not really, says Altman — because of the sheer size of the data sets. Unlike the findings from a traditional study, these numbers won’t fit into a spreadsheet. There is no way for scientists to see the numbers before they decide how to analyze them. As Altman puts it, “It would be very hard to cherry-pick the data even if we wanted to.”
Lowe and his colleagues are now tackling what he calls “the hard problem in medical informatics”: extracting useful information from typed or dictated text within medical records. This data trove remains difficult to tap because the computational challenge is far more complex than processing the structured portion of the clinical record.
There’s an irony in the situation: A big motivation for creating electronic medical records was to compute their data, but so far that’s extremely difficult for the richest, unstructured, portion of the information.
“So the quest continues,” says Lowe. “We have the electronic health record, but we still can’t compute much of the really important data it contains.”
But they’re making progress. The text-mining tool they’ve begun developing can already parse pathology reports to find patients with a specific cancer diagnosis. And it can pull out descriptions of cancer types and biopsy sites. It even has rudimentary “negation detection,” the ability to understand a notation that the patient does not have the diagnosis of interest.
Still, the tool is much less powerful than a human reader. But just give it a little time, says Lowe.