by Bruce Goldman
Photography by Colin Clark
Medical researcher Atul Butte is at his computer collecting tissue samples, in this case for a study of leukemia. No need for sterile technique here — these days such tasks can be easily accomplished online. So he Googles. “Ah, here’s a company in Huntsville, Alabama,” he says. “OK, let’s see: ‘bladder cancer,’ ‘brain cancer,’ ‘breast cancer.’ Ah, here’s ‘leukemia.’ Let’s click on that.”
Butte, MD, PhD, an associate professor and chief of systems medicine in the Department of Pediatrics, has no lab in the orthodox sense. His discoveries, and there are plenty of them, pour out of a warren of cubicles housing computers and anywhere from 10 to 25 people. However untraditional, his lab is amazingly productive, averaging a new publication every two weeks — from new uses for old drugs to insights into the genetics of diabetes.
A few more clicks and he’s ordered 15 serum samples from leukemia patients, and for comparison 15 serum samples from healthy people. “I get info on each patient’s age, race, sex, alcohol and tobacco status, what medications they’re on.” And it’s only $55 per sample.
Butte is visibly excited about this, as he is about many things. (He’s a very happy man.) His inflection rises. “They show up in 72 hours on dry ice. If we tried gathering them in my lab, it would take a year and they’d cost me about $1,000 apiece once I factor in all the labor that’s going to be involved,” says Butte.
Not that these samples are necessarily going to show up at Butte’s doorstep. He’ll probably outsource the analysis too — because it’s faster, less expensive and in many cases is carried out with more expertise than he could muster even if he tried.
Traditional biological research has relied on painstaking work with experimental systems involving animal models: mice, rats, worms, frogs and flies, to name a few. But there’s a new experimental system on campus. An explosion of biomedical data, particularly molecular data, is piling up exponentially in databases whose numbers are themselves increasing exponentially. The new experimental system is the universe of all these databases, many of which can be accessed and exploited via the Internet.
Butte is a walking window through which to watch the data revolution in bioscience unfold. A proven master at mining this medical data, he’s now throwing his considerable energy into persuading other scientists to try his approach. It’s the fastest, least costly, most effective path to improving people’s health that he knows.
Medical science is being swept up in a revolution, says Butte — a revolution as big as the microscope, or the breaking of the DNA code.
Although beginnings can be arbitrary, you could say this revolution — call it the digitalization of biomedical research — began with the Human Genome Project: a massive $3 billion effort, begun in 1990 and projected to take 15 years, to construct a linear catalog of all 3 billion chemical letters in a generic human genome. The Genome Project holds a hallowed place as one of the few large-scale government efforts that was completed early, in 2003. Only nine years later, we are fast approaching the era of the $1,000, 15-minute personal genome, a result of rapid technological improvements.
Butte thinks this means we now have to look at science in a new way. “In traditional biology research, people ask a key question, or run a trial. They make clinical and molecular measurements to address that question. They use some statistics or computation. Then they validate what they’ve found in another, more advanced trial. I would argue that three out of these four steps are now completely commoditized. We can outsource all that stuff and save a lot of money. But what you’ll never outsource is asking good questions. As scientists, that’s what we’re really supposed to do best.”
A lot of the answers to important medical questions are already here, Butte says, trapped inside a matrix of voluminous data gathering dust in myriad repositories — many of them accessible even to a teenager, and many of them free. The trick is to figure out what questions to ask to get the data to divulge their secrets.
“I don’t think enough people study the measurements that have already been made,” says Butte. “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world. If I don’t analyze those data and show others how to do it, too, I fear that no one will.”
Butte started writing code as a kid in New Jersey. He wrote it in longhand, inside notebooks. Tagging along with his parents on shopping trips, he would steal off to the computer-sales area and type his programs into the demo models. Eventually one of those shopping trips paid off big time. He got his very own computer at age 12 and never looked back. Butte picked Brown University for his undergraduate studies specifically because that school had an eight-year program allowing him to major in whatever he wanted (that was easy: computer science) then attend medical school there. On his own, the 43-year-old academic has spun off six companies during his career and is now spinning out his seventh. If you count the three more founded by his PhD students in the past year and not directly involving Butte, it’s an even 10 so far. In 2011, he created the Department of Pediatrics’ division of systems medicine, whose goal is to harvest the vast troves of biomedical data that researchers are pouring into public repositories. Thanks to the Internet, this data is now available to all comers (albeit with appropriate requisite safeguards).
The Human Genome Project was just a starting point. A genome, by itself, reveals only some of the mysteries of the organism. Biologists really want to know how many of which of our different proteins are at work in our cells, but proteins are tricky to keep tabs on. Because proteins vary radically in their structures and functions, quantifying them requires radically different biochemical procedures.
There’s a proxy for that, though, stemming from the fact that genes carry the instructions for making proteins. In a living cell, the quantity of each type of protein being made can be estimated by measuring the quantities of a molecular intermediary, messenger RNA, that links individual genes to the proteins those genes specify just as a waiter’s ticket links specific menu items to the plates of food the menu describes in words.
Different cell types in an organism “order” different amounts of the tens of thousands various human proteins, as does the same cell at various stages of development or decrepitude, or in disease versus health. So you can learn a lot about a cell’s identity or condition by seeing what amounts of different proteins it’s ordering up. This is known in the business as a gene-expression analysis.
Photo by Colin Clark
Instruments pioneered by Stanford in the mid-1990s can simultaneously measure amounts of individual messenger RNA molecules (cells’ order forms for specific proteins) corresponding to each of the roughly 23,000 genes in the cell’s genome. These once-exotic devices, called microarray chips, are now a “throwaway commodity item,” says Butte, retailing for as little as $200 apiece. The advent of these tools has triggered such a torrent of gene-expression experiments that “about a decade ago, the top peer-reviewed journals finally started saying, ‘We won’t publish your study unless you deposit its raw data in public repositories,’” he says. Government funding sources also now insist on such data dumps by grantees. The result: an explosion of data from studies employing genomics, gene-expression and other molecular methodologies. Thus, experimental data can be shared among not only the researcher and teammates but hundreds of other researchers they’ve never met or even heard of, and whose fresh approach may yield answers to questions that the guys who initially posted the data never thought to ask.
Health-related data can come from sources besides molecular measurements, including electronic health records, hospital orders, admissions and census data. Others are taking notice. Last year the McKinsey Global Institute noted in a report, Big Data: The Next Frontier for Innovation, Competition and Productivity, that analyzing the large data sets produced by the U.S. health-care system could create more than $300 billion in value a year.
Not everybody is in love with this new way of doing things. The sheer volume and novelty of data mining’s output — a paper every 14 days! — is off-putting to more traditional bench scientists who earned their stripes mastering entire batteries of complex wet-lab techniques and overseeing or performing every step of the data-collection process themselves.
In April, at an off-campus retreat hosted by the Stanford Cancer Institute, Butte had an opportunity to sway doubters. He was accorded five minutes (the same as all the other speakers at the event) to talk about his methodology. There was little time for him to discuss his own studies in any detail, but most in the audience of about 100 were already at least broadly familiar with it. So, after a brief summary of all the major publicly accessible databases of possible interest to cancer researchers, he plunged into the question foremost on his mind. “OK, a show of hands,” he said. “How many of you have actually taken advantage of these databases for your work?” About five hands went up.
“My second question is, why not? Your postdocs and grad students are perfectly capable of mining this data,” Butte said. Up jumped the hand of distinguished cancer researcher Ronald Levy, MD, chief of the Division of Oncology. “Here is my answer to your question. Why should I trust data from experiments I haven’t done, or overseen, myself?”
Butte’s response: First, by definition, more than half of the data you pull from these databases is going to be quite recent. This follows directly from the fact that the contents of the biological databases are doubling every 15 months or so. So the data you download is going to have been derived, by and large, via up-to-the-minute techniques and instruments.
Second, Butte answered, “I can’t vouch for the accuracy of the data in any given experiment pulled up from the database, any more than I can vouch for the accuracy of the data from any specific researcher who’s not a member of my own lab. But for the most part, the guys who are posting this data are the ones who are getting all the funding, so they tend to have histories of competence. And if their study was in a top-tiered government-sponsored database, they had [NCI director] Harold Varmus and [NIH director] Francis Collins looking over their shoulders. That’s got to be worth something.”
Did Butte’s rebuttal change Levy’s mind? “He made some good points, but I don’t think it will convince many people to switch from doing their own experiments to relying on data from others who were doing their own experiments and asking their own specific question,” says Levy.
And do these criticisms faze Butte? “Not a bit,” he replies. “I probably have heard all the common criticisms. Successful scientists aren’t necessarily going to change how they do science, but I do try to convince junior students and investigators that this data-driven approach is just as acceptable a way to build a career.”
In any case, these database searches let you look at data aggregated from a huge number of independent experiments, not just from one lab. Just a week before the SCI retreat, Butte published a study in Proceedings of the National Academy of Sciences in which his group implicated a hitherto-unsuspected gene in type-2 diabetes. For the study, Butte checked out results from 130 independent experiments comparing gene activity levels in diabetic versus healthy tissue — in four tissues (fat, liver, muscle and insulin-producing pancreatic beta cells) and three species (mice, rats and humans). The same gene jumped over the moon in 78 out of the 130 experiments, a result whose chances of occurring randomly are less than one in 10 million-trillion.
Interestingly, the gene codes for a receptor found on the surfaces of macrophages, primitive immune cells that abound in porculent people’s potbellies. An online search revealed that a famed experimental-animal facility, the Jackson Laboratory in Bar Harbor, Maine, had a strain of mice lacking the receptor. Butte ordered some of these mice along with their normal counterparts and had them sent out to an animal-testing consultant. Performing experiments designed by the Butte team, the consultant showed that while normal mice developed diabetes from a high-fat diet, otherwise identical mice lacking the receptor didn’t. The team then tested a proto-drug blocking the receptor on the mice carrying the gene. That prevented these mice, too, from getting diabetes after waxing tubby on a high-fat diet. This proto-drug could turn out to be therapeutic for human patients, as well, and Stanford’s Office of Technology Licensing is attempting to license the intellectual property related to the study.
It’s discoveries like this that spur Butte to spread the gospel of data mining. In the past 12 months, he’s given 28 invited talks in places as far-flung as London, Vienna and Seoul. It seems the word is getting out.
Butte says a 15-year-old can go to, say, the National Center for Biotechnology Information’s GEO (Gene Expression Omnibus) database, type in “breast cancer,” and get more than 31,000 experimental readouts — on more samples than any single breast-cancer researcher has ever put through a study — about as easily as chasing down a bunch of songs on iTunes.
Just as any youthful geek might create a music “mash-up,” you can pair up data sets and see if any interesting correlations jump out at you.
Just as any youthful geek might create a music “mash-up” by syncing the vocal track of one song with the instrumental track of another, you can pair up data sets and see if any interesting correlations jump out at you. There are innumerable ways to match them up. You can cross-compare gene expression against blood chemistry (e.g., pollutant or vitamin levels), or census data against patient reports or patient care (what the diagnosis was, what was prescribed, what procedures were performed, what medications the patient bought and how consistently he or she took them, and patient outcomes), and tell the computer to let you know what correlates with what.
That’s systems medicine, folks. And Butte et al.’s virtual-lab tests yield some very real-world results. For example, once you know which genes’ activity is elevated or depressed by a specific disease, and if in addition you know how those genes’ activity is amped up or tamped down by various drugs, you can perform the molecular equivalent of a Match.com search, pairing drugs and disease indications according to the time-honored “opposites attract” principle.
In August 2011, Butte and his teammates published two papers in Science Translational Medicine describing how they showed just that. They used an algorithm designed in-house, which they made freely available to all researchers, hooking up drugs and diseases that had opposing effects on gene expression levels. The two papers detailed two separate such pairings: In one case, their algorithm predicted that a safe, old, off-patent ulcer drug, cimetidine, could be effective against lung adenocarcinoma, the most common form of lung cancer. In the other study, the digital Ouija board predicted that another safe, off-patent seizure drug, topiramate, could fight Crohn’s disease, an inflammatory autoimmune condition affecting the intestinal tract.
But Butte and his associates didn’t just stop at the formulation of a prediction. “At some point,” he says, “you have to turn off the computer and actually try it.” So they took the next step of testing their predictions in animal models. At first, they worked with other colleagues in the medical school who were equipped with animals and facilities. But to more rigorously verify their predictions, Butte wanted to avail himself of expertise not found on campus. For the topiramate study, he went online and found two companies that would conduct extensive trials in rats: induce a rat version of Crohn’s disease, administer the drug, then analyze its effect by performing rat colonoscopies, captured on videotape.
That procedure, to say the least, requires technical sophistication. “No one at Stanford knew how to do rat colonoscopies,” Butte says. Instead of picking just one company, Butte went with both, the better to demonstrate that the results, produced outside Stanford, were indeed reproducible. He clicked on “Add to Shopping Cart,” provided his credit card information, and it was off to the rat races. The resulting video footage, as well as more routine histological analyses, showed that topiramate worked even better than steroids, which also have all sorts of potentially nasty side effects.
That study, which came out in August 2011, landed with a splash. “Essentially every major pharmaceutical company and biotech called within a month,” says Butte. The academic community certainly took notice. “We’re getting a couple of citations each week.”
The federal government’s research establishment noticed, too. “Historically, people have looked mainly at what a drug does, for good or ill, in a particular organ — the eye, the liver — and maybe not asked what this drug does in other organs. Nobody would have guessed topiramate was going to be useful in an inflammatory bowel disease,” says Christopher Austin, MD, the director of the National Center for Advancing Translational Sciences’ division of preclinical innovation at the National Institutes of Health. “Of course, it remains to be seen whether this is going work in humans. But a lot of the data Atul used to get his prediction was from humans. So these findings may turn out to be accurate.”
Those who originally generated the data that Butte mined “had no idea what Atul was going to do with it all,” Austin says. Butte’s studies, he adds, show the importance of making the data public and “letting smart people like Atul, who had nothing to do with generating the data, follow up.”
Long before the topiramate study was published, the molecular/malaise match-up algorithm was licensed by Stanford and became the core platform of a spin-off, NuMedii, cofounded in 2008 by Butte’s graduate student Joel Dudley, PhD, and Butte’s wife, Gini Deshpande, who has a PhD in molecular biology and biochemistry and whose background includes work for biotech companies and venture-capital firms.
The Mountain View, Calif., company “really took off last year,” says Deshpande. “Atul and his group showed that old drugs can have surprising new uses. But we’ve been fielding calls from companies asking us if we can help them identify a first use for a drug they’re developing.” It’s not always apparent which of perhaps four or five potential indications is the one in which a new drug is most likely to succeed, she says. And with the prospect of hundreds of millions of dollars going down the drain should a drug fail, the stakes in identifying the best drug/disease match are high.
Several other San Francisco Bay Area companies owe their existence to Butte’s entrepreneurial instincts and data-mining discoveries. Among them is Carmenta Bioscience, a startup that blends database searches with observations of protein activities to address maternal and fetal health problems. Another, Personalis, aims to vastly improve the medical accuracy with which personal genomic test results are interpreted.
You might say Butte is fond of companies. He also likes company. In May 2012, Butte gave a talk to a group of Northern California science writers. What was billed as a 40-minute talk and Q&A became a tour de force that lasted well over an hour, at which point the restaurant started closing the place down. Forced out into the hallway, Butte cheerfully fielded waves of questions from a score of new fans, blissfully unaware that he’d left behind one of his props, a demonstration microarray chip, until someone noticed it on a table and brought it out to him. He shrugged, smiled, gave his thanks and said, “Did I mention this is a commodity item?”