By Rosanne Spector
Illustration by Doug Ross
The idea behind most diagnostic tests is simple: Identify a telltale chemical and look for it in a blood sample. The PSA test for prostate cancer is the best-known cancer diagnostic, but diagnostics exist for other cancers too — ovarian and colorectal to name a few. And while the tests are not infallible, they can help find hard-to-detect, early stage cancers and monitor treatment.
So it takes years of hard work and serious cash to create one of these “simple” tests, right? Not anymore.
“All you really need is a computer browser and Excel,” says computer scientist Purvesh Khatri, PhD, who, working with Atul Butte, MD, PhD, associate professor of systems medicine in pediatrics, identified telltale chemicals (aka biomarkers) for three types of cancer all in the span of one year.
How was this possible? By analyzing some of the vast amount of genetic information from tumor cell samples that has been amassed over the past decade in free, publicly accessible databases, and by outsourcing the lab work.
“We say ‘outsource everything except the genius,’” says Butte. “You come up with the question and the target, and let everyone else do the work.”
As Khatri walked me through the discovery process, I learned there’s a little more to it than that. Some work and cash is involved, not to mention high-school level biology. And basic statistics will be a big help. But with those tools, skills and about five days’ work, plus $4,000 to confirm through blood tests, you’re on your way.
You’ll need a computer that runs Windows, a browser, an Internet connection and a spreadsheet program like Microsoft Excel.
Go to an online repository that offers gene expression data from cancer tumors. We’ll use the National Center for Biotechnology Information’s GEO database for our example.
At GEO, enter a type of cancer in the DataSets window, surround it with quotation marks — for instance, “pancreatic cancer” — and click “go.” That pulls up a list of experiments. You’ll want to download the data from five to 10 of these; the more data, the better. Here’s what to look for: experiments with mRNA results (not miRNA or SNPs) with lots of samples, including some from non-tumor cells. To download, click on the record number, and in the new window click on “Series Matrix Files.” You’ll get a file that you can open in Microsoft Excel or other spreadsheet program. Do this for each experiment.
Resource: NCBI GEO (ncbi.nlm.nih.gov/geo)
Time needed: Up to two hours
Take a look at the files you’ve downloaded. At the top of the file is general information about the experiment. Below is the table in which each row corresponds to expression of an mRNA in each sample. Now read through the info at the top (or use the “find” function) to figure out if at least some of the samples were taken from healthy tissue, and whether the results have been normalized. If the experiment had no healthy tissue samples, it’s out of the running. You need healthy samples for comparison. If the data isn’t normalized, either take it out of the running or normalize it yourself. Good, free software for normalizing results is available at the open-source bioinformatics tools site Bioconductor.
Time needed: Maximum eight hours
Next you’ll look for mRNA sequences that are produced at high levels in tumor cells but not in ordinary ones. This is the part where some knowledge of statistics is helpful: You’ll need to run a T test for each experiment. Luckily, Stanford professor Rob Tibshirani, PhD, developed SAM, a free Excel plug-in that makes it easy. So download SAM, open Excel and click on the SAM ribbon. Choose “two class unpaired” as the response type, select “unlogged” (if any of the values are in hundreds or thousands) or “logged” (if they’re not) and click OK for a chart of the results. Set the delta value that corresponds to 5 percent false discovery rate, click “list significant genes” and then you’ll see them: Genes that are over-expressed in tumors compared with normal samples will be red. These are the ones to choose from. Pick those with the highest fold change (greater than 2 is best) and the lowest false discovery rate.
Now check whether each of the finalists is over-expressed in tumor tissue in the majority of the experiments. Rule out those that appear in just a few. Those that remain are your contenders for the diagnostic test.
Resource: SAM (http://www-stat.stanford.edu/~tibs/SAM/)
Time needed: Maximum 24 hours
Since you’re developing a diagnostic that tests for tumors, you want to make sure the gene is not produced elsewhere in the body. Go to gene portal BioGPS, type each mRNA ID into the search box and look at the result across body tissues. Ideally, your gene is not made elsewhere, showing a level close to zero.
Time needed: Five minutes or less
You’ll need blood samples from 10 cancer patients and from 10 healthy patients for comparison. You can order these online. Two reliable companies are Conversant Bio and US Biomax.
Cost: About $50 each, total $1,000
Time needed: A week or two for delivery
Then ship the samples to a contract lab to assay them using an ELISA test based on your top mRNA pick. Two reliable companies are Assay Depot and Science Exchange.
Cost: $2,000 to $3,000 per gene
Time needed: Four to six weeks’ waitIf the assays find the protein in the cancer patients’ blood samples and not in the others’, you have found yourself a biomarker — the basis for a diagnostic test. Now, you can get out the word by publishing your results. A peer-reviewed journal is best, though competitive. Try a publication related to the type of cancer you’ve singled out, or a general open-access journal like PLoS One.