Without question, next-generation sequencing (NGS) has been immensely successful, even if it hasn’t transformed medicine just yet. Sequencing itself has become nearly trivial. Declining costs have made it widely accessible. Recently, the rise of affordable benchtop NGS instruments promises to democratize the technology further, extending it out from large sequencing centers into smaller labs and clinics.
Today, the bigger challenge is deciphering the vast NGS datasets of wide-ranging data types—RNA-Seq, ChiP-Seq, exome, etc.—in order to inform biomedical questions ranging from evolutionary heritage to the functioning of cellular molecular machinery. NGS data analysis was among the many topics discussed at last month’s “NGX: Applying Next-Generation Sequencing” meeting.
Integrating various NGS data into networks that are both manageable in size and likely to be true was the core of a talk from MIT’s Ernest Fraenkel, Ph.D., associate professor of biological engineering. Interpreting high-throughput data, he noted, can be like reading “The Hitchhikers’ Guide to the Galaxy,” in which the meaning of life and the universe turns out to be the number 42.
“Suddenly you realize you didn’t understand what the question was,” Dr. Fraenkel said. “That’s often true of high-throughput data. We get different answers and we don’t know what they mean. Our integrative approach is to try to discover a biological process that gives rise to the experimental data we detect.”
Two recent papers describe his approach, which does not rely on published literature or traditional pathway analysis. Rather, the method uses only the physical interactions. The basic idea is to connect those interactions within true biological networks of manageable size. Dr. Fraenkel uses a graph-based Prize-Collecting Steiner Tree (PCST) algorithm to build networks.
Drawing upon roughly a quarter-million physical interactions reported from experiments and data from a given NGS experiment, this PCST-based method identifies highly probable networks. Every interaction—including the quarter of a million he’s already gathered—is given a probability based on reliability factors such as the experimental method used and number of times reported. The number of possible network connections is huge. PCST winnows them down.
“You collect as many prizes—high confidence interactions—as you can in the final network,” he said. “But if you just do that, you still get a hairball”—or a dense usable network. So, “you tell it one more thing. You say every time you use an edge to connect something, you have to pay a price for the edge, and the price goes up the less reliable it is,” Dr. Fraenkel continued. “High-confidence edges are cheap; low-confidence edges are expensive. You ask it to collect as many prizes as possible while paying as little as possible for those edges. That forces the algorithm to decide whether or not to connect something to a bunch of connections to get to other data points—that whole chain of connections has to be really high confidence.”
Dr. Fraenkel and his colleagues have set up a website with links to several tools including its PCST tool. “You can upload a list of genes and press a button and it sends an email back when it’s solved the problem,” he said.