Mining for Multivariate Markers
One of the biggest changes in recent years is the migration to multivariate biomarkers. It is becoming clear that the serum proteome does not offer many clear, individual biomarkers of disease, and that further advances will require looking for panels or profiles that can track changes in a number of molecules simultaneously.
Darius Dziuda, Ph.D., a professor in the department of mathematical sciences at Central Connecticut State University, is working to educate scientists about using data-mining methods to identify multivariate biomarkers. Using a multivariate approach is critically important, according to Dr. Dziuda.
“Too many studies are still limited to the univariate approach, if some of them result in efficient classifiers, it’s ok. However, the univariate approach not only neglects correlations between genes, but also removes from considerations, genes that are not significant univariately, but are very important in combination with other genes.”
Using a multivariate approach means looking for a set of genes or variables that can differentiate between classes or disease states. The focus of Dr. Dziuda’s paper at the Barcelona meeting will be his methods for identification of stable multivariate biomarkers. “First, using heuristic multivariate methods, we identify the informative set of genes that includes all significant discriminatory information. There are typically a few hundred genes in such a set. Some of them are univariately significant, others could not be identified by univariate methods.
“Then, we build a large number of bootstrap-based classifiers, which are used to vote for variables and to identify the most important expression patterns. Finally, feature selection performed on these patterns leads to small multivariate biomarkers that are stable and biologically interpretable.”
The next step is validating the resulting multivariate biomarker using external data. Validation of biomarkers is a somewhat contentious subject. There is an argument to be made that a biomarker panel does not need to be validated or mechanistically characterized in order to be useful—that the pattern alone is sufficient for clinical or research purposes.
It is becoming more and more apparent, however, that in order to make the best use of a set of genes, their function and relationship should be discovered. (The function of the individual molecule does not necessarily translate to the biological interpretation of the set.) So, while multivariate biomarkers could be useful without a biological context, this will inevitably be a temporary situation.
One of the projects Dr. Dziuda has finished uses publicly available data from acute lymphoblastic leukemia. “After filtering noise, we had about 7,000 genes. The informative set of genes included about 200 genes. Using ensembles of classifiers we identified the most frequently used genes and the most important expression patterns. Then, heuristic feature selection identified a multivariate biomarker of five genes.
“This biomarker worked well on independent test data. This and other case studies indicate that this approach works very well and results in robust multivariate biomarkers.” The method is applicable, not just for early diagnosis of disease, but for prognosis, therapeutic response, and many other situations.
“Whenever you have a case that has a number of classes that are not that easy to differentiate, it is possible that there’s a multivariate gene- or protein-expression pattern that can be used for efficient classification.”