The comparison of samples belonging to different physiological states is vital in the search for putative biomarkers, and proteomics provides suitable methods for this purpose, through the quantitation of proteins. Proteomics provides some advantages over transcriptomics as it can both be used in cell-free biological fluids, such as serum, urine, and synovial fluid, and provide further knowledge such as through post-translational modifications. Quantitative methods, to identify the amounts of proteins, can also be applied, which can be an advantage over quantifying levels of gene expression, depending on the purpose of the study, as gene expression does not necessarily correlate with protein levels. However, the value of this technology is dependent on the quality of the analysis methods used to process the generated data (Bantscheff et al., 2007, 2012).
Machine learning techniques have been utilized broadly to analyze data from many areas of biology; in particular, various machine learning methods have been applied to data generated by the analytical techniques of transcriptomics and metabolomics for classification of unknown samples and identification of genes relevant to the disease state. Similar methods are now being applied to the field of proteomics and, more specifically, the analysis of data generated from tandem mass spectrometry (Sun and Markey, 2011).
There are numerous technologies used to extract quantitative protein information from biological samples. These techniques cover a broad spectrum of approaches, balancing throughput (one/many proteins at a time) and quality of the extracted data.
Commonly used techniques include two-dimensional gel electrophoresis, enzyme-linked immunosorbent assays (ELISAs), protein arrays, affinity separation, and mass spectrometry-based technologies (Ray et al., 2011; Schiess et al., 2009).
A number of these methods, including gels and ELISAs, are limited in the number of proteins they can analyze because of time requirements. They also require specific proteins of interest to be chosen when designing the study and suitable cross-reactive antibodies to be available; this can be challenging for non-model organisms. In comparison, mass spectrometry (MS) techniques can be used as a high-through-put discovery based method; lists of proteins can be identified from samples that are analyzed (Perkins et al., 1999). This means that tandem mass spectrometry can be used to find proteins that may not have previously been considered, provided the proteins can be found within protein sequence databases. A combination of multiple proteomics methods can also be utilized to form an effective analysis pipeline. Ray et al. (2011) showed that MS, in various guises, has been pivotal to biomarker discovery for a range of different diseases. This review will discuss the applications of machine learning for analysis of proteomic mass spectrometry data and the challenges involved. MS-specific challenges, including the identification of proteins using sequence database searching software and protein quantitation or pre-processing for peak analysis, will be covered. As will machine learning-specific considerations, such as the small numbers of samples that can often result from an MS investigation, and the types of machine learning most suited for the required task, either sample classification or biomarker identification. A survey of articles involving the combination of mass spectrometry and machine learning will be followed by a brief discussion of post-machine learning analysis, including literature mining and pathway analysis. An overview of the areas to be covered in this review is provided in Figure 1.
To read the rest of this article, CLICK HERE.