As with any systems-biology effort, metabolomics does create a challenge due to the inherent biological noise in the system being analyzed—the signal-to-noise conundrum.
Successfully identifying and quantifying readily detectable small molecule biochemicals and metabolites in a set of biological samples, e.g., disease vs. nondisease, requires a robust and repeatable process protocol that involves uniform collection and processing of samples and rigorous quality-control systems, thus allowing for differentiation of increased and decreased levels of compounds with low process coefficient of variation.
Because mass spectrometry is such a sensitive bioanalytical process, discovery of disease biomarkers (subclinical disease, disease presence, or progression) by biochemical profiling requires well-powered clinical studies with appropriate controls as well as balanced demographic parameters. In addition to test samples, quality-control samples are prepared and incorporated into a LIMS tracking system.
Test and control samples are processed and analyzed side by side with a run-order randomization protocol, with QC and QA samples representing about 30% of the sample set. Such quality-control samples include process and solvent blanks, recovery and internal standards, derivatization standards, library reference compounds, and dilutions of test samples in the appropriate test matrix.
For consistent isolation of the small molecule biochemicals in each sample, sample preparation is a straightforward protein precipitation with an organic solvent. After pelleting the proteins, the supernatant is divided into aliquots that are subsequently dried down, with one reconstituted in an acidic buffer (UHPLC positive ionization), another in a basic buffer (UHPLC negative ionization), and another subjected to trimethylsilylation derivatization buffer (GC).
Basic molecules (e.g., some amino acids, sugars, nucleotides, carnitines, and phospholipids) ionize efficiently in positive ionization mode while acidic molecules (e.g., some amino acids, phosphates, sulfates, fatty acids, and steroids) ionize efficiently in negative ionization mode. GC covers compounds too hydrophobic or polar for the UHPLC (including small organic acids, diacyl lipids, some amino acids, and certain sugars).
Derived from this process, the wide variety of measured ionic species represented by separated peaks are integrated and quantified as they elute off the column. Associated with each peak is a mass spectrum, which the software compares to a database containing thousands of standard biochemical mass spectra. Using sophisticated algorithms, the spectra are filtered to reduce noise and positively identify each biochemical in the sample.
Data curation involves exclusion of false positives and confirmatory identification and relative quantitation of authentic, discrete biochemicals in each sample, referenced against the automated software calculations. Subsequent statistical analyses of the curated data are the next steps in the biomarker discovery and selection process, ultimately leading to biological interpretations and further mechanistic understanding of metabolic pathways under normal and diseased states.
Curated data with relative quantitation from the discovery screening process undergoes univariate statistical analysis to assess the statistical strength of individual small molecules as potential biomarker candidates.
To look for additive or synergistic biomarkers, subsequent multivariate statistical analyses are carried out on the data including variable selection procedures such as Random Forest and LASSO regression analyses. Top-ranked variables and models that appear the most frequently are selected.
Multiple biomarker candidates at this stage are selected for targeted data analysis by the three aforementioned mass spectrometric methods, resulting in analytical quantitation. Absolute quantitation includes running stable isotope-labeled internal standards and calibration standard samples, in order to calculate calibration curves and quantify the biomarker concentrations in your samples.
Identification of the best biomarker candidates/biomarker algorithms are derived from confirmatory targeted data results and a multitude of statistical techniques including, but not limited to, multiple linear, logistic, and spline regression models. These models are evaluated and the best ones are rationally selected.
The biomarker algorithms derived from a training set must then be validated against a test set from the same clinical study cohort. Moreover, this final biomarker algorithm must be secondarily validated in independent study cohorts.
The algorithm may be a collection of biomarkers, each of which independently contributes a statistically significant and additive correlation to a reference gold-standard diagnostic process or procedure. Such gold-standard diagnostic procedures may be clinically impractical due to labor-, cost-, or time-intensive reasons.