Graham A. A. McGibbon Ph.D. Director, Strategic Partnerships ACD/Labs
Andrew Anderson Vice President Advanced Chemistry Development
Practical Method for Managing Modern Scientific Information
For scientific investigation, observational and instrument-derived data is the lineage of information which provides knowledge that enables managers to make strategic and tactical data-based decisions for actions that maximize benefits and limit risks. Data exchange between organizations and data sharing inside organizations is necessary to effectively communicate this “define-to-decide” lineage. Digital data standards are intended to facilitate dealing with the data deluge together with valuable master data, such as reference spectra. This deluge often results from the increasing volume, velocity, variety, and variability of newly acquired analytical data—necessary for confident, comprehensive material characterization as will be described below. This comprehensive characterization necessitates digital representation of analytical data, chemical processes, and molecular structures.
In data workflows—not only Big Data, but analyses generating a variety of not-quite-so-big data—two factors contribute greatly to the deluge. The first aspect is the automation and/or parallelization of specific high-throughput analyses on particular instruments. The second is the challenging implementation of the so-called “Internet-of-things” (IoT) due to the tremendous assortment of computer-based data sources and their diversity of parts, performance attributes, and output of analytical data formats.
Ongoing heterogeneity of analytical-data formats is thus a natural hallmark of technology advancement. A summary of some of those analytical-data-standardization efforts beyond human-parsable, purely open generic formats such as ASCII text or XML are noted in a recent white paper. Standardization of ontologies is one of the interesting, yet challenging, endeavors of modern scientific information management.
Digital Representation of Analytical Data
Analytical data is generally used for qualitative (what’s in my sample?) and quantitative (how much of each analyte is in my sample?) investigations. Depending upon the sample composition, and the physical characteristics or properties of the analytes of interest, a wide range of techniques may be used to gather analytical data. Among the most common are separations (especially liquid, gas, and supercritical fluid chromatography for small molecules and electrophoresis—particularly CE or for biomolecules PAGE), mass spectrometry, and spectroscopies (NMR, UV, IR, CD, Raman). Typically, each vendor creates and promotes its own proprietary format for data acquisition and data handling. But several widely used, open formats also exist to exchange and store data. A detailed review of open formats is described in the aforementioned white paper.
Mass Spectrometry (MS) equipment is often considered to be especially challenging in terms of the data generated. MS commonly generates more complex and larger size datasets than other detectors. As an additional complication, experiments often simultaneously include additional data dimensions or types, (e.g., ion mobility or imaging MS). MS hardware technologies are rapidly developing and there are a wide variety of experiments and workflows. All major MS instrument vendors (Agilent, Bruker, LECO, PerkinElmer, Sciex, Shimadzu, Thermo, Waters) keep their data formats closed, but present SDKs to access the data. MS complexities and innovations, and also dreams of Big Data science, have put pressure on older data standards like AIA/netCDF format, which was the de-facto standard in the industry. So, there are initiatives for those to be superseded by newer, more capable file formats.
Digital Representation of Chemical Structures
Significant efforts have made by many scientists over the last 40 years to effectively represent chemical processes digitally. The fundamental units of the process are:
º Chemical Structure(s): representing the substances used or made during the process
º Reaction Arrow(s): representing the unit operation(s)
º Process Instructions: their location relative to reaction arrows, and their timing during the unit operation. These can be either text or certain graphical representations for certain operations (e.g., heat, stir)
• Relative Position of Entities: the “location” of chemical structures and process instructions relative to reaction arrows imparts meaning. For example:
º Right of arrow: reaction products
º Left of arrow: reaction starting materials
º Above arrow: catalysts
º Below arrow: instructions and reaction solvents
A comprehensive summary of the efforts to enable digital representation of chemical processes are described in the white paper mentioned earlier. Moreover, the utilization of these digital representations (searching, comparing, etc.) are also described.
Digital Representation of the Fourth Paradigm—”Assemblies” of Analytical Data from Multiple Sources, Chemical Structures, and Interpretations
For a large range of scientific experiments, there is a variety of data types which contribute to the understanding of material disposition and behavior—the so-called unification of theory, science, and simulation.
The interrelatedness of certain analytical information is especially important. For many materials/substances, compositional profiling by hyphenated chromatography–mass spectrometry analyses may be captured by one data file. However, more comprehensive characterizations commonly require multiple techniques from different instruments that generate data in different formats that are collected at different times, in different places, by a colleague, contractor, or collaborative partner outside the scientists’ own organization.
To support such interrelatedness, systems in the fourth paradigm must be able to not only sufficiently index individual analytical data files but must also afford “analysis assembly” capabilities to provide users with a comprehensive “story” for relevant analyses.
Consider formulation profiling as an example. The following list of “related data” must be “assembled” to present a comprehensive assessment of a product formulation:
• LC-UV/MS (and other detector types)
• GC-FID/MS (and other detector types)
• Chemical, biological, formulation schematics
• Chemical structures for process-related impurities/degradants
• 1D and 2D NMR data for isolated formulation components, with references to separations information (e.g., retention time)
• XRPD, DSC, TGA, particle size distribution, and a variety of other material characterization datasets
Finally, the ability to have explicit digital representations of a scientist’s interpretations, specifically beyond alphanumeric descriptions, is also necessary in the fourth paradigm. In product formulation example above, digital representation of interpretation would be:
• Chemical structures “assigned” to spectral and chromatographic components directly within data architectures
• Association of experimental metadata to assembled analytical data architectures
The key to achieving effective decide-to-design workflows using digital environments per the fourth paradigm is to recognize that there are certain essential requirements and outputs of the processes under investigation—and those are assembled to deliver value. Specifically, every IND Application for a New Chemical Entity (NCE) requires a completed section on Chemical Manufacturing and Controls (CMC) for Drug Substance and Drug Product. The information for these sections is currently human assembled and document driven. Moreover, systems supporting these document-based assemblies do not manage the underlying analytical and chemical structure data for all detected and anticipated compounds across all processes. Therefore, modern systems should address, first and foremost, this primary need. Secondly, they should allow a “line-of-sight” between process-control decisions and the underlying, supporting data from which they were derived—along with any relevant human or machine interpretation that was added. Any other uses the data might find are undoubtedly nice to have but of indeterminate value.
ACD/Spectrus, using proprietary formats, is a widely known and used data-management platform with components capable of reading analytical data from over 100 other major open and vendor-proprietary formats (via SDKs), not only for MS but for chromatography, spectroscopy, and a variety of other XY type data.