January 15, 2017 (Vol. 37, No. 2)
Thomas Hill Ph.D. Executive Director Statistica
Angela Waner Product Manager Statistica
When Will Big Data Analytics Change Biopharmaceutical and Pharma Manufacturing?
The GEN Special Section on Big Data consists of four articles:
Precision Medicine Research in the Million-Genome Era
Utilizing Machine-Learning Capabilities
NGS Big Data Issues for Biomanufacturing
Visualization for Advanced Big Data Analysis
Machine learning and pattern recognition algorithms can deliver (predictive/analytic) models—however complex they may need to be—that provide an accurate line-of-site from raw material characteristics, through manufacturing processes and settings, to final product quality, and ultimately performance (risk) in the field.
These methods can deliver a detailed understanding of important process input parameters and how they affect product outcomes, customer satisfaction, and consumer risk. Just like self-driving cars, automated manufacturing and “lights-out-manufacturing” (no humans need to be present on the shop-floor) today rely on big data machine learning and pattern-recognition methods for robust, low-variability, highly-optimized, and efficient manufacturing processes.
Pharma Manufacturing Is Different
Biopharma and pharmaceutical batch manufacturing is mostly automated, and well instrumented. So why are big data machine learning algorithms not widely implemented? The reason is that the manufacture of drugs, vaccines, and medical devices are strictly regulated by the FDA in the U.S., and equivalent organizations worldwide. The standard for managing consumer risk, and for understanding all critical variables throughout the manufacturing process that affect consumer risk is much higher. Understandably, if a consumer item like a washing machine will not last as long as predicted, the customers might not be happy but will be mostly unhurt; if the active ingredient in a life-sustaining drug doesn’t last as long as predicted (i.e., it’s shelf-life), the consequences to the consumer can be much more serious.
The regulatory framework and guidance for Good Manufacturing Practices (GMP) and generally Good “anything” Practices (GxP) is all about demonstrating that the entire process is well understood and robust, demonstrated to producing an acceptable (negligible) number of defects with no or extremely low risk to consumers for harm (see, for example, Snee, 2016).
Over the decades, GMP and “validated manufacturing processes” have evolved into widely accepted best practices for process design, initial process qualification, continued process verification, etc. Standard methods are based on statistical analyses, such as analysis of variance, multiple regression, univariate quality control charting, and more recently also multivariate principal components analysis (PCA) and partial least squares (PLS) based quality control.
An argument can be made that the long history and
collective industry experience with statistical data analyses to design, qualify, and continuously monitor the quality of production processes means that this approach is well understood, and therefore proven to be “near-optimal.”
Or Is It Not So Different?
However, an argument can also be made that established best practices and significant accumulated experience exists regarding the proper use of and inferences from the application of machine learning and pattern-recognition algorithms. After all, many manufacturing industries operate in extremely competitive environments, and if these methods would not repeatedly show value and little risk in, for example, semiconductor manufacturing, they would quickly be abandoned.
At the same time, the statistical analysis approach, based on data models, can easily lead to very misleading and plain wrong results. If the a-priori assumptions about the data model are not met—if the data model is a poor emulation of nature—then conclusions may be wrong (as pointed out by Breiman, 2001).
Machine learning and pattern-recognition algorithms for small and big data have by now been in use for decades (e.g., neural nets), and commonly accepted best practices exist and are widely documented for how to estimate model predictive power and accuracy, sensitivity of predictions to input variables, how to estimate error or noise variability, and so on (see for example Nisbet, Miner, and Elder, 2009, for many different application examples; Miner et al., 2014, for applications in healthcare; or Hastie, Tibshirani, and Friedman, 2013, for technical details on approaches and algorithms).
Like traditional statistical analysis, one can define the steps necessary to derive a prediction model, evaluate the quality of the model, and determine how to make decisions based on a prediction model. Thus, specific analytic processes can be validated and documented in standard operating procedures, to create a repeatable, robust, analytic process that will yield more information about important inputs, better predictions of problems, quicker identification of root causes of actual quality problems, and generally lower cost, better quality, and lower consumer risk.
Details on how to build machine-learning models can now be found in literally thousands of publications. Most importantly, because these algorithms are so flexible in representing any type of relationship in the data, it is extremely important to evaluate any model in at least one hold-out validation sample, before the model is deployed to production for driving process decisions. Other methods like v-fold cross-validation, simulation, target-shuffling (Elder, 2014), and others exist to help create models from small or big data that can be expected to be accurate in new data.
There are many different types of algorithms that have proven useful to solve certain types of analytics problems. For example, neural nets are particularly good at representing continuous relationships in dynamic systems of measurements. Recursive partitioning algorithms or “trees” are very good at classifying observations into buckets (e.g., good-bad), and for root-cause analysis and interaction detection.
Deep-learning neural networks have proven extremely useful for detecting reliable patterns based on semi-structured and unstructured data, including recorded sounds or pictures.
The main point here is that a large, cumulative body of experience is documented in articles, books, or on specialized websites on how best to apply machine learning and pattern-recognition algorithms to various types of data, including manufacturing data, in pursuit of creating a more robust, high-quality, less-expensive product, and safer products.
There are other barriers to the adoption of these techniques into manufacturing environments.
• Lack of expertise, cost of resources. So-called data scientists are difficult and expensive to find, hire, and retain. On the other hand, validated manufacturing environments where analytic procedures are well documented provide an excellent environment where specific analytic approaches can be developed once, and then used by specifically trained engineers, process owners and stakeholders, or operators.
Actually, in typical current practice and using statistical process monitoring techniques, only few expert statisticians are usually involved in day-to-day operations, while operators and process stakeholders are routinely relying on sometimes very complex statistical procedures accessed through standard interfaces.
For example, using software solutions like Statistica, analytic workflow templates can be deployed to empower trained operators to use these methods and interfaces, which are developed and validated following established best practices.
• Model transparency, and interpretability of results. Sometimes, statistical models may appear to be more transparent and simpler to understand when compared to machine learning models.
For example, the workflows around multivariate continuous batch process monitoring based on PCA or PLS methods (see for example Nomikos and MacGregor, 1995; Wold, Kettaneh, Friden, and Holmberg, 1998) rely on well-defined and fairly easy-to-follow steps for drill-down and root-cause analysis, should out-of-control conditions be encountered.
However, the computations underlying those steps are actually quite complex, and depend on specific and restrictive assumptions about the data and their relationships (linear relationships). If those assumptions are wrong, conclusions may be wrong. Using machine learning and pattern-recognition methods, similar steps for drill-down and root-cause analysis can be designed for more flexible, automated, and transparent root-cause analysis.
• Introducing new methods into a conservative and risk-aversive culture. This one is perhaps the biggest practical hurdle to overcome, before big data machine learning and pattern-recognition methods can become widely adopted for analytic support in validated biopharma or pharmaceutical manufacturing.
Process stakeholders and reviewers understandably prefer to err on the conservative side, and reject new approaches. This makes sense in many ways, and can benefit the consumer who can be assured that the same “proven methods” were used in the manufacture of any given vaccine or drug. However, as drugs become more expensive, or when the availability of vaccines in a crisis is insufficient due to slow and expensive production methods, innovation may be inevitable.
• The cost of change. Actually, the “price” of big data machine learning is surprisingly low, which perhaps is one reason these methods have been adopted across many industries with such speed. Most mature analytics platforms support useful standard methods and algorithms, such as recursive partitioning or tree methods, neural nets, support vector machines, PCA/PLS, and so on. Statistica supports those, and in addition can incorporate and manage (for validation) the large collections of methods available through open-source projects like R (see cran.r-project.org), Python, or specialized big-data analytics libraries like MLlib, H20, and others.
However, experts (also called “data scientists”) who understand best practices regarding machine learning from Big Data are in-demand, scarce, and not cheap. But most importantly and without a doubt, the biggest costs are associated with the organizational and cultural changes that will need to happen to support the paradigm shift from statistical modeling and inference to pattern-recognition and machine-learning methods.
In summary, big data machine learning or pattern-recognition techniques have transformed nearly every industry, including nonregulated discrete, batch, and process manufacturing. Compared to statistical methods for process monitoring, optimization, and root-cause analysis, machine-learning techniques are more flexible, robust, and often much more accurate, in particular when relationships among parameters and outcomes are complex, nonlinear, and not easy to represent via statistical models. Regulated pharmaceutical and biopharmaceutical manufacturing has been slow to adopt these techniques, primarily because of a very conservative approach to the management of consumer risk, regulatory compliance concerns, and also because, on the surface, these modern methods appear complex and somewhat opaque.
However, it can be argued that validated and repeatable analytic processes can be built using machine-learning techniques, adhering to established best practices based on nearly a decade of experience in other industries. Perhaps the increased pressures for manufacturing efficiency and cost control, to understand consumer risk more accurately, and to deliver new treatments to patients more quickly, based on new products of better quality, will make adoption and acceptance of these modern approaches more appealing, or even necessary. For sure, there are opportunities to get started today.