January 1, 1970 (Vol. , No. )
Guy Cavet Kaggle
Data analytics has the potential to do much more if applied across the pharmaceutical enterprise.
Intelligent use of large-scale data has become fundamental to other industries: finance, insurance—even sports. But despite its importance in areas of research, data analytics has the potential to do much more if applied across the pharmaceutical enterprise.
In the past 15 years, biology has been transformed by the availability of large-scale genetic and genomic data. The first ten years of work on the human genome yielded one draft genome. The last ten years have yielded over ten thousand. Advances in technology have enabled high-throughput gene expression profiling, cancer genome analysis, and other disciplines to change the way biology is studied. Cheminformatics allows companies like Numerate to screen millions of compounds for activity by purely computational prediction. However, there are much greater opportunities for data-driven transformation across the broader pharmaceutical enterprise.
These opportunities arise, in part, because of the broad trend toward data being tracked and recorded in new and far-reaching ways. Importantly, many of these are outside the pharma industry. Medical records are collected electronically on an unprecedented scale, driven in part by federal “meaningful use” programs. These records reveal how diseases manifest and how treatments are used in the real world. Social media also contains vast amounts of information on real patient experiences with both diseases and treatments. And in the sales and marketing of drugs, data on program effectiveness is collected in real-time by reps, and companies like Aktana are interpreting it to understand where physicians perceive value.
Getting data is only the first step. The true value arises from analytics that generate actionable insights. In many cases, this means predictive modeling: developing algorithms that reveal what drives an outcome of interest (such as response to therapy or drug choice) and allowing that outcome to be predicted in the future. The data scientists that can carry out this type of analysis are multidisciplinary experts with skills from statistics, computer science, biology, chemistry and other fields, and they are highly sought after.
The conventional ways to engage data analysts involve building internal teams of scientists or buying time from consultants. However, data analytics is also particularly well-suited to crowdsourcing, which opens up a problem for many people to address. It’s inevitable that most of the world’s experts in any domain are outside any single pharma company. Even with strong internal teams, as Bill Joy of Sun Microsystems insightfully noted, “Most of the smartest people work for someone else.” Crowdsourcing allows those people to be tapped in a highly flexible and cost-effective manner. A team of experts can coalesce around a problem, working on it only as long as necessary, and then move on.
In 2006, Netflix used crowdsourcing to improve their ability to suggest movies to their customers. Rather than just inviting people to work in isolation, they set up an online competition in which people submitted entries in real-time and vied to come up with the best solution. This is a particularly effective approach to predictive modeling analytics. Seeing their rivals above them on a leaderboard drives people to continuously generate better results. In the Netflix competition, the company’s internal method was surpassed within six days, and the eventual winner was more than 10% better.
The competition approach is equally applicable to pharmaceutical industry problems. For example, Boehringer Ingelheim sponsored a contest to develop methods to predict small molecule safety that resulted in a 25% improvement over an industry standard approach. In the Heritage Health Prize competition, methods are being developed to predict which patients will require hospitalization, and for how long, over the next twelve months. Other competitions have been used to predict patient outcomes, sales patterns, and clinical outcomes. In each case, the results were better than any methods that had previously existed.
The full potential of data analytics requires accessing and using data in creative ways. For example, after a drug launches, information about the drug is rapidly generated in the outside world through patient and physician experiences. This information is currently largely untapped. It is entered into electronic medical records, tweeted, posted on Facebook, and entered into community sites such as Patients Like Me. This data is often unstructured and very noisy, but companies such as Israeli startup Treato are beginning to systematically organize it. Despite the complexities of working with data like this, skilled data scientists can extract meaningful patterns about drug-drug interactions, what drives patients to start and stop medications, or which patients will not adhere to their prescriptions, to name a few.
Predictive models even have the potential to tackle some of the most critical decisions in drug development, such as whether a clinical trial will be successful or whether a licensing deal will eventually lead to a drug. Billions of dollars rest on these decisions, but it is rare that all available relevant data is systematically employed to predict the probability of success. Of course, no algorithm can make such predictions with perfect accuracy, and no computation can replace a clinical trial. However, for an organization deciding between multiple costly development programs, having any improvement in ability to predict results is immensely valuable.
Putting data beyond the company firewall for outside experts to use may not be a natural step for organizations that are accustomed to carefully protecting their sensitive information. However, with the appropriate steps, the confidentiality and privacy of pharmaceutical and medical data can be carefully preserved. For example, when Boehringer Ingelheim sponsored a competition to predict small molecule activity, neither the structures of the molecules nor the specifics of the activity were revealed. In a competition to identify patients with type 2 diabetes using electronic medical records, the data was carefully de-identified to meet HIPAA standards. Privacy and confidentiality concerns can also be addressed by restricting access to trained and trusted individuals.
With drug development costs rising and approvals declining, new approaches are sorely needed. It’s too simplistic to see “big data” as a knight in shining armor, but the intelligent use of rich data, regardless of size, has the potential to help dramatically with problems from basic research to commercial operations.
Guy Cavet is vice president of life sciences at Kaggle, a platform for data science competitions. He has a Ph.D. in molecular biology and has worked in computational biology at Rosetta Inpharmatics, Merck, Genentech, and Crescendo Bioscience.