Alexa B. Kimball, MD
Alexa B. Kimball, MD

In 2019, investors reportedly poured $4 billion into more than 90 healthcare artificial intelligence (AI) startups specializing in various technologies—from drug discovery to diagnostics—that may empower individual clinicians as well as entire healthcare systems. Although the technologies are promising, the enthusiasm they arouse should be tempered. In healthcare applications of AI, serious data challenges remain, and predictive modeling still exhibits shortcomings.

We are all familiar with the many promises of AI. For example, there are multiple claims that radiologists and dermatologists will soon be replaced by computers powered by AI to better discern the differences between malignant and benign masses and lesions. There are also projections that the technology will, in the near future, ensure that treatments are more precise, that patients are given access to customized engagement tools, and that operational challenges in healthcare are overcome.

These projections, however, may be overly optimistic. Why? First, it’s important to underscore that AI refers to computer algorithms that perform analyses on datasets to generate insights and predictions, which require well-curated and clean datasets. But despite the use of electronic medical records systems, healthcare data is far from “clean.” There are many different types of data that come in a variety of formats—making the data difficult to aggregate or normalize. Without investment in these datasets to find a faster, easier way to reliably combine data, there’s simply no way to trust information that’s being used to train AI algorithms. And in the end, “trash in” equals “trash out.”

A tricky equation

Although the lack of quality datasets is a serious problem in healthcare applications of AI, a more fundamental problem lies in statistics. The fact is that the scenarios in which AI could have a clinically meaningful impact are surprisingly limited. This scarcity is rather like the one that has plagued pharmacogenomics, a technology that, like AI, was introduced to healthcare with great enthusiasm.

Pharmacogenomics assesses how a person’s genes affect that person’s response to medications. The extent to which pharmacogenetics can be helpful can depend on (1) how common a gene of interest is in the population (prevalence), and (2) how important that gene is. If the gene is highly prevalent, then there isn’t much value in testing for it. If the prevalence is very low, testing can result in a lot of false positives. So, the ideal prevalence range is about 25–40% in any given population.

But even gene prevalence is not sufficient on its own. The impact of that gene on clinical effects also matters to pharmacogenomics. It turns out that most genetic traits do not, in isolation, cause huge clinical effects. There have been a few notable exceptions, such as the one addressed by Herceptin, in which a single gene had a meaningful prevalence and an important impact on disease. But the past 20 years have, in fact, revealed very few genes that are, on their own, sufficiently prevalent and impactful to justify a pharmacogenomics approach.

AI will face the same challenges as pharmacogenomics. First, there is the same problem of common versus rare. Sophisticated models may marginally refine predictions regarding outcomes, but current algorithms can do the same thing. For diagnosis, common things are still common, and AI may not add much. Electrocardiograms (ECGs) have been subjected to computer-based predictions for decades. If the ECG reveals a dramatic life-threatening rhythm, the machine issues an alert, but such a rhythm is also very obvious to the clinician. Meanwhile, if it is a subtle change in a patient’s condition, detection may not be important—or create a large proportion of meaningless warnings like the ones that are often found on ECG reports today.

Issues such as these aren’t limited to signal processing and waveform data such as ECGs. Similar issues are encountered in assessments of imaging data. For example, it would be highly desirable to be able to take a picture of a nevus (a birthmark or mole) and use it to assess the nevus’ malignant potential. Many companies have tried and failed to do this well. Again, the prior predictive value is affected if used in a general population at low risk.

In this case, there is another complication: a dermatologist examining a patient doesn’t look at just one nevus; he or she looks at all of them. They may number in the hundreds. The phenotype of overall mole pattern matters a lot, but a single photo can’t possibly show that pattern. So, due to these two phenomena, the false-positive rate for malignant moles detected using AI can be overwhelming.

AI is most helpful when it can detect uncommon phenomena and avoid oversignaling. Such opportunities for AI are still limited.

Disappointing findings

Just as it is in pharmacogenomics, the next hurdle for AI is impact. For example, if AI technology increases the certainty of a diagnosis by only two points, such a change won’t matter enough to make a clinical difference. Moreover, the success of interventions based on such diagnostic cues has been limited.

For example, the inadequacy of strategic data analyses was demonstrated in a recent randomized trial that was used to evaluate whether a “hot spotters” program had succeeded in reducing healthcare costs. In this program, a team of doctors, nurses, and social workers in Camden, NJ, relied on data analyses to provide targeted care management for “super utilizers,” that is, patients with complex social and medical needs.

An early study indicated that hospital readmissions had been reduced by nearly 40%. However, the randomized trial revealed that comparable reductions in rehospitalization rates were obtained with patients who were not part of the program. It also appeared that a sort of regression to the mean was at play, one in which patients with high medical costs tend to see their expenses decline over time, whether or not targeted care is implemented. Ultimately, the goal to reduce medical costs was not met.

Lastly, there is the issue of cognitive bias. There is already a body of literature around how physicians are biased to choose a suggested diagnosis and fail to consider others. AI may simply point to the most likely diagnosis for the population, but not the most likely diagnosis for that individual. AI, then, may yield an ironic outcome—a new source of error.

This is not to say that there are no potential applications for AI that could be valuable to healthcare. One area where AI may make a large impact on healthcare soon is likely not in the clinical care realm, but in healthcare operations: optimizing staff and resources in areas such as operating room scheduling and claims adjudication. Given the computing power of AI, it is also possible that AI could play a critical role in providing more objective data for largely clinical diagnoses such as endometriosis, or that it could use natural language processing to identify risks for bias in clinical decision making.

We may even see AI applications bear fruit in settings with low clinical capacity and resources. For example, AI may be able to improve cervical cancer screening in low-resource settings where pap tests that require a human expert reviewer are not available.

So, while AI will no doubt change some things in medicine—maybe even change things dramatically—it will be far less transformational than we think, or hope.


Alexa B. Kimball, MD, is the president of Physician Performance, CEO of Harvard Medical Faculty Physicians at Beth Israel Deaconess Medical Center, and a professor of dermatology at Harvard Medical School.