Medical A.I. Retakes the Test

Artificial Intelligence, or AI, including those large language systems (e.g., ChatGPT), is gaining much traction. When “teaching for the test,” one system passed the U.S. Medical License Exam – a three-component test that's required in order to earn an MD degree. Will doctors be among the first white-collar (white-coat?) workers to be replaced by automation?

Short answer, no. Algorithms, for all their ability to be personalized, are just not capable of medicine’s magic of “hands-on.”

I’ve written about AI systems taking on humans previously, and just now, ACSH is instituting a column entitled Let’s Ask ChatGPT. A new study in arXiv reports on a large language model (LLM) trained specifically on medical information in addition to the wealth of the Internet. It passed the three components of USMLE with a score of 67%; 60% is generally passing.

Meanwhile, ChatGPT has been credited as an author of several papers provoking consternation and concern from editorial boards.

“An attribution of authorship carries with it accountability for the work, which cannot be effectively applied to LLMs.”

- Magdalena Skipper, editor-in-chief of Nature 

“We would not allow AI to be listed as an author on a paper we published, and use of AI-generated text without proper citation could be considered plagiarism,”

- Holden Thorp, editor-in-chief of the Science family of journals

Leaving this academic self-absorption aside, the authors of the arXiv paper raise some legitimate concerns regarding AI entering the real world.

“…the safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks, enabling researchers to meaningfully measure progress and capture and mitigate potential harms.”

Scientific consensus

Whether we call it consensus or settled science is unimportant, but AI models are trained on “web documents, books, Wikipedia, code, natural language tasks, and medical tasks at a given point of time.” One of the lessons of COVID surely must have been that our scientific beliefs and dogma can change over short time frames. Can we rely on the knowledge of the AI model based on the past instead of the present? Electronic health records (EHR) show lagged times in keeping up with national guidelines. Assuming there is some regulatory agency overseeing these systems, how and how often will they update the veracity of the systems? Once again, the experience of COVID shows the difficulty, if not impossibility, of timely updates.  

Incorrect or missing content

The researcher found that content written by living, breathing clinicians had inappropriate or incorrect content in 1.4% of their responses to questions that patients might ask. Their AI system, which passed those exams, had 16 – 18% inappropriate or incorrect information. Whether you agree or not, state medical boards are disciplining physicians for spreading inappropriate or inaccurate content. Will these same boards or some other regulator discipline the AI, and how might that be done?

Human clinicians committed the sin of omission, leaving out important information for patient concerns 11.1% of the time. The AI system omitted 15% of important information. When a group of patients was asked to rate the responses, the clinician response was rated more helpful at 91%; AI came in at 80%. As to whether the systems answered their questions, it was essentially a tie at roughly 95% of the time.

Do No Harm

Raters were asked to identify harm that might come from acting upon the information provided. The AI system provided answers resulting in harm 5.9% of the time, which was essentially the same as for physicians at 5.7%. But we have strategies to manage physician error, from self-reflective morbidity and mortality conferences to malpractice litigation. Will the harms from AI be treated similarly? Can you actually sue AI, or will it devolve as it most frequently does to the human wielding the instrument, not the instrument itself?

Adding another layer of complexity is that the definition of harm is not invariant. It may change based on the population studied, for example, the overwhelming “whiteness” of the UK Biobank participants. Or their “lived experience,” a reason given for the low uptake of COVID vaccines in some communities. Or their cultural beliefs, e.g., the feelings towards circumcision in the Jewish community.

“…our human evaluation clearly suggests these models are not at clinician expert level on many clinically important axes.”

To meet those needs, the researchers suggest AI responses to account for “the time varying nature of medical consensus,” improving the ability to “detect and communicate uncertainty effectively;” and to respond in multiple languages. Responding in various languages seems to be the low-hanging fruit of the three.

Media and expert commentary often focus on the latest shiny object, and LLMs certainly fit that bill. As AI systems increasingly emulate physicians, we must consider how they will be used and regulated. We failed to do that with electronic health records, and the downstream consequences of that rush to implementation remain a tremendous cost financially and in the well-being of humans made to use these “tools.”

Unlike the academicians arguing over whether AI can be rightly considered an author, these computer scientists engaged in the real world recognize the real issues. They get the last word.

“Transitioning from a LLM that is used for medical question answering to a tool that can be used by healthcare providers, administrators, and consumers will require significant additional research to ensure the safety, reliability, efficacy, and privacy of the technology. Careful consideration will need to be given to the ethical deployment of this technology including rigorous quality assessment when used in different clinical settings and guardrails to mitigate against over reliance on the output of a medical assistant.”

Sources: Large Language Models Encode Clinical Knowledge arXiv DOI:10.48550/arXiv.2212.13138

AI Passes US Medical Licensing Exam MedPageToday