A doctor using a chatbot on a smart phone. The background has icons representing digital medicine, pills and tablets, a doctor with a stethoscope, the emergency service network, and telemedicine.
The new study found that LLMs tended to provide inaccurate and inconsistent information to people seeking medical advice. Image credit: everythingpossible, Getty Images.

New study warns of risks in AI chatbots giving medical advice

The largest user study of large language models (LLMs) for assisting the general public in medical decisions has found that they present risks to people seeking medical advice due to their tendency to provide inaccurate and inconsistent information. The results have been published in Nature Medicine.

The new study, led by the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, carried out in partnership with MLCommons and other institutions, reveals a major gap between the promise of large language models (LLMs) and their usefulness for people seeking medical advice. While these models now excel at standardised tests of medical knowledge, they pose risks to real users seeking help with their own medical symptoms.

Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.

Dr Rebecca Payne, Nuffield Department of Primary Care Health Sciences

In the study, participants used LLMs to identify health conditions and decide on an appropriate course of action, such as seeing a GP, or going to the hospital, based on information provided in a series of specific medical scenarios developed by doctors.

A key finding was that LLMs were no better than traditional methods. Those using LLMs did not make better decisions than participants who relied on traditional methods like online searches or their own judgment.

The study also revealed a two-way communication breakdown. Participants often didn’t know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations, making it difficult to identify the best course of action.

In addition, existing tests fall short: current evaluation methods for LLMs do not reflect the complexity of interacting with human users. Like clinical trials for new medications, LLM systems should be tested in the real world before being deployed.

‘These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,’ said Dr Rebecca Payne, GP, lead medical practitioner on the study (Nuffield Department of Primary Care Health Sciences, University of Oxford, and Bangor University).  

‘Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.’

In the study, researchers conducted a randomised trial involving nearly 1,300 online participants who were asked to identify potential health conditions and recommended courses of action, based on personal medical scenarios. The detailed scenarios, developed by doctors, ranged from a young man developing a severe headache after a night out with friends to a new mother feeling constantly out of breath and exhausted.

We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.

Associate Professor Adam Mahdi, Oxford Internet Institute

One group used an LLM to assist their decision-making, while a control group used other traditional sources of information. The researchers then evaluated how accurately participants identified the likely medical issues and the most appropriate next step, such as visiting a GP or going to A&E. They also compared these outcomes to the results of standard LLM testing strategies, which do not involve real human users. The contrast was striking; models that performed well on benchmark tests faltered when interacting with people.

They found evidence of three types of challenge:

  • Users often didn’t know what information they should provide to the LLM,
  • LLMs provided very different answers based on slight variations in the questions asked,
  • LLMs often provided a mix of good and bad information which users struggled to distinguish.

Lead author Andrew Bean, a DPhil student at the Oxford Internet Institute, said: ‘Designing robust testing for large language models is key to understanding how we can make use of this new technology. In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.’

Senior author Associate Professor Adam Mahdi (Oxford Internet Institute) said: ‘The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators. Our recent work on construct validity in benchmarks shows that many evaluations fail to measure what they claim to measure, and this study demonstrates exactly why that matters. We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.’

The study ‘Clinical knowledge in LLMs does not translate to human interactions’ has been published in Nature Medicine.

 For more information about this story or republishing this content, please contact [email protected]