Textbook-level medical knowledge in large language models: comparative evaluation using Japanese National Medical Examination.

BackgroundThe accuracy of the latest reasoning-enhanced large language models on national medical licensing examinations remains unknown, which is crucial for determining how close they are to serving as effective knowledge sources for medical education. This study aimed to evaluate the performance of four reasoning-enhanced large language models (LLMs)—GPT-5, Grok-4, Claude Opus 4.1, and Gemini 2.5 Pro—on the Japanese National Medical Examination (JNME), providing insights into their potential as educational resources and their future applicability in medical practice.MethodsWe evaluated LLM performance using the 2019 and 2025 JNME (n = 793). Questions were entered into each model with chain-of-thought prompting enabled. Accuracy was assessed overall and by question type. Incorrect responses were qualitatively reviewed by a licensed physician and a medical student.ResultsFrom highest to lowest, the overall accuracies of the four LLMs were 97.2% for Gemini 2.5 Pro, 96.3% for GPT-5, 96.1% for Claude Opus 4.1, and 95.6% for Grok-4, with no significant pairwise differences. For image-based and non-image-based items, Gemini 2.5 Pro achieved the highest accuracy of 96.1% and 97.6%, with no significant difference, whereas accuracy was significantly lower on image-based items for the other three LLMs. Across difficulty levels, Gemini 2.5 Pro again achieved the highest accuracy (98.4% for easy, 97.3% for moderate, and 93.2% for difficult items). Within each LLM, accuracy on difficult questions was significantly lower than on easy questions. Common error patterns included providing unnecessary additional options in single-choice questions, misdiagnosis of X-ray or computed tomography images (primarily due to confusion regarding left–right laterality), and difficulties in prioritizing appropriate actions in clinical questions with complex contextual information.ConclusionsFour LLMs released in 2025 surpassed the 95% benchmark on the JNME, and their near-perfect (approximately 99%) performance on basic medical knowledge questions highlights substantial potential for use as learning resources in foundational medical education. Gemini 2.5 Pro demonstrated the most consistent performance across question types, while Grok-4 showed greater variability. The concentration of incorrectness in clinical questions indicates that LLMs still require substantial refinement and validation before their use can be extended to clinical reasoning or patient care.

Description

Acknowledgements: Not applicable.

Publication status: Published

Keywords

Artificial intelligence, ChatGPT, Large language models, Medical education, Medical licensing examination

Journal Title

BMC Med Inform Decis Mak

Journal ISSN

1472-6947
1472-6947

Volume Title

26

Publisher

Springer Nature

Publisher DOI

https://doi.org/10.1186/s12911-026-03370-y

Rights and licensing

Except where otherwised noted, this item's license is described as http://creativecommons.org/licenses/by/4.0/

Sponsorship

Japan Society for the Promotion of Science (24KJ0830)

Collections

Jisc Publications Router