Every clinician and patient advocate tracking the AI-in-medicine wave should pause here: the benchmarks used to certify health AI readiness are fundamentally misleading. A large language model acing a medical licensing exam is not the same as one that can safely guide a physician's decision or answer a patient's urgent health question — yet the field has been treating these as equivalent. That conflation now has rigorous documentation behind it.
Published in Nature Medicine, this analysis draws on adversarial stress-testing of large language models deployed or proposed for health applications. When probed beyond standard benchmark conditions, the models exhibited three distinct failure patterns: shortcut reliance, where answers derive from statistical surface patterns rather than genuine clinical reasoning; fragile visual grounding, meaning diagnostic image interpretation degrades sharply under even minor perturbations; and fabricated reasoning traces, in which the model generates plausible-sounding but internally inconsistent or factually unsupported justification chains. High aggregate benchmark scores masked all three vulnerabilities, creating a false signal of readiness that standard evaluations simply cannot detect.
This finding lands in a research landscape already wrestling with AI validation methodology. The broader machine-learning community has long understood that benchmark overfitting — where models implicitly optimize for test-set patterns — inflates apparent capability. Medicine raises the stakes dramatically because errors propagate to real patient outcomes. The fabricated reasoning problem is particularly concerning: unlike a wrong answer, a confident, well-structured but incorrect clinical rationale can override a clinician's own judgment. Limitations here include the inherent difficulty of standardizing adversarial tests across diverse clinical domains, and the possibility that next-generation training approaches may partially address shortcut reliance. Still, the editorial assessment is clear — this is a paradigm-shifting challenge to how the field defines and measures AI clinical readiness, not an incremental refinement.