The promise of AI-assisted mental healthcare faces a reality check as the latest generation of language models continues to struggle with the nuanced complexity that defines psychiatric practice. While these tools could eventually support overwhelmed mental health systems, their current limitations underscore why human clinical judgment remains irreplaceable for complex cases.

Psychiatrists evaluated ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 across sleep disorder cases with psychiatric comorbidities, comparing performance to earlier 2023 ChatGPT versions. The newer models demonstrated empathetic communication and generally sound evidence-based recommendations for non-pharmacological interventions. Primary diagnostic accuracy improved modestly, with systems correctly identifying major psychiatric conditions in straightforward presentations.

However, critical gaps emerged in clinical reasoning as case complexity increased. The AI systems consistently overlooked somatic factors and frequently missed important comorbidities—oversights that could prove dangerous in real-world psychiatric assessment. Cultural sensitivity, while improved, remained inconsistent across different demographic contexts, raising concerns about equitable care delivery.

This longitudinal analysis reveals AI's incremental progress but highlights persistent blind spots in psychiatric applications. Unlike medical specialties with clear diagnostic criteria, psychiatry requires integrating biological, psychological, and social factors—a synthesis these models haven't mastered. The findings suggest current AI tools might serve as preliminary screening aids or educational resources, but fall short of the comprehensive assessment skills required for independent psychiatric evaluation. For a field already grappling with access challenges and provider shortages, these results emphasize that meaningful AI integration requires substantial advancement beyond current capabilities.