PuSH - Publication Server of Helmholtz Zentrum München

Dani, M.* ; Prakash, M.J.* ; Rosa, F.* ; Akata, Z. ; Liebe, S.*

Evaluating large language models for diagnostic reasoning from unstructured clinical narratives in epilepsy.

Commun. Med. 6:303 (2026)
Publ. Version/Full Text Research data DOI PMC
Open Access Gold
Creative Commons Lizenzvertrag
BACKGROUND: Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. METHODS: In this study we task eight Large Language models including two medical models (GPT-3.5, GPT-4, Mixtral-8 × 7B, Qwen-72B, LlaMa2, LlaMa3, OpenBioLLM, Med42) with a core diagnostic task in epilepsy: mapping seizure description phrases-after targeted filtering and standardization-to one of seven possible seizure onset zones using likelihood estimates. We conduct quantitative and qualitative analyses, measuring correctness, confidence, calibration, and expert-evaluated reasoning quality and source citation accuracy. Through systematic prompt-engineering and ablation studies, we assess how model performance depends on variations in prompt strategy, clinical role impersonation, narrative length, and language context. RESULTS: Most models yield well-above chance accuracy after prompt engineering that even approaches clinician-level performance. Specifically, clinician-guided chain-of-thought reasoning leads to the most consistent improvements. Performance is further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, reasoning analysis by clinical experts reveal that correct prediction can be based on hallucinated knowledge and inaccurate source citation, underscoring the need to improve interpretability of LLMs in clinical use. CONCLUSIONS: Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of LLMs, our work contributes to testing the applicability of foundational AI systems for healthcare.
Altmetric
Additional Metrics?
Edit extra informations Login
Publication type Article: Journal article
Document type Scientific Article
Keywords Interpretability ; Narrative ; Context (archaeology) ; Task (project Management) ; Language Model ; Vocabulary ; Epilepsy ; Comprehension
ISSN (print) / ISBN 2730-664X
e-ISSN 2730-664X
Quellenangaben Volume: 6, Issue: 1, Pages: , Article Number: 303 Supplement: ,
Publisher Springer
Reviewing status Peer reviewed
Grants Else Kröner-Fresenius-Stiftung (Else Kroner-Fresenius Foundation)