PuSH - Publication Server of Helmholtz Zentrum München: Evaluating large language models for diagnostic reasoning from unstructured clinical narratives in epilepsy.

Navigation

Home

Deutsch

Research

Advanced Search

Browse by ...

... Journal

... Publication Type

... Research Data

... Publication Year

Publication overview

Support & Contact

Contact persons

Help

Data protection

Dani, M.* ; Prakash, M.J.* ; Rosa, F.* ; Akata, Z. ; Liebe, S.*

Evaluating large language models for diagnostic reasoning from unstructured clinical narratives in epilepsy.

Commun. Med. 6:303 (2026)

Publ. Version/Full Text

Research data

DOI

PMC

	Open Access Gold

Abstract
Metrics
Extra information

BACKGROUND: Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. METHODS: In this study we task eight Large Language models including two medical models (GPT-3.5, GPT-4, Mixtral-8 × 7B, Qwen-72B, LlaMa2, LlaMa3, OpenBioLLM, Med42) with a core diagnostic task in epilepsy: mapping seizure description phrases-after targeted filtering and standardization-to one of seven possible seizure onset zones using likelihood estimates. We conduct quantitative and qualitative analyses, measuring correctness, confidence, calibration, and expert-evaluated reasoning quality and source citation accuracy. Through systematic prompt-engineering and ablation studies, we assess how model performance depends on variations in prompt strategy, clinical role impersonation, narrative length, and language context. RESULTS: Most models yield well-above chance accuracy after prompt engineering that even approaches clinician-level performance. Specifically, clinician-guided chain-of-thought reasoning leads to the most consistent improvements. Performance is further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, reasoning analysis by clinical experts reveal that correct prediction can be based on hallucinated knowledge and inaccurate source citation, underscoring the need to improve interpretability of LLMs in clinical use. CONCLUSIONS: Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of LLMs, our work contributes to testing the applicability of foundational AI systems for healthcare.

Altmetric

Additional Metrics?

[➜Log in]

Edit extra informations Login

Publication type Article: Journal article

Document type Scientific Article

Keywords Interpretability ; Narrative ; Context (archaeology) ; Task (project Management) ; Language Model ; Vocabulary ; Epilepsy ; Comprehension; Seizure Semiology

ISSN (print) / ISBN 2730-664X

e-ISSN 2730-664X

Journal Communications Medicine

Quellenangaben Volume: 6, Issue: 1, Article Number: 303

Publisher Springer

Publishing Place Campus, 4 Crinan St, London, N1 9xw, England

Reviewing status Peer reviewed

Institute(s) Helmholtz Artifical Intelligence Cooperation Unit (HAICU)

Grants Else Kröner-Fresenius-Stiftung (Else Kroner-Fresenius Foundation)