PuSH - Publikationsserver des Helmholtz Zentrums München: Benchmarking large language models for personalized, biomarker-based health intervention recommendations.

Informationen

Hinweise zu Qualitätskriterien von Zeitschriften

Open-Access-Richtlinie der Helmholtz-Gemeinschaft 2016

CC Licencences

Metriken für Publikationen

Navigation

Startseite

English

EVA: Veröffentlichen

Neuer EVA Antrag

Recherche

Erweiterte Suche

Durchblättern nach ...

... HMGU-Autoren/Konsortien

... Organisationsstruktur

... Zeitschriften

... Publikationstypen

... Forschungsdaten

... Arbeitsgruppen

... Erscheinungsjahr

Publikationen im Überblick

Statistik

HGF Fortschrittsbericht

OA Publikationen

Eintragen

Neue Publikation eintragen

Neue Publikation holen aus...

...EVA

Fehlende Publikation melden

Highlights

Suche

Hilfe & Kontakt

Ansprechpartner

Hilfe

Datenschutz

Helmholtz Open Science

Bibliometrische Indikatoren

SHERPA/RoMEO

DOAJ

Export:

Text

Endnote (RIS) BIB

BibTeX

Jarchow, H.* ; Bobrowski, C.* ; Falk, S.* ; Hermann, A.* ; Kulaga, A.* ; Põder, J.C.* ; Unfried, M.* ; Usanov, N.* ; Zendeh, B.* ; Kennedy, B.K.* ; Lobentanzer, S. ; Fuellen, G.*

Benchmarking large language models for personalized, biomarker-based health intervention recommendations.

NPJ Digit. Med. 8:631 (2025)

Verlagsversion

Forschungsdaten

DOI

PMC

	Open Access Gold

Abstract
Metriken
Zusatzinfos

The use of large language models (LLMs) in clinical diagnostics and intervention planning is expanding, yet their utility for personalized recommendations for longevity interventions remains opaque. We extended the BioChatter framework to benchmark LLMs' ability to generate personalized longevity intervention recommendations based on biomarker profiles while adhering to key medical validation requirements. Using 25 individual profiles across three different age groups, we generated 1000 diverse test cases covering interventions such as caloric restriction, fasting and supplements. Evaluating 56000 model responses via an LLM-as-a-Judge system with clinician validated ground truths, we found that proprietary models outperformed open-source models especially in comprehensiveness. However, even with Retrieval-Augmented Generation (RAG), all models exhibited limitations in addressing key medical validation requirements, prompt stability, and handling age-related biases. Our findings highlight limited suitability of LLMs for unsupervised longevity intervention recommendations. Our open-source framework offers a foundation for advancing AI benchmarking in various medical contexts.

Impact Factor

Scopus SNIP

Altmetric

15.100

0.000