TY - JOUR AB - The use of Large Language Models (LLMs) in mental health highlights the need to understand their responses to emotional content. Previous research shows that emotion-inducing prompts can elevate "anxiety" in LLMs, affecting behavior and amplifying biases. Here, we found that traumatic narratives increased Chat-GPT-4's reported anxiety while mindfulness-based exercises reduced it, though not to baseline. These findings suggest managing LLMs' "emotional states" can foster safer and more ethical human-AI interactions. AU - Ben-Zion, Z.* AU - Witte, K. AU - Jagadish, A.K. AU - Duek, O.* AU - Harpaz-Rotem, I.* AU - Khorsandian, M.C.* AU - Burrer, A.* AU - Seifritz, E.* AU - Homan, P.* AU - Schulz, E. AU - Spiller, T.R.* C1 - 74451 C2 - 57485 TI - Assessing and alleviating state anxiety in large language models. JO - NPJ Digit. Med. VL - 8 IS - 1 PY - 2025 SN - 2398-6352 ER - TY - JOUR AB - The World Health Organization increasingly highlights the role of digital health technologies in supporting prenatal care. Despite this potential, the real-world implementation of such technologies remains limited, even in high-income countries with established analog systems. We developed a comprehensive digital pregnancy care framework, SMART Start and evaluated it in a prospective study involving 528 pregnant individuals in Germany. This study is registered at the German Clinical Trials Register (DRKS00036867). Participants were equipped with a mobile app and self-examination technologies. The mobile app featured study functionality, pregnancy-related questionnaires, digital maternity records, and pregnancy-supportive content. Self-examination technologies included a standard care kit for home measurements of routine prenatal care parameters (weight, blood pressure, urinalysis), and an innovative kit with novel sensors (smartwatch, sleep analyzer). Here, we analyzed the adherence to digital pregnancy care and present the lessons learned from a clinical and technical perspective. Among all participants, 49% engaged with at least one digital package. Weekly weight tracking reached adherence rates up to 67% in the first 14 weeks. Adherence to blood pressure and urinalysis measurements was lower, peaking at 20 and 28%, respectively, but remained stable over time. Questionnaire completion rates varied in dependence on their length and complexity. 31% of users disengaged at the time of registration. While overall retention time did not significantly differ across participant subgroups (all p > 0.05), adherence analyses revealed meaningful group-level differences in engagement with specific self-examination protocols. This discrepancy underscores that continued participation does not necessarily imply consistent engagement with all components of the digital care model. The adherence to the study schedule demonstrated that pregnant individuals are generally willing and capable of engaging in home-based, multimodal self-monitoring; however, the importance of adaptive scheduling, patient-centered feedback, agile development, and interdisciplinary collaboration should be addressed by future studies. The presented SMART Start framework offers a pathway towards data-driven, personalized pregnancy care while potentially reducing the demand for conventional healthcare infrastructure. AU - Jaeger, K.M.* AU - Nissen, M.* AU - Leutheuser, H.* AU - Danzberger, N.* AU - Titzmann, A.* AU - Pontones, C.A.* AU - Goossens, C.* AU - Ziegler, P.* AU - Uhrig, S.* AU - Haeberle, L.* AU - Bleher, H.* AU - Kast, K.* AU - Kornhuber, J.* AU - Schoeffski, O.* AU - Braun, M.* AU - Fasching, P.A.* AU - Beckmann, M.W.* AU - Eskofier, B.M. AU - Huebner, H.* C1 - 75435 C2 - 58008 CY - Heidelberger Platz 3, Berlin, 14197, Germany TI - Adherence to digital pregnancy care - lessons learned from the SMART start feasibility study. JO - NPJ Digit. Med. VL - 8 IS - 1 PB - Nature Portfolio PY - 2025 SN - 2398-6352 ER - TY - JOUR AB - The use of large language models (LLMs) in clinical diagnostics and intervention planning is expanding, yet their utility for personalized recommendations for longevity interventions remains opaque. We extended the BioChatter framework to benchmark LLMs' ability to generate personalized longevity intervention recommendations based on biomarker profiles while adhering to key medical validation requirements. Using 25 individual profiles across three different age groups, we generated 1000 diverse test cases covering interventions such as caloric restriction, fasting and supplements. Evaluating 56000 model responses via an LLM-as-a-Judge system with clinician validated ground truths, we found that proprietary models outperformed open-source models especially in comprehensiveness. However, even with Retrieval-Augmented Generation (RAG), all models exhibited limitations in addressing key medical validation requirements, prompt stability, and handling age-related biases. Our findings highlight limited suitability of LLMs for unsupervised longevity intervention recommendations. Our open-source framework offers a foundation for advancing AI benchmarking in various medical contexts. AU - Jarchow, H.* AU - Bobrowski, C.* AU - Falk, S.* AU - Hermann, A.* AU - Kulaga, A.* AU - Põder, J.C.* AU - Unfried, M.* AU - Usanov, N.* AU - Zendeh, B.* AU - Kennedy, B.K.* AU - Lobentanzer, S. AU - Fuellen, G.* C1 - 75878 C2 - 58175 TI - Benchmarking large language models for personalized, biomarker-based health intervention recommendations. JO - NPJ Digit. Med. VL - 8 IS - 1 PY - 2025 SN - 2398-6352 ER - TY - JOUR AB - The applicability of vision-language models (VLMs) for acute care in emergency and intensive care units remains underexplored. Using a multimodal dataset of diagnostic questions involving medical images and clinical context, we benchmarked several small open-source VLMs against GPT-4o. While open models demonstrated limited diagnostic accuracy (up to 40.4%), GPT-4o significantly outperformed them (68.1%). Findings highlight the need for specialized training and optimization to improve open-source VLMs for acute care applications. AU - Kurz, C.* AU - Merzhevich, T.* AU - Eskofier, B.M. AU - Kather, J.N.* AU - Gmeiner, B.* C1 - 75121 C2 - 57772 CY - Heidelberger Platz 3, Berlin, 14197, Germany TI - Benchmarking vision-language models for diagnostics in emergency and critical care settings. JO - NPJ Digit. Med. VL - 8 IS - 1 PB - Nature Portfolio PY - 2025 SN - 2398-6352 ER - TY - JOUR AB - Generative artificial intelligence is revolutionizing digital twin development, enabling virtual patient representations that predict health trajectories, with large language models (LLMs) showcasing untapped clinical forecasting potential. We developed the Digital Twin-Generative Pretrained Transformer (DT-GPT), extending LLM-based forecasting solutions to clinical trajectory prediction. DT-GPT leverages electronic health records without requiring data imputation or normalization and overcomes real-world data challenges such as missingness, noise, and limited sample sizes. Benchmarking on non-small cell lung cancer, intensive care unit, and Alzheimer's disease datasets, DT-GPT outperformed state-of-the-art machine learning models, reducing the scaled mean absolute error by 3.4%, 1.3% and 1.8%, respectively. It maintained distributions and cross-correlations of clinical variables, and demonstrated explainability through a human-interpretable interface. Additionally, DT-GPT's ability to perform zero-shot forecasting highlights potential advantages of LLMs as clinical forecasting platforms, proposing a path towards digital twin applications in clinical trials, treatment selection, and adverse event mitigation. AU - Makarov, N. AU - Bordukova, M. AU - Quengdaeng, P AU - Garger, D. AU - Rodriguez-Esteban, R.* AU - Schmich, F.* AU - Menden, M.P. C1 - 75691 C2 - 57920 TI - Large language models forecast patient health trajectories enabling digital twins. JO - NPJ Digit. Med. VL - 8 IS - 1 PY - 2025 SN - 2398-6352 ER - TY - JOUR AB - During pregnancy, almost all women experience pregnancy-related symptoms. The relationship between symptoms and their association with pregnancy outcomes is not well understood. Many pregnancy apps allow pregnant women to track their symptoms. To date, the resulting data are primarily used from a commercial rather than a scientific perspective. In this work, we aim to examine symptom occurrence, course, and their correlation throughout pregnancy. Self-reported app data of a pregnancy symptom tracker is used. In this context, we present methods to handle noisy real-world app data from commercial applications to understand the trajectory of user and patient-reported data. We report real-world evidence from patient-reported outcomes that exceeds previous works: 1,549,186 tracked symptoms from 183,732 users of a smartphone pregnancy app symptom tracker are analyzed. The majority of users track symptoms on a single day. These data are generalizable to those users who use the tracker for at least 5 months. Week-by-week symptom report data are presented for each symptom. There are few or conflicting reports in the literature on the course of diarrhea, fatigue, headache, heartburn, and sleep problems. A peak in fatigue in the first trimester, a peak in headache reports around gestation week 15, and a steady increase in the reports of sleeping difficulty throughout pregnancy are found. Our work highlights the potential of secondary use of industry data. It reveals and clarifies several previously unknown or disputed symptom trajectories and relationships. Collaboration between academia and industry can help generate new scientific knowledge. AU - Nissen, M.* AU - Barrios Campo, N.* AU - Flaucher, M.* AU - Jaeger, K.M.* AU - Titzmann, A.* AU - Blunck, D.* AU - Fasching, P.A.* AU - Engelhardt, V.* AU - Eskofier, B.M. AU - Leutheuser, H.* C1 - 68667 C2 - 54872 CY - Heidelberger Platz 3, Berlin, 14197, Germany TI - Prevalence and course of pregnancy symptoms using self-reported pregnancy app symptom tracker data. JO - NPJ Digit. Med. VL - 6 IS - 1 PB - Nature Portfolio PY - 2023 SN - 2398-6352 ER - TY - JOUR AB - Consumer wearables and sensors are a rich source of data about patients’ daily disease and symptom burden, particularly in the case of movement disorders like Parkinson’s disease (PD). However, interpreting these complex data into so-called digital biomarkers requires complicated analytical approaches, and validating these biomarkers requires sufficient data and unbiased evaluation methods. Here we describe the use of crowdsourcing to specifically evaluate and benchmark features derived from accelerometer and gyroscope data in two different datasets to predict the presence of PD and severity of three PD symptoms: tremor, dyskinesia, and bradykinesia. Forty teams from around the world submitted features, and achieved drastically improved predictive performance for PD status (best AUROC = 0.87), as well as tremor- (best AUPR = 0.75), dyskinesia- (best AUPR = 0.48) and bradykinesia-severity (best AUPR = 0.95). AU - Sieberts, S.K.* AU - Schaff, J.* AU - Duda, M.* AU - Pataki, B.Á.* AU - Sun, M.* AU - Snyder, P.* AU - Daneault, J.F.* AU - Parisi, F.* AU - Costante, G.* AU - Rubin, U.* AU - Banda, P.* AU - Chae, Y.* AU - Chaibub Neto, E.* AU - Dorsey, E.R.* AU - Aydın, Z.* AU - Chen, A.* AU - Elo, L.L.* AU - Espino, C.* AU - Glaab, E.* AU - Goan, E.* AU - Golabchi, F.N.* AU - Görmez, Y.* AU - Jaakkola, M.K.* AU - Jonnagaddala, J.* AU - Klén, R.* AU - Li, D.* AU - McDaniel, C.* AU - Perrin, D.* AU - Perumal, T.M.* AU - Rad, N.M.* AU - Rainaldi, E.* AU - Sapienza, S.* AU - Schwab, P.* AU - Shokhirev, N.* AU - Venäläinen, M.S.* AU - Vergara-Diaz, G.* AU - Zhang, Y.* AU - Abrami, A.* AU - Adhikary, A.* AU - Agurto, C.* AU - Bhalla, S.* AU - Bilgin, H.* AU - Caggiano, V.* AU - Cheng, J.* AU - Deng, E.* AU - Gan, Q.* AU - Girsa, R.* AU - Han, Z.* AU - Heisig, S.* AU - Huang, K.* AU - Jahandideh, S.* AU - Kopp, W.* AU - Kurz, C.F. AU - Lichtner, G.* AU - Norel, R.* AU - Raghava, G.P.S.* AU - Sethi, T.* AU - Shawen, N.* AU - Tripathi, V.* AU - Tsai, M.* AU - Wang, T.* AU - Wu, Y.* AU - Zhang, J.* AU - Zhang, X.* AU - Wang, Y.* AU - Guan, Y.* AU - Brunner, D.* AU - Bonato, P.* AU - Mangravite, L.M.* AU - Omberg, L.* C1 - 61660 C2 - 50371 TI - Crowdsourcing digital health measures to predict Parkinson’s disease severity: The Parkinson’s Disease Digital Biomarker DREAM Challenge. JO - NPJ Digit. Med. VL - 4 IS - 1 PY - 2021 SN - 2398-6352 ER -