PuSH - Publication Server of Helmholtz Zentrum München: MM-DINOv2: Adapting Foundation Models for Multi-modal Medical Image Analysis.

Information

Clues to quality of journals

Open Access Policy of Helmholtz Association 2016

CC Licencences

Publication Metrics

Navigation

Home

Deutsch

EVA: Electronic Publishing

New EVA Application

Research

Advanced Search

Browse by ...

... HMGU-Authors/Consortia

... Organizational Structure

... Journal

... Publication Type

... Research Data

... Arbeitsgruppen

... Publication Year

Publication overview

Statistics

Statistics (last 5 years)

OA Publications

Publish

Submit Publication

Import publication from...

...EVA

Report missing publication

Highlights

Support & Contact

Contact persons

Help

Data protection

Helmholtz Open Science

JANE Journal Estimator

SHERPA/RoMEO

DOAJ

Export:

Text

Endnote (RIS) BIB

BibTeX

Scholz, D.* ; Erdur, A.C.* ; Ehm, V.* ; Meyer-Baese, A.C.* ; Peeken, J.C. ; Rueckert, D.* ; Wiestler, B.*

MM-DINOv2: Adapting Foundation Models for Multi-modal Medical Image Analysis.

Lect. Notes Comput. Sc. 15967 LNCS, 320-330 (2026)

DOI

Open Access Green as soon as Postprint is submitted to ZB.

Abstract
Metrics
Extra information

Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities—a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. We demonstrate our approach on glioma subtype classification from multi-sequence brain MRI, achieving a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Beyond this specific application, our framework provides a scalable and robust blueprint for various multi-modal medical imaging problems effectively leveraging vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations (The code is publicly available at: https://github.com/daniel-scholz/mm-dinov2).

Impact Factor

Scopus SNIP

Altmetric

0.000

0.555