PuSH - Publikationsserver des Helmholtz Zentrums München

Zanca, D.* ; Zugarini, A.* ; Dietz, S.* ; Altstidl, T.R.* ; Ndjeuha, M.A.T.* ; Chakraborty, M.* ; Jami, N.V.S.J.* ; Schwinn, L.* ; Eskofier, B.M.

Contrastive language-image pretrained models are zero-shot human Scanpath predictors.

IEEE trans. artif. intell., DOI: 10.1109/TAI.2025.3612905 (2025)
Postprint DOI
Understanding human attention mechanisms is crucial for advancing both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths. NevaClip leverages contrastive language-image pretrained (CLIP) models in conjunction with human-inspired neural visual attention (NeVA) algorithms. By aligning the representation of foveated visual stimuli with associated captions, NevaClip uses gradient-driven visual exploration to generate scanpaths that simulate human attention. We also present CapMIT1003, a new dataset comprising captions and click-contingent image explorations collected from participants engaged in a captioning task. Based on the established MIT1003 benchmark, which includes eye-tracking data from free-viewing conditions, CapMIT1003 provides a valuable resource for studying human attention across both free-viewing and task-driven contexts. Additionally, we demonstrate NevaClip’s performance on the publicly available AiR-D dataset, which includes visual question answering (VQA) tasks. Experimental results show that NevaClip outperforms existing unsupervised computational models in scanpath plausibility across captioning, VQA, and free-viewing tasks. Furthermore, we demonstrate that NevaClip’s performance is sensitive to caption accuracy, with misleading captions leading to inaccurate scanpath behaviors. This underscores the importance of caption guidance in attention prediction and highlights NevaClip’s potential to advance our understanding of task-driven human attention mechanisms. Together, NevaClip and CapMIT1003 offer significant contributions to the field, providing new tools for studying and simulating human visual attention.
Impact Factor
Scopus SNIP
Altmetric
0.000
2.239
Tags
Anmerkungen
Besondere Publikation
Auf Hompepage verbergern

Zusatzinfos bearbeiten
Eigene Tags bearbeiten
Privat
Eigene Anmerkung bearbeiten
Privat
Auf Publikationslisten für
Homepage nicht anzeigen
Als besondere Publikation
markieren
Publikationstyp Artikel: Journalartikel
Dokumenttyp Wissenschaftlicher Artikel
Schlagwörter Captioning ; Human Attention ; Human-inspired Modeling ; Multimodal ; Visual Scanpath ; Zero-shot
Sprache englisch
Veröffentlichungsjahr 2025
HGF-Berichtsjahr 2025
e-ISSN 2691-4581
Verlag IEEE
Begutachtungsstatus Peer reviewed
POF Topic(s) 30205 - Bioengineering and Digital Health
Forschungsfeld(er) Enabling and Novel Technologies
PSP-Element(e) G-540008-001
Scopus ID 105019103139
Erfassungsdatum 2025-10-27