Investigating the performance of foundation models on human 3'UTR sequences.
Nucleic Acids Res. 53:gkaf871 (2025)
Foundation models, such as DNABERT and Nucleotide Transformer, have recently shaped a new direction in DNA research. Trained in an unsupervised manner on a vast quantity of genomic data, they can be used for a variety of downstream tasks, such as promoter prediction, DNA methylation prediction, gene network prediction, or functional variant prioritization. However, these models are often trained and evaluated on entire genomes, neglecting genome partitioning into different functional regions. In our study, we investigate the efficacy of various unsupervised approaches, including genome-wide and 3' untranslated region (3'UTR)-specific foundation models on human 3'UTR regions. To this end, we train a set of popular transformer architectures on a 3'UTR-specific dataset comprising 3 783 714 3'UTR sequences (6.6B bp) of 241 Zoonomia species. Our evaluation includes downstream tasks specific for RNA biology, such as recognition of binding motifs of RNA-binding proteins, detection of functional genetic variants, prediction of expression levels in massively parallel reporter assays, and estimation of messenger RNA half-life. Remarkably, models specifically trained on 3'UTR sequences demonstrate superior performance when compared to established genome-wide foundation models in three out of four downstream tasks. Our results underscore the importance of considering genome partitioning into distinct functional regions when training and evaluating foundation models. In addition, the proposed set of 3'UTR-specific tasks can be used for benchmarking of future models.
Impact Factor
Scopus SNIP
Web of Science
Times Cited
Scopus
Cited By
Altmetric
Publication type
Article: Journal article
Document type
Scientific Article
Thesis type
Editors
Keywords
Keywords plus
Language
english
Publication Year
2025
Prepublished in Year
0
HGF-reported in Year
2025
ISSN (print) / ISBN
0305-1048
e-ISSN
1362-4962
ISBN
Book Volume Title
Conference Title
Conference Date
Conference Location
Proceedings Title
Quellenangaben
Volume: 53,
Issue: 17,
Pages: ,
Article Number: gkaf871
Supplement: ,
Series
Publisher
Oxford University Press
Publishing Place
Great Clarendon St, Oxford Ox2 6dp, England
Day of Oral Examination
0000-00-00
Advisor
Referee
Examiner
Topic
University
University place
Faculty
Publication date
0000-00-00
Application date
0000-00-00
Patent owner
Further owners
Application country
Patent priority
Reviewing status
Peer reviewed
POF-Topic(s)
30205 - Bioengineering and Digital Health
Research field(s)
Enabling and Novel Technologies
PSP Element(s)
G-553500-001
Grants
Deutsche Forschungsgemeinschaft
Copyright
Erfassungsdatum
2025-10-21