Open Access Green möglich sobald Postprint bei der ZB eingereicht worden ist.
MolEncoder: Towards optimal masked language modeling for molecules.
In: (34th International Conference on Artificial Neural Networks, ICANN 2025, 9-12 September 2025, Kaunas). Berlin [u.a.]: Springer, 2026. 42-44 (Lect. Notes Comput. Sc. ; 16072 LNCS)
Predicting molecular properties is an important
challenge in drug discovery. Machine learning methods, particularly
those based on transformer architectures, have become increasingly
popular for this task by learning molecular representations directly
from chemical structure [1, 2]. Motivated by progress in natural
language processing, many recent approaches apply models of the BERT
(Bidirectional Encoder Representations from Transformers) architecture
[3] to molecular data using SMILES as the input format [4, 5, 6, 7,
8–9]. In this study, we revisit core design assumptions that originate
in natural language processing but are often carried over to molecular
tasks without modification. We explore how variations in masking
strategies, pretraining dataset size, and model size influence
downstream performance in molecular property prediction. Our findings
suggest that common practices inherited from natural language processing
do not always yield optimal results in this setting. In particular, we
observe that increasing the masking ratio can lead to significant
improvements, while scaling up the model or dataset size results in
stagnating gains despite higher computational cost (Fig. 1). Building on
these observations, we develop MolEncoder, a BERT-style model that
achieves improved performance on standard benchmarks while remaining
more efficient than existing approaches. These insights highlight
meaningful differences between molecular and textual learning settings.
By identifying design choices better suited to chemical data, we aim to
support more effective and efficient model development for researchers
working in drug discovery and related fields.
Altmetric
Weitere Metriken?
Zusatzinfos bearbeiten
[➜Einloggen]
Publikationstyp
Artikel: Konferenzbeitrag
ISSN (print) / ISBN
0302-9743
e-ISSN
1611-3349
Konferenztitel
34th International Conference on Artificial Neural Networks, ICANN 2025
Konferzenzdatum
9-12 September 2025
Konferenzort
Kaunas
Zeitschrift
Lecture Notes in Computer Science
Quellenangaben
Band: 16072 LNCS,
Seiten: 42-44
Verlag
Springer
Verlagsort
Berlin [u.a.]
Institut(e)
Institute of Structural Biology (STB)