The current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against "traditional" machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.
FörderungenUK Research and Innovation (UKRI) under the UK government's Horizon Europe European Research Council (ERC) Data Sciences Institute at the University of Toronto Novo Nordisk Foundation USorb-DAC Project through Grantham Foundation for the Protection of the Environment Carl Zeiss Foundation European Regional Development Fund (ERDF) Galician Government Spanish Ministry of Science and Innovation Spanish National Research Council (CSIC) European Union NextGenerationEU/PRTR Spanish Agencia Estatal de Investigacion (AEI) - MICIU/AEI European Research Council under the European Union's Horizon 2020 research and innovation program through the ERC grant DiProPhys National Institutes of Health Oxford-Cambridge Scholars Program Cambridge Trust's Cambridge International Scholarship
European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant NCCR MARVEL, a National Centre of Competence in Research - Swiss National Science Foundation Italian MUR St. John's College Research Fellowship programme Rhodes Trust Schmidt Science Fellowship Frances and Augustus Newman Foundation European Research Council under the European Union's Seventh Framework Programme (FP7/2007-2013) Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health Swiss Science Foundation