Assessment of fine-tuned large language models for real-world chemistry and material science applications †
Published version
Peer-reviewed
Repository URI
Repository DOI
Type
Change log
Authors
Abstract
The current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against “traditional” machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.
Description
Funder: National Center of Competence in Research Materials’ Revolution: Computational Design and Discovery of Novel Materials; doi: https://doi.org/10.13039/501100009150; Grant(s): Unassigned
Keywords
Journal Title
Conference Name
Journal ISSN
2041-6539
Volume Title
Publisher
Publisher DOI
Rights and licensing
Sponsorship
Seventh Framework Programme (FP7/2007–2013)
Cambridge Trust (Unassigned)
National Institutes of Health (Unassigned)
Agencia Estatal de Investigación (TED2021-131693B-I00, CNS2022-135474)
Consejo Superior de Investigaciones Científicas (ILINK23047)
Ministerio de Ciencia e Innovación (Unassigned)
European Regional Development Fund (Unassigned)
Carl-Zeiss-Stiftung (Unassigned)
Grantham Foundation for the Protection of the Environment (Unassigned)
Novo Nordisk Fonden (NNF23OC0081359)
H2020 European Research Council (101106377, 101001615, 949229)
UK Research and Innovation (EP)
Intramural Research Program (Unassigned)
National Institute of Diabetes and Digestive and Kidney Diseases (Unassigned)
Frances and Augustus Newman Foundation (Unassigned)
NCCR Catalysis (Unassigned)
Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (205602)
H2020 Marie Skłodowska-Curie Actions (945363)

