A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists.

Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs' impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.

Description

Acknowledgements: This work was supported by the Carl Zeiss Foundation, and a ‘Talent Fund’ of the ‘Life’ profile line of the Friedrich Schiller University Jena. In addition, M.S.-W.’s work was supported by Intel and Merck via the AWASES programme. Parts of A.M.’s work were supported as part of the ‘SOL-AI’ project funded by the Helmholtz Foundation model initiative. K.M.J. is part of the NFDI consortium FAIRmat funded by the Deutsche Forschungsgemeinschaft (the German Research Foundation) project no. 460197019. K.M.J. thanks FutureHouse (a non-profit research organization supported by the generosity of Eric and Wendy Schmidt) for supporting PaperQA2 runs via access to the API. We also thank Stability.AI for the access to its HPC cluster. M.R.-G. and M.V.G. acknowledge financial support from the Spanish Agencia Estatal de Investigación (AEI) through grants TED2021-131693B-I00 and CNS2022-135474, funded by Ministerio de Ciencia, Innovación y Universidades (MICIU)/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. M.V.G. acknowledges support from the Spanish National Research Council through the Programme for internationalization i-LINK 2023 (project no. ILINK23047). A.A. gratefully acknowledges financial support for this research by the Fulbright US Student Programme, which is sponsored by the US Department of State and German-American Fulbright Commission. Its contents are solely the responsibility of the author and do not necessarily represent the official views of the Fulbright Programme, the Government of the USA or the German-American Fulbright Commission. M.A. expresses gratitude to the European Research Council for evaluating the project with the reference no. 101106377 titled ‘CLARIFIER’, and accepting it for funding under the HORIZON TMA MSCA Postdoctoral Fellowships—European Fellowships. Furthermore, M.A. acknowledges the funding provided by UK Research and Innovation under the UK government’s Horizon Europe funding guarantee (grant reference EP/Y023447/1; organization reference 101106377). M.R. and U.S.S. thank the ‘Deutsche Forschungsgemeinschaft’ for funding under the regime of the priority programme SPP 2363 ‘Utilization and Development of Machine Learning for Molecular Applications—Molecular Machine Learning’ (SCHU 1229/63-1; project no. 497115849). A.D.D.W. acknowledges funding from the European Union Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 101107360. P.S. acknowledges support from the National Centre of Competence in Research Catalysis (grant no. 225147), a National Centre of Competence in Research grant funded by the Swiss National Science Foundation. In addition, we thank the OpenBioML.org community and their ChemNLP project team for valuable discussions. Moreover, we thank P. Márquez for discussions and support and J. Kimmig for feedback on the web app. In addition, we acknowledge support from S. Kumar with an initial prototype of the web app. We thank B. Smit for feedback on an early version of the manuscript.

Funder: Carl-Zeiss-Stiftung (Carl Zeiss Foundation); doi: https://doi.org/10.13039/501100007569

Funder: Helmholtz Association; doi: https://doi.org/10.13039/501100009318

Funder: Fulbright Association; doi: https://doi.org/10.13039/501100010629

Keywords

Humans, Language, Chemistry, Large Language Models

Journal Title

Nat Chem

Journal ISSN

1755-4330
1755-4349

Volume Title

17

Publisher

Springer Nature

Publisher DOI

https://doi.org/10.1038/s41557-025-01815-x

Rights and licensing

Except where otherwised noted, this item's license is described as http://creativecommons.org/licenses/by/4.0/

Sponsorship

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Swiss National Science Foundation) (225147)
EC | Horizon 2020 Framework Programme (EU Framework Programme for Research and Innovation H2020) (101106377)
Ministry of Economy and Competitiveness | Agencia Estatal de Investigación (Spanish Agencia Estatal de Investigación) (CNS2022-135474)
Deutsche Forschungsgemeinschaft (German Research Foundation) (497115849, 497115849)

Collections

Jisc Publications Router