Repository logo
 

How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

Published version
Peer-reviewed

Repository DOI


Change log

Authors

paragon-plus: 5608081  ORCID logo  https://orcid.org/0000-0003-0475-403X
paragon-plus: 314732  ORCID logo  https://orcid.org/0000-0002-1552-8743

Abstract

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a “domain-specific” corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8­(+11.5)% and a recall of 37.2­(+4.5)%.

Description

Publication status: Published

Keywords

Journal Title

Journal of Chemical Information and Modeling

Conference Name

Journal ISSN

1549-9596
1549-960X

Volume Title

64

Publisher

American Chemical Society
Sponsorship
BASF (NA)
Science and Technology Facilities Council (NA)
Royal Academy of Engineering (RCSRF1819/7/10)