Repository logo

OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain.

Published version

Published version

Repository DOI

Change log


Text mining in the optical-materials domain is becoming increasingly important as the number of scientific publications in this area grows rapidly. Language models such as Bidirectional Encoder Representations from Transformers (BERT) have opened up a new era and brought a significant boost to state-of-the-art natural-language-processing (NLP) tasks. In this paper, we present two "materials-aware" text-based language models for optical research, OpticalBERT and OpticalPureBERT, which are trained on a large corpus of scientific literature in the optical-materials domain. These two models outperform BERT and previous state-of-the-art models in a variety of text-mining tasks about optical materials. We also release the first "materials-aware" table-based language model, OpticalTable-SQA. This is a querying facility that solicits answers to questions about optical materials using tabular information that pertains to this scientific domain. The OpticalTable-SQA model was realized by fine-tuning the Tapas-SQA model using a manually annotated OpticalTableQA data set which was curated specifically for this work. While preserving its sequential question-answering performance on general tables, the OpticalTable-SQA model significantly outperforms Tapas-SQA on optical-materials-related tables. All models and data sets are available to the optical-materials-science community.



Data Mining, Electric Power Supplies, Language, Materials Science, Natural Language Processing

Journal Title

J Chem Inf Model

Conference Name

Journal ISSN


Volume Title



American Chemical Society (ACS)
Royal Academy of Engineering (RCSRC\1819\7\10)
Christ's College, University of Cambridge (NA)
Cambridge Trust (NA)
China Scholarship Council (NA)