Repository logo
 

Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project


Type

Article

Change log

Authors

Morgan, Peter 
Downing, Jim 
Murray-Rust, Peter 
Stewart, Diana 
Tonge, Alan 

Abstract

Scientific e-theses are data-rich resources, but much of the information they contain is not readily accessible. For chemistry, the SPECTRa-T project has addressed this problem by developing data-mining techniques to extract experimental data, creating RDF (Resource Description Framework) triples for exposure to sophisticated Semantic Web searches.

We used OSCAR3, an Open Source chemistry text-mining tool, to parse and extract data from theses in PDF, and from theses in Office Open XML document format.

Theses in PDF suffered data corruption and a loss of formatting that prevented the identification of chemical objects. Theses in .docx yielded semantically rich SciXML that enabled the additional extraction of associated data. Chemical objects were placed in a data repository, and RDF triples deposited in a triplestore.

Data-mining from chemistry e-theses is both desirable and feasible; but the use of PDF, the de facto format standard for deposit in most repositories, prevents the optimal extraction of data for semantic querying. In order to facilitate this, we recommend that universities also require deposition of chemistry e-theses in an XML document format. Further work is required to clarify the complex IPR issues and ensure that they do not become an unwarranted barrier to data extraction and re-use.

Description

Keywords

e-theses, chemistry research, text-mining, Open Data

Journal Title

Conference Name

Journal ISSN

Volume Title

Publisher

11th International Symposium on Electronic Theses and Dissertations

Publisher DOI

Publisher URL