Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project
View / Open Files
Authors
Morgan, Peter
Downing, Jim
Murray-Rust, Peter
Stewart, Diana
Tonge, Alan
Townsend, Joseph A
Harvey, Matt
Rzepa, Henry S
Publication Date
2008-06Publisher
11th International Symposium on Electronic Theses and Dissertations
Language
English
Type
Article
Metadata
Show full item recordCitation
Morgan, P., Downing, J., Murray-Rust, P., Stewart, D., Tonge, A., Townsend, J. A., Harvey, M., & et al. (2008). Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project. http://www.dspace.cam.ac.uk/handle/1810/230116
Abstract
Scientific e-theses are data-rich resources, but much of the information they contain is not readily accessible. For chemistry, the SPECTRa-T project has addressed this problem by developing data-mining techniques to extract experimental data, creating RDF (Resource Description Framework) triples for exposure to sophisticated Semantic Web searches.
We used OSCAR3, an Open Source chemistry text-mining tool, to parse and extract data from theses in PDF, and from theses in Office Open XML document format.
Theses in PDF suffered data corruption and a loss of formatting that prevented the identification of chemical objects. Theses in .docx yielded semantically rich SciXML that enabled the additional extraction of associated data. Chemical objects were placed in a data repository, and RDF triples deposited in a triplestore.
Data-mining from chemistry e-theses is both desirable and feasible; but the use of PDF, the de facto format standard for deposit in most repositories, prevents the optimal extraction of data for semantic querying. In order to facilitate this, we recommend that universities also require deposition of chemistry e-theses in an XML document format. Further work is required to clarify the complex IPR issues and ensure that they do not become an unwarranted barrier to data extraction and re-use.
Keywords
e-theses, chemistry research, text-mining, Open Data
Identifiers
This record's URL: http://www.dspace.cam.ac.uk/handle/1810/230116