Show simple item record

dc.contributor.authorMorgan, Peter
dc.contributor.authorDowning, Jim
dc.contributor.authorMurray-Rust, Peter
dc.contributor.authorStewart, Diana
dc.contributor.authorTonge, Alan
dc.contributor.authorTownsend, Joseph A
dc.contributor.authorHarvey, Matt
dc.contributor.authorRzepa, Henry S
dc.date.accessioned2011-02-14T15:13:08Z
dc.date.available2011-02-14T15:13:08Z
dc.date.issued2008-06
dc.identifier.citationMorgan, P., Downing, J., Murray-Rust, P., Stewart, D., Tonge, A., Townsend, J., Harvey, M., Rzepa, H., 2008. Extracting and re-using research data from chemistry e-theses: the SPECTRa-T Project. In: 11th International Symposium on Electronic Theses and Dissertations, "ETD 2008: Spreading the Light", Robert Gordon University, Aberdeen, 4-7 June 2008. 9 pp.
dc.identifier.urihttp://www.dspace.cam.ac.uk/handle/1810/230116
dc.description.abstractScientific e-theses are data-rich resources, but much of the information they contain is not readily accessible. For chemistry, the SPECTRa-T project has addressed this problem by developing data-mining techniques to extract experimental data, creating RDF (Resource Description Framework) triples for exposure to sophisticated Semantic Web searches. We used OSCAR3, an Open Source chemistry text-mining tool, to parse and extract data from theses in PDF, and from theses in Office Open XML document format. Theses in PDF suffered data corruption and a loss of formatting that prevented the identification of chemical objects. Theses in .docx yielded semantically rich SciXML that enabled the additional extraction of associated data. Chemical objects were placed in a data repository, and RDF triples deposited in a triplestore. Data-mining from chemistry e-theses is both desirable and feasible; but the use of PDF, the de facto format standard for deposit in most repositories, prevents the optimal extraction of data for semantic querying. In order to facilitate this, we recommend that universities also require deposition of chemistry e-theses in an XML document format. Further work is required to clarify the complex IPR issues and ensure that they do not become an unwarranted barrier to data extraction and re-use.
dc.language.isoen
dc.publisher11th International Symposium on Electronic Theses and Dissertations
dc.rightsAll Rights Reserved
dc.rights.urihttps://www.rioxx.net/licenses/all-rights-reserved/
dc.subjecte-theses
dc.subjectchemistry research
dc.subjecttext-mining
dc.subjectOpen Data
dc.titleExtracting and re-using research data from chemistry e-theses: the SPECTRa-T project
dc.typeArticle
dc.type.versionaccepted version
pubs.declined2017-10-11T13:54:29.24+0100


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record