Project Documentation - Unilever Centre


Recent Submissions

Now showing 1 - 9 of 9
  • ItemOpen Access
    SPECTRa-T / TheOREM Test Corpus
    (University of Cambridge, 2007) Day, Nick; Townsend, Joseph A
  • ItemOpen Access
    An Introduction to SPECTRa
    (2006-10) Downing, Jim
    SPECTRa is delivering tools that enable chemists to prepare and submit Open Data into DSpace Institutional Repositories
  • ItemOpen Access
    Maximum Entropy Models for Text mining from the Life Sciences Literature
    (DAMTP and Department of Chemistry, 2009-09-18) Nikolov, Nikolay
    The life sciences nowadays are characterized by rapid growth. Due to the huge number of publications per year – in the hundreds of thousands and growing – it is becoming increasingly difficult for the researchers to stay abreast of the latest developments. Thus, automated methods of analysing the scientific information grow in importance. Text mining in the Life Sciences aims at extracting information from textual data (usually abstracts or full texts of scientific publications, but also non-publications like clinical histories or patents). It normally involves some kind of machine learning technique that requires training data from the given thematical domain. Our case study concerns the automatic identification of chemical named entities (e.g. compounds, reaction names) from the life science literature. We investigate the impact of the data heterogeneity on the performance of Maximum Entropy Markov models and explore possible solutions to this problem. This is, to the best of our knowledge, the first study to explore thematical heterogeneity in the chemistry-related life science literature and its impact on named entity recognition. Thus it is necessarily general - its role is to collect evidence, establish basic facts and explore possible solutions. In doing so, our study suggests that the genre structure is especially important for high precision recognition. It also suggests that a system aiming at recall, rather than precision, transferring training data from one domain to another is a useful strategy (especially in respect to the domains having smaller training datasets). But, most importantly, this study provides motivation for a model that explicitly models the thematic heterogeneity of the life science literature. It explores possible solutions and the practical issues of such implementation.
  • ItemOpen Access
    Web Feeds and Repositories
    (2008-12-09T12:18:02Z) Downing, Jim
    Web feeds are an important way of adding value to repository resources. This presentation introduces web feeds, shows some examples of what it's possible to publish and consume, and some technical details of using conditional GET, archived feeds etc.
  • ItemOpen Access
    Embedding Metadata and Other Semantics In Word-Processing Documents
    (2008-12-08) Sefton, Peter; Barnes, Ian; Ward, Ron; Downing, Jim
    This paper describes a technique for embedding document metadata, and potentially other semantic references inline in word processing documents, which the authors have implemented with the help of a software development team. Several assumptions underly the approach; It must be available across computing platforms and work with both Microsoft Word (because of its user base) and (because of its free availability). Further the application needs to be acceptable to and usable by users, so the initial implementation covers only small number of features, which will only be extended after user-testing. Within these constraints the system provides a mechanism for encoding not only simple metadata, but for inferring hierarchical relationships between metadata elements from a "flat" word processing file. The paper includes links to open source code implementing the techniques as part of a broader suite of tools for academic writing. This addresses tools and software, semantic web and data curation, integrating curation into research workflows and will provide a platform for integrating work on ontologies, vocabularies and folksonomies into word processing tools.
  • ItemOpen Access
    Results files for organic solid-state PM6 calculations from Nick Day's PhD thesis
    (2008-07-31T05:35:20Z) Day, Nicholas E
    Results files for organic solid-state PM6 calculations from Nick Day's PhD thesis. For each calculation, there are CIF, MOP, OUT and CML files available.
  • ItemOpen Access
    Results files for inorganic solid-state PM6 calculations from Nick Day's PhD thesis
    (2008-07-31T05:30:59Z) Day, Nicholas E
    Results files for inorganic solid-state calculations using the PM6 method in MOPAC2007 in Nick Day's PhD thesis. Contains CIF, MOP, OUT and CML files for each calculation.
  • ItemOpen Access
    CrystalEye - From Desktop to Data Repository
    (2008-04-11) Downing, OJ; Day, Nicholas E; Murray-Rust, Peter
    CrystalEye is a public data system consisting of processed open crystallographic data. It's development and evolution has some lessons for the development of other data repositories.
  • ItemOpen Access
    A preview of the TheOREM project
    (2008-04-11T09:12:49Z) Downing, O J
    This presentation was delivered at the European roll-out meeting of the Open Archives Initiative standard for Object Re-use and Exchange (ORE), to introduce a Joint Information Systems Committee (JISC) funded experiment looking to apply ORE to doctoral theses.