Repository logo
 

CHIC - Converting hamburgers into cows


No Thumbnail Available

Type

Presentation

Change log

Authors

Townsend, JA 
Downing, J 
Murray-Rust, P 

Abstract

We have developed a methodology and workflow (CHIC) for the automatic semantification and structuring of legacy textual scientific documents. CHIC imports common document formats (PDF, DOCX and (X)HTML) and uses a number of toolkits to extract components and convert them into SciXML. This is sectioned into text-rich and data-rich streams and stand-off annotation (SAF) is created for each. Embedded domain specific objects can be converted into XML (Chemical Markup Language). The different workflow streams can then be recombined and typically converted into RDF (Resource Description Format).

Description

Keywords

4605 Data Management and Data Science, 46 Information and Computing Sciences

Is Part Of

Publisher

IEEE