Repository logo

Automatic data interpretation from scientific literatures in the portable document format with information-extraction tools for the advancement of materials discovery



Change log


Zhu, Miao 


This thesis addresses the development of scientific literature mining software tools to automatically extract metadata, text, images, and chemistry information from scientific literatures that are in PDF (portable document format), a widely used format in academic and scientific communities, to accelerate the data-driven materials discovery. The nature of PDF files presents specific challenges for data extraction. Typically, no semantic tags are usually provided in a PDF file that is not designed to be edited or its data interpreted by software. This creates a barrier to efficiently accessing and utilising the wealth of information they contain, especially in domains like materials science and chemistry where data extraction and analysis are crucial. In the materials discovery domain, efficiently accessing and analysing chemical and property data is of paramount importance. More crucially, it emphasises the need to understand and exploit the relationships between these data points to enable data science in areas such as data-driven materials discovery, where identifying correlations between different materials and their properties can lead to significant scientific breakthroughs. By addressing these challenges, this thesis contributes to enabling data science in areas like data-driven materials discovery. It demonstrates how extracting and analysing data from scientific literature can provide new insights and accelerate the process of discovery in this field. The software tools developed as part of this thesis represent significant technical innovations. They are designed to navigate the complexities of PDF files and extract valuable data with precision. This makes the tools not only valuable for researchers in materials science but also potentially applicable in other scientific domains where data extraction from literature is a key part of the research process.

Chapter 1 discusses the fundamental and history of portable document format and the current literatures on data extraction from portable document format for materials discovery.

Chapter 2 introduces relevant methodologies and frameworks used throughout the thesis.

Chapter 3 introduces PDFDataExtractor, a highly automated data and information extraction toolkit that can automatically detect layouts of scientific literatures that are in portable document format, from which semantic information can be extracted and interpreted to reconstruct the logical structure of articles to a machine-readable format. This tool was tested on a self-created evaluation set and key metadata are extracted with nearly 60% precision.

Chapter 4 descries a complete workflow for the extraction of image, scheme, and figure with corresponding caption from scientific literatures and supplementary information that are in portable document format. This workflow also extracts chemical and property data by channelling results to ImageDataExtractor1. ImageDataExtractor1 is a toolkit that extracts quantitative data from microscopy images.

Chapter 5 details a bespoke workflow for the extraction of metadata, text, image, chemistry information and property data from physics literatures.

Chapter 6 concludes the work and discusses future research opportunities.





Cole, Jacqueline




Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge