Semantic text classification for cancer text mining

Baker, Simon

doi:10.17863/CAM.23105

Semantic text classification for cancer text mining

Repository URI

https://www.repository.cam.ac.uk/handle/1810/275838

Repository DOI

https://doi.org/10.17863/CAM.23105

Files

Primary Thesis (17.87 MB)

Type

Thesis

Authors

Baker, Simon

https://orcid.org/0000-0002-0998-438X

Abstract

Cancer researchers and oncologists benefit greatly from text mining major knowledge sources in biomedicine such as PubMed. Fundamentally, text mining depends on accurate text classification. In conventional natural language processing (NLP), this requires experts to annotate scientific text, which is costly and time consuming, resulting in small labelled datasets. This leads to extensive feature engineering and handcrafting in order to fully utilise small labelled datasets, which is again time consuming, and not portable between tasks and domains.

In this work, we explore emerging neural network methods to reduce the burden of feature engineering while outperforming the accuracy of conventional pipeline NLP techniques. We focus specifically on the cancer domain in terms of applications, where we introduce two NLP classification tasks and datasets: the first task is that of semantic text classification according to the Hallmarks of Cancer (HoC), which enables text mining of scientific literature assisted by a taxonomy that explains the processes by which cancer starts and spreads in the body. The second task is that of the exposure routes of chemicals into the body that may lead to exposure to carcinogens.

We present several novel contributions. We introduce two new semantic classification tasks (the hallmarks, and exposure routes) at both sentence and document levels along with accompanying datasets, and implement and investigate a conventional pipeline NLP classification approach for both tasks, performing both intrinsic and extrinsic evaluation. We propose a new approach to classification using multilevel embeddings and apply this approach to several tasks; we subsequently apply deep learning methods to the task of hallmark classification and evaluate its outcome. Utilising our text classification methods, we develop and two novel text mining tools targeting real-world cancer researchers. The first tool is a cancer hallmark text mining tool that identifies association between a search query and cancer hallmarks; the second tool is a new literature-based discovery (LBD) system designed for the cancer domain. We evaluate both tools with end users (cancer researchers) and find they demonstrate good accuracy and promising potential for cancer research.

Date

2017-09-28

Advisors

Korhonen, Anna

Keywords

Cancer, Text Mining, Machine Learning, Classification, Literature-based Discovery, Hallmarks of Cancer, Deep Learning, Artificial Intelligence

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Sponsorship

Commonwealth Scholarship Cambridge Trust

Collections

Theses - Computer Science and Technology

Semantic text classification for cancer text mining

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections