Retrieval of research-level mathematics via joint modelling of text and types

Stathopoulos, Yiannos

doi:10.17863/CAM.94992

Retrieval of research-level mathematics via joint modelling of text and types

Repository URI

https://www.repository.cam.ac.uk/handle/1810/347577

Repository DOI

https://doi.org/10.17863/CAM.94992

Files

Thesis (16.85 MB)

Type

Thesis

Authors

Stathopoulos, Yiannos

Abstract

Recent work in Mathematical Information Retrieval (MIR) emphasises the retrieval of sym- bolic formulae but ignores their interactions with the text. However, mathematicians think and communicate ideas that take the form of mathematical objects and structures using both modalities. An important component of the interaction between the modalities is the mathematical type. Types are technical terms that occur in the textual modality and are used to refer to mathematical ideas. When linked to the symbolic modality, types assume the role of denotations to mathematical expressions. Existing MIR retrieval models ignore this connection and treat types in the same way as text-based IR systems would: as a bag of independent words, rather than potential multi-word units. In this thesis I ask two questions: can MIR of research mathematics benefit from the recognition of types in the textual modality, and can it benefit from linking the two modalities by assigning denotations from the textual to the symbolic modality? To investigate these questions I develop a method for constructing a test collection for research- level mathematics, which relies on observing the MathOverflow online community (this aspect of my work has resulted in the CUMTC collection with 160 queries) and use machine learning to link variables in context to their type. I then develop models that jointly model the modalities by combining type-based textual retrieval with typed formula retrieval. A key idea is typed unification – matching formulae by their structure and by the types of their constituents. My experiments show that textual types in queries improve textual retrieval significantly over traditional term-based IR methods when queries are expanded with similar types from an embedding space. My best model of this class beats the best commonly available MIR model by a margin of 144% on my test collection (MAP=.173 vs. MAP=.071). However, I was unable to prove the additional benefit of my joint typed models, although I was able to empirically describe some retrieval situations where retrieval benefits from these more complex models.

Date

2022-03-02

Advisors

Teufel, Simone

Keywords

information retrieval, machine learning, mathematical information retrieval, mathematical language processing, natural language processing

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Sponsorship

EPSRC (1345079)

EPSRC departmental grant

Collections

Theses - Computer Science and Technology

Retrieval of research-level mathematics via joint modelling of text and types

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections