Retrieval of research-level mathematics via joint modelling of text and types
Recent work in Mathematical Information Retrieval (MIR) emphasises the retrieval of sym- bolic formulae but ignores their interactions with the text. However, mathematicians think and communicate ideas that take the form of mathematical objects and structures using both modalities. An important component of the interaction between the modalities is the mathematical type. Types are technical terms that occur in the textual modality and are used to refer to mathematical ideas. When linked to the symbolic modality, types assume the role of denotations to mathematical expressions. Existing MIR retrieval models ignore this connection and treat types in the same way as text-based IR systems would: as a bag of independent words, rather than potential multi-word units. In this thesis I ask two questions: can MIR of research mathematics benefit from the recognition of types in the textual modality, and can it benefit from linking the two modalities by assigning denotations from the textual to the symbolic modality? To investigate these questions I develop a method for constructing a test collection for research- level mathematics, which relies on observing the MathOverflow online community (this aspect of my work has resulted in the CUMTC collection with 160 queries) and use machine learning to link variables in context to their type. I then develop models that jointly model the modalities by combining type-based textual retrieval with typed formula retrieval. A key idea is typed unification – matching formulae by their structure and by the types of their constituents. My experiments show that textual types in queries improve textual retrieval significantly over traditional term-based IR methods when queries are expanded with similar types from an embedding space. My best model of this class beats the best commonly available MIR model by a margin of 144% on my test collection (MAP=.173 vs. MAP=.071). However, I was unable to prove the additional benefit of my joint typed models, although I was able to empirically describe some retrieval situations where retrieval benefits from these more complex models.