Automatic induction of verb classes using
clustering
Lin Sun
University of Cambridge
Computer Laboratory
Girton College
December 2012
This dissertation is submitted for
the degree of Doctor of Philosophy
Declaration
This dissertation is the result of my own work and includes nothing which is the outcome
of work done in collaboration except where specifically indicated in the text.
This dissertation does not exceed the regulation length of 60 000 words, including tables
and footnotes.
Automatic induction of verb classes using clustering
Lin Sun
Summary
Verb classifications have attracted a great deal of interest in both linguistics and natural
language processing (NLP). They have proved useful for important tasks and applica-
tions, including e.g. computational lexicography, parsing, word sense disambiguation,
semantic role labelling, information extraction, question-answering, and machine trans-
lation (Swier and Stevenson, 2004; Dang, 2004; Shi and Mihalcea, 2005; Kipper et al.,
2008; Zapirain et al., 2008; Rios et al., 2011). Particularly useful are classes which
capture generalizations about a range of linguistic properties (e.g. lexical, (morpho-
)syntactic, semantic), such as those proposed by Beth Levin (1993). However, full ex-
ploitation of such classes in real-world tasks has been limited because no comprehensive
or domain-specific lexical classification is available.
This thesis investigates how Levin-style lexical semantic classes could be learned au-
tomatically from corpus data. Automatic acquisition is cost-effective when it involves
either no or minimal supervision and it can be applied to any domain of interest where
adequate corpus data is available. We improve on earlier work on automatic verb clus-
tering. We introduce new features and new clustering methods to improve the accuracy
and coverage. We evaluate our methods and features on well-established cross-domain
datasets in English, on a specific domain of English (the biomedical) and on another lan-
guage (French), reporting promising results. Finally, our task-based evaluation demon-
strates that the automatically acquired lexical classes enable new approaches to some
NLP tasks (e.g. metaphor identification) and help to improve the accuracy of existing
ones (e.g. argumentative zoning).
Acknowledgements
I wish to thank my supervisor, Anna Korhonen. She patiently provided vision, encour-
agement and advice necessary for me to proceed with my doctoral program and complete
my dissertation. I want to thank Anna for her unflagging encouragement and serving as
a role model to me as a member of academia. She has always given me a great freedom
to pursue independent work.
I am grateful to Yuval Krymolowski and Thierry Poibeau for their practical support
and advice. Yuval has made various datasets available and answered my endless queries
when I started my research. Thierry Poibeau provided valuable advice on French verb
clustering and constructed the gold-standard for me to start experimenting. In addition,
I would like to thank Ce´dric Messiant for making the French subcategorization lexicon
available.
I would like to thank Ekaterina Shutova and Yufan Guo for collaboration on task-based
evaluations. Special thanks to Ilona Silins, Ulla Stenius, Johan Hogberg, Diarmuid O´
Se´aghdha and Ian Lewin for a great collaboration on cancer risk assessment.
I would also like to thank my examiners, Simone Teufel and Diana McCarthy who pro-
vided constructive feedbacks and detailed comments for this thesis.
I am grateful to Girton College and my college supervisor Frances Gandy. The financial
support from the British Telecom and EPSRC are gratefully acknowledged.
I wish to thank my wife Yu Zhang and my parents for their continuous encouragement.
I would like to thank my friends Yichi Zhang, Shuguang Li, Yixi Sui, Likan Fu and Hui
Li for their support, understanding and the extra burden they have to shoulder.
Contents
1 Introduction 11
1.1 Automatic lexical acquisition . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 The role of a verb in a sentence . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Verb classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Language and domain transferability . . . . . . . . . . . . . . . . 16
1.4.4 Task-based evaluation . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Overview of subsequent chapters . . . . . . . . . . . . . . . . . . . . . . 17
2 Background to verb classification 19
2.1 Verb resources in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Levin’s verb classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 VerbNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Levin’s verb classification and NLP applications . . . . . . . . . . 23
2.3 Automatic verb classification . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Classification methods and performance . . . . . . . . . . . . . . 25
2.3.2 Features and feature extraction . . . . . . . . . . . . . . . . . . . 28
2.3.3 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 The current challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Semantic information . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Hierarchical classification . . . . . . . . . . . . . . . . . . . . . . 39
2.4.3 Task-based evaluation . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.4 Polysemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
CONTENTS CONTENTS
3 Verb clustering using selectional preferences 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2 Feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Pairwise clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Statistical significance test . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Discussion and related work . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Hierarchical verb clustering using graph factorization 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Target classification and test sets . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Features and feature extraction . . . . . . . . . . . . . . . . . . . 59
4.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6
CONTENTS CONTENTS
5 Verb classification in the biomedical domain 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 Test Verbs and Gold Standard . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6 Cross-linguistic potential of verb classification 82
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 French verb classes and the gold standard . . . . . . . . . . . . . . . . . . 83
6.3 Verb clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Data: the LexSchem lexicon . . . . . . . . . . . . . . . . . . . . . 85
6.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.6.1 Data and pre-processing . . . . . . . . . . . . . . . . . . . . . . . 88
6.6.2 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.7.1 Quantitative evaluation . . . . . . . . . . . . . . . . . . . . . . . 88
6.7.2 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . 91
6.8 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7
CONTENTS CONTENTS
7 Task-based evaluation of verb classification 95
7.1 Use of verb and noun classification in metaphor identification . . . . . . 95
7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1.4 Evaluation and discussion . . . . . . . . . . . . . . . . . . . . . . 99
7.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Use of verb classes in argumentative zoning . . . . . . . . . . . . . . . . 100
7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2.2 Data and experiments . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Release of large-scale verb classifications . . . . . . . . . . . . . . . . . . 103
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Conclusions 105
8.1 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.1.1 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.1.2 Semantic features . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.1.3 Cross-lingual and cross-domain study . . . . . . . . . . . . . . . . 106
8.1.4 Task-based evaluation . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2.1 Diathesis alternation as a new feature for verb classification . . . 108
8.2.2 Partial membership model for verb polysemy . . . . . . . . . . . . 113
8.2.3 Data requirement for reliable clustering . . . . . . . . . . . . . . . 115
8.2.4 Further task and application based evaluations . . . . . . . . . . . 115
A Features used in this thesis 117
B Similarity measures used in this thesis 118
C A proof of the impact of diathesis alternation features on the kernel 119
D The list of verbs in T1 and T3 120
8
List of Abbreviations
ACC Accuracy
AGG Agglomerative Clustering
ANLT Alvey Natural Language Tools
ASSCI A Subcategorization Frame Acquisition System for French Verbs
AZ Argumentative Zoning
BL Baseline
BNC The British National Corpus
CO Co-occurrence
COMLEX COMLEX Syntax Dictionary
DA Diathesis Alternation
EM Expectation-Maximisation Algorithm
F F-Measure
GENIA The GENIA project
GR Grammatical Relation
HGFC Hierarchical Graph Factorization Clustering
IGNG Incremental Growing Neural Gas
JSD Jensen–Shannon divergence
K-Means K-Means Clustering Algorithm
LexSchem LexSchem Subcategorization Lexicon for French Verbs
LP Lexical Preference
9
CONTENTS 10
MDL Minimum Description Length
MNCut Multiway Normalizaed Cut Spectral Clustering
mPUR Modified Purity
MRD Machine Readable Dictionary
NLP Natural Language Processing
NMI Normalized Mutual Information
NP Noun Phrase
OBJ Object
P Precision
PC Pairwise Clustering
POS Part-of-Speech
PP Prepositional Phrase
R Recall
Radj Adjusted Rand Index
RASP Robust Accurate Statistical Parsing System
SCF Subcategorization Frame
SP Selectional Preference
SPEC Spectral Clustering
SUBJ Subject
VALEX VALEX Subcategorization Lexicon for English Verbs
VN VerbNet
VP Verb Phrase
Chapter 1
Introduction
Verb classifications have attracted a great deal of interest in both linguistics and natural
language processing (NLP). Verb classes which capture generalizations about a wide
range of linguistic properties (e.g. lexical, (morpho-)syntactic, semantic), such as those
proposed by Beth Levin (1993), have been of particular interest for NLP. They have
proved useful for various important tasks and applications, including e.g. computational
lexicography, parsing, word sense disambiguation, semantic role labelling, information
extraction, question-answering, and machine translation (Swier and Stevenson, 2004;
Dang, 2004; Shi and Mihalcea, 2005; Kipper et al., 2008; Zapirain et al., 2008; Rios et
al., 2011). However, full exploitation of such classes in real-world tasks has been limited
because no comprehensive or domain-specific lexical classification is available.
This thesis investigates how Levin-style classes could be learned automatically from cor-
pus data. Unsupervised or minimally supervised acquisition is cost-effective and can be
applied to any domain of interest provided adequate corpus data is available. We im-
prove on earlier work on automatic verb clustering. We introduce new features and new
clustering methods to improve the accuracy and coverage. We evaluate our methods
and features on general English, a specific domain (the biomedical) and another lan-
guage (French), as well as in the context of NLP tasks, reporting promising results. Our
task-based evaluation demonstrates that the automatically acquired lexical classes enable
new approaches to some NLP tasks (e.g. metaphor identification) and help to improve
the accuracy of existing ones (e.g. argumentative zoning).
This introductory chapter discusses the need for automatic lexical acquisition and the
approaches proposed so far (section 1.1). It then discusses the central role of a verb in
the syntactic structure and the semantics of a sentence (section 1.2). Section 1.3 gives
an overview of the linguistic theory behind Levin-style classification, describes its impor-
tance for NLP, and discusses the automatic acquisition of verb classifications from corpus
data. Section 1.4 summarizes the contributions of our work and section 1.5 provides an
overview of the organization of this thesis.
11
CHAPTER 1. INTRODUCTION 12
1.1 Automatic lexical acquisition
The lexicon plays a central role in NLP. Rich lexical knowledge-bases are an impor-
tant component of NLP applications (e.g. Word-Sense Disambiguation and Information
Extraction).
Building large, explicit enough lexicons for NLP use has proved difficult. Manual con-
struction of a large-scale lexicon involves many years of lexicographic work. The use
of computers has enabled automating the task. Much of the early work on computa-
tional lexicography exploited information in machine readable dictionaries (MRD). Since
MRDs were constructed primarily for human use, the conversion to a satisfactory com-
putational lexicon proved difficult (Tuells, 1997). Also, it turned out that manually built
lexicons have gaps and inconsistencies which are difficult to detect automatically (e.g.
Boguraev and Briscoe (1987)). It is also costly to extend such resources to cover new
information (e.g. statistical information).
Researchers therefore started to acquire lexical information automatically from corpus
data. This automatic approach can solve the above mentioned problems: it can obtain
a good coverage, and it can gather statistical information and can be easily applied to
sub-languages and domains. Automatic methods have been developed for many areas
of the lexicon: e.g. subcategorization frames (SCF) (Briscoe and Carroll, 1997; Korho-
nen, 2002; Messiant et al., 2008; Lippincott et al., 2012), selectional preferences (SPs)
(Brockmann and Lapata, 2003; Erk, 2007; Bergsma et al., 2008; O´ Se´aghdha, 2010;
O´ Se´aghdha and Korhonen, 2012), diathesis alternations (Lapata, 1999; McCarthy and
Korhonen, 1998) and word sense induction (Pantel and Lin, 2002; Navigli and Crisa-
fulli, 2010; Ponzetto and Navigli, 2010; Lau et al., 2012). Many approaches have aimed
to minimise data annotation and employed semi-supervised or unsupervised methods.
Many current methods need further refinement, but resources produced by some meth-
ods have already proved useful for NLP tasks and applications e.g., question answering
(Lin and Pantel, 2001), unsupervised POS tagging (Clark and Tim, 2003) and text entail-
ment (Zhang et al., 2010).
1.2 The role of a verb in a sentence
The verb is central to the syntax and semantic of a sentence. The main verb of a sentence
determines the number and the role of participants in the event the sentence is describing.
For example, the verb give may select as its subject an entity that is able to give (AGENT),
as its object an entity which is being given (THEME), and as another object an entity
which is being given to (RECIPIENT). From the syntactic point of view, these requirements
can be realized as a dative construction, e.g. John gives an apple to Lucy or a double
object construction, e.g. John gives Lucy an apple.
CHAPTER 1. INTRODUCTION 13
When constructing a sentence, the speaker will find the participants that are compatible
with the selectional restrictions or preferences of the verb in question, and will compose
the sentence according to the syntactic mapping rules. Therefore, verbs play a key role
in the meaning and the structure of sentences.
1.3 Verb classification
Verbs can be classified according to their syntax (Boguraev et al., 1987; Grishman et al.,
1994), semantics (Fellbaum, 1998) or other properties. A number of linguists (Pinker,
1989; Jackendoff, 1990; Levin, 1993) have shown that verbs which share commonality
in meaning often share also commonality in their (morpho-)syntactic behaviour and can
be grouped into lexical classes according to a wider range of linguistic properties, e.g.
Pinker (1989), Jackendoff (1990) and Levin (1993). Such classes (often called lexical-
semantic classes) generalize over a range of linguistic properties of verbs without defining
the idiosyncratic details for each verb. For example, the Levin class of COOK verbs
includes the verbs cook, bake, boil, roast and heat which share similar meaning and the
following syntactic frames and alternations between frames (Levin, 1993):
• Causative/Inchoative Alternation
1. Jennifer baked the potatoes.
2. The potatoes baked.
• Middle Alternation
1. Jennifer baked Idaho potatoes.
2. Idaho potatoes bake beautifully.
• Instrument subject Alternation
1. Jennifer baked the potatoes in the oven.
2. This oven bakes potatoes well.
The largest and the most widely deployed English verb classification in NLP is the classi-
fication of Levin (1993). This classification provides a summary of the variety of theo-
retical research done on lexical-semantic verb classification over the past decades. Verbs
which display the same or a similar set of diathesis alternations are assumed to share
certain meaning components and are organized into a semantically coherent class. Al-
though alternations are chosen as the primary means for identifying verb classes, ad-
ditional properties related to subcategorization, morphology and extended meanings of
CHAPTER 1. INTRODUCTION 14
verbs are taken into account as well. VerbNet (Kipper-Schuler, 2005) 1 – an extensive
on-line lexicon for English verbs – provides detailed syntactic-semantic descriptions of
Levin’s classes as well as additional classes organized into a refined taxonomy. The re-
sulting taxonomy classifies over 6272 verbs into 270 first level classes.
Levin-style verb classes are interesting for NLP because they can help to reduce the re-
dundancy in the lexicon (since verbs in a class share similar properties). They can also
alleviate the problem of data sparseness which affects many NLP tasks by predicting the
properties of member verbs, when not enough empirical evidence is available. VerbNet
classes have been used to help tasks such as parsing, word sense disambiguation, seman-
tic role labelling, information extraction, question-answering, and machine translation
(Swier and Stevenson, 2004; Dang, 2004; Shi and Mihalcea, 2005; Zapirain et al., 2008;
Rios et al., 2011). Such verb classifications could be particularly useful for domains. One
such domain is the biomedical domain. According to Ananiadou and McNaught (2005),
current manually built lexical resources (e.g. the UMLS specialist lexicon (Browne et
al., 2003)) do not provide enough coverage for domain specific language properties, e.g.
word usage and word relation.
Because verb classes are sensitive to meaning variations between different text types and
domains, and manual classification of large numbers of verbs is not practical, automatic
verb classification has received a considerable amount of interest (Schulte im Walde,
2006; Joanis et al., 2007; Sun et al., 2008b; Li and Brew, 2008; Korhonen et al., 2008;
O´ Se´aghdha and Copestake, 2008; Vlachos et al., 2009b). Automatic classification is
not only cost-effective but it also gathers the important statistical information from data
and can easily be applied to new domains and usage patterns. Like manual classification,
automatic classification is based on the assumption that the shared semantic behaviour
of verbs can be largely inferred by their shared (morpho-)syntactic behaviour.
The methods proposed so far use machine learning techniques to classify syntactic and
semantic features extracted from corpus data using part-of-speech (POS) tagging or sta-
tistical parsing techniques. Both supervised and unsupervised approaches have been
proposed. Primarily unsupervised approaches are particularly interesting because they
are easy to port between different tasks and domains. However, the accuracy of such
approaches shows room for improvement. Also, because most approaches have been
evaluated on a general language, only on one language, and against pre-determined gold
standards, we do not know how useful they are to practical tasks.
1.4 Our contributions
The aim of this thesis is to improve the accuracy of automatic, primarily unsupervised
verb classification and to evaluate the approach not only using well-established gold
1See http://verbs.colorado.edu/verb-index/index.php for details.
CHAPTER 1. INTRODUCTION 15
standard for English, but also on a different domain and a language, and in the context
of NLP tasks.
Our contributions include:
• Introducing two clustering methods (MNCut spectral clustering (SPEC) and hierar-
chical graph factorization (HGFC) to the field of NLP. The methods outperform the
best performing methods in previous works.
• Exploring a wide range of features including the selectional preference (SP) feature.
The verb clustering performance is improved by using the SP features.
• Investigating whether the method and features developed for general English can
be applied to a new language (French) and new domain (biomedicine). The results
show that both the method and the features transfer well to French and biomedical
domain.
• Performing two task based-evaluations of the verb clusters. To the author’s knowl-
edge, this is the first task-based evaluations on automatically acquired verb clusters.
The results demonstrate that the automatically acquired verb clusters can be very
useful for NLP applications.
The following subsections describe these contributions in detail.
1.4.1 Clustering methods
We introduce two novel clustering methods – a new variation of spectral clustering and
HGFC which are new not only for verb classification but also for the field of NLP. SPEC
has previously been used for a verb classification task (Brew and Schulte im Walde,
2002), but we use a new version (Maila and Shi, 2001) which consistently outperforms
other previous methods in our studies. We also introduce a method to automatically de-
tect the number of clusters for SPEC (Zelnik-Manor and Perona, 2004). In addition, we
introduce HGFC (Yu et al., 2006) as a hierarchical verb clustering method. Hierarchical
verb clustering is important as all the existing manually built classifications are hierar-
chical in nature (e.g. VerbNet). All previous hierarchical verb clustering experiments
(Schulte im Walde and Brew, 2001; Stevenson and Joanis, 2003; Ferrer, 2004) have used
linkage based clustering methods. We address two problems with the linkage method:
1) error propagation (i.e. when a verb is misclassified at a level, the error propagates to
all the upper/lower levels) and 2) local pairwise merging (only two clusters can be com-
bined at any level). HGFC solves these two problems and shows improved performance
on small and large-scale verb classification tasks. The method can also be modified to
integrate prior knowledge about the task in the form of soft constraints. In other words,
the HGFC can be adjusted to integrate semi-supervision in situations where some gold
standard data is available.
CHAPTER 1. INTRODUCTION 16
1.4.2 Features
We explore a wide range of features in our studies, focussing in particular on SP features.
Previous works have used SPs acquired from WordNet or GermaNet and have reported
that the feature offers no significant improvement over syntactic features (Schulte im
Walde, 2006; Joanis, 2002). This is contradictory to manual verb classification where
selectional preferences are often discussed. Using a new clustering and SP acquisition
methods, we demonstrate that such semantic information can be very useful. The per-
formance is greatly improved over the use of purely syntactic features.
1.4.3 Language and domain transferability
We investigate whether a verb clustering approach initially developed using general En-
glish can be transferred to a new language and to a new domain. For many languages
there is no Levin-style verb classification. We conduct experiments on French using SPEC
and features similar to our work in English. The result confirms that the clustering
techniques and features can be applied to a new language without substantial changes.
However, to obtain better results, a large corpus and improved language-specific NLP
tools (e.g. tagger, parser) would be ideal. We also apply our methods (SPEC and HGFC)
and features to the biomedical domain. We perform experiments on Korhonen et al.
(2008)’s dataset. The results show that our clustering outperforms previous results on
the same data using similar features. The SP feature is the best feature in all the experi-
ments, demonstrating that our methods can achieve good performance in a new domain.
Domain specific features could be tried to further improve performance.
1.4.4 Task-based evaluation
VerbNet has proved useful for many practical NLP tasks, e.g. (Swier and Stevenson,
2004; Dang, 2004; Shi and Mihalcea, 2005; Zapirain et al., 2008). Automatically ac-
quired classification has not been applied to any NLP task. In collaboration with other
researchers, we apply our clustering results to two tasks: metaphor identification and
argumentative zoning. In the metaphor identification task, automatically obtained noun
and verb classes are used to detect the source and target concept domains. The system is
the first of its kind and it is capable of identifying metaphorical expressions with a high
precision (0.79). The coverage is also better compared against the WordNet baseline. In
the argumentative zoning task, verb classes are used as an additional feature for detecting
the categories based on scientific discourse (e.g. Introduction, Background and Conclu-
sion) in biomedical abstracts. The results show that this feature improves performance
over the raw verb feature. In fact, it makes the raw verb feature redundant: adding the
raw verb feature over the verb class feature actually decreases the performance.
CHAPTER 1. INTRODUCTION 17
1.5 Overview of subsequent chapters
The chapters of this thesis are organized as follows:
Chapter 2 (Background to verb classification) introduces the background and motivates
our work. We discuss current verb resources and the benefits of Levin-style verb
classification, including its use in NLP applications. We survey approaches to auto-
matic verb classification, and the data, features and methods used in each approach
are discussed. We finally identify the problems which need to be addressed in cur-
rent approaches and define the scope of our work.
Chapter 3 (Verb clustering using selectional preferences) examines the role of semantic
features (SPs) for verb clustering. We use a clustering method – MNCUT SPEC –
which is new to NLP and particularly useful for handling high dimensional fea-
tures. We also introduce a method for SP acquisition which is based on the same
clustering algorithm. Using this approach, we show on two well-established test
sets that automatically acquired SPs can be highly useful for verb clustering. This
result contrasts with most previous works but is in line with theoretical work on
verb classification which relies not only on syntactic but also on semantic features
(Levin, 1993).
Chapter 4 (Hierarchical verb clustering using graph factorization) investigates hierar-
chical clustering of verbs. Most previous research on verb clustering has focused
on acquiring flat classifications, although many manually built classifications are
taxonomic in nature. We introduce a new graph-based method – HGFC – to hi-
erarchical verb clustering which avoids some of the problems reported with the
frequently used agglomerative method (AGG)2. We modify HGFC so that it can
automatically determine the tree structure for clustering, and propose two exten-
sions to it. The first involves automatically determining the number of clusters to
be produced. The second involves adding soft constraints to guide the clustering
performance, useful for domains where some prior classification is available. The
results are promising. On a flat test set, the unconstrained version of HGFC out-
performs AGG and performs similarly with the best current flat clustering method
(SPEC). On the hierarchical test sets, the unconstrained and constrained versions
of HGFC outperform AGG clearly at all levels. The constrained version of HGFC
detects the missing hierarchy from the existing gold standards with high accuracy.
When the number of clusters and levels is learned automatically, the unconstrained
method produces a fairly accurate multi-level hierarchy. Finally, our qualitative
evaluation shows that both constrained and unconstrained versions of HGFC are
capable of learning valuable novel information not included in gold standards.
2We also use the name linkage method in this thesis.
CHAPTER 1. INTRODUCTION 18
Chapter 5 (Verb classification in the biomedical domain) explores the domain-specific
application of verb clustering methods developed using general, cross-domain data.
We apply our two clustering methods (SPEC and HGFC) to a 3 level hierarchical gold
standard which consists of verbs in biomedical texts. We use the same features as
in Korhonen et al. (2008), and extract the features from biomedical journals. We
demonstrate that SPEC outperforms the previous best method (PC) on 3 levels. We
then use HGFC to produce a 3 level hierarchy. We show that the resulting hierarchy
is more accurate than the 3 level flat clustering produced by PC. We conclude that
both clustering methods that are developed for the general domain can be applied
to the biomedical domain without significant change.
Chapter 6 (Cross-linguistic potential of verb classification) investigates the cross-linguistic
potential of Levin style verb clustering. We apply the SPEC clustering method devel-
oped for general English to French. We develop our initial gold standard based on
the translation of a widely used English gold standard (Sun et al., 2008b). The SCF
features are acquired from a French subcategorization lexicon - LexSchem (Mes-
siant et al., 2008). The results show that not only the general methodology but
also the best performing features are transferable between the languages, making
it possible to learn useful VerbNet style classes for French automatically without
language specific tuning.
Chapter 7 (Task-based evaluation of verb classification) evaluates the use of verb clus-
tering in two NLP tasks - metaphor identification and argumentative zoning. 3 For
metaphor identification, the verb and noun clusters enable a novel approach for un-
restricted text. Starting from a limited set of metaphorical seeds, we use noun and
verb clustering to harvest the target concepts (noun clusters) associated with the
same source domain (verbs clusters). The selectional strength filter is used to filter
out the verbs that are not prone to metaphoricity. Finally, we show that the system
is capable of capturing the regularities behind metaphor production and annotat-
ing a wider range of previously unseen metaphors. In the argumentative zoning
task, verb classes are used as additional feature to identify the argumentative dis-
course categories in scientific abstracts. The verb class feature greatly improves the
classification accuracy.
Chapter 8 (Conclusion) summarises the contributions of our work and proposes direc-
tions for future research.
3These are the collaboration works with Ekaterina Shutova and Yufan Guo. Our contribution is to
provide the verb and noun clusters.
Chapter 2
Background to verb classification
2.1 Verb resources in NLP
Verbs are central to the syntactic structure and the meaning of sentences. Many computa-
tional resources and classifications have been developed for verbs. They can be classified
into three types:
Syntactic resources : Comlex (Grishman et al., 1994) and ANLT (Boguraev et al., 1987)
dictionaries are examples of syntactic verb resources. These dictionaries are mainly
manually developed. A verbal entry comprises verb forms and subcategorization
information.
Semantic resources : FrameNet (Baker et al., 1998), PropBank (Kingsbury and Palmer,
2002) and WordNet (Miller, 1995) are mainly semantic resources of English verbs.
FrameNet groups verbs according to the conceptual structures (frames) and their
combinatorial patterns. PropBank provides information about predicate-argument
structures of verbal predicates. The core arguments of each verb are numbered. For
a given verb, the argument with the same number always has the same semantic
meaning across different syntactic frames. WordNet groups words into synsets
(synonym sets), and records the semantic relation between synsets. These resources
contain little syntactic information, or the syntactic information is bound to the
semantics of individual verbs (e.g. PropBank and WordNet).
Syntactic-semantic resources : In Levin (1993)’s verb classification, verbs are grouped
in terms of shared meaning components, similar (morpho-)syntactic behavior of
words and a number of other properties. VerbNet (Kipper-Schuler, 2005) extends
this classification with a large number of additional verbs and classes.
For many NLP applications, a Levin style verb classification is useful for its ability to
capture generalizations about a range of linguistic properties. For example, in the se-
mantic role labelling task, Zapirain et al. (2008) discovered that a system trained on
19
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 20
PropBank was too focussed on verb specific knowledge and VerbNet classes provided a
better starting point. Also WordNet has been criticized for lack of generalization, for ex-
ample: Palmer (2000) argues that WordNet’s level of sense distinction is too fine-grained
for a computational lexicon. For example, the word lose has the sense fail to keep or
to maintain: “She lost her purse when she left it unattended on her seat” and miss from
one’s possessions: “I have lost my glasses.”. In both cases, the word lose takes an an-
imate subject and a solid thing as object. The sense distinction comes from the future
event specifying whether the object can be found which is a very fine-grained distinction
for a computational lexicon. In contrast, Levin’s classification is more consistent with
the ideal sense distinction criteria proposed by Palmer:
• different predicate argument structures
• different semantic class constraints on verb arguments
• different lexical co-occurrences, such as prepositions
We will discuss the benefits of Levin’s classes further in section 2.3 after first introducing
Levin’s classification.
2.2 Levin’s verb classes
Different subcategories of verbs impose different constraints on the number and the type
of arguments they take. The syntactic variation can be captured by frames called subcat-
egorization frames (SCFs). For example, the verb eat can take a simple NP frame which
consists of one subject and one direct object (Example 1).
Example 1 I ate an apple.
On the other hand, the verb put cannot take this frame (Example 2). It requires at least
three syntactic arguments: a subject, a direct object and an indirect object, as shown in
the SCF NP-PP in Example 3.
Example 2 I put the apple* 1
Example 3 I put the apple on the table.
The semantics of the verb is said to determine its syntactic behaviour. To prove the inter-
dependency, Hale and Keyser (1987) presented the example below showing that the verb
1An asterisk (*) indicates that the form or construction is not found in natural language.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 21
semantics is a key to the verb syntactic behaviour. Suppose we don’t know the meaning
of the archaic English word ’gally’, as in The sailor gallied the whales. Some people
might suggest that the word means ’see’, as in The sailor saw the whales, while others
might suggest that it means ’frighten’, as in The sailor frightened the whales. At this
point, we don’t know which meaning is correct. To solve this problem, Hale and Keyser
looked at the middle transitivity alternation, where the subject of the intransitive usage
of the verb can be placed as the object the transitive usage. For example, for the verb
’read’, we can say ’this book reads well’; or we can move the subject ’book’ to the object
position, as in ’I read this book’. The people who believe ’gally’ means ’see’ would not
allow for the sentence ’Whales gally easily’, because middle transitivity alternation is not
allowed for ’see’ (*Whales see easily). The people who believe the meaning is ’frighten’
would allow for the middle transitivity alternation, as in Whales frighten easily. In fact,
the verbs with middle transitivity alternation would normally cause a change of state, e.g.
frighten, open, split and crush. From this example, we can see that a verb’s behaviour
has a very strong relationship with its semantics.
On the basis of the inter-dependency between this type of diathesis alternation (DA) be-
haviour of verbs and their meanings, Levin (1993) manually created a classification for
English verbs. The verbs within the same class share some meaning component, similar
syntactic frames and possibly other linguistic features (e.g. zero nominals). Levin’s cen-
tral thesis is that “the behaviour of a verb, particularly with respect to the expression and
interpretation of its arguments, is to a large extent determined by its meaning”. Accord-
ing to Levin, a verb’s syntactic behaviour and semantics are linked. The syntactic frames
are understood as a direct reflection of the underlying meaning components which have
selectional restrictions on arguments. For example:
• Heat radiates from the sun
• The sun radiates heat
According to Levin, verbs taking this alternation (substance/source alternation) express
substance emission (e.g. bleed, leak, sweat). They take two arguments: (i) a source (e.g.
sun) and (ii) the substance emitted from the source (e.g. heat). The semantic role of the
subject of the intransitive use of the verb is the same as the semantic role of the object
of the transitive use, and the semantic role of the object of intransitive use is the same as
that of the subject of the transitive use.
Based on the previous research on DAs (e.g. Pinker (1989); Jackendoff (1990)), Levin
defined 78 possible DA types. These alternations concern changes in verbs’ transitivity
or within the arguments of VP, or involve the introduction of oblique complements,
reflexives, passives, there-insertion, different forms of inversions or specific words. They
are mainly restricted to verbs taking NP and PP complements.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 22
Then, Levin analysed 3104 verbs according to the alternations, and associated each verb
with a set of relevant alternations. She argued that verbs which behave similarly with
respect to alternations share certain meaning components, and can thus be grouped to-
gether to form a semantically coherent class.
Finally, the verbs with the same or similar alternation behaviour were assigned to the
same class. Levin classified the verbs into 49 broad semantically motivated verb classes,
some of which divide further into subclasses, making the total number of classes 192.
For example, the class of ’COOK’ verbs includes verbs such as cook, bake, boil, roast,
heat . . . which share the following syntactic behaviour and alternations (Levin, 1993):
1. Causative/Inchoative Alternation
(a) Jennifer baked the potatoes.
(b) The potatoes baked.
2. Middle Alternation:
(a) Jennifer baked Idaho potatoes.
(b) Idaho potatoes bake beautifully.
3. Instrument subject Alternation
(a) Jennifer baked the potatoes in the oven.
(b) This oven bakes potatoes well.
4. Resulative Phrase
(a) Jennifer boiled the pot dry.
(b) Jennifer baked the potatoes to a crisp.
Besides listing the relevant alternations, Levin also included some alternations that the
verbs cannot take. Levin demonstrated that the resulting verb classes can capture the
inter-dependency between the syntax and the verb meaning.
Despite the popularity of the classification in NLP applications, Levin’s original resource
has a few limitations:
1. The classification is not exhaustive in terms of breadth or depth of coverage. More
work is needed to cover a larger set of DAs and further to extend and refine verb
classification.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 23
2. The classification lacks a comprehensive hierarchical organization of the types
found in other computational lexical resources, such as WordNet and FrameNet.
The taxonomy is only three levels with both missing upper and lower parts, and
only a few classes have subclasses. A more complete taxonomy could benefit vari-
ous NLP applications.
3. The classification does not provide explicit description of syntactic and semantic
properties of member verbs (e.g. only some syntactic frames are listed that partici-
pate in alternations).
2.2.1 VerbNet
VerbNet was designed to address these limitations in Levin’s original classification. It
is currently the most extensive on-line verb lexicon available for English. It provides
detailed syntactic-semantic descriptions of Levin classes organized into a refined taxon-
omy (additional classes and subclasses). Each verb class is described by thematic roles,
selectional restrictions, semantic predicates and argument and frame types containing
syntactic descriptions. The resource has recently been extended with additional classes
for verbs not covered by Levin (Kipper et al., 2006b). It now provides 270 fine-grained
classes, which cover a wider number of verb senses and member verbs (6272).
VerbNet has been used to aid a number of NLP applications such as automatic verb
acquisition (Swift, 2005), semantic role labelling (Swier and Stevenson, 2004), robust
semantic parsing (Shi and Mihalcea, 2005), word sense disambiguation (Dang, 2004),
building conceptual graphs (Hensman and Dunnion, 2004), and creating a unified lexical
resource for knowledge extraction (Croch and King, 2005), among others. However,
the resource is still not comprehensive. It provides no statistical information about the
likelihood of different classes for individual verbs, and is not helpful for processing texts
in specific domains where verb senses (and thus classes) may only partially overlap with
the ones in general language.
2.2.2 Levin’s verb classification and NLP applications
Verb classes can help to reduce the redundancy in the lexicon, since verbs in a class share
similar properties. They can alleviate the problem of data sparseness which affects many
NLP tasks by predicting the properties of member verbs, when not enough empirical
evidence is available. One such task is semantic role labelling. Most work on this task
takes the supervised approach which relies on a large amount of manually tagged corpora
such as PropBank and FrameNet as training data. Swier and Stevenson (2004) take an
unsupervised approach instead which takes advantage of the classification in VerbNet.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 24
Experiencer V Cause
Experiencer V Cause Prep(in) Oblique
Experiencer V Oblique Prep(for) Cause
Table 2.1: An example VerbNet entry for the class containing verb admire
The probability model is p(r|v, s, n), which is the probability of a semantic role r given
the verb v, the slot s and the noun n. Consider the following example sentence:
Kiva[EXPERIENCER] admires Mat[CAUSE].
The model estimates p(EXPERIENCE|admire, SUBJECT,Kiva) for the subject slot.
The bootstrapping algorithm initiates the probability model by making initial unambigu-
ous role assignment according to the frames in the relevant VerbNet entry (an example is
shown in table 2.1). It then iteratively updates the probability model and makes further
assignment to the ambiguous slots. When the evidence for a slot is below a threshold, the
model falls back on the statistics over the verb and noun classes. Thus, the backoff model
is p(r|V erbClasss, s,NounClass). This creates a simplistic probabilistic model with the
emphasis on class generalization. This unsupervised method achieves an error reduction
of 50-65% over an informed baseline2, demonstrating the potential of the approach for
a task that has previously relied on large amounts of manually tagged training data.
Verb classifications have also been used to help tasks such as parsing, word sense disam-
biguation, information extraction, question-answering, and machine translation (Swier
and Stevenson, 2004; Dang, 2004; Shi and Mihalcea, 2005; Zapirain et al., 2008; Rios
et al., 2011).
2.3 Automatic verb classification
Although Levin-style classes have proved helpful for a number of NLP tasks, large-scale
exploitation in real-world or highly domain-sensitive tasks has been limited because no
fully accurate or comprehensive lexical classification is available. There is no such re-
source because manual classification of large numbers of words has proved very time-
consuming. In addition, class-based differences are typically manifested in differences
in the statistics over usages of syntactic-semantic features. This statistical information is
difficult to collect by hand as it is highly domain-sensitive, i.e. it varies with predominant
word senses, which change across corpora and domains (e.g. biomedical domain).
In recent years, automatic induction of verb classes from corpus data has become increas-
ingly popular (Merlo and Stevenson, 2001; Korhonen et al., 2003; Schulte im Walde,
2The baseline assigns all slots the role with highest probability given the slot class (subject, object,
indirect object and PP object).
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 25
2006; Joanis et al., 2007; Li and Brew, 2008; O´ Se´aghdha and Copestake, 2008; Vla-
chos et al., 2009b); Some works have focussed on multilingual (Ferrer, 2004; Falk et al.,
2012) and domain specific classification (Korhonen et al., 2006b, 2008).
Automatic verb classification is important, because it provides the opportunity to acquire
and tune classifications for the application and domain in question. Automatic classifi-
cation is not only cost-effective but it also gathers important statistical information from
corpora and can be applied to new domains, provided relevant data is available.
Various approaches have been proposed for verb classification. Both supervised and
unsupervised (including semi-supervised) approaches have been used to classify features
extracted from corpus data (raw, tagged or parsed). Although the results have been
generally encouraging, the accuracy of automatic classification shows room for improve-
ment. In the following, we review the previous works on automatic verb classification.
For each work, the data, features, clustering methods and results are described.
2.3.1 Classification methods and performance
Supervised approaches
A supervised approach assigns verbs into one of several pre-defined verb classes. Super-
vised methods yield optimal performance where adequate and accurate training data are
available. However, such methods require a pre-defined verb classification as part of the
training data. Therefore, they are unable to detect new classes, and the cost of applying
the method to a new domain is high.
Merlo and Stevenson (2001) presented an automatic classification of English intransitive
verbs into just three syntactic classes: unergative, unaccusative and object-drop verbs.
The classification was based on the argument structure and the thematic relations of
verbs. The features used were transitivity, causativity, animacy, passive voice and verb
tense. Each verb was associated with a vector of 5 features. The vectors served as input
to a decision tree classification algorithm. 60 verbs are classified into three classes with
the accuracy3 of 69.8%.
Joanis et al. (2007) classified 845 verbs into 14 (some coarse, some fine-grained) Levin
classes. 11 evaluations were reported for 2-14 way classifications. The features included:
syntactic slot, syntactic slot overlap, tense, voice, aspect and animacy. Support Vector
Machines (Vapnik, 1995) were used for classification. The best accuracy on the 14 way
test was 58.4%.
Sun et al. (2008b) classified 204 verbs into 17 Levin classes using rich SCF based fea-
tures (SCFs parameterized for prepositions, tense and voice). The features were extracted
3The evaluation measures used in this section are described in section 2.3.3.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 26
from the automatically acquired VALEX lexicon (Korhonen et al., 2006a). Three classifi-
cation methods were used: Support Vector Machines, Maximum Entropy and Gaussian
method. The best classification accuracy was obtained with the Gaussian method (64%).
The best performing feature is SCF parameterized with prepositions. O´ Se´aghdha and
Copestake (2008) classified Sun et al.’s best performing feature using the distributional
kernel method. A better accuracy was obtained at 67%.
Li and Brew (2008) investigated a range of feature sets for classifying English verbs.
Joanis et al. (2007)’s test set and a large-scale test set of 48 Levin classes involving 1300
verbs were used. Bayesian Multinomial Regression (Genkin et al., 2008) was employed
for classification. The feature sets included Joanis et al.’s feature sets, CO, dependency
relation, SCF and their combinations. The SCFs combined with COs gave the best result
at 66.3% in accuracy for Joanis et al.’s 14-way test-set.
Unsupervised and semi-supervised approaches
Unsupervised methods use clustering to infer verb classes based on the similarity between
verbs4. Such methods require minimal prior knowledge on the resulting classification: a
similarity measure and the number of classes (optional for some methods). This approach
has the benefit that it can also be used to discover novel information in corpus data. This
is particularly useful for supplementing or improving existing classifications or learning
new classifications for languages and domains where no manually built classifications
are available. The resulting classification reflects the natural groupings of the data (Jain
et al., 1999). This can help to understand the set of features that characterizes the verb
classes and the underlying linguistic phenomena. For example, if a clustering experiment
shows that DAs naturally group verbs into Levin style classes, it can be an empirical proof
for Levin’s verb classification.
Sometimes, partial information of the verb classification is available prior to clustering.
This information can be incorporated as a guideline for the clustering algorithm. This
forms a basis of a semi-supervised approach5. Since in the existing verb clustering exper-
4There are unsupervised methods other than clustering methods: for example dimensionality reduction
techniques such as Non-negative matrix factorization (Seung and Lee, 2001), Principal component analysis
(Wold et al., 1987) and Singular value decomposition (Golub and Reinsch, 1970) and Neural Network
model (Kohonen, 1990). Because verb classification is a task that groups verbs together, clustering is
the most straightforward approach. However, note that some methods that are not traditional clustering
methods can be used for clustering, e.g. Non-negative matrix factorization (Ding et al., 2010; Yang and
Oja, 2012).
5We use the term semi-supervised instead of weakly or lightly supervised learning. In some works, the
two terms are used interchangeably (e.g. Jones (2004); Pham et al. (2005)). Yet in weakly-supervised
learning, the training dataset (the seeds) is typically small. At least one labelled sample is required for
each class. In semi-supervised learning, the definition tends to be more flexible: the training data does
not need to be small, but the use of a large amount of unlabelled data is the focus (e.g. Søgaard (2011)).
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 27
iments, the supervision(e.g. constraints) is always added over an unsupervised method,
we included semi-supervised approaches in the current section.
Brew and Schulte im Walde (2002) used spectral clustering algorithm (SPEC) (Ng et al.,
2002) to cluster 57 German verbs into 14 classes on the basis of cues related to SCF fre-
quency information. A set of 38 SCFs was employed, some of which were parameterized
for prepositions. The frequency information was extracted automatically from the data
parsed using a statistical parser. The performance of SPEC was compared to that of the
K-means algorithm. The results showed that the SPEC outperformed K-means in all the
experiments yielding an F-Measure of 0.48 at best.
Schulte im Walde (2003) performed a larger experiment with 168 German verbs belong-
ing to 43 classes. The features were similar to the ones used in Brew and Schulte im
Walde (2002). Some frames were parameterized for SPs. 0.182 adjusted Rand Index was
obtained using K-means for classification.
Stevenson and Joanis (2003) classified 260 verbs into 13 Levin’s classes using agglomer-
ative hierarchical clustering. The features were similar to the features used in Joanis et
al. (2007). They experimented with unsupervised, manual and semi-supervised feature
selection. The semi-supervised method samples a small set of seed verbs whose classifi-
cation is known. The features were selected on the basis of the supervised classification
result. The semi-supervised feature selection gave the best accuracy at 38%, outperform-
ing the manually selected features.
Korhonen et al. (2003) conducted an experiment of classifying 110 highly polysemous
English verbs into 34 classes. They achieved 60% modified purity. Like Schulte im Walde
(2003), they clustered verbs according to SCF frequency information, but employed a
technique capable of dealing with sense variation. SCFs were extracted from corpus data
using the SCF acquisition system of Briscoe and Carroll (1997). They parameterized two
of the SCFs for prepositions, and applied the Information Bottleneck and the Nearest
Neighbours method to assign verbs into classes corresponding to (any of) senses in the
corpus data.
Korhonen et al. (2006b) extended this system with additional clustering methods (Prob-
abilistic Latent Semantic Analysis and a modified version of the Information Bottleneck)
and applied it to the biomedical domain. 192 medium to high frequency verbs were
selected from biomedical journals for the experiment. The resulting classification was
highly accurate (77% F-measure) and domain specific. Using the same gold standard,
Korhonen et al. (2008) investigated a wide range of feature sets (e.g. SCF, voice, SP). A
novel clustering algorithm - pairwise clustering (Puzicha et al., 2000) was used. The best
performing feature was SCFs parameterized for prepositions and lexical preferences (LPs).
The LPs were acquired automatically from corpus data using an unsupervised method.
In verb clustering, the supervision is often not available for all the classes. Therefore, we use the term
semi-supervised learning.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 28
Vlachos et al. (2009b) used the Dirichlet process mixture model to cluster verbs in Sun
et al. (2008b) test set using SCFs parameterized for prepositions. The result is 57% in
V-measure (Rosenberg and Hirschberg, 2007) for the 30 clusters setting against the 17
gold standard classes. Furthermore, a small number of pairwise constraints specifying if
two verbs must link or must not link were added to the algorithm. The performance was
improved by 8%.
Falk et al. (2012) used existing syntactic and semantic lexical resources to cluster 2183
French verbs. These verbs are from the gold standard classes we developed and intro-
duce in chapter 6 of this thesis. The following features were used: SCF, VerbNet thematic
grids, syntactic features (symmetric arguments, predicate, sentential argument, optional
object and passive build with it) and semantic features (location role, concrete object,
asset role and plural role). Incremental Growing Neural Gas (IGNG) with Feature Max-
imization was used as the clustering method (Lamirel et al., 2011). The method was
shown to outperform the K-means, because it is suitable for the relatively small and
clean features extracted from the lexicon. 0.70 F-Measure was achieved using a combi-
nation of thematic grid, SCF and syntactic features. Detailed discussion on this work can
be found in section 6.8.
In addition to the flat clustering, Ferrer (2004) applied hierarchical clustering to 514
Spanish verbs and evaluated against a hierarchical gold standard resembling that of
Levin’s classification in English (Va´zquez et al., 2000). Radj of 0.07 was reported for a
15-way classification, which is comparable to the result of Stevenson and Joanis (2003).
2.3.2 Features and feature extraction
The main feature of manual verb classification is a DA which manifests at the level of
syntax as alternating sets of SCFs. Since automatic detection of DAs is challenging (Mc-
Carthy, 2001), most work on automatic classification has focused on syntactic features,
exploiting the fact that similar alternations tend to result in similar syntactic behaviour.
The syntactic features have been shallow syntactic slots (e.g. NPs preceding or follow-
ing the verb) extracted using a lemmatizer or a chunker, or verb SCFs extracted using
a chunker or a parser. These feature types have been refined with information about
prepositional preferences (PPs) of verbs. Joanis et al. (2007) have reported better re-
sults using syntactic slots, while several others have obtained good results using SCFs,
e.g. (Schulte im Walde, 2006; Li and Brew, 2008). While SCFs correspond better (than
syntactic slots) with the features used in manual work, optimal results have required
including in SCFs additional information about adjuncts (not only arguments) of verbs
(Sun et al., 2008a) which are typically not used in manual classification.
Recent research has also experimented with replacing or supplementing SCFs with in-
formation about basic lexical context (co-occurrences (COs)) of verbs, or LPs in specific
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 29
grammatical relations (GRs) associated with verbs in parsed data (for example, the type
and frequency of prepositions in the indirect object relation) (Li and Brew, 2008). Some
experiments have also explored the usefulness of verb tense (e.g. the part-of-speech tags
of verbs), voice (the knowledge whether the verb was used in active or passive) and/or
aspect for verb classification (Joanis et al., 2007; Korhonen et al., 2008).
While most work has focussed on syntactic or lexical features, a few attempts have also
been made to refine syntactic features with semantic information about verb selectional
preferences (SPs). Following Merlo and Stevenson (2001); Joanis et al. (2007) used a
simple ’animacy’ feature which was determined by classifying e.g. pronouns and proper
names in data to this single SP class. Joanis (2002) employed as SP models the top level
WordNet (Miller, 1995) classes (Schulte im Walde (2006) tried a similar approach for
German). Finally, combinations of lexical, syntactic, semantic and other features have
been explored. We describe below the features and the feature extraction methods used
in previous works.
Transitivity
The transitive and intransitive usage of verbs can be used to distinguish certain groups
of verbs (Merlo and Stevenson, 2001). For example, among the three groups of verbs:
unergative (e.g. race), unaccusative (e.g. melt) and objective-drop (e.g. play), we are
expecting the frequency of the transitive usage as follows: object-drop >unaccusative
>unergative.
An approximation of this feature was used by Merlo and Stevenson (2001): A verb
occurrence preceded by forms of the verb be, or immediately followed by a potential
object (e.g. noun, pronoun and determiner) was counted as transitive; otherwise, the
occurrence was counted as intransitive. As the underlying corpus was tagged but not
parsed, the identified transitive/intransitive uses are expected to include some noise.
Causativity
Causativity is related to the causative alternation. Consider the example The sun melted
the ice./The ice melted. In this alternation, the thematic role of the subject of the intran-
sitive is same as the thematic role of the object of the transitive (Merlo and Stevenson,
2001). For verbs that don’t take this alternation, this pattern is infrequent. For example,
in the unexpressed object alternation (Mike ate the cake./Mike ate.), the thematic role of
the subject in the intransitive use is different from the thematic role of the object in the
transitive use.
In Merlo and Stevenson (2001), causativity feature was approximated as follows: The
subjects and objects of a verb were collected into two multi-sets, respectively. The overlap
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 30
of these two multi-sets were calculated in order to find out the number of times the same
noun was used as both subject and object. For example, given the subject and object
multisets {a,a,a,b} and {a}, the overlap is {a,a,a}. The feature value is the ratio between
the cardinality of the overlap, and the sum of the cardinality of the subject and object
multisets. For the last example, the ratio would be 3/5.
Animacy
Certain groups of verbs can more frequently take agentive subjects. For example, the
unergative and object-drop verbs take agentive subjects in both transitive and intran-
sitive. However, the unaccusative verbs assign an agentive role only in transitive. In
the intransitive use, the unaccusative verbs assign a theme role to the subject. Thus,
the agentive subject is less frequently found with unaccusatives than with unergative or
object-drop verbs. Since agents are more likely to be animate nouns than the themes, the
unaccusative verbs are expected to be less likely taking animate nouns as subject when
compared to unergative and object-drop verbs (Merlo and Stevenson, 2001).
Merlo and Stevenson approximated the animacy of the subjects by all pronouns except
it. They followed the hypothesis, presented in Silverstein (1976); Dixon (1994), that
those pronouns most often refer to animate entities. The pronouns that occur in the
subject position were counted. The feature value is the ratio of the count of pronoun
subjects to the count of all subjects.
Stevenson and Joanis (2003) and Joanis et al. (2007) extended this approach by including
proper noun phrases that are labelled as a person by a chunker (Abney and Abney, 1991).
The chunker includes a crude named entity recognition system which recognizes person
names by matching a list of English first names and titles.
Voice
The passive/active voice of the verb is related to the transitivity alternations. The passive
use of the verb implies a transitive use of a verb.
In Merlo and Stevenson (2001), verbs tagged with VBD (the past tense tag) were con-
sidered as active voice. A token tagged as VBN (the past participle tag) and the closest
preceding auxiliary be were considered as passive use. This feature was also used by
Stevenson and Joanis (2003) and Joanis et al. (2007).
Verb Tense
In analogy to the voice feature, the VBN/VBD tag is related to the transitivity alternation.
The past participle form of the verb implies a transitive use of a verb. In addition, the
middle voice (found in middle alternation) is often in present tense (Joanis et al., 2007).
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 31
In Merlo and Stevenson (2001), the occurrences of VBD/VBN tags are simply counted
within the tagged corpus. In Stevenson and Joanis (2003) and Joanis et al. (2007), tags
VB (base form), VBP (present tense, not third person singular), VBZ (present tense, third
person singular) and VBG (present particle) are also included.
Lexical Preferences
This feature encodes the frequency of the syntactic slots corresponding to verbal ar-
guments (e.g. subject and direct and indirect object slots). These argument slots are
components of a subcategorization frame, but are here considered independently of their
co-occurrence with other slots.
In Stevenson and Joanis (2003) and Joanis et al. (2007), these features were extracted
from the tagged and NP-chunked BNC corpus. In Joanis et al. (2007), the subject slots
that were found in transitive and intransitive usages were considered separately. In ad-
dition, the frequency of two word classes on the slots were considered which reflect the
number of DAs, including the reflexive pronouns (e.g. Jill dressed hurriedly./Jill dressed
herself hurriedly.) and the words it/there (e.g. A problem developed./There developed a
problem.). Furthermore, the overlap of the word lemmas in the slots was also a feature.
This overlap feature was inspired by Causativity feature (Merlo and Stevenson, 2001)
described in this section. Since the same semantic argument can occur in different slots
in alternating frames, the degree to which these two slots containing the same entities is
an indicator of the verb’s participation in an alternation. For example, given the alterna-
tion The sky cleared/The clouds cleared from the sky, an overlap feature of the subject
and the indirect object slot is added. The overlap value was calculated in the same way
as in Merlo and Stevenson (2001) (described above in the causativity section).
In the recent works, the heads of arguments were detected from grammatical relations
found by statistical parsers. In Li and Brew (2008), the lexical heads of arguments were
identified using grammatical relations produced by the C&C CCG parser (Clark and
Curran, 2007). Korhonen et al. (2008) used the RASP parser (Briscoe et al., 2006) to
extract grammatical relations.
Subcategorization Frames
DAs show at alternating sets of SCFs (e.g. in the causative/inchoative alternation, an NP
frame alternates with an intransitive frame: Tony broke the window ↔ The window
broke). Most work (e.g. Korhonen et al. (2003, 2006b, 2008); Li and Brew (2008)) on
automatic classification has exploited the fact that similar alternations tend to result in
similar SCFs.
Many recent verb classification works (Korhonen et al., 2003, 2006b; Joanis et al., 2007;
Korhonen et al., 2008) have used some version of the comprehensive SCF acquisition
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 32
system originally developed by Briscoe and Carroll (1997). The system makes use of the
RASP toolkit (Briscoe et al., 2006). The corpus data are first tokenized, tagged, lemma-
tised and parsed using RASP. SCF patterns are then extracted from the parser output and
classified using a classifier which distinguishes between over 160 SCF types, a superset of
those in the ANLT (Boguraev et al., 1987) and COMLEX (Grishman et al., 1994) syntax
dictionaries. The SCFs abstract over lexically-governed particles, prepositions and spe-
cific predicate selectional preferences. A statistical filtering component may optionally be
applied which removes noisy SCFs from the lexicon. A large SCF lexicon, VALEX (Korho-
nen et al., 2006a), has been constructed using this system. The VALEX lexicon (typically
an unfiltered noisy version of the lexicon, as this produced the best results for the task)
was used for verb classification by Sun et al. (2008b) and Sun et al. (2008a).
Li and Brew (2008) extracted SCFs by matching the label of a grammatical relation to a
list of syntactic constituents that are found in subcategorization frame definitions. For
example, NP1 is the subject of the verb, NP2 is the object of the verb and PP is the
prepositional phrase. In the first step, a lexical frame is constructed. For a sentence he
broke the door with a hammer, the list of grammatical relations include (dobj broke 1
door 3), (dobj with 4 hammer 6), (iobj broken 1 with 4) and (ncsubj broken 1 He 0);
the identified lexical frame is NP1(he)-V-NP2(door)-PP(with:hammer). Then, the SCF is
constructed as NP1-NP2-PP(with).
SCFs can be further refined by adding lexical, syntactic and semantic information:
SCF+preposition
Some of the SCFs can be parameterized with the prepositional phrase involved. For
example, consider the SCF (NP-PP): she puts the flowers on the table./ she removed the
flowers from the table. As the examples show, put and remove have same SCF NP-
PP. However, put prefers a PP headed by on, while remove prefers a PP headed by
from. Therefore, to tell the difference between put and remove (which belong to different
verb classes, those of PUTTING and REMOVING verbs) it helps to know the prepositions
occurring in their SCFs. In this example, the parameterized SCF would be NP-PP(on) and
NP-PP(from).
In Korhonen et al. (2003), only two high frequency SCFs (PP and NP-PP) were parameter-
ized with preposition information. In recent works, the parameterization was extended
to all the SCFs that involve PPs (e.g. Sun et al. (2008b); Korhonen et al. (2008)). For
example, the SCF NP-P-NP-ING has different variations, depending on the preposition in
question, e.g. he attributed his failure to buying his books; he told her about climbing
the mountain).
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 33
(a)
Feature name Feature value Type of feature
NP-PP 10
SCF features
PP 20
VBD 15
Tense features
VBN 15
(b)
Feature name Feature value Type of feature
NP-VBD-PP 6
NP-PP parameterized by POS tag
NP-VBN-PP 4
VBD-PP 9
PP parameterized by POS tag
VBP-PP 11
Table 2.2: Example of two types of SCF+tense features
SCF+tense
This feature supplements a SCF with the POS tag of the main verb. The tense feature
is related to the passive voice and middle voice of a verb, which are in turn related to
certain DAs. Further description can be found in the tense feature section.
In previous works, two variants of this feature have been used. In the first one, the
frequencies of POS tags are calculated over all the SCFs of a verb. This feature is actually
a simple concatenation of SCF features and tense features. In the second variant, the
frequencies of POS tags are calculated specific to each SCF of the verb. These two types
of features are exemplified in table 2.2.
SCF+voice
The active and passive voices are related to the transitivity of the verb use (described in
the voice feature section). In previous works, two sub-types of the feature were used:
1) the frequency of the active and passive occurrences of the verb calculated over all
the SCFs of the verb 2) the frequency of the active and passive occurrences of the verb
calculated specific to each SCF of the verb. Table 2.3 illustrates these two feature types.
SCF+lexical preference
This feature supplements a SCF with information about LPs of the verbs in the following
slots: subject, direct object, second object, and the NP within the PP complement.
For an example sentence The sun melted the ice, the SCF NP is parameterized by the lex-
ical head of the arguments in the subject and direct object slots. The SCF is transformed
to nsubj:sun-NP-dobj:ice after parameterization.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 34
(a)
Feature name Feature value Type of feature
NP-PP 10
SCF features
PP 20
ACTIVE 25
Voice features
PASSIVE 5
(b)
Feature name Feature value Type of feature
NP-ACTIVE-PP 9
NP-PP parameterized by voice
NP-PASSIVE-PP 1
ACTIVE-PP 16
PP parameterized by voice
PASSIVE-PP 4
Table 2.3: Example of two types of SCF+voice features
The LP features are extracted from the grammatical relation produced by a statistical
parser (Li and Brew, 2008; Korhonen et al., 2008). To relieve the sparse data problem,
frequency threshold can be added to remove the noise (Korhonen et al., 2008).
SCF+selectional preference
To overcome the data sparseness problem with the LP feature, selectional preferences
(SPs) can be used instead of actual nouns. SPs can be strong indicators of DAs (McCarthy
and Korhonen, 1998) and fairly precise semantic descriptions, including information
about verb selectional restrictions, can be assigned to the majority of Levin classes, as
demonstrated by VerbNet (Kipper-Schuler, 2005). SP acquisition from undisambiguated
corpus data is arguably difficult (Brockmann and Lapata, 2003; Erk, 2007; Bergsma
et al., 2008). The traditional way of SP acquisition is to use a lexical resource. Joa-
nis (2002); Schulte im Walde (2006) employed top level WordNet (Miller, 1995) and
GermaNet (Kunze and Lemnitzer, 2002) classes as SP models in verb classification. Joa-
nis (2002) obtained no improvement over syntactic features, whereas Schulte im Walde
(2006) obtained insignificant improvement. Korhonen et al. (2008) combined SPs with
SCFs when clustering biomedical verbs. The SPs were acquired automatically from syn-
tactic slots of SCF using PC clustering. The SP clusters offered no improvement over the
SCF+LP features.
As an example, if word w1 is in SP cluster c1 and word w2 is in cluster c2, the original
SCF+LP feature ncsubj:w1 NP dobj:w2 is transformed to ncsubj:c1 NP dobj:c2. If w3 is
also in the cluster c2, then the SCF+LP feature ncsubj:w1 NP dobj:w3 would have the
same representation in SCF+SP.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 35
Co-occurrence
Co-occurrence (CO) shows the words that occur in the context of a verb. The verbs
that occur in similar contexts tend to have similar meaning, according to the distribu-
tional hypothesis (Harris, 1954). However, COs are generally not considered particularly
sensitive to argument structure (Rohde et al., 2004).
This feature was introduced by Li and Brew (2008). The word lemmas that occur in the
fixed length window around verb are collected as features. A stopword list was used to
filter out function words. Li and Brew also used an extended feature which integrated
some syntactic information: 1) All the prepositions are kept, as they are known to carry
information about the lexical meaning of the verb. 2) All verbs that occur in the context
of the target verb are replaced with their POS tags. Li and Brew assumed that most verbs
tend to have a strong selectional preference for their nominal arguments, but not for
their verbal arguments.
Table 2.4 summarizes all the existing features and the extraction method that are reused
in this thesis. Table 2.5 shows an example for each feature type.
2.3.3 Evaluation measures
Many evaluation measures have been used in verb classification experiments. In this
section, we focus on clustering evaluation measures that have been used in unsupervised
verb classification experiments. Typical objective functions in clustering have a goal of
attaining high intra-cluster similarity and low inter-cluster similarity. This is an internal
criterion for the quality of the clustering (Manning et al., 2008). However, a good result
by an internal criterion might not match the gold standard which is an indicator of the
effectiveness in application. Therefore, we describe the external criterion that evaluates
how well the clustering matches the gold standard. We will omit evaluation measures
which do not consider the gold standard, for example: the Mean Silhouette (Stevenson
and Joanis, 2003) and Cumulative Micro-Precision (Falk et al., 2012). These evaluation
measures are forms of internal criterion which do not reflect the effectiveness of the
clustering in application.
Modified Purity, Weighted class accuracy and F-Measure have been used in many previ-
ous verb clustering experiments, e.g. by Korhonen et al. (2008); O´ Se´aghdha and Copes-
take (2008).
Modified purity (mPUR) is a global measure which evaluates the mean precision of clus-
ters. Each cluster is associated with its prevalent class. The number of verbs in a cluster
K that take this class is denoted by nprevalent(K). Verbs that do not take it are considered
as errors. Clusters where nprevalent(K) = 1 are disregarded, so as not to introduce a bias
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 36
Feature Name Description and extraction method
F-CO Co-occurrence (Li and Brew, 2008)
F-LP Lexical preference, extracted as in Korhonen et al. (2008) using RASP
parser
F-PP Prepositional preference, a subset of F-LP which only include the type
and frequency of prepositions in the indirect object relation
F-SCF Basic SCF, extracted using Preiss et al. (2007)’s system
F-SCF(B) Basic SCF feature, as in Sun et al. (2008b), extracted from the VALEX
lexicon.
F-SCF+CO The concatenation of the F-SCF and F-CO
F-SCF+TENSE(A) F-SCF with the tense of the verb. The frequency of verbal POS tags is
calculated over all SCFs (Korhonen et al., 2008).
F-SCF+TENSE(B) Same as above, but the frequency of verbal POS tags is calculated specific
to each SCF.
F-SCF+VOICE(A) F-SCF with the active/passive voice of the verb. The frequency of the
voice is calculated over all SCFs (Korhonen et al., 2008)
F-SCF+VOICE(B) Same as above, but the frequency of voice is calculated specific to each
SCF.
F-SCF+PP(A) F-SCF with two high frequency PP frames parameterized for prepositions:
the PP and NP-PP frames (Korhonen et al., 2008).
F-SCF+PP(B) F-SCF with all PP frames parameterized for prepositions (Korhonen et al.,
2008).
F-SCF+LP(A) F-SCF is parameterized by the F-LP in all argument slots (Korhonen et al.,
2008).
F-SCF+LP(B) Filter F-SCF+LP(A) by only keeping those raw argument head types which
occur with four or more verbs with frequency of ≥ 3 (Korhonen et al.,
2008).
F-SCF+SP(A) F-SCF is parameterized by the F-SP in all argument slots. As in Korho-
nen et al. (2008), the SPs are acquired automatically by clustering the
argument head. The number of clusters was set to 10.
F-SCF+SP(B) The number of clusters was set to 20.
F-SCF+SP(C) The number of clusters was set to 50.
Table 2.4: Summary of previously proposed features that are reused in this thesis
towards singletons:
mPUR =
∑
nprevalent(ki)>2
nprevalent(ki)
number of verbs
The second measure is weighted class accuracy (ACC): the proportion of members of
dominant clusters DOM-CLUSTi within all classes ci.
ACC =
∑C
i=1 verbs in DOM-CLUSTi
number of verbs
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 37
Feature Name Example features
F-CO she -1, flower +2, on +3 . . .
F-LP nsubj:she, dobj:flower, iobj:on, idobj:table
F-PP iobj:on
F-SCF NP-PP
F-SCF+CO NP-PP, she -1, flower +2, on +3 . . .
F-SCF+TENSE(A) NP-PP, VBZ
F-SCF+TENSE(B) NP-PP VBZ
F-SCF+VOICE(A) NP-PP, ACTIVE
F-SCF+VOICE(B) NP-PP ACTIVE
F-SCF+PP NP-PP:on
F-SCF+LP she-NP:flower-PP:on table
F-SCF+SP SP1-NP:SP5-PP:on SP7
Table 2.5: Example of features extracted from the sentence She puts the flower on the
table. SP noun cluster 1 contains pronouns like he, she and they; Cluster 5 contains
nouns like flower, vine and weed; Cluster 7 contains nouns like table, chair and bench.
mPUR and ACC can be seen as a measure of precision(P) and recall(R) respectively. F-
measure is calculated as the harmonic mean of P and R:
F =
2 ·mPUR · ACC
mPUR + ACC
If the number of clusters is not pre-defined, a high F-Measure would be easily achieved,
as the mPUR and ACC tend to increase when the number of clusters is large. Information
theory based evaluation measures (e.g. NMI, V-MEASURE (Rosenberg and Hirschberg,
2007)) have been used to compare the verb clustering results with different numbers of
clusters (Vlachos et al., 2009b). NMI measures the amount of statistical information
shared by two random variables representing the clustering result and the gold standard
labels. Given random variables A and B:
NMI(A,B) =
I(A;B)
[H(A) +H(B)]/2
I(A,B) =
∑
k
∑
j
|(vk ∩ cj |
N
log
N |vk ∩ cj |
|vk||cj |
where |vk ∩ cj| is the number of shared members between cluster vk and gold standard
class cj. Geiß (2011) proved that V-MEASURE is equivalent to NMI (they are identical
under a certain condition).
Adjusted rand index (Radj) was employed in Schulte im Walde (2006). The Radj views
clustering as a series of decisions, one for each pair of verbs. It measures the disagreement
and agreement of these pairs. Given a gold standard with G classes, and a clustering with
C clusters, a C × G contingency table N defines the agreement between gold standard
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 38
Data Work Method Result(in F)
Joanis et al. (2007)
Li et al. 2008 supervised 66.3
Joanis et al. 2008 supervised 58.4
Stevenson et al. 2003
semi-supervised 29
unsupervised 31
Sun et al. (2008b) Sun et al. 2008
supervised 62.50
unsupervised 51.6
O´ Se´aghdha et al. 2008 supervised 67.3
Table 2.6: Previous verb classification results on two small gold standards
and the clustering. nij is the size of the intersection between class i and cluster j. The
formula of Radj is (Hubert and Arabie, 1985):
Radj =
∑
i,j
(nij
2
)
−
∑
i
(ni·
2
)∑
j
(n·j
2
)
/
(n
2
)
1
2 [
∑
i
(ni·
2
)
+
∑
j
(n·j
2
)
]−
∑
i
(ni·
2
)∑
j
(n·j
2
)
/
(n
2
)
2.4 The current challenges
In table 2.6, we list a few previous classification results on two small gold standards6.
Although improved accuracy can be observed in more recent works, there is still much
room for further improvement. Even the supervised approaches do not achieve perfor-
mance better than an accuracy of 70. Large-scale, cross-lingual and domain specific
experiments are also needed. We discuss a few important challenges currently faced by
verb classification. The first three challenges are addressed in this thesis. The last chal-
lenge – polysemy is discussed in the future work section (section 8.2).
2.4.1 Semantic information
The current features used in verb classification are mainly syntactic in nature, e.g. SCF-
based and lexical features. However, the member verbs of Levin’s classes are not only
similar in terms of their syntactic behaviour, but also share meaning components. Levin
discusses selectional preferences with many verb classes and fairly precise semantic de-
scriptions. Information about verb selectional restrictions has been assigned to the ma-
jority of VerbNet classes. It seems intuitive that semantic features should be useful for
verb classification.
Previous works have mainly experimented with SPs acquired from a lexicon, because the
automatic acquisition of SPs from corpus is challenging (Brockmann and Lapata, 2003;
6Note that although the results are obtained using the same gold standards, the results are not mutually
fully comparable because of differences in corpus data, features, and the number of test verbs actually used
in experiments. However, they serve to show the upper bound of previous works in English.
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 39
Erk, 2007; Bergsma et al., 2008). In these previous works, SPs do not offer improvement
over the syntactic features (Joanis, 2002; Schulte im Walde, 2006; Korhonen et al., 2008).
In verb classification, SPs are needed for argument slots of SCFs. The data may be sparse,
and the resulting feature space is very high-dimensional. This large feature space is a
challenge for the current clustering methods. New clustering methods that are good in
handling high-dimensional feature space may help.
2.4.2 Hierarchical classification
The existing gold standards for verb classification (e.g. Levin’s classification and Verb-
Net) are hierarchical in nature. Different levels of sub-classes form a tree structure. Yet
current works mostly focus on flat classification of verbs. Some perform multiple lev-
els of flat clustering and evaluate against a hierarchical gold standard (Korhonen et al.,
2008), and others perform hierarchical clustering but evaluate against a flat gold stan-
dard (e.g. Stevenson and Joanis (2003)). We found that only Ferrer (2004) performed
hierarchical clustering on Spanish verbs, and evaluated against a small hierarchical gold
standard. Moreover, all previous works use linkage hierarchical clustering. This method
has a few problems, e.g. the cut-off value is difficult to determine (Stevenson and Joa-
nis, 2003). A recent graph-based method (Yu et al., 2006) avoids some problems of the
linkage method, and performs better on many tasks. Yu et al.’s method can be a good
starting point for improving the performance of hierarchical verb clustering.
2.4.3 Task-based evaluation
The manually created verb classification in VerbNet has proved useful for many practi-
cal NLP tasks. However, to our knowledge, automatic verb classification has not been
evaluated in the context of a NLP task yet. Automatic classification is typically evaluated
against a manual gold standard. Although gold standard evaluation gives some idea of
an accuracy of the method, few gold standards are perfect and the required accuracy or
ideal granularity of a verb classification may change from one task to another. More-
over, some tasks involve specific domains for which no manually built classifications are
available that could serve as a gold standard. It is therefore important to evaluate verb
clustering in the context of tasks.
2.4.4 Polysemy
Polysemy is frequent in language. It is estimated that 25% of the verbs in VerbNet are
polysemous (Abend et al., 2008). In particular, many high frequency verbs have several
senses and can be members of several classes. Most work on automatic classification
CHAPTER 2. BACKGROUND TO VERB CLASSIFICATION 40
has bypassed this issue by assuming a single class for each verb – usually the one cor-
responding to its predominating (the most frequent sense) in language according to e.g.
WordNet. This is not only an oversimplified model for the real-world application of
verb classes but also the predominating sense is not static but varies across domains and
sub-languages.
Few attempts have been made to address this problem. Korhonen et al. (2003) performed
a clustering experiment with highly polysemous verbs. They constructed a polysemous
gold standard for around 200 English verbs and examined whether a soft clustering
method (Information Bottleneck) could be used to assign these verbs to several classes.
The clustering turned out hard, with the majority of verbs being assigned to one class
only. Yet the investigation showed that polysemy has a considerable impact on verb
classification: optimal results were obtained when clustering was evaluated against the
polysemous gold standard, not the monosemous version of it which assumed the pre-
dominant sense according to WordNet.
Clearly polysemy is an issue that needs to be dealt with, and this amounts to both ex-
tending gold standards to capture non-predominant senses as well as finding a suitable
ML method. A multi-label classification method was used for supervised adjective clas-
sification (Boleda et al., 2007) which might yield useful results also with verbs. For
unsupervised learning, we argue that the soft clustering methods (e.g. Gaussian mixture
model and EM) are not suitable for modelling polysemy. The probability of a verb be-
longing to more than one cluster can not be modelled by these soft clustering approaches.
Probabilistic clustering models which explicitly model the overlap between lexical cat-
egories might be of use (Heller et al., 2008). We discuss this in detail in the section
8.2.2.
Chapter 3
Verb clustering using selectional
preferences
3.1 Introduction
1 As discussed in previous chapters, both supervised and unsupervised machine learning
(ML) methods have been proposed for verb classification and used to classify a variety
of features extracted from raw, tagged and/or parsed corpus data. The best performing
features on cross-domain verb classification have been syntactic in nature (e.g. syntac-
tic slots, SCFs). Disappointingly, semantic features have not yielded significant additional
improvement, although they play a key role in manual and theoretical work on verb clas-
sification and could thus be expected to offer a considerable contribution to classification
performance.
We further investigate the potential of semantic features – verb SPs – for the task. We
introduce a novel approach to verb clustering which involves the use of (i) a SCF acqui-
sition system by (Preiss et al., 2007) which produces rich lexical, SCF and syntactic data,
(ii) novel syntactic-semantic feature sets extracted from this data which incorporate a
variety of linguistic information, including SPs, and (iii) a new variation of spectral clus-
tering based on the MNCut algorithm (Maila and Shi, 2001) which is well-suited for
dealing with the resulting, high dimensional feature space.
Using this approach, we show on two well-established test sets that automatically ac-
quired SPs can be highly useful for verb clustering. They yield high performance when
used in combination with syntactic features. We obtain our results using a fully unsu-
pervised approach to SP acquisition which differs from previous approaches employed
in verb classification in that it does not exploit WordNet (Miller, 1995) or other lexi-
cal resources. It is based on clustering argument head data in the grammatical relations
associated with verbs.
1The research reported in this chapter was published in Sun and Korhonen (2009).
41
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 42
We describe our features in this section and the clustering methods in section 3.2. Exper-
imental evaluation and results are reported in sections 3.3 and 3.4, respectively. Section
3.5 provides discussion and describes related work, and section 3.6 concludes.
SP acquisition from undisambiguated corpus data is arguably challenging (Brockmann
and Lapata, 2003; Erk, 2007; Bergsma et al., 2008). It is especially so in the context
of verb classification where SP models are needed for specific syntactic slots for which
the data may be sparse, and the resulting feature vectors integrating both syntactic and
semantic features may be high dimensional. However, we wanted to investigate whether
better results could be obtained if the features were optimised for richness, the feature
extraction for accuracy, and a clustering method capable of dealing with the resulting
high dimensional feature space was employed.
3.1.1 Feature extraction
We adopted SCF acquisition system which has proved more accurate than previous com-
parable systems for English 2 but which has not been employed for verb clustering before:
the system of Preiss et al. (2007). This system tags, lemmatizes and parses corpus data
using the current version of the RASP toolkit (Briscoe et al., 2006), and on the basis of
resulting grammatical relations (GRs) assigns each occurrence of a verb to one of 168
verbal SCFs classes3.
The system provides a filter which can be used to remove adjuncts from the resulting
lexicon. We do not employ this filter since adjuncts have proved informative for verb
classification (Sun et al., 2008b; Joanis et al., 2007). However, we do frequency-based
thresholding to minimise the noise (e.g. erroneous SCFs) and sparse data in verb classifica-
tion and to ensure that only features supported by several verbs are used in classification:
we only consider SCFs and GRs which have frequency larger than 40 with 5 or more
verbs4.
The system produces a rich lexicon which includes raw and processed input sentences and
provides a variety of material for verb clustering, including e.g. (statistical) information
related to the POS tags, GRs, SCFs, argument heads, and adjuncts of verbs. Using this
material, we constructed a wide range of feature sets for experimentation, both shallow
and deep syntactic and semantic features. As described below, some of the feature types
have been employed in previous works and some are novel.
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 43
F1: F-CO
F2: F-PP
F3: F-LP
F4: F-SCF
F5: F-SCF+CO
F6: F-SCF+TENSE(B)
F7: F-SCF+PP(B)
F8: F-SCF(B)
Table 3.1: The mapping to the features in table 2.4. F8 is extracted from the VALEX lexi-
con (Korhonen et al., 2006a) for the comparison to Preiss et al. (2007)’s SCF acquisition
system.
3.1.2 Feature sets
Table 3.1 provides the mapping of our features to the features showed earlier in table 2.4
which were used in previous experiments. F8 was included to enable comparing the
contribution of the SCF system to that of an older, comparable system which was used
for constructing the VALEX lexicon (Korhonen et al., 2006a).
The following 9 feature sets are novel, so they are not in table 2.4. They build on F7,
refining it further. F9-F11 refine F7 with information about LPs:
F9: F7 with F3 (subject only)
F10: F7 with F3 (object only)
F11: F7 with F3 (subject, object, indirect object)
F12-17 refine F7 with SPs. We adopt a fully unsupervised approach to SP acquisition.
We acquire the SPs by
1. taking the GR relations (subject, object, indirect object) associated with verbs,
2. extracting all the argument heads in these relations which occur with frequency >
20 with more than 3 verbs, and
3. clustering the resulting N most frequent argument heads into M classes using the
SPEC method described in the following section.
2See Preiss et al. (2007) for the details of evaluation.
3We used an implementation of the SCF classifier provided by Paula Buttery.
4These and other threshold values mentioned in this chapter were determined empirically on corpus
data.
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 44
We tried the N settings {200, 500} and the M settings {10, 20, 30, 80}. The best settings
N = 200,M = 20 and N = 500,M = 30 are reported. We enforce the features to be
shared by all the potential members of a verb class. The expected class size is approxi-
mately N/K, and we allow for 10% outliers (the features occurring less than (N/K)×0.9
verbs are thus removed).
The resulting SPs are combined with SCFs in a similar fashion as LPs are combined with
SCFs in F9-F11:
F12-F14: as F9-F11 but SPs (20 clusters from 200 argument heads) are used instead of
LPs
F15-F17: as F9-F11 but SPs (30 clusters from 500 argument heads) are used instead of
LPs
All the features (including features used in other chapters) are summarized in appendix A.
3.2 Clustering methods
We use two clustering methods: (i) pairwise clustering (PC), which obtained the best per-
formance in comparison with several other methods in work on biomedical verb cluster-
ing (Korhonen et al., 2008), and (ii) a method which is new to the task (and to the best
of our knowledge, to NLP): a variation of spectral clustering which exploits the MNCut
algorithm (Maila and Shi, 2001) (SPEC). SPEC has been shown to be effective for high
dimensional and non-convex data in NLP (Chen et al., 2006) and it has been applied to
German verb clustering by Brew and Schulte im Walde (2002). However, previous work
has used Ng et al. (2002)’s algorithm, while we adopt the MNCut algorithm. The latter
has shown a wider applicability (von Luxburg, 2007; Verma and Meila, 2003) and it can
be justified from the random walk view, which has a clear probabilistic interpretation.
Clustering groups a given set of items (verbs in our experiment) V = {vn}Nn=1 into a
disjoint partition of K classes I = {Ik}Kk=1. Both our algorithms take a similarity matrix
as input. We construct this from the skew divergence (Lee, 2001). We choose this simi-
larity measure by following Brew and Schulte im Walde (2002)’s work on German verb
clustering. An alternative distributional similarity measure is Jensen-Shannon divergence
(Lin, 1991). These two measures are compared in appendix B.
The skew divergence between two feature vectors v and v′ is:
dskew(v, v
′) = D(v′||a · v + (1− a) · v′) (3.1)
whereD is the KL-divergence. v is smoothed with v′. The level of smoothing is controlled
by a whose value is set to a value close to 1 (e.g. 0.9999). We symmetrize the skew
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 45
divergence as follows:
d(v, v′)sskew =
1
2
(dskew(v, v
′) + dskew(v
′, v))
SPEC is typically used with the Radial Basis Function (RBF) kernel. We adopt a new
kernel similar to the symmetrized KL divergence kernel (Moreno et al., 2004) which
avoids the need for scale parameter estimation.
w(v, v′) = exp(−dsskew(v, v
′))
The similarity matrix W is constructed where Wij = w(vi, vj).
3.2.1 Pairwise clustering
PC (Puzicha et al., 2000) is a method where a cost criterion guides the search for a
suitable partition. This criterion is realized through a cost function H(S,M) where
(i) S = {sim(a, b)}, a, b ∈ A : a collection of pairwise similarity values, each of which pertains
to a pair of data elements a, b ∈ A.
(ii) M = (A1, . . . , Ak) : a candidate clustering configuration, specifying assignments of all
elements into the disjoint clusters (that is ∪Aj = A and Aj ∩Aj′ = φ for every 1 ≤ j < j′ ≤
k).
The main idea underlying the clustering criteria is the preference of configurations in
which similarity of elements within each cluster is generally high and similarity of ele-
ments that are not in the same cluster is correspondingly low.
The cost function is defined as follows:
H = −
∑
nj ·Avgsimj ,
Avgsimj =
1
nj ·(nj−1)
∑
{a,b∈Aj}
sim(a, b)
where nj is the size of the jth cluster and Avgsimj is the average similarity between cluster
members. We used the skew divergence as the similarity measure.
3.2.2 Spectral clustering
In SPEC, the similarities Wij are viewed as the weight on the edges ij of a graph G
over V . The similarity matrix W is thus the adjacency matrix for G. The degree of
a vertex i is di =
∑N
j=1 wij. A cut between two partitions A and A
′ is defined to be
Cut(A,A′) =
∑
m∈A,n∈A′Wmn.
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 46
In the MNCut algorithm, the similarity matrix W is transformed to a stochastic matrix
P .
P = D−1W (3.2)
The degree matrix D is a diagonal matrix where Dii = di.
It was shown by Maila and Shi (2001) that if P has the K leading eigenvectors that are
piecewise constant5 with respect to a partition I∗ and their eigenvalues are not zero, then
I∗ minimizes the multiway normalized cut(MNCut):
MNCut(I) = K −
K∑
k=1
Cut(Ik, Ik)
Cut(Ik, I)
(3.3)
Pmn can be interpreted as the transition probability between vertices m,n. The criterion
can thus be expressed as:
MNCut(I) =
K∑
k=1
(1− P (Ik → Ik|Ik)) (3.4)
which is the sum of transition probabilities across different clusters. The criterion finds
the partition where the random walks are most likely to happen within the same cluster.
In practice, the K leading eigenvectors of P are not piecewise constant. But we can
extract the partition by finding the approximately equal elements in the eigenvectors
using a clustering algorithm like K-Means.
The numerator of MNCut is similar to the cost function of PC. The main differences
between the two algorithms are: 1) MNCut takes into account of cross cluster similar-
ity, while PC does not. 2) PC optimizes the cost function using deterministic annealing,
whereas SPEC uses eigensystem decomposition.
The SPEC algorithm is based on the MNCut algorithm (Maila and Shi, 2001).
5The eigenvector v is piecewise constant with respect to I if v(i) = v(j)∀i, j ∈ Ik and k ∈ 1, 2...K
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 47
Input: Dataset S, Number of clusters K
1. Compute similarity matrix W and Degree matrix D
2. Construct stochastic matrix P using equation 3.2
3. Compute the eigenvalues and eigenvectors {λn, xn}Nn=1 of P , where λn ≥ λn+1,
form a matrix X = [x2, . . . , xk] by stacking the eigenvectors in columns.
4. Form a matrix Y from X by normalizing the row sums to have norm 1:
Yij = Xij/(
∑
j X
2
ij)
1
2
5. Consider the row of Y to be the transformed feature vectors for each verb
and cluster them into clusters C1 . . . Ck using K-means clustering algorithm.
Output: Clusters C1 . . . Ck
3.3 Experimental evaluation
3.3.1 Test sets
We employed two test sets which have been used to evaluate previous work on English
verb classification:
T1 The test set of Joanis et al. (2007) provides a classification of 835 verbs into 15
(some coarse, some fine-grained) Levin classes. 11 tests are provided for 2-14
way classifications. We employ the 14-way classification because this corresponds
the closest to our target (Levin’s fine-grained) classification6. We select 586 verbs
according to Joanis et al.’s selection criteria, resulting in 10-120 verbs per class. We
restrict the class imbalance to 1:1.57. This yields 205 verbs (10-15 verbs per class),
which is similar to the sub-set of T1 employed by Stevenson and Joanis (2003).
T2 The test set of Sun et al. (2008b) classifies 204 verbs to 17 fine-grained Levin classes,
so that each class has 12 member verbs.
Table 3.2 shows the classes in T1 and T2. The class names in the first column and
the number in the second column correspond to classes in Levin’s classification. For
example, the class 51.3.2 (Run) contains verbs fly, gallop, glide, jog, march, run, slide,
stroll, swim, travel, trot, walk. Joanis et al. (2007) did not include the actual verbs used
in the experiment. Thus, we show the list of verbs used in T1 in appendix D.
6However, the correspondence is not perfect, with half of the classes including two or more of Levin’s
fine-grained classes.
7Otherwise, in the case of a large class imbalance the evaluation measure would be dominated by the
classes with large population.
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 48
T1
Object Drop 26.1, 26.3, 26.7
Recipient 13.1, 13.3
Admire 31.2
Amuse 31.1
Run 51.3.2
Sound 43.2
Light & Substance 43.1,43.4
Cheat 10.6
Steal & Remove 10.5,10.1
Wipe 10.4.1, 10.4.2
Spray / Load 9.7
Fill 9.8
Putting 9.1-6
Change of State 45.1, 45.2, 45.3, 45.4
T2
Remove 10.1
Send 11.1
Get 13.5.1
Hit 18.1
Amalgamate 22.2
Characterize 29.2
Peer 30.3
Amuse 31.1
Correspond 36.1
Manner of speaking 37.3
Say 37.7
Nonverbal expression 40.2
Light 43.1
Other change of state 45.4
Mode with motion 47.3
Run 51.3.2
Put 9.1
Table 3.2: Levin classes in T1 and T2
3.3.2 Data processing
For each verb in T1 and T2, we extracted all the occurrences (up to 10,000) from the
raw corpus data gathered originally for constructing the VALEX lexicon (Korhonen et al.,
2006a). The data was gathered from five corpora, including the BNC (Leech, 1992), the
Guardian corpus, the Reuters corpus (Rose et al., 2002), the North American News Text
Corpus (Graff, 1995) and the data used for two Text Retrieval Evaluation Conferences8
(TREC-4 and TREC-5). The average frequency of verbs in T1 was 1448 and T2 2166,
showing that T1 is a more sparse dataset.
The data was first processed using the feature extraction module. Table 3.3 shows (i) the
total number of features in each feature set and (ii) the average per verb in the resulting
lexicons for T1 and T2.
We normalized the feature vectors by the sum of the feature values before applying the
clustering techniques. Since both clustering algorithms have an element of randomness,
we run them multiple times. The step 5 of SPEC (K-means) was run for 50 times. The
result that minimizes the distortion (the distances to cluster centroid) is reported. PC was
run 20 times, and the results are averaged.
8http://trec.nist.gov/data/docs eng.html
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 49
T1 T2
total avg total avg
CO F1 1328 764 743 382
LP (p) F2 61 37 55 25
LP (all) F3 2521 526 1481 295
SCF F4 88 46 86 38
SCF+CO F5 1466 833 856 422
SCF+POS F6 319 114 299 87
SCF+P F7 282 96 273 76
SCF (V) F8 - - 92 45
SCF+LP (s) F9 1747 324 1474 225
SCF+LP (o) F10 2817 424 2319 279
SCF+LP (all) F11 4250 649 3515 426
SCF+SP20 (s) F12 821 235 690 145
SCF+SP20 (o) F13 792 218 706 135
SCF+SP20 (all) F14 1333 357 1200 231
SCF+SP30 (s) F15 977 274 903 202
SCF+SP30 (o) F16 1026 273 1012 205
SCF+SP30 (all) F17 1720 451 1640 330
Table 3.3: (i) The total number of features and (ii) the average per verb for all the feature
sets. In the feature name, s means subject slot; o means object and indirect object slot; p
means the preposition and all indicates information on subject, object and indirect object
slots. Appendix A contains more details on each feature.
3.3.3 Evaluation measures
To facilitate meaningful comparisons, we employed the mPUR, ACC and F as used e.g. by
Korhonen et al. (2008) and O´ Se´aghdha and Copestake (2008). These measures are
described earlier in section 2.3.3.
The random baseline (BL) is calculated as follows:
BL = 1/number of classes
3.3.4 Statistical significance test
We performed one-tailed McNemar’s test (McNemar, 1947) on the major findings. The
test is widely used in previous NLP (Escudero et al., 2000; Chambers et al., 2007; Guo et
al., 2011a) and machine learning studies (Dietterich (1998) and Saloja¨rvi et al. (2005)).
The result of the test is equivalent to Cochran’s Q test (Cochran, 1950) when the number
of treatments (number of clusterings) is two (Tate and Brown, 1970). Since the test
requires the response variable to be binary, we extend the test as in Dietterich (1998).
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 50
Number of verb pairs misclustered in both
A and B
Number of verb pairs misclustered in A
but not in B
Number of verb pairs misclustered in B
but not in A
Number of verb pairs clustered correctly
in both A and B
Table 3.4: The example contingency table for McNemar’s test
For two clustering results A and B, we record how each verb pair was clustered and
construct the contingency table 3.4.
The value in the table is calculated as following: we convert the clustering result to a
set of binary indicators on paired verbs. For example, given verbs v1, v2 and v3, we will
have pairs: (v1, v2), (v1, v3) and (v2, v3). A pair is considered to be clustered correctly if
two verbs that are in the same cluster are also in the same class as in the gold standard,
or two verbs that are not in the same cluster are also not in the same class as in gold
standard. We report the statistical test decision directly, e.g. “significant at p < 0.05” or
simply “p < 0.05”.
3.4 Results
3.4.1 Quantitative evaluation
Table 3.5 includes the F-measure results for all feature sets when the two methods (PC
and SPEC) are used to cluster verbs in the test sets T1 and T2, respectively.
A number of tendencies can be observed in the results. Firstly, the results for T2 are
clearly better than those for T1. Including a higher number of verbs lower in frequency
from classes of variable granularity, T1 is probably a more challenging test set than
T2. T2 is controlled for the number and frequency of verbs to facilitate cross-class
comparisons. While this may contribute to better results, T2 is a more accurate test set
for us in the sense that it offers a better correspondence with our target (fine-grained
Levin) classes.
Secondly, the difference between the two clustering methods is clear: the new SPEC out-
performs PC on both test sets and across all the feature sets except F2, F6 and F7. The
average improvement in F is 10.49 (p < 0.05). The performance of the two methods is
still fairly similar with the more basic, less sparse feature sets (F1-F2, F4, F6-7) but when
the more sophisticated feature sets are used (F3, F5, F9-F17) SPEC performs considerably
better. This demonstrates that it is a better suited method for high dimensional feature
sets.
Comparing the feature sets, the simple co-occurrence based F1 performs significantly
better than the random baseline (p < 0.05). F2 and F3 which exploit lexical data in the
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 51
T1 T2
PC SPEC PC SPEC
BL 7.14 7.14 5.88 5.88
CO F1 15.62 33.85 17.86 40.94
LP (p) F2 40.40 38.97 50.98 49.02
LP (all) F3 42.94 47.50 41.08 74.55
SCF F4 34.22 36.16 52.33 57.78
SCF+CO F5 26.43 28.70 19.52 29.10
SCF+POS F6 36.14 34.75 44.44 46.70
SCF+P F7 43.57 43.85 63.40 63.28
SCF (V) F8 - - 34.08 38.30
SCF+LP (s) F9 47.72 56.09 65.94 71.65
SCF+LP (o) F10 43.09 48.43 57.11 73.97
SCF+LP (all) F11 45.87 54.63 56.30 72.97
SCF+SP20 (s) F12 46.67 57.75 39.52 71.67
SCF+SP20 (o) F13 44.95 51.70 40.76 70.78
SCF+SP20(all) F14 48.19 55.12 39.68 73.09
SCF+SP30 (s) F15 45.89 56.10 64.44 80.35
SCF+SP30 (o) F16 42.01 48.74 52.75 70.52
SCF+SP30(all) F17 46.66 52.68 51.07 68.67
Table 3.5: Results on testsets T1 and T2
argument head positions of GRs prove significantly better than F1 (p < 0.05). F3 yields
surprisingly good results on T2: it is the second best feature set on this test set. Also
on T1, F3 performs significantly better than the SCF-based feature sets F4-F7 (p < 0.05).
This demonstrates the usefulness of lexical data when obtained from argument positions
in relevant GRs.
Our basic SCF feature set F4 performs significantly better than the comparable feature
set F8 obtained from the VALEX lexicon (p < 0.05). The difference is 19.50 in F-measure.
As both lexicons were extracted from the same corpus data, the improvement can be
attributed to improved parser and SCF acquisition performance (Preiss et al., 2007).
F5-F7 refine the basic SCF feature set F4 further. F5 which combines a SCF with CO
information proved the best feature set in the supervised verb classification experiment
of Li and Brew (2008). In our experiment, F5 produces a significantly lower result than
CO and SCF alone (i.e. F1 and F4) using SPEC (p < 0.05). However, our corpus is smaller
(Li and Brew used the large Gigaword corpus), our SCFs are different, and our approach
is unsupervised, making meaningful comparisons difficult.
F6 combines F4 with information about verb tense. This was not helpful: F6 produces
worse results than F4. The difference is not significant on T1 (p = 0.27), but significant
on T2 (p < 0.05). F7, on the other hand, yields better results than F4 on both test sets
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 52
(p < 0.05). This demonstrates what the previous research has shown: SCFs perform
better when parameterized for prepositions.
Looking at our novel feature sets F9-F17, F9-F11 combine the most accurate SCF fea-
ture set F4 with the LP-based features F2-F3. Although the feature space becomes more
sparse, all the feature sets outperform F2-F3 on T1 (p < 0.05). On T2, F3 performs
exceptionally well, and thus yields a better result than F9-F11, but F9-F11 nevertheless
perform significantly better than the best SCF-based feature set F4 alone (p < 0.05). The
differences among F9, F10 and F11 are small on T2, but on T1 F9 yields the best perfor-
mance. It could be that F9 works the best for the more sparse T1 because it suffers the
least from data sparsity (it uses LPs only for the subject relation).
F12-F17 replace the LPs in F9-F11 by semantic SPs. When only 20 clusters are used as
SP models and acquired from the smaller sample of (200) argument heads (F12-F14),
SPs do not perform better than LPs on T2. A small improvement can be observed on
T1, especially with F12 which uses only the subject data (yielding the best F-measure on
T1: 57.75%, p < 0.05 when compared to the result of the second best non-SP feature
F9). However, when 30 more fine-grained clusters are acquired from a bigger sample of
(500) argument heads (F15-F17), lower results can be seen on T1. On T2, on the other
hand, F15 yields dramatic improvement and we get the best performance for this test set:
80.35% F-measure (p < 0.05 when compared to the result of the second best feature F3).
The fact that no improvement is observed when using F16 and F17 on T2 could be
explained by the fact that SPs are stronger for the subject position, which also suffers less
from the sparse data problem than e.g. object position. The fact that no improvement
is observed on T1 is likely to be due to the fact that verbs have strong SPs only at the
finer-grained level of Levin classification. Recall that in T1, as many as half of the classes
are coarser-grained.
3.4.2 Qualitative evaluation
The best performing feature sets on both T1 and T2 were thus our new SP-based feature
sets. We conducted qualitative analysis of the best 30 SP clusters in the T2 data created
using SPEC to find out whether these clusters were really semantic in nature, i.e. captured
semantically meaningful preferences. As no gold standard specific to our verb classifi-
cation task was available, we did manual cluster analysis using VerbNet (VN) as an aid.
In VN, Levin classes are assigned with semantic descriptions: the arguments of SCFs in-
volved in DAs are labeled with thematic roles, some of which are labeled with selectional
restrictions.
From the 30 thematic role types in VN, as many as 20 are associated with the 17 Levin
classes in T2. The most frequent role in T2 is agent, followed by theme, location, patient,
recipient, and source. From the 36 possible selectional restriction types, 7 appear in
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 53
Human mother, wife, parent, girl, child
Role patient, student, user, worker, teacher
Body-part neck, shoulder, back, knee, corner
Authority committee, police, court, council, board
Organization society, firm, union, bank, institution
Money cash, currency, pound, dollar, fund
Amount proportion, value, size, speed, degree
Time minute, moment, night, hour, year
Path street, track, road, stair, route
Building office, shop, hotel, hospital, house
Region site, field, area, land, island
Technology system, model, facility, engine, machine
Task operation, test, study, analysis, duty
Arrangement agreement, policy, term, rule, procedure
Matter aspect, subject, issue, question, case
Problem difficulty, challenge, loss, pressure, fear
Idea argument, concept, idea, theory, belief
Power control, lead, influence, confidence, ability
Form colour, style, pattern, shape, design
Item letter, book, goods, flower, card
Table 3.6: Cluster analysis: 20 clusters, their SP labels (assigned by the author of this
thesis), and prototypical member nouns
T2; the most frequent ones being +animate and +organization, followed by +concrete,
+location, and +communication.
As SP clusters capture selectional preferences rather than restrictions, we examined man-
ually whether the 30 clusters (i) capture semantically meaningful classes, and whether
they (ii) are plausible given the VN semantic descriptions/restrictions for the classes in
T2.
The analysis revealed that all the 30 clusters had a predominant, semantically motivated
SP supported by the majority of the member nouns. Although many clusters could be
further divided into more specific SPs (and despite the fact that some nouns were clearly
misclassified), we were able to assign each cluster a descriptive label characterizing the
predominant SP. Table 3.6 shows 15 sample clusters, the SP labels assigned to them, and
a number of example nouns in these clusters.
When comparing each SP cluster against the VN semantic descriptions/restrictions for T2,
we found that each predominant SP was plausible. Also, the SPs frequent in our data were
also frequent among the 17 classes according to VN. For example, the many SP clusters
labeled as arrangements, issues, ideas and other abstract concepts were also frequent in
T2, e.g. among COMMUNICATION (37), CHARACTERISE (29.2), AMALGAMATE (22.2)
and other classes.
This analysis showed that the SP models which performed well in verb clustering were
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 54
semantically meaningful for our task. An independent evaluation using one of the stan-
dard datasets available for SP acquisition research (Brockmann and Lapata, 2003) is of
course needed to determine how well the acquisition method performs in comparison
with other existing methods.
Finally, we evaluated the quality of the verb clusters created using the SP-based features.
We found that some of the errors were similar to those seen on T2 when using syntactic
features: errors due to polysemy and syntactic idiosyncracy. However, a new error type
clearly due to the SP-based feature was detected. A small number of classes got con-
fused because of strong similar SPs in the subject (agent) position. For example, some
PEER (30.3) verbs (e.g. look, peer) were found in the same cluster with SAY (37.7) verbs
(e.g. shout, yell) – an error which purely syntactic features do not produce. Such er-
rors were not numerous and could be addressed by developing more balanced SP models
across different GRs.
3.5 Discussion and related work
Although features incorporating semantic information about verb SPs make theoreti-
cal sense, they have not proved equally promising in previous experiments, which have
compared them against syntactic features in verb classification. Joanis et al. (2007) in-
corporated an ’animacy’ feature (a kind of a ’SP’) which was determined by classifying
e.g. pronouns and proper names in data to this single SP class. A small improvement was
obtained when this feature was used in conjunction with syntactic features in supervised
classification.
Joanis (2002) and Schulte im Walde (2006) experimented with more conventional SPs
with syntactic features in English and German verb classification, respectively. They
employed top level WordNet (Miller, 1995) and Germanet (Kunze and Lemnitzer, 2002)
classes as SP models. Joanis (2002) obtained no improvement over syntactic features,
whereas Schulte im Walde (2006) obtained insignificant improvement.
Korhonen et al. (2008) combined SPs with SCFs when clustering biomedical verbs. The
SPs were acquired automatically from syntactic slots of SCFs (not from GRs as in our
experiment) using PC clustering. A small improvement was obtained using LPs extracted
from the same syntactic slots, but the SP clusters offered no improvement. Schulte im
Walde et al. (2008) proposed an interesting SP acquisition method, which involves com-
bining EM training and the MDL principle for a verb classification incorporating SPs.
However, no comparison against purely syntactic features is provided.
In our experiment, we obtained a considerable improvement over syntactic features, de-
spite using a fully unsupervised approach to both verb clustering and SP acquisition. In
addition to the rich, syntactic-semantic feature sets, our good results can be attributed
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 55
Method Result
T1
Li et al. 2008 supervised 66.3
Joanis et al. 2008 supervised 58.4
Stevenson et al. 2003
semi-supervised 29
unsupervised 31
SPEC unsupervised 57.55
T2 Sun et al. 2008
supervised 62.50
unsupervised 51.6
O´ Se´aghdha et al. 2008 supervised 67.3
SPEC unsupervised 80.35
Table 3.7: Previous verb classification results
to the clustering technique capable of dealing with them. The potential of SPEC for the
task was recognised earlier by Brew and Schulte im Walde (2002). Although a different
version of the algorithm was employed and applied to German (rather than to English),
and although no SP features were used, these earlier experiments did demonstrate the
ability of the method to perform well in high dimensional feature space.
To get an idea of how our performance compares with that of related approaches, we
examined works on verb classification (supervised and unsupervised), which were eval-
uated on the same test sets using comparable evaluation measures. These works are
summarized in table 3.7. ACC and F-measure are shown for T1 and T2, respectively.
It is important to note here that although the gold-standards (T1 and T2) employed by
these works are the same, the feature extraction methods and the corpora used by each
method are different and therefore the results are not directly comparable. However, the
results shown serve to show the current best performances on the task.
On T1, the best performing supervised method reported so far is that of Li and Brew
(2008). Li and Brew used Bayesian Multinomial Regression for classification. A range
of feature sets integrating COs, SCFs and/or LPs were evaluated. The combination of
COs and SCFs gave the best result, shown in the table. Joanis et al. (2007) report the
second best supervised result on T1, using Support Vector Machines for classification
and features derived from linguistic analysis: syntactic slots, slot overlaps, tense, voice,
aspect, and animacy of NPs. Stevenson and Joanis (2003) report a semi- and unsupervised
experiment on T1. A feature set similar to that of Joanis et al. (2007) was employed
(features were selected in a semi-supervised fashion) and hierarchical clustering was used.
Our unsupervised method SPEC performs substantially better than the unsupervised method
of Stevenson et al. and nearly as well as the supervised approach of Joanis et al. (2007)
(note, however, that the different experiments involved different sub-sets of T1 so are
not entirely comparable).
On T2, the best performing supervised method so far is that of O´ Se´aghdha and Copes-
CHAPTER 3. VERB CLUSTERING USING SELECTIONAL PREFERENCES 56
take (2008) which employs a distributional kernel method to classify SCF features pa-
rameterized for prepositions in the automatically acquired VALEX lexicon. Using exactly
the same data and feature set, Sun et al. (2008b) obtain a slightly lower result when using
a supervised method (Gaussian) and a notably lower result when using an unsupervised
method (PC clustering). Our method performs considerably better and also outperforms
the supervised method of O´ Se´aghdha and Copestake (2008).
3.6 Summary
We introduced a new approach to verb clustering which involves the use of (i) rich lexical,
SCF and GR data produced by a SCF system, (ii) novel syntactic-semantic feature sets
which combine a variety of linguistic information, and (iii) a new variation of SPEC which
is particularly suited for dealing with the resulting, high dimensional feature space. Using
this approach, we showed on two well-established test sets that automatically acquired
SPs can be highly useful for verb clustering. This result contrasts with most previous
works but is in line with theoretical work on verb classification which relies not only on
syntactic but also on semantic features (Levin, 1993).
In addition to the ideas mentioned earlier, future work could look into optimal ways
of acquiring SPs for verb classification. Considerable research has been done on SP ac-
quisition, most of which has involved collecting argument headwords from data and
generalizing to WordNet classes. Brockmann and Lapata (2003) have showed that
WordNet-based approaches do not always outperform simple frequency-based models,
and a number of techniques have been proposed which may offer ideas for refining our
current unsupervised approach (Erk, 2007; Bergsma et al., 2008). The number and type
(and combination) of GRs for which SPs can be reliably acquired, especially when the
data is sparse, requires also further investigation.
In addition, it would be interesting to investigate other potentially useful features for verb
classification (e.g. named entities and preposition classes) and explore semi-automatic ML
technology and active learning for guiding the classification.
Chapter 4
Hierarchical verb clustering using graph
factorization
4.1 Introduction
1Most works on verb classification have focussed on acquiring and evaluating flat clas-
sifications (Schulte im Walde, 2006; Joanis et al., 2007; Sun et al., 2008b; Li and Brew,
2008; Korhonen et al., 2008; O´ Se´aghdha and Copestake, 2008; Vlachos et al., 2009b).
Levin’s classification is not flat, but taxonomic in nature, which is practical for NLP
purposes since applications may differ in terms of the granularity they require from a
classification.
In this chapter, we experiment with hierarchical Levin-style clustering. We adopt as our
baseline method a well-known hierarchical method – agglomerative clustering (AGG)
– which has been previously used to acquire flat Levin-style classifications (Stevenson
and Joanis, 2003) as well as hierarchical verb classifications not based on Levin (Ferrer,
2004; Schulte im Walde, 2008). The method has also been used in the related task of
noun clustering (Ushioda, 1996; Matsuo et al., 2006; Bassiou and Kotropoulos, 2011).
We introduce then a new method called Hierarchical Graph Factorization Clustering
(HGFC) (Yu et al., 2006). This graph-based, probabilistic clustering algorithm has some
clear advantages over AGG (e.g. it delays the decision on a verb’s cluster membership at
any level until a full graph is available, minimising the problem of error propagation) and
it has been shown to perform better than several other hierarchical clustering methods in
recent comparisons (Yu et al., 2006). The method has been applied to the identification
of social network communities (Lin et al., 2008), but has not been used (to the best of
our knowledge) in NLP before.
We modify HGFC with a new tree extraction algorithm which ensures a more consistent
result, and we propose two novel extensions to it. The first is a method for automatically
1The research reported in this chapter was published in Sun and Korhonen (2011).
57
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 58
determining the tree structure (i.e. number of clusters to be produced for each level of
the hierarchy). This avoids the need to pre-determine the number of clusters manually.
The second is addition of soft constraints to guide the clustering performance (Vlachos
et al., 2009b). This is useful for situations where a partial (e.g. a flat) verb classification
is available and the goal is to extend it.
Adopting a set of lexical and syntactic features which have performed well in previous
works, we compare the performance of the two methods on test sets extracted from
Levin and VerbNet. When evaluated on a flat clustering task, HGFC outperforms AGG
and performs very similarly with the best flat clustering method reported on the same
test set in section 3. When evaluated on a hierarchical task, HGFC performs considerably
better than AGG at all levels of gold standard classification. The constrained version of
HGFC performs the best, as expected, demonstrating the usefulness of soft constraints
for extending partial classifications.
Our qualitative analysis shows that HGFC is capable of detecting novel information not
included in our gold standards. The unconstrained version can be used to acquire novel
classifications from scratch while the constrained version can be used to extend existing
ones with additional class members, classes and levels of the hierarchy.
4.2 Target classification and test sets
The taxonomy of Levin (1993) classifies over 3000 verbs in 57 top level classes, some of
which divide further into subclasses. The extended version of the taxonomy in VerbNet
(Kipper-Schuler, 2005) classifies 5757 verbs. Its 5-level taxonomy includes 101 top level
and 369 subclasses. We used three gold standards (and corresponding test sets) extracted
from these resources in our experiments:
T3: The first gold standard is a flat gold standard which includes 13 classes appearing
in Levin’s original taxonomy (Stevenson and Joanis, 2003). We included this small gold
standard in our experiments so that we could compare the flat version of our method
against previously published methods. Stevenson and Joanis (2003) did not include the
actual list of verbs used in the experiment. Therefore, we selected 20 verbs from each
class which occur at least 100 times in our corpus. This is also the approach used by
Stevenson and Joanis. This gave us 260 verbs in total. The actual verbs are listed in
appendix D.
T4: The second gold standard is a large, hierarchical gold standard which we extracted
from VerbNet as follows: 1) We removed all the verbs that have less than 1000 occur-
rences in our corpus. 2) In order to minimise the problem of polysemy, we assigned each
verb to the class which, according to VerbNet, corresponds to its predominant sense in
WordNet (Miller, 1995). 3) In order to minimise the sparse data problem with very fine-
grained classes, we converted the resulting classification into a 3-level representation so
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 59
that the classes at the 4th and 5th level were combined. For example, the sub-classes of
Declare verbs (numbered as 29.4.1.1.{1,2,3}) were combined into 29.4.1. 4) The classes
that have fewer than 5 members were discarded. The total number of verb senses in the
resulting gold standard is 1750, which is 33.2% of the verbs in VerbNet. T4 has 51 top
level, 117 second level, and 133 third level classes.
T5: The third gold standard is a subset of T4 where singular classes (top level classes
which do not divide into subclasses) are removed. This gold standard was constructed to
enable proper evaluation of the constrained version of HGFC (introduced in the following
section) where we want to compare the impact of constraints across several levels of
classification. T5 provides classification of 357 verbs into 11 top level, 14 second level,
and 32 third level classes.
For each verb appearing in T3-T5, we extracted all the occurrences (up to 10,000) from
the raw corpus data used for constructing VALEX (Korhonen et al., 2006a), including
the BNC (Leech, 1992), the Guardian corpus, the Reuters corpus (Rose et al., 2002),
the North American News Text Corpus (Graff, 1995) and the data used for two Text
Retrieval Evaluation Conferences (TREC-4 and TREC-5).
4.3 Method
4.3.1 Features and feature extraction
We used in the experiments the features that proved the best in the earlier experiment
reported in chapter 3: F-SCF, F-SCF+PP(B) and F-SCF+LP(A). The description of these
features and the feature extraction methods are given in table 2.4 (all the features used
in this thesis are summarized in appendix A). However, we only used the best syntactic
features here. Although the semantic feature – verb SPs – was the best feature (when used
in combination with syntactic features) in chapter 3, we left it for future work because
we noticed that different levels of classification are likely to require semantic features at
different granularities.
4.3.2 Clustering
We introduce the agglomerative clustering (AGG) and Hierarchical Graph Factorization
Clustering (HGFC) methods in the following two subsections, respectively. The subse-
quent two subsections present our extensions to HGFC: (i) automatically determining the
cluster structure and (ii) adding soft constraints to guide clustering performance.
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 60
Agglomerative clustering
AGG is a method which treats each verb as a singleton cluster and then successively
merges the closest two clusters until all the clusters have been merged into one. We
used the SciPy’s implementation (Oliphant, 2007) of the algorithm. The cluster distance
is measured using linkage criteria. We experimented with four commonly used linkage
criteria: Single, Average, Complete and Ward’s (Ward Jr., 1963). Ward’s criterion per-
formed the best and was used in all the experiments in this chapter. It measures the
increase in variance after two clusters are merged. The output of AGG tends to have an
excessive number of levels. Cut-based methods (Wu and Leahy, 1993; Shi and Malik,
2000) are frequently applied to extract a simplified view. We followed previous verb
clustering works and cut the AGG hierarchy manually.
AGG suffers from two problems. The first is error propagation. When a verb is misclas-
sified at a lower level, the error propagates to all the upper levels. The second is local
pairwise merging, i.e. the fact that only two clusters can be combined at any level. For ex-
ample, in order to group clusters representing Levin classes 9.1, 9.2 and 9.3 into a single
cluster representing class 9, the method has to produce intermediate clusters, e.g. 9.{1,2}
and 9.3. Such clusters do not always have a semantic interpretation. Although they can
be removed using a cut-based method, this requires a pre-defined cut-off value which is
difficult to set (Stevenson and Joanis, 2003). In addition, a significant amount of infor-
mation is lost in pair-wise clustering. In the above example, only the clusters 9.{1,2}
and 9.3 are considered, while alternative clusters 9.{1,3} and 9.2 are ignored. Ideally,
information about all the possible intermediate clusters should be aggregated, but this is
intractable in practice.
Hierarchical Graph Factorization Clustering
Our new method HGFC derives a probabilistic bipartite graph from the similarity matrix
(Yu et al., 2006). The local and global clustering structures are learned via the random
walk properties of the graph.
The method does not suffer from the above problems with AGG. Firstly, there is no error
propagation because the decision on a verb’s membership at any level is delayed until
the full bipartite graph is available and until a tree structure can be extracted from it by
aggregating probabilistic information from all the levels. Secondly, the bipartite graph
enables the construction of a hierarchical structure without any intermediate classes. For
example, we can group classes 9.{1,2,3} directly into class 9.
We calculated the similarity matrix using JSD instead of skew divergence. When com-
pared to JSD, skew divergence has an extra parameter (the smoothing factor). The value
of this parameter is difficult to set automatically, as there is no labelled training data
available for parameter estimation. The details of JSD and other similarity measures are
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 61
summarized in appendix B. Given a set of verbs, V = {vn}Nn=1, we compute a similarity
matrix W where Wij = exp(−djsd(v1, v2)). W can be encoded by a undirected graph G
(Figure 4.1(a)), where the verbs are mapped to vertices and the Wij is the edge weight
between vertices i and j.
The graph G and the cluster structure can be represented by a bipartite graph K(V, U). V
are the vertices on G. U = {up}mp=1 represent the hidden m clusters. For example, looking
at Figure 4.1(b), V on G can be grouped into three clusters u1, u2 and u3. The matrix
B denotes the n × m adjacency matrix, with bip being the connection weight between
the vertex vi and the cluster up. Thus, B represents the connections between clusters at
an upper and lower level of clustering. A flat clustering algorithm can be induced by
computing B.
v
1
v
6
v
2
v
4
v
3
v
5
v
7
v
8
v
9
(a)
v
1
v
7
v
6
v
9
v
8
v
2
v
3
v
4
v
5
u
1
u
2
u
3
(b)
v
1
v
6
v
2
v
4
v
3
v
5
v
7
v
8
v
9
u
1
u
3
u
2
(c)
u
1
u
2
u
3
(d)
v
1
v
7
v
6
v
9
v
8
v
2
v
3
v
4
v
5
u
1
u
2
u
3
q
1
q
2
(e)
Figure 4.1: (a) An undirected graphG representing the similarity matrix; (b) The bipartite
graph showing three clusters onG; (c) The induced clusters U ; (d) The new graphG1 over
clusters U ; (e) The new bipartite graph over G1
The bipartite graphK also induces a similarity (W ′) between vi and vj: w′ij =
∑m
p=1
bipbjp
λp
=
(BΛ−1BT )ij where Λ = diag(λ1, ..., λm). Therefore, B can be found by approximating
the similarity matrix W of G using W ′ derived from K. Given a distance function ζ
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 62
between two similarity matrices, B approximates W by minimizing the cost function
ζ(W,BΛ−1BT ). The coupling between B and Λ is removed by setting H = BΛ−1:
min
H,Λ
ζ(W,HΛHT ), s.t.
n∑
i=1
hip = 1 (4.1)
We use the divergence distance: ζ(X, Y ) =
∑
ij(xij log
xij
yij
− xij + yij). Yu et al. (2006)
showed that this cost function is non-increasing under the update rule:
h˜ip ∝ hip
∑
j
wij
(HΛHT )ij
λphjp s.t.
∑
i
h˜ip = 1 (4.2)
λ˜p ∝ λp
∑
j
wij
(HΛHT )ij
hiphjp s.t.
∑
p
λ˜p =
∑
ij
wij (4.3)
wij can be interpreted as the probability of the direct transition between vi and vj: wij =
p(vi, vj), when
∑
ij wij = 1. bip can be interpreted as:
p(up, uq) = p(up)p(up|uq) =
n∑
i=1
bipbiq
di
= (BTD−1B)pq (4.4)
D = diag(d1, ..., dn) where di =
m∑
p=0
bip
p(up, uq) is the similarity between the clusters. It takes into account a weighted average
of contributions from all the data. This is different from the linkage method, where only
the data from two clusters are considered.
Given the cluster similarity p(up, uq), we can construct a new graph G1 (Figure 4.1(d))
with the clusters U as vertices. The cluster algorithm can be applied again (Figure 4.1(e)).
This process can go on iteratively, leading to a hierarchical graph. 2
Additional steps need to be performed in order to extract a tree from the hierarchical
graph. Yu et al. (2006) performs the extraction via a propagation of probabilities from
the bottom level clusters. For a verb vi, the probability of assigning it to cluster v
(l)
p at
level l is given by:
p(v(l)p |vi) =
∑
Vl−1
...
∑
V1
p(v(l)p |v
(l−1))...p(v(1)|vi)
= (D(−1)1 B1D
−1
2 B2D
−1
3 B3...D
−1
l Bl)ip (4.5)
2The hierarchical graph is constructed only for inducing the tree of clusters. It is not motivated by any
linguistic phenomenon (e.g. verb polysemy). Although it is a soft clustering approach, we argue that soft
clustering is not a proper model for polysemy (more details in section 8.2.2).
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 63
Algorithm 1 HGFC algorithm (Yu et al., 2006)
Require: N verbs V , number of clusters ml for L levels
Compute the similarity matrix W0 from V
Build the graph G0 from W0 , and m0 ← n
for l = 1, 2 to L do
Factorize Gl−1 to obtain bipartite graph Kl with the adjacency matrix Bl (eq. 4.1, 4.2 and
4.3)
Build a graph Gl with similarity matrix Wl = BTl D
−1
l Bl according to equation 4.4
end for
return BL, BL−1...B1
This method may not extract a consistent tree structure because the cluster member-
ship at lower levels does not constrain the upper level membership. This prevented us
from extracting a Levin style hierarchical classification in our initial experiments. For
example, where two verbs were grouped together at a lower level, they could belong to
separate clusters at an upper level. We therefore propose a new tree extraction algorithm
(Algorithm 2).
The new algorithm starts from the top level bipartite graph, and generates consistent
labels for each level by taking into account the tree constraints set at upper levels.
Algorithm 2 Tree extraction algorithm for HGFC
Require: Given N , (Bl,ml) on each level for L levels
On the top level L, collect the labels TL (eq. 4.5)
Define C to be a (mL−1 ×mL) zero matrix, Cij ← 1, where i, j = argmaxi,j{BLij}
for l = L− 1 to 1 do
for i = 1 to N do
Compute p(vlp|vi) for each cluster p (eq. 4.5)
tli = argmaxp{p(v
l
p|vi)|p = 1...ml, Cptl+1i
6= 0}
end for
Redefine C to be a (ml−1 ×ml) zero matrix, Cij ← 1, where i, j = argmaxi,j{Blij}
end for
return Tree consistent labels TL, TL−1...T 1
Automatically determining the number of clusters for HGFC
HGFC needs the number of levels and clusters at each level as input. However, this infor-
mation is not always available (e.g. when the goal is to actually learn this information
automatically). We therefore propose a method for inferring the cluster structure from
data. As shown in figure 1, a similarity matrix W models one-hop transitions that follow
the links from vertices to neighbors. A walker can also go to other vertices via multi-hop
transitions. According to the chain rule of the Markov process, the multi-hop transitions
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 64
indicate a decaying similarity function on the graph (Yu et al., 2006). After t transitions,
the similarity matrix (Wt) becomes:
Wt = Wt−1D
−1
0 W0
Yu et al. (2006) proved the correspondence between the HGFC levels (l) and the random
walk time: t = 2l−1. So the vertices at level l induce a similarity matrix of verbs after
t-hop transitions. The decaying similarity function captures the different scales of clus-
tering structure in the data (Azran and Ghahramani, 2006b). The upper levels would
have a smaller number of clusters which represent a more global structure. After several
levels, all the verbs are expected to be grouped into one cluster. The number of levels
and clusters at each level can thus be learned automatically.
We therefore propose a method that uses the decaying similarity function to learn the
hierarchical clustering structure. One simple modification to algorithm 1 is to set the
number of clusters at level l (ml) to be ml−1 − 1. m is denoted as the number of clusters
that have at least one member according to eq. 4.5. We start by treating each verb as a
cluster at the bottom level. The algorithm stops when all the data points are merged into
one cluster. The increasingly decaying similarity causes many clusters to have 0 members
especially at lower levels, which are pruned in the tree extraction.
Adding constraints to HGFC
The basic version of HGFC makes no prior assumptions about the classification. It is use-
ful for learning novel verb classifications from scratch. However, when wishing to extend
an existing classification (e.g. VerbNet) it may be desirable to guide the clustering per-
formance on the basis of information that is already known. We propose a constrained
version of HGFC which makes uses of labels at the bottom level to learn upper level clas-
sifications. We do this by adding soft constraints to clustering, following Vlachos et al.
(2009b).
We modify the similarity matrix W as follows: If two verbs have different labels (li 6= lj),
the similarity between them is decreased by a factor a, and a < 1. We set a to 0.5 in
the experiments. The resulting tree is generally consistent with the original classification.
The influence of the underlying data (domain or features) is reduced according to a.
As discussed in section 2.3.1, adding the constraints is a form of semi-supervised learning
that enables us to make use of the kind of vague prior knowledge that we have available,
e.g. knowledge about pairs of verbs that cannot be in the same class, even when we have
no idea how they should be grouped.
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 65
4.4 Experimental evaluation
We applied the clustering methods introduced in section 4.3 to the test sets described in
section 4.2 and evaluated them both quantitatively and qualitatively, as described in the
subsequent sections.
4.4.1 Evaluation methods
We used ACC and Radj to evaluate the results on the flat test set T3 (see section 4.2 for
details of T3-T5). Since NMI can compare clusterings with different numbers of clusters,
and since we also want to compare to the F in previous experiments, we used NMI and F
to evaluate hierarchical clustering results on T4 and T5. Finally, we supplemented quan-
titative evaluation with qualitative evaluation of clusters produced by different methods.
We used the McNemar’s test (McNemar, 1947) as described in 3.3.4 to verify the statis-
tical significance of the major findings.
4.4.2 Quantitative evaluation
We first evaluated AGG and the basic (unconstrained) HGFC on the small flat test set
T3. The main purpose of this evaluation was to compare the results of our methods
against previously published results on the same test set. The number of clusters (K) and
levels (L) were inferred automatically for HGFC as described in section 4.3.2. However,
to make the results comparable with previously published ones, we cut the resulting
hierarchy at the level of closest match (12 clusters) to the K (13) in the gold standard.
For AGG, we cut the hierarchy at 13 clusters.
Method ACC Radj
HGFC 41.2 (+8.5) 17.4 (+7.5)
AGG (reproduced) 32.7 9.9
AGG (Stevenson and Joanis (2003) 31.0 9.0
Table 4.1: Comparison against Stevenson and Joanis (2003)’s result on T3 (using similar
features).
Table 4.1 shows our results and the results of Stevenson and Joanis (2003) on T3 when
employing AGG using Ward as the linkage criterion. In this experiment, we used the
same feature set as Stevenson and Joanis (2003) (F-SCF+PP(B)) and were therefore able
to reproduce their AGG result with a difference smaller than 2%. When using this simple
feature set, HGFC outperforms the best performing AGG significantly: 8.5% in ACC and
7.3% in Radj (p < 0.05).
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 66
Nc Nl
HGFC unconstrained AGG
NMI F NMI F
130 133 57.31 36.65 54.22 32.62
114 117 54.67 37.96 51.35 32.44
50 51 37.75 40.00 32.61 32.78
Table 4.2: Performance on T4 using a pre-defined tree structure.
Nc Nl
HGFC unconstrained HGFC constrained AGG
NMI F NMI F NMI F
31 32 51.65 42.01 91.47 92.07 49.70 40.30
15 14 42.75 47.70 82.16 82.80 39.19 43.69
11 11 38.91 51.17 71.69 75.00 34.88 44.80
Table 4.3: Performance on T5 using a pre-defined tree structure.
We also compared HGFC against the best reported clustering method on T3 to date – that
of SPEC in section 3. We used the feature sets F-SCF+LP(A). HGFC obtains F of 49.93%
on T3, which is 5% lower than the result in section 3 (p < 0.05). The difference comes
from the tree consistency requirement. When HGFC is forced to produce a flat clustering
(a one level tree only), it achieves F of 52.55%. This is very close to the performance of
SPEC.
We then evaluated our methods on the hierarchical test sets T4 and T5. We used the
best-performing feature sets F-SCF+LP(A) for these tasks. In the first set of experiments,
we pre-defined the tree structure for HGFC by setting L to 3 and K at each level to be
the K in the hierarchical gold standard. The hierarchy produced by AGG was cut into
3 levels according to the Ks in the gold standard. This enabled direct evaluation of the
results against the 3-level gold standards using both NMI and F.
The results are reported in tables 4.2 and 4.3. In these tables, Nc is the number of clusters
in HGFC clustering while Nl is the number of classes in the gold standard (the two do not
always correspond perfectly because a few clusters have zero members).
Table 4.2 compares the results of the unconstrained version of HGFC against those of
AGG on our largest test set T4. As with T3, HGFC outperforms AGG significantly (p <
0.05). The benefit can now be seen at three different levels of the hierarchy. On average,
HGFC outperforms AGG 3.5% in NMI and 4.8% in F. The difference between the methods
becomes clearer when moving towards the upper levels of the hierarchy.
Table 4.3 shows the results of both unconstrained and constrained versions of HGFC and
those of AGG on the test set T5 (where singular classes are removed to enable proper
evaluation of the constrained method). The results are generally better on this test set
than on T4, which is to be expected since T5 is a refined subset of T43.
3NMI is higher on T4, however, because NMI has a higher baseline for a larger number of clusters (Vinh
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 67
T4 T5
Nc Nl HGFC Nc Nl HGFC
148 133 53.26 64 32 54.91
97 117 49.85 35 32 50.83
46 51 33.55 20 14 44.02
19 51 25.80 10 14 34.41
9 51 19.17 6 11 32.27
3 51 13.06
Table 4.4: NMI of unconstrained HGFC when trees for T4 and T5 are inferred automati-
cally.
Recall that the constrained version of HGFC learns the upper levels of classification on
the basis of soft constraints set at the bottom level, as described earlier in section 4.3.2.
As a consequence, NMI and F are both greater than 90% at the bottom level and the
results at the top level are notably lower because the impact of the constraints degrades
the further away one moves from the bottom level. Yet, the relatively good result across
all levels (p < 0.05) shows that the constrained version of HGFC can be a useful method
to extend the hierarchical structure of known classifications.
Finally, Table 4.4 shows the results for the unconstrained HGFC on T4 and T5. Here, the
tree structure is not pre-defined, but inferred fully automatically as described in section
4.3.2. 6 levels are learned for T4 and 5 for T5. The number of clusters produced ranges
from 3 to 148 for T4 and from 6 to 64 for T5. We can see that the automatically de-
tected cluster numbers distribute evenly across different levels. The scale of the clustering
structure is more complete here than in the gold standards.
In the table, Nc indicates the number of clusters in the inferred tree, while Nl indicates
the closest match to the number of classes in the gold standard. This evaluation is not
fully reliable because the match between the gold standard and the clustering is poor at
some levels of the hierarchy. However, it is encouraging to see that the results do not
drop dramatically until the match between the two is really poor.
4.4.3 Qualitative evaluation
To gain better insight into the performance of HGFC, we conducted further qualitative
analysis of the clusters produced for T5 by the two versions of this method. We focussed
on the top level of 11 clusters (in the evaluation against the hierarchical gold standard,
see table 4.3), as the impact of soft constraints is the weakest for the constrained method
at this level.
et al., 2009). NMI is not ideal for comparing the results of T4 and T5.
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 68
As expected, the constrained HGFC kept many individual verbs belonging to the same
VerbNet subclass together (e.g. verbs enjoy, hate, disdain, regret, love, despise, detest,
dislike, fear for the class 31.2.1) so that most clusters simply group lower level classes
and their members together. Three nearly clean clusters were produced which only in-
clude sub-classes of the same class (e.g. 31.2.0 and 31.2.1 which both belong to 31.2
Admire verbs). However, the remaining 8 clusters group together sub-classes (and their
members) belonging to unrelated parent classes. Interestingly, 6 of these make both syn-
tactic and semantic sense. For example, several 37.7 Say verbs and 29.5 Conjencture
verbs are found together. These verbs share the meaning of communication and take
similar sentential complements.
In contrast, none of the clusters produced by the unconstrained HGFC represents a single
VerbNet class. The majority represent a high number of classes and fewer members per
class. Yet many of the clusters make syntactic and semantic sense. A good example is
a cluster which includes member verbs from 9.7 Spray/Load verbs, 21.2 Carve verbs,
51.3.1 Roll verbs, and 10.4 Wipe verbs. The verbs included in this cluster share the
meaning of a specific type of motion and show similar syntactic behaviour.
Thorough Levin-style investigation particularly of the unconstrained method, would re-
quire looking at shared diathesis alternations between cluster members. We left this
for future work. However, the analysis we conducted confirmed that the constrained
method could indeed be used for extending known classifications, while the uncon-
strained method is required for acquiring novel classifications from scratch. The errors
in clusters produced by both methods were mostly due to syntactic idiosyncracy and the
lack of semantic information in this clustering.
4.5 Discussion and conclusion
We have introduced a new graph-based method HGFC for hierarchical verb clustering
which avoids some of the problems (e.g. error propagation, pairwise cluster merging)
reported with the frequently used AGG method. We modified HGFC so that it can be used
to automatically determine the tree structure for clustering, and proposed two extensions
to it which make it even more suitable for our task. The first involves automatically
determining the number of clusters to be produced, which is useful when this is not
known in advance. The second involves adding soft constraints to guide the clustering
performance, which is useful when aiming to extend an existing classification.
The results reported are promising. On a flat test set (T3), the unconstrained version
of HGFC outperforms AGG and performs very similarly to the best current flat cluster-
ing method (SPEC) evaluated on the same dataset. On the hierarchical test sets (T4 and
T5), the unconstrained and constrained versions of HGFC outperform AGG clearly at all
levels of classification. The constrained version of HGFC detects the missing hierarchy
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 69
from the existing gold standards with high accuracy. When the number of clusters and
levels is learned automatically, the unconstrained method produces a multi-level hierar-
chy. Our evaluation against a 3-level gold standard shows that such a hierarchy is fairly
accurate. Finally, the results from our qualitative evaluation show that both constrained
and unconstrained versions of HGFC are capable of learning valuable novel information
not included in the gold standards.
Previous work on Levin style verb classification has mostly focussed on flat classifications
using methods suitable for flat clustering (Schulte im Walde, 2006; Joanis et al., 2007;
Sun et al., 2008b; Li and Brew, 2008; Korhonen et al., 2008; O´ Se´aghdha and Copes-
take, 2008; Vlachos et al., 2009b). However, some works have employed hierarchical
clustering as a method to infer flat clustering.
For example, Schulte im Walde and Brew (2001) employed AGG to initialize the K-Means
clustering for German verbs. This gave better results than random initialization. Steven-
son and Joanis (2003) used AGG for flat clustering on T3. They cut the hierarchy at
the number of classes in the gold standard and found that it is difficult to automati-
cally determine a good cut-off. Our evaluation in the previous section shows that HGFC
outperforms their implementation of AGG.
AGG was also used by Ferrer (2004) who performed hierarchical clustering of 514 Span-
ish verbs. The results were evaluated against a hierarchical gold standard resembling that
of Levin’s classification in English (Va´zquez et al., 2000). Radj of 0.07 was reported for a
15-way classification, which is comparable to the result of Stevenson and Joanis (2003).
Hierarchical clustering has also been performed for the related task of semantic verb
classification. For example, Basili et al. (1993) identified the problems of AGG, and
applied a conceptual clustering algorithm (Fisher, 1987) to Italian verbs. They used semi-
automatically acquired semantic roles and the concept types as features. No quantitative
results were reported. The qualitative evaluation shows that the resulting clusters are
very fine-grained.
Schulte im Walde (2008) performed hierarchical clustering of German verbs using human
verb association as features and AGG as a method. They focussed on two small collec-
tions of 56 and 104 verbs and evaluated the result against flat gold standard extracted
from GermaNet (Kunze and Lemnitzer, 2002) and German FrameNet (Erk et al., 2003),
respectively. They reported F of 62.69% for the 56 verbs, and F of 34.68% for the 104
verbs.
In the future, this research line could be extended in several directions. One would be to
try to determine optimal features for different levels of clustering. For example, the gen-
eral syntactic features (e.g. SCF) may perform the best at top levels of a hierarchy while
more specific or refined features (e.g. SCF+pp) may be optimal at lower levels. Another
idea is to investigate incorporating semantic features, like verb SPs, in our feature set. It
CHAPTER 4. HIERARCHICAL VERB CLUSTERING USING GRAPH
FACTORIZATION 70
is likely that different levels of clustering require more or less specific SPs. One way to
obtain the latter is hierarchical clustering of relevant noun data.
In addition, unconstrained HGFC could be applied to specific domains to investigate its
capability to learn novel, previously unknown classifications. As for the constrained
version of HGFC, a larger scale experiment on the VerbNet data could be conducted to
investigate what kind of upper level hierarchy it can propose for this resource (which
currently has over 270 top level classes).
Finally, HGFC could be compared to other hierarchical clustering methods that are rel-
atively new to NLP but have proved promising in other fields, including Bayesian Hier-
archical Clustering (Heller and Ghahramani, 2005; Teh et al., 2008) and the method of
Azran and Ghahramani (2006a) based on SPEC.
Chapter 5
Verb classification in the biomedical
domain
5.1 Introduction
In recent years, the application of NLP techniques to biomedicine has become increasingly
popular due to the urgency to develop techniques which can automatically process infor-
mation in the growing volume of literature in this field. Remarkable progress has been
made in many areas of BIO-NLP, including information retrieval, information extraction
and basic text processing (e.g. POS-tagging and parsing). The current challenge is to
improve these techniques with richer and deeper analysis. Large scale lexical resources
which specify the syntax and semantics of words are needed for this (Ananiadou and Mc-
Naught, 2005). General lexical resources such as WordNet, VerbNet and Comlex only
provide limited coverage of biomedical words, and manually developed domain-specific
lexical resources (e.g. the UMLS specialist lexicon (Browne et al., 2003)) are insufficient
in particular for verbs, and costly to extend.
Automatically acquiring or updating lexical information from corpora is a better ap-
proach for a rapidly developing domain such as biomedicine. The automatic approach
is cost effective, it can handle sub-domain variation easily (i.e. the fact that the sub-
domains of biomedicine differ in their lexical characteristics (Lippincott et al., 2011)). It
can also gather statistical information, which is highly useful for biomedical NLP appli-
cations, but difficult to collect via manual means.
A few large lexical resources exist for biomedicine such as the UMLS metathesaurus
(Nelson et al., 2002) (e.g. MeSH (Lipscomb, 2000)). Such resources mainly focus on
biomedical concepts which are nouns. A few lexical resources cover verbs in biomedical
texts. The UMLS SPECIALIST lexicon (Browne et al., 2000) includes both scientific and
biomedical vocabulary, with a particular emphasis on medical and health-related vocab-
ulary. It was created manually by lexicographers. It contains a small number syntactic
71
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 72
complementation patterns for verbs, but there is no statistical information on the usage
of the patterns. The BioLexicon (Sasaki et al., 2008) is a corpus-driven lexical resource
which contains syntactic and semantic frame information for verbs. The grammatical
frames of verbs are acquired from the output of the Enju (Miyao et al., 2008) deep syn-
tactic parser in the E. Coli subdomain, so only a limited number of verbs and frames
are represented. There are also smaller corpus-driven lexical resources: BioFrameNet
(Dolbey et al., 2006) and PASBio (Wattarujeekrit et al., 2004), but e.g. PASBio covers
30 verbs only.
The work presented in this chapter is most related to the work of Korhonen et. al 2006
and 2008, because it deals with the same task: Levin-style biomedical verb classification.
Biomedical texts have domain-specific verb classes. Some of them capture senses that
only or mainly appear in biomedical texts. For example, verbs in the ACTIVATE class
(e.g. activate / up-regulate / induce / stimulate) (Korhonen et al., 2008) take similar SCFs
and SPs, and share similar underlying predicate argument structure (e.g. PROTEINS:
TP53 ACTIVATE GENES: CIP1). Others capture general or general-scientific senses
(e.g. DEMONSTRATE verbs).
Korhonen et al. (2006b, 2008) created a three level gold standard containing 192 verbs
and 50 fine-grained classes for biomedicine and using biomedical journals as corpus data,
performed a preliminary verb clustering experiment using pairwise clustering (Puzicha et
al., 2000) as a method. Korhonen et al. (2008) took this work further and systemat-
ically compared a range of syntactic and semantic features for the task. These works
showed that despite the challenging nature of biomedical texts (e.g. frequent use of pas-
sive, anaphora and long, embedded sentences) and the fact that the unsupervised lexical
acquisition did not rely on the best performing, lexicalised parsers in biomedicine, it was
still possible to obtain good results because biomedical texts tend to be quite uniform
in terms of word senses. The sense uniformity results in cleaner features, e.g. clearer
selectional preferences, which can aid clustering.
Using the same data, feature types and gold standard as those employed by Korhonen et
al. (2006b, 2008), we investigate how our clustering methods perform on the biomedical
data in comparison with those used in previous works. In the previous work of Korhonen
et al. (2008), PC was the best performing clustering method. In our previous experiment
on general English data (in chapter 3), SPEC outperformed PC. In this experiment, we
investigate the performance of SPEC and HGFC on the biomedical data. Korhonen et
al. (2008) produced a three-level hierarchical result for biomedical data. However, the
resulting hierarchy is not a tree but a graph. In order words, verbs in the same cluster
can be divided into different clusters at upper levels. This does not match the tree-
consistent standard of Levin’s style verb classification, and it also makes the resulting
hierarchy difficult for humans to interpret. We showed in chapter 4 how our HGFC
method produces a tree-consistent hierarchical verb classification for general language;
here we investigate how it performs in the biomedical domain.
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 73
5.2 Classification methods
We used the SPEC and HGFC clustering methods described earlier in section 3.2 and 4.3.2
respectively. The JSD similarity measure is employed for calculating the similarity matrix.
The details of JSD with comparison to other similarity measures are listed in appendix B.
5.3 Data
5.3.1 Test Verbs and Gold Standard
We employed in our experiments the same gold standard as earlier employed by Korho-
nen et al. (2006b, 2008). The gold standard includes 192 test verbs (typically frequent
verbs in biomedical journal articles) classified into 16, 34 and 50 classes, respectively.
This three level gold standard was created by 4 domain experts and 2 linguists who were
asked to examine whether the test verbs are similar in terms of their syntactical proper-
ties (i.e. verbs with similar SCF distribution) and also similar in term of semantics (i.e.
they share a common meaning). If a group of verbs match the criteria, a verb class was
identified and named. The classes created by domain experts were labeled as BIO and
those created by linguists as GEN. BIO classes include 116 verbs whose analysis required
domain knowledge (e.g. activate, solubilize, harvest). GEN classes include 76 general or
scientific text verbs (e.g. demonstrate, hypothesize and appear). Each class is associated
with 1-30 member verbs.
The linguists used Levin (1993) classes as gold standard classes whenever possible and
created new ones when needed. The domain experts used two semantic classifications of
biomedical verbs (Friedman et al., 2002; Spasic et al., 2005) as a starting point. Only
those classes/memberships which all experts agreed on were included. Table 5.1 shows
all the gold standard classes with member verbs.
5.3.2 Test Data
We used the same corpus data as that used by Korhonen et al. (2008). The data was
downloaded from the MEDLINE database, from eight journals covering various areas of
biomedicine (rather than aiming to focus on any particular sub-domain of biomedicine).
The first column in table 5.2 lists each journal, the second shows the years from which
the articles were downloaded, and the third indicates the size of the data. According to
Korhonen et al. (2008), the data is sufficient in size for the various feature sets to have
good quality frequency distributions.
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 74
1 HAVE AN EFFECT ON ACTIVITY
1.1 Activate / Inactivate
1.1.1 Change activity abolish accelerate activate arrest block disrupt enhance inactivate inhibit
1.1.2 Suppress repress suppress
1.1.3 Stimulate stimulate
1.1.4 Inactivate compromise delay diminish
1.2 Affect
1.2.1 Modulate alter modulate stabilize
1.2.2 Regulate affect control induce influence regulate support
1.3 Increase / decrease decrease elevate increase
1.4 Modify catalyze modify
2 BIOCHEMICAL EVENTS
2.1 Express express overexpress
2,2 Modification
2.2.1 Biochemical modification dephosphorylate phosphorylate
2.2.2 Cleave cleave
2.3 Interact coincide colocalize cooperate correlate interact interfere react
3.1 Omit deplete displace omit
3.2 Subtract dissect draw subtract
4 EXPERIMENTAL PROCEDURES
4.1 Prepare
4.1.1 Wash rinse wash
4.1.2 Mix mix
4.1.3 Label fix immunoblot label probe stain
4.1.4 Incubate incubate preincubate
4.1.5 Elute elute
4.2 Precipitate coimmunoprecipitate coprecipitate precipitate
4.3 Solubilize lyse solubilize
4.4 Dissolve dissolve freeze homogenize resuspend suspend
4.5 Place deposit embed load locate mount place plate seed spot
5 PROCESS align cut fill linearize overlap
6 TRANSFECT cotransfect inject microinject transfect
7 COLLECT
7.1 Collect collect harvest select
7.2 Process centrifuge process recover
8 PHYSICAL RELATION BETWEEN MOLECULES
8.1.0 Binding attach bind conjugate couple fuse hybridize tether
8.2 Translocate and Segregate
8.2.1 Translocate redistribute shift switch translocate
8.2.2 Segregate export segregate
8.3 Transmit
8.3.1 Transport deliver extend transmit transport
8.3.2 Link connect link map
9 REPORT
9.1 Investigate
9.1.1 Examine analyze assess estimate evaluate examine explore
9.1.2 Establish establish investigate test
9.1.3 Confirm confirm determine verify
9.2 Suggest
9.2.1 Presentational argue assume conclude hypothesize note reason speculate
9.2.2 Cognitive believe consider mean postulate predict propose think
9.3 Indicate demonstrate imply indicate suggest
10 PERFORM
10.1 Quantify
10.1.1 Quantitate measure monitor quantify quantitate
10.1.2 Calculate calculate record
10.1.3 Conduct conduct perform
10.2 Score count score
11 RELEASE detach dissociate excise release
12 USE employ exploit use utilize
13 INCLUDE
13.1 Encompass bear comprise encompass harbor possess span
13.2 Include carry constitute contain include underlie
14 CALL call designate name
15 MOVE
15.1 Proceed move pass point proceed progress recycle traffic
15.2 Emerge arise come disappear emerge originate
16 APPEAR appear become occur prove remain seem
Table 5.1: The gold standard classification
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 75
Journal Years Words
Genes & Development 2003-5 4.7M
Journal of Biological Chemistry 2004 5.2M
(Vol.1-9)
The Journal of Cell Biology 2003-5 5.6M
Cancer Research 2005 6.5M
Carcinogenesis 2003-5 3.4M
Nature Immunology 2003-5 2.3M
Drug Metabolism and Disposition 2003-5 2.3M
Toxicological Sciences 2003-5 3.1M
Total: 33.1M
Table 5.2: Data from Korhonen et al. (2008) downloaded from MEDLINE
5.4 Features
Korhonen et al. (2008) investigated optimal features for biomedical verb classification,
including lexical, syntactic and semantic ones. Their experiments showed that feature
sets containing both syntactic and semantic information perform the best on this task.
To facilitate direct comparison of our results against those of Korhonen et al. (2008), we
adopted the same feature sets for our experiments. We briefly describe each feature set
below.
Table 5.3 provides the mapping of our features to the features shown earlier in table
2.4. These features along with other features used in this thesis, are listed in appendix
A. The basic SCF features were extracted using Preiss et al. (2007)’s system as in one
earlier experiment. The system tags, lemmatizes and parses the corpus data using the
RASP toolkit (Briscoe et al., 2006). RASP is a domain independent parser. We did not use
the bio-tuned Enju parser and GENIA tagger as used for creating BioLexicon, because we
wanted to evaluate new clustering methods and keep the feature sets similar to those used
in the previous works. However, the biomedical version of Enju does not outperform
general parsers (GENIA retrained) in biomedical event extraction tasks (Miyao et al.,
2008), so using a bio-tuned Enju parser might not give optimal performance anyway.
The SPs are acquired using the method described in section 3.1.2.
5.5 Experimental evaluation
5.5.1 Experimental settings
We set the number of clusters (Ks) to be the number of clusters in the 3-level gold
standard: 16, 34, 50. Both SPEC and the HGFC have a random element. We therefore
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 76
F1: F-SCF
F2: F-SCF+PP(A)
F3: F-SCF+PP(B)
F4: F-SCF+TENSE(A)
F5: F-SCF+TENSE(B)
F6: F-SCF+VOICE(A)
F7: F-SCF+VOICE(B)
F8: F-SCF+SP(A)
F9: F-SCF+SP(B)
F10: F-SCF+SP(C)
Table 5.3: The mapping to the features in table 2.4
repeat the K-Means module of the SPEC 100 times with random initialization and the
result that minimizes average distance to the centroid is used. Also the HGFC algorithm
is run 100 times and the result that minimizes the objective function of HGFC is used.
5.5.2 Measures
We employed the same measures (mPUR, ACC and F) as previously employed by Korho-
nen et al. (2008) in order to facilitate the meaningful comparison of results.
We performed McNemar’s statistical significance test (McNemar, 1947) for the major
findings. The details of the test are described in section 3.3.4.
5.5.3 Results
Tables 5.4 and 5.5 show the results for SPEC and HGFC, respectively. The results (first
given for each individual feature set) are directly comparable with the PC clustering re-
sults in Korhonen et al. (2008) where the same corpus and gold standard was used.
Recall that F1-F3 include the basic SCF features (F1) refined with information about
prepositions (F2-F3). As in Korhonen et al. (2008), F2-F3 mostly outperforms F1. How-
ever, while in Korhonen et al. F3 did not outperform F2, in our case, the improvement
is significant (p < 0.05). Parameterizing additional frames with prepositions is clearly
helpful, yielding useful class distinctions (for example, the SCF NP-P-NP-ING has differ-
ent variations, depending on the preposition in question, e.g. he attributed his failure to
buying his books; he told her about climbing the mountain).
Recall that F4-F10 refine F3 further with additional information. In our case, the perfor-
mance is not improved with F4 and F5 (the verb tense features). This contrasts with the
result of Korhonen et al., but is in line with our result on general language verb classifi-
cation in chapter 3. As in Korhonen et al.’s experiment, F6-F7 (verb voice features) do
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 77
not prove helpful. F4 combines verb tense with the SCF feature by simple concatenation.
Since a concatenated feature cannot be seen as a probabilistic distribution anymore, it is
expected that a distributional similarity measure like JSD will underperform. F5 param-
eterizes the SCF feature with verb tense information. According to Merlo and Stevenson
(2001), verb tense and voice features are related to the transitive/intransitive alternation,
since the use of a past participle or a passive voice implies a transitive use of the verb.
If the voice feature does not improve the result over F3, the verb tense feature should
not provide any improvement as they are providing similar information. In addition, the
SCFs already indicate the transitive/intransitive use of a verb, so the verb tense informa-
tion provides no new information.
Recall that F8-F10 parameterize F3 with additional information about LPs and SPs. F8
and F9 parameterize F3 with LPs, while F10(a-c) supplements F3 with automatically
acquired information about SPs. Like in chapter 3, these feature sets are most useful.
When SPEC is used, F8-F10 outperform other features sets by a large margin. In the case
of HGFC, the improvement is smaller, but still consistent. F10a is the best-performing
feature with SPEC. It performs at F-Measure of 82.2, 76.4 and 74.0 at the three levels of
the gold standard (16, 34 and 50 classes), respectively. This shows that SP information
can perform very well with SPEC – a finding we reported earlier also on general English
(see section 3). With HGFC, the best performing feature is also a SP-based feature, but
F10c gives the best performance at 16 and 50 classes, while F10a performs best at 34
classes. The largest improvement over F8 is found at the coarse-grained level of 16
classes (from 72.8 to 76.1). This indicates that SPs can offer a larger improvement over
LP features on the coarse-grained level than the other levels.
Comparing our own two methods (SPEC to HGFC), SPEC outperforms HGFC on almost
every feature set except at 50 classes using F5 and at 16 classes using F10c. The dif-
ferences in the best performances at the 3 levels (16, 34 and 56 classes) are 10.8, 3.5
and 4.1 in F respectively. This is expected since HGFC produces a tree-consistent hierar-
chy, but SPEC and other flat clustering algorithms (e.g. K-Means and PC) produce 3 flat
clustering results for 3 levels which are not tree-consistent. Tree-consistency is impor-
tant as all the existing Levin-style verb classification gold standards are tree-consistent.
Tree-consistency is also important for humans to understand and interpret the resulting
hierarchy. For example, if two verbs v1 and v2 are assigned into the same cluster at the
50 classes level, v1 and v2 can’t belong to different clusters on the upper levels (34 and
16 classes). In chapter 4, we showed that without the tree-consistency constraint, HGFC
performs similarly to SPEC.
When comparing our results against those of Korhonen et al.’s, SPEC outperforms PC
(p < 0.05) on almost all feature sets and class number settings (except on F4 and F5 at
34 and 50 classes where the results worsen slightly or the improvement is not significant
(p > 0.05)). The biggest improvement can be observed at the level of 16 classes (7.9 on
average over all feature sets). On the level of 34 and 50 classes, the average improvement
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 78
over all feature sets is 3.2 and 3.8 in F respectively. The best performance is improved
over Korhonen et al.’s best result by 11.9, 4 and 5.2 in F for 16, 34 and 50 classes
using SPEC. For HGFC, the performance is not always improved. Small improvements
and declines can be found in F1-F9. However, consistent improvements can be observed
using F10a-c (2.26 on average, p < 0.05 except F10b and F10c at 34 classes). The best
performance at 16, 34 and 50 classes is then improved by 4.4, 3 and 2.9, respectively.
As discussed above, the tree-consistency requirement is the main reason that the HGFC
does not always perform better than PC.
We further investigated whether combining the best features might yield higher overall
performance. We constructed F8+F10(a-c) which combine LP and SP information. For
SPEC, the performance at 16 classes is further improved by 4.7 in F when compared to
the best performance with F8 and F10(a-c). For 30 and 54 classes, the best performance
stays the same. For HGFC, the performance is not further improved by the combination
of LP and SP. For both methods, F8+F10c outperforms F8 consistently (p < 0.05 except
at 50 classes using SPEC), showing that adding SP information on top of LP is useful.
However, F8+F10a-c do not consistently outperform F10a-c, demonstrating that adding
LP information over SP is not always helpful. A possible explanation is that since SPs
were acquired automatically and are therefore noisy, the actual LPs can sometime provide
more detailed information which the SPs lack. At the level of 16 classes, SPs are much less
clear than at the level of 34 and 50 classes, and therefore LPs can make a real difference.
16 Classes 34 Classes 50 Classes
mPUR ACC F mPUR ACC F mPUR ACC F
SCF F1 76.6 74.5 75.6(+15) 65.8 65.1 65.6(+9.7) 61.6 61.5 61.6(+8.6)
F2 75.6 76.0 75.9(+4.8 ) 68.2 67.2 67.7(+2.1) 65.4 63.5 64.5(+1.3)
F3 77.9 78.6 78.3(+7.3 ) 74.7 75.0 74.8(+5) 69.9 69.8 69.8(+5.7)
F3 + tense F4 74.7 75.0 74.8(+3.8) 68.7 69.3 69.0(-0.8) 64.5 64.6 64.6(+0.5)
F5 76.6 76.0 76.3(+5.3) 70.1 70.3 70.2(+0.4) 64.1 63.5 63.8(-0.3)
F3+voice F6 73.5 77.1 77.2(+6.8 ) 70.3 69.8 70.0(+4.9) 67.5 68.2 67.9(+3.5)
F7 79.2 78.0 78.6(+10.4) 70.3 70.4 70.4(+5.2) 66.7 67.1 66.9(+2.2)
F3+SP F8 79.6 79.2 79.4(+9.5 ) 74.9 72.9 73.9(+3.7) 73.7 71.9 72.7(+6.1)
F9 78.6 76.6 77.4(+7.5 ) 76.7 74.5 75.5(+3.7) 73.7 72.4 72.9(+4.1)
F10A 83.7 80.7 82.2(+11.6) 78.7 74.0 76.4(+6.5) 74.4 71.9 74.0(+5.2)
F10B 84.2 79.7 81.9(+9 ) 74.5 68.8 71.5(+1.3) 71.6 71.4 71.5(+4.9)
F10C 79.4 72.9 75.9(+4.2) 74.2 69.8 71.9(+1.2) 72.8 70.8 71.8 (+4.8)
F8 + F10a 84.2 82.8 83.5 78.1 71.9 74.9 74.9 72.9 73.9
F8 + F10b 86.6 84.8 85.7 78.2 73.4 75.7 73.3 71.9 72.6
F8 + F10c 89.6 84.3 86.9 78.4 74.5 76.4 74.0 72.5 73.2
Table 5.4: SPEC clustering results in comparison to the results in Korhonen et al. (2008)
We performed further qualitative analysis of clusters produced by SPEC and HGFC with
their best features. The most common error type was misclustering due to purely syn-
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 79
16 Classes 34 Classes 50 Classes
mPUR ACC F mPUR ACC F mPUR ACC F
SCF
F1 70.0 66.7 68.3(+7.7) 59.7 59.9 59.8(+4) 59.5 57.8 58.7(+4.8)
F2 69.4 65.1 67.2(-3.9) 65.8 64.1 64.9(-0.7) 59.2 59.4 59.3(-3.9)
F3 75.9 68.8 72.1(-0.1) 68.4 67.2 67.8(+2.8) 65.6 66.7 66.1(+3.2)
F3+tense
F4 74.0 70.8 72.4(+1.4) 69.3 67.7 68.5(-1.3) 60.3 59.9 60.1(-4)
F5 72.4 71.4 71.9(-1.5) 67.9 66.1 67.0(-0.1) 66.6 65.6 66.1(-0.7)
F3+voice
F6 73.2 68.8 70.9(+0.5) 69.0 66.7 67.8(+1.9) 66.5 66.7 66.6(+2.2)
F7 76.7 69.8 73.1(+4.9) 68.8 67.2 68.0(+2.8) 66.1 65.6 65.9(+1.2)
F3+SP
F8 76.1 69.8 72.8(+3.9) 71.4 69.8 70.6(+0.4) 67.6 67.9 67.8(+2.1)
F9 82.5 53.1 64.6(-5.3) 70.9 63.0 66.7(-5.1) 65.9 61.5 63.6(-5.2)
F10a 76.3 73.9 75.1(+4.5) 74.7 71.4 72.9(+3) 68.1 68.8 68.4(+2.2)
F10b 77.1 70.8 73.8(+0.9) 70.7 67.7 69.2(+0.07) 69.3 67.2 68.2(+2.3)
F10c 77.7 74.5 76.1(+4.4) 72.8 69.8 71.3(+0.06) 70.7 69.3 69.9(+2.9)
F8+F10a 69.0 66.7 67.8 69.5 67.7 68.6 65.3 65.1 65.2
F8+F10b 77.1 70.8 73.8 70.7 67.7 69.2 69.3 67.2 68.2
F8+F10c 77.7 74.5 76.0 72.8 69.8 71.3 70.7 69.3 70.0
Table 5.5: HGFC clustering results in comparison to the results in Korhonen et al. (2008)
tactic similarity. For example, the APPEAR verb occur showed up in the same cluster
with MOVE verbs pass, proceed and progress because it frequently appears with locative
arguments, like MOVE verbs tend to do. However, an equally common error type was
misclassification due to genuine semantic similarity, which was not apparent in the gold
standard. Some such cases were due to polysemy. The gold standard was not created
like previous similar gold standards where verbs were classified on the basis of their
predominant sense in language. This was not possible because no sense-annotated data
was available. However, Korhonen et al. (2006b) relied on the assumption that verbs
tend to have one predominating sense in a domain. As the annotators were given syn-
tactically similar verb distributions to consider as a starting point, it is indeed possible
that if a predominating sense exists, the gold standard captures it in most cases. Yet we
found several examples which showed that polysemy plays a factor in this data as well.
For example, while the verb diminish belongs to the gold standard class of INACTIVATE
it is also a perfectly valid member of the class INCREASE/DECREASE, and indeed gets
clustered together with its member verbs. Another type of semantic ”error” was due to
the members of distinct, but semantically related gold standard classes being clustered
together.
Although the gold standard is hierarchical, it fails to relate some classes together which
share aspects of meaning. Even looking at the clustering output at the finest-grained
level of 50 classes, our methods confuse classes such as EXPERIMENTAL PROCEDURES
and COLLECT which are clearly both related to experiments but distinct in the gold
standard. Other such examples include the INVESTIGATE and QUANTIFY classes, which
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 80
include very similar general scientific verbs (e.g. determine, estimate, measure). Although
the clustering methods tend to respect the basic division between biomedical and general
verbs made by annotators, usually assigning them in different classes, in some cases both
general and biomedical verbs are found in the same cluster. Sometimes this confusion
is due to the genuine relatedness of the classes, e.g. the biomedical class TRANSPORT is
related in meaning to the general class MOVE. Looking at the clusterings at the higher
levels of the hierarchy reveals further, larger groupings of semantically related classes
which are distinct in the gold standard. This demonstrates how clustering can be used to
hypothesise both flat and hierarchical verb classifications.
5.6 Summary
In this work, we applied our flat and hierarchical clustering methods (SPEC and HGFC) to
the verb classification task in the biomedical domain. Using the same feature and dataset
as in previous work (Korhonen et al., 2008), both of our methods outperformed the best
previous method (PC). In particular, HGFC produced a tree-consistent hierarchy while
PC can only output three levels of flat clusterings. Tree-consistency is important, as all
the manual classifications are tree-consistent. As in the general domain, our experiments
demonstrated that the feature sets containing both syntactic and semantic information
performed the best. We analyzed the errors in a qualitative analysis, and discovered
that our classification includes some genuine classes that are missing from the gold stan-
dard. In sum, this experiment showed that our clustering approach can be applied to the
biomedical domain without any change, indicating that the approach could be used to
construct large-scale verb resource for biomedical domain.
5.7 Future Work
Future work could investigate using a bio-tuned tagger and parser for more accurate pro-
cessing of data. The currently available Enju parser does not outperform other parsers
that are retrained on the same biomedical corpus (e.g. Charniak (2000) and Sagae and
Tsujii (2007)). But according to Miyao et al. (2008), retrained parsing models outper-
form their original models. Because GENIA contains only molecular biology publications
related to E.Coli, lexicalized parser models might not be able to generalize well to other
biomedical subdomains. Therefore, one could first perform domain-specific tuning of
the unlexicalized RASP parser on the GENIA corpus, and then select the best performing
parser among the retrained RASP parser, Enju parser and Charniak’s Parser for processing
of the data.
In addition, the resulting verb classification could be evaluated in the context of BIO-NLP
tasks. For example, semantic role labelling of biomedical text has employed VerbNet
CHAPTER 5. VERB CLASSIFICATION IN THE BIOMEDICAL DOMAIN 81
classes as features in the classifier (Chou et al., 2006). Verb clustering could improve this
approach further by yielding classes specific to biomedicine that are missing in VerbNet.
Chapter 6
Cross-linguistic potential of verb
classification
6.1 Introduction
1Real-world exploitation of Levin style classes has been limited because for most lan-
guages, no such classes are available. To date most work on Levin type classification has
focussed on English. Large-scale research on other languages such as German (Schulte im
Walde, 2006) and Japanese (Suzuki and Fukumoto, 2009) has focussed on semantic clas-
sification. Although there are similarities between the two classification systems, studies
comparing the overlap between VerbNet and WordNet (Miller, 1995) have reported that
the mapping is only partial and many to many due to the fine-grained nature of classes
based on synonymy (Kipper-Schuler, 2005; Shi and Mihalcea, 2005; Abend et al., 2008).
Only few studies have been conducted on Levin style classification for languages other
than English. In their experiment involving 59 verbs and three classes, Merlo et al.
(2002) applied a supervised approach developed for English to Italian, obtaining high
accuracy (86.3%). In another experiment with 60 verbs and three classes, they showed
that features extracted from Chinese translations of English verbs can improve English
classification. These results are promising, but those from a later experiment by Ferrer
(2004) are not. Ferrer applied a clustering approach developed for English to Spanish,
and evaluated it against the manual classification of Va´zquez et al. (2000), constructed
using criteria similar (but not identical) to Levin’s. This experiment involving 514 verbs
and 31 classes produced results only slightly better than the random baseline.
In this chapter, we investigate the cross-linguistic potential of Levin style classification
further. In past years, verb classification techniques – in particular unsupervised ones –
have improved considerably, making investigations for a new language more feasible. We
1The research reported in this chapter was published in Sun et al. (2010).
82
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 83
take the SPEC verb clustering method developed for English (see section 3.2) and apply
it to French – a major language for which no such experiment has been conducted yet.
Basic NLP resources (corpora, taggers, parsers and subcategorization acquisition systems)
are now sufficiently developed for this language for the application of a state-of-the-art
verb clustering approach to be realistic.
Our investigation reveals similarities between the English and French classifications, sup-
porting the linguistic hypothesis (Jackendoff, 1990) and the earlier result of Merlo et
al. (2002) that Levin classes have a strong cross-linguistic basis. Not only the general
methodology but also the best performing features are transferable between the lan-
guages, making it possible to learn useful classes for French automatically and without
the need for language-specific tuning.
This is joint work with Thierry Poibeau and Cedric Messiant. The French gold standard
was constructed by Thierry Poibeau. The SCF lexicon LexSchem was provided by Cedric
Messiant. The rest of the work was performed by the author of this thesis.
6.2 French verb classes and the gold standard
The development of an automatic verb classification approach requires at least an initial
gold standard. Some syntactic (Gross, 1975) and semantic (Vossen, 1998) verb classifi-
cations exist for French, along with ones which aim to integrate aspects of both (Saint-
Dizier, 1998). Although such resources could be combined to hypothesise Levin-style
classes for French (using e.g. an approach similar to that employed by Kipper et al.
(2008)), we adopted a more direct approach: following the idea of Merlo et al. (2002),
we translated a number of Levin classes from English to French.
We chose an English gold standard which has been used to evaluate several recent clus-
tering works – that of Sun et al. (2008b). It includes 17 fine-grained Levin classes. Each
class has 12 member verbs whose predominant sense (according to the WordNet fre-
quency data) belongs to that class. We evaluated each class in this resource as follows:
1. Member verbs were first translated to French. Where several relevant translations
were identified, each of them was considered.
2. For each candidate verb, SCFs were identified and possible diathesis alternations
were considered using the criteria of Levin (1993): alternations must result in the
same or extended verb sense. Only verbs sharing diathesis alternations were kept
in the class – others were discarded.
For example, the gold standard class 31.1 AMUSE includes the following English verbs:
stimulate, threaten, shock, confuse, upset, overwhelm, scare, disappoint, delight, ex-
haust, intimidate and frighten. Relevant French translations were identified for all of
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 84
Class No Class Verbs
9.1 PUT accrocher, de´poser, mettre, placer, re´partir, re´inte´grer,
empiler, emporter, enfermer, inse´rer, installer
10.1 REMOVE oˆter, enlever, retirer, supprimer, retrancher, de´barrasser,
soustraire, de´compter, e´liminer
11.1 SEND envoyer, lancer, transmettre, adresser, porter, expe´dier,
transporter, jeter, renvoyer, livrer
13.5.1 GET acheter, prendre, saisir, re´server, conserver, garder,
pre´server, maintenir, retenir, louer, affre´ter
18.1 HIT cogner, heurter, battre, frapper, fouetter, taper, rosser,
brutaliser, e´reinter, maltraiter, corriger,
22.2 AMALGAMATE incorporer, associer, re´unir, me´langer, meˆler, unir, as-
sembler, combiner, lier, fusionner
29.2 CHARACTERIZE appre´hender, concevoir, conside´rer, de´crire, de´finir,
de´peindre, de´signer, envisager, identifier, montrer,
percevoir, repre´senter, ressentir
30.3 PEER regarder, e´couter, examiner, conside´rer, voir, scruter,
de´visager
31.1 AMUSE abattre, accabler, briser, de´primer, consterner, ane´antir,
e´puiser, exte´nuer, e´craser, ennuyer, e´reinter, inonder,
36.1 CORRESPOND coope´rer, participer, collaborer, concourir, contribuer,
prendre part, s’associer, travaille
37.3 MANNER OF SPEAKING raˆler, gronder, crier, ronchonner, grogner, bougonner,
maugre´er, rouspe´ter, grommeler, larmoyer, ge´mir, gein-
dre, hurler, gueuler, brailler, chuchoter
37.7 SAY dire, re´ve´ler, de´clarer, signaler, indiquer, montrer, an-
noncer, re´pondre, affirmer, certifier, re´pliquer
43.1 LIGHT EMISSION briller, e´tinceler, flamboyer, luire, resplendir, pe´tiller,
rutiler, rayonner., scintiller
45.4 CHANGE OF STATE me´langer, fusionner, consolider, renforcer, fortifier,
adoucir, polir, atte´nuer, tempe´rer, pe´trir, fac¸onner, for-
mer
47.3 MODES OF BEING trembler, fre´mir, osciller, vaciller, vibrer, tressaillir, fris-
sonner, palpiter, gre´siller, trembloter, palpiter
51.3.2 RUN voyager, aller, se promener, errer, circuler, se de´placer,
courir, bouger, naviguer, passer
Table 6.1: A Levin style gold standard for French
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 85
them: abattre, accabler, briser, de´primer, consterner, ane´antir, e´puiser, exte´nuer, e´craser,
ennuyer, e´reinter, inonder. The majority of these verbs take similar SCFs and diathesis
alternations, e.g. Cette affaire e´crase Marie (de chagrin), Marie est e´crase´e par le chagrin,
Le chagrin e´crase Marie. However, stimuler (stimulate) and menacer (threaten) do not,
and they were therefore removed.
40% of translations were discarded from classes after step 2 was applied. The final
version of the gold standard (shown in table 6.1) includes 171 verbs in 16 classes. Each
class is named according to the original Levin class. The smallest class (30.3) includes 7
verbs and the largest (37.3) 16. The average number of verbs per class is 10.7.
6.3 Verb clustering
We performed an experiment where we
• took a French corpus and a SCF lexicon automatically extracted from that corpus
using French NLP technology,
• extracted from these resources a range of features (lexical, syntactic and semantic)
– a representative sample of those employed in recent English experiments (Joanis
et al., 2007; Li and Brew, 2008; O´ Se´aghdha and Copestake, 2008; Vlachos et al.,
2009b) and also those in section 3.
• clustered the features using a method which has proved promising in both English
and German experiments: SPEC,
• evaluated the clusters both quantitatively (using the gold standard) and qualita-
tively,
• and finally, compared the performance of individual features to that recently ob-
tained for English in order to gain a better understanding of the cross-linguistic and
language-specific properties of verb classification
6.3.1 Data: the LexSchem lexicon
We extracted the features for clustering from LexSchem (Messiant et al., 2008). This
large subcategorization lexicon provides SCF frequency information for 3,297 French
verbs. It was acquired fully automatically from Le Monde newspaper corpus (200M
words from the period 1991-2000) using ASSCI – a recent subcategorization acquisition
system for French (Messiant, 2008).
Systems similar to ASSCI have been used in recent verb classification works e.g. (Schulte im
Walde, 2006; Li and Brew, 2008; O´ Se´aghdha and Copestake, 2008). Like these other
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 86
systems, ASSCI takes raw corpus data as input. The data is first tagged and lemmatized
using the Tree-Tagger and then parsed using the Syntex parser (Bourigault et al., 2005).
Syntex is a shallow parser which employs a combination of statistics and heuristics to
identify grammatical relations (GRs) in sentences. It is a state-of-the-art French parser: it
obtained the best precision and F-measure for written texts in the recent EASY evalua-
tion2.
ASSCI considers those GRs where the target verbs occur and constructs SCFs from nomi-
nal, prepositional and adjectival phrases, and infinitival and subordinate clauses. When
a verb has no dependency, its SCF is considered as intransitive. Otherwise, ASSCI as-
sumes no pre-defined list of SCFs, but almost any combination of permitted construc-
tions can appear as a candidate SCF. The number of automatically generated SCF types
in LexSchem is 336.
Many of the candidate SCFs are noisy due to processing errors and the difficulty of
argument-adjunct distinction. Most SCF systems operate on the basis of the assumption
that true arguments occur in argument positions more frequently than adjuncts. Many
of them also integrate sophisticated filters for removing noise from the system output.
When LexSchem was evaluated using a relative frequency and heuristics-based filter its
F-measure was 69 – which is similar to those of other current SCF systems (Messiant et
al., 2008) However, we used the unfiltered version of LexSchem because previous work
on English verb classification has showed that information about adjuncts can actually
help verb clustering (Sun et al., 2008b).
6.4 Features
Lexical entries in LexSchem provide a variety of material for verb clustering, including
(statistical) information related to the POS tags, SCFs, argument heads, and adjuncts of
verbs. Using this material, we constructed a range of features for experimentation. The
first three include basic information about SCFs:
F1: F-SCF
F2: F-SCF+TENSE(B)
F3: F2, with SCFs parameterized for prepositions (PP).
The following six features include information about the lexical context (co-occurrences)
of verbs. We adopt the best method of Li and Brew (2008) where collocations (COs) are
extracted from the window of words immediately preceding and following a lemmatized
verb. Stop words are removed prior to extraction.
2See http://w3.univ-tlse2.fr/erss/textes/pagespersos/bourigault/syntex.html for details.
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 87
F4, F6, F8: COs are extracted from the window of 4, 6 and 8 words, respectively. The
relative word position is ignored.
F5, F7, F9: F4, F6 and F8 with the relative word position recorded.
The next four features include information about lexical preferences (LP) of verbs in
argument head positions of specific GRs associated with the verb:
F10: LP(PREP): the type and frequency of prepositions in the preposition (PREP) relation.
F11: LP(SUBJ): the type and frequency of nouns in the subject (SUBJ) relation.
F12: LP(IOBJ): the type and frequency of nouns in the object (OBJ) and indirect object
(IOBJ) relation.
F13: LP(ALL): the combination of F10-F13.
The final two features refine SCF features with LPs and semantic information about verb
SP:
F14-F16: F1-F3 parameterized for LPs.
F17: F3 refined with SPs.
We adopt the fully unsupervised approach to SP acquisition described in chapter 3, with
the difference that we determine the optimal number of SP clusters automatically fol-
lowing Zelnik-Manor and Perona (2004). The method is introduced in the following
section. The approach involves (i) taking the GRs (subj, obj, iobj) associated with verbs,
(ii) extracting all the argument heads in these GRs, and (iii) clustering the resulting N
most frequent argument heads into M classes. An empirically determined N of 200 was
used. The method produced 40 SP clusters.
Features F1, F2, F4-F9, F10-F13 and F14 are used in previous verb classification exper-
iments. More details and the feature extraction methods are listed in table 2.4. All the
above features and other features used in this thesis are summarized in appendix A.
6.5 Clustering methods
We used the SPEC method described in chapter 3 with JSD as the similarity measure. The
reason why JSD is used and details on other similarity measures can be found in appendix
B. As the number of clusters is not known beforehand, we use Zelnik-Manor and Perona
(2004)’s method to estimate it. This method finds the optimal value by minimizing a cost
function based on the eigenvector structure of the similarity matrix.
Like Brew and Schulte im Walde (2002), we compare SPEC against a K-Means baseline.
We used the Matlab implementation with Euclidean distance as the distance measure.
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 88
6.6 Experimental evaluation
6.6.1 Data and pre-processing
Our initial plan was to experiment with all the 171 verbs in the gold standard (see Table
6.1). However, we decided to exclude phrasal verbs (e.g. faire disparaıˆtre) and drop
one class (40.2 NON-VERBAL EXPRESSION) which included reflexive verbs in French
(e.g. s’amuser, se moquer) since multiword units would have been challenging for our
method. Also verbs assigned to several classes due to polysemy were excluded. This left
us with 147 verbs in 15 classes (10 verbs per class on average).
The SCF-based features (F1-F3 and F14-F17) were extracted directly from LexSchem.
The CO (F4-F9) and LP features (F10-F13) were extracted from the raw and parsed cor-
pus sentences, respectively, which were used for creating the lexicon. Features that only
appeared once were removed. Feature vectors were normalized by the sum of the feature
values before clustering. Since our clustering algorithms have an element of randomness,
we repeated clustering multiple times. We report the results that minimize the distortion
(the distance to cluster centroid).
6.6.2 Evaluation measures
We employed the same measures (mPUR, ACC and F) for evaluation as in our experiments
on English (chapter 3).
We performed McNemar’s statistical significance test (McNemar, 1947) to verify the
major findings. The details of the test are described in section 3.3.4.
6.7 Evaluation
6.7.1 Quantitative evaluation
In our first experiment, we evaluated 116 verbs – those which appeared in LexSchem a
minimum of 150 times. We did this because English experiments had shown that due to
the Zipfian nature of SCF distributions, 150 corpus occurrences are typically needed to
obtain a sufficient number of frames for clustering (Sun et al., 2008b). The amount of
corpus evidence needed for reliable clustering is discussed in section 8.2.3.
Table 6.2 shows F-measure results for all the features. The 4th column of the table
shows, for comparison, the results (in the section 3.4) obtained for English. The results
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 89
for English are obtained using SPEC with same features, which are evaluated against the
English version of the same gold standard, also using F-measure3.
As expected, SPEC (the 2nd column) outperforms K-Means (the 3rd column) throughout
the feature sets with an average improvement of 14.1 on F (p < 0.05). Looking at the
basic SCF features F1-F3, we can see that they perform substantially better than the BL
method. F3 performs the best among the three features both in French (50.6 F, p < 0.05)
and in English (63.3 F). We therefore use F3 as the SCF feature in F14-F17 (the same was
done for English).
In French, most CO features (F6-F9) outperform SCF features (p < 0.05). The best result
is obtained with F7: 55.1 F. This is clearly better than the best SCF result 50.6 (F3).
This result is interesting since SCFs correspond better than COs to features used in the
manual Levin classification. Also, SCFs perform considerably better than COs in the
English experiment (we only have the result for F4 available, but it is considerably lower
than the result for F3). However, earlier English studies have reported contradictory
results (e.g. Li and Brew (2008) showed that CO performs better than SCF in supervised
verb classification), indicating that the role of CO features in verb classification requires
further investigation.
Looking at the LP features, F13 produces the best F (52.7) for French which is slightly
better than the best SCF result for the language. Also in English, F13 performs the best
in this feature group and yields a higher result than the best SCF-based feature F3.
Parameterizing the best SCF feature F3 with LPs (F14-16) and SPs (F17) yields better
performance in French. F15 and F17 have an F-measure of 54.5 and 54.6, respectively.
These results are so close to the result of the best CO feature F7 (55.1 – which is the
highest result in this experiment) that the differences are not statistically significant (p >
0.05). In English, the results of F14-F17 are similarly good; however, only F17 beats the
already high performance of F13.
On the basis of this experiment, it is difficult to tell whether shallow CO features or
more sophisticated SCF-based features are better for French. In the English experiment
sophisticated features performed better, and the SCF-SP feature F17 was the best one.
However, the English experiment employed a much larger dataset. These more sophisti-
cated features may suffer from data sparseness in our French experiment since although
we required the minimum of 150 occurrences per verb in LexSchem, verb clustering
performance tends to improve when more data is available, and given the fine-grained
nature of LexShem SCFs it is likely that more data is required for optimal performance.
We therefore performed another experiment with French on the full set of 147 verbs,
using SPEC, where we investigated the effect of instance filtering on the performance of
3Note that the results for the two languages are not mutually comparable due to differences in test sets,
data sizes, and feature extraction systems. The results for English are included so that we can compare the
relative performance of individual features in the two languages in question.
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 90
the best features from each feature group: F3, F7, F13 and F17. The results shown
in Table 6.3 reveal that the performance of the features remains fairly similar until the
instance threshold of 1000. When 2000 occurrences per verb are used, the differences
become clearer, until at the threshold of 4000, it is obvious that the most sophisticated
SCF-SP feature F17 is by far the best feature for French (65.4 F, p < 0.05 when compared
to F3) and the SCF feature F3 the second best (60.5 F). The CO-feature F7 and the LP
feature F13 are not nearly as good (53.4 and 51.0 F).
Although the results at different thresholds are not comparable due to the different num-
ber of verbs and classes (see columns 2-3), the results for features at the same threshold
are. Those results suggest that when 2000 or more occurrences per verb are used, most
features perform like they performed for English in the experiment (chapter 3) with CO
being the least informative4 and SCF-SP being the most informative feature. The only
exception is the LP feature which performed relatively better than CO in English.
French English
SPEC K-Means SPEC
BL 6.7 6.7 6.7
F1 SCF 42.4 39.3 57.8
F2 SCF(POS) 45.9 40.3 46.7
F3 SCF(PP) 50.6 36.9 63.3
F4 CO(4) 50.3 38.2 40.9
F5 CO(4+loc) 48.8 26.3 -
F6 CO(6) 52.7 29.2 -
F7 CO(6+loc) 55.1 33.8 -
F8 CO(8) 54.2 36.4 -
F9 CO(8+loc) 54.6 37.2 -
F10 LP(PREP) 35.5 32.8 49.0
F11 LP(SUBJ) 33.7 23.6 -
F12 LP(OBJ) 50.1 33.3 -
F13 LP(ALL) 52.7 40.1 74.6
F14 SCF+LP(SUBJ) 50.3 40.1 71.7
F15 SCF+LP(OBJ) 54.5 35.6 74.0
F16 SCF+LP(SUBJ+OBJ) 53.4 36.2 73.0
F17 SCF+SP 54.6 39.8 80.4
Table 6.2: Results for all features for French (SPEC and K-means) and English (SPEC)
4However, it is worth noting that CO is not a useless feature. As table 6.3 shows, when 150 or fewer
occurrences are available for a verb, CO outperforms all the other features in French, compensating for
data sparseness.
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 91
THR Verbs Cls F3 F7 F13 F17
0 147 15 43.7 57.5 43.3 50.1
50 137 15 47.9 56.1 44.8 49.1
100 125 15 49.2 54.3 44.8 49.5
150 116 15 50.6 55.1 52.7 54.6
200 110 15 54.9 52.9 49.7 52.5
400 96 15 52.7 52.9 43.9 53.2
1000 71 15 51.4 54.0 44.8 54.5
2000 59 12 52.3 45.9 42.7 53.5
3000 51 12 55.7 49.0 46.8 59.2
4000 43 10 60.5 53.4 51.0 65.4
Table 6.3: The effect of verb frequency on performance
6.7.2 Qualitative evaluation
We conducted qualitative analysis of the clusters for French, focusing on those created
using SPEC with F17 and F3.
Verbs in the gold standard classes 29.2, 36.1, 37.3, 37.7 and 47.3 (Table 6.1) performed
particularly well, with the majority of member verbs found in the same cluster. These
verbs are ideal for clustering because they have distinctive syntactic-semantic charac-
teristics. For example, verbs in 29.2 CHARACTERIZE class (e.g. concevoir, conside´rer,
de´peindre) not only have a very specific meaning but they also take high frequency SCFs
involving the preposition comme (Eng. as) which is not typical to many other classes.
Interestingly, Levin classes 29.2, 36.1, 37.3, and 37.7 were among the best performing
classes also in the English supervised verb classification experiment of Sun et al. (2008b)
(which employed the English version of our gold standard) because these classes have
distinctive characteristics also in English.
The benefit of sophisticated features which integrate also semantic (SP) information (F17)
is particularly evident for classes with non-distinctive syntactic characteristics. For exam-
ple, the intransitive verbs in 43.1 LIGHT EMISSION class (e.g. briller, e´tinceler, flamboyer)
are difficult to cluster based on syntax only, but semantic features work because the verbs
pose strong SPs on their subjects (entities capable of light emission). In the experiment of
Sun et al. (2008b), 43.1 was the worst performing class for English, possibly because no
semantic features were used in the experiment.
The most frequent source of error is syntactic idiosyncracy. This is particularly evident
for classes 10.1 REMOVE and 45.4 CHANGE OF STATE. Although verbs in these classes
can take similar SCFs and alternations, only some of them are frequent in data. For exam-
ple, the SCF oˆter X a` Y is frequent for verbs in 10.1, but not oˆter X de Y. Although class
10.1 did not suffer from this problem in the English experiment of Sun et al. (2008b),
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 92
class 45.4 did. Class 45.4 performs particularly bad in French also because its member
verbs are low in frequency.
Some errors are due to polysemy, caused partly by the fact that the French version of
the gold standard was not controlled for this factor. Some verbs have their predominant
senses in classes which are missing in the gold standard, e.g. the most frequent sense of
retenir is memorize, not keep as in the gold standard class 13.5.1. GET.
Finally, some errors are not true errors but demonstrate the capability of clustering to
learn novel information. For example, the CHANGE OF STATE class 45.4 includes many
antonyms (e.g. weaken vs. strengthen). Clustering (using F17) separates these antonyms,
so that verbs adoucir, atte´nuer and tempe´rer appear in one cluster and consolider and
renforcer in another. Although these verbs share the same alternations, their SPs are
different. For the same reason, verbs in LIGHT EMISSION class 43.1 end up in different
clusters, depending on whether they describe abstract or concrete light emission.
The opposite effect can be observed when clustering maps together classes which are
actually semantically and syntactically related (e.g. 36.1 CORRESPOND and 37.7 SPEAK).
Such classes are distinct in Levin and VerbNet, because these resources do not to draw
links between semantically similar classes belonging to different main classes.
Cases such as these show the potential of clustering in discovering novel valuable infor-
mation in data. It is encouraging that we have observed this effect in this first clustering
experiment in French.
6.8 Discussion and conclusion
We have seen that when sufficient corpus data is available, there is a strong correlation
between the types of features which perform the best in English and French. Interestingly,
we have also seen that when the best features are used, many individual Levin classes
have similar performance in the two languages.
Due to differences in language-specific datasets and sizes, a direct comparison of the
actual performance figures for English and French is not possible. When considering the
general level of performance, our best performance for French (65.4 F) is clearly lower
than the best performance for English in the section 3. However, it compares favourably
to the performance of other state-of-the-art (even supervised) systems for English verb
classification (Joanis et al., 2007; Li and Brew, 2008; O´ Se´aghdha and Copestake, 2008;
Vlachos et al., 2009b). This is impressive considering that we experimented with a fully
unsupervised method originally developed for another language.
Our experiment suggests that when aiming to improve performance further, employing
larger data is critical. Most recent experiments on English have employed bigger test and
datasets, and unlike us, some of them have only considered the predominant senses of
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 93
medium-high frequency verbs (O´ Se´aghdha and Copestake, 2008; Vlachos et al., 2009b).
As seen in section 6.7.1, such differences in data can have significant impact on perfor-
mance.
However, parser and feature extraction performance can also play a big role in overall
accuracy, and should therefore be investigated further. When we evaluated our basic SCF
feature (equivalent to F1) using the same corpus data and gold standard but an older
version of the RASP parser and the SCF extraction system in section 3.4, the F dropped
dramatically: from 57.8 to 38.3. The relatively low performance of basic LP features
in French suggests that at least some of the current errors are due to parsing. Future
research should therefore investigate the source of error at different stages of processing.
In the future, it would also be interesting to investigate whether performance on French
can be further enhanced by language-specific tuning (e.g. by experimenting with language
specific features such as auxiliary classes).
Methodology similar to ours has yielded promising results on semantic verb classifica-
tion in German (Schulte im Walde, 2006) and Japanese (Suzuki and Fukumoto, 2009).
However, these studies have not focussed on Levin style classes, and have not explored
cross-linguistic transfer. The works most related to ours are those of Merlo et al. (2002)
and Ferrer (2004). Our results contrast with those of Ferrer who showed that a cluster-
ing approach does not transfer well from English to Spanish. However, her experiment
used basic SCF and named entity features only, and a clustering algorithm less suitable
for high dimensional data.
Like us, Merlo et al. (2002) created a gold standard by translating Levin classes to an-
other language (Italian). They also applied a classification approach developed for En-
glish to Italian, and reported good overall performance using features developed for
English. Although the experiment was very small in scale (involving three classes and a
few features only), and although it involved a use of a supervised classification technique,
the results are in agreement with our results from this larger, unsupervised experiment
with French.
In their recent experiment, Falk et al. (2012) built on some of the work we have described
in this chapter (Sun et al., 2010). They made use of existing syntactic and semantic lexical
resources to cluster 2183 French verbs in our gold standard classes (a superset of our gold
standard). They experimented with a new clustering method and new feature sets. They
obtained better result (70 F), but this result is not comparable with ours because the gold
standard was not identical. Also, manually specified rather than automatically acquired
features were used in the experiment. In addition, we found that there are two potential
flaws with the experiment which can affect the results. 5:
5The second point was confirmed with the first author via personal communication. We were not able
to get a clarification regarding the first point.
CHAPTER 6. CROSS-LINGUISTIC POTENTIAL OF VERB CLASSIFICATION 94
1. In order to obtain the thematic grid feature from VerbNet, a classifier was trained
to map French verbs to VerbNet classes. The gold standard verbs and classes were
used to train the classifier (see footnote 3 on page 2 in their paper). In other
words, the gold standard was used for feature extraction. This makes the clustering
result higher than in fully automatic work, as the thematic grid feature is already
implicitly encoded in the class label.
2. F-Measure was used to select the number of clusters for K-Means and IGNG (see
page 4 in their paper). This means that the gold standard was used as help in
clustering. This also makes the result unrealistically high from the perspective of
automatic acquisition, as the reported best F-Measure cannot be extracted when
the gold standard is unknown.
In sum, the experiments reported in this chapter further support the linguistic hypoth-
esis that Levin-style classification can be cross-linguistically applicable or overlapping
(Levin, 1993). A clustering technique such as the one presented here could be used as a
helpful tool to investigate this hypothesis further, and to find out whether classifications
are similar across a wider range of more diverse languages. From the NLP perspective,
the fact that an unsupervised technique developed for one language can be applied to
another language without substantial changes in the methodology means that automatic
techniques can be used to hypothesise useful Levin-style classes in a cost-effective man-
ner (Kipper et al., 2008). This, in turn, can facilitate the creation of VerbNets for new
languages.
Chapter 7
Task-based evaluation of verb
classification
VerbNet has proved useful for many practical NLP tasks including automatic verb acqui-
sition (Swift, 2005), semantic role labelling (Swier and Stevenson, 2004), robust seman-
tic parsing (Shi and Mihalcea, 2005), word sense disambiguation (Dang, 2004), building
conceptual graphs (Hensman and Dunnion, 2004), and creating a unified lexical resource
for knowledge extraction (Croch and King, 2005). According to our knowledge, auto-
matically acquired classification has not been evaluated in the context of an NLP task yet,
although such an evaluation would be important. We apply our automatically acquired
verb and noun classes (SPs) to two NLP tasks: metaphor identification and argumentative
zoning. We did this work in collaboration with Ekaterina Shutova and Yufan Guo. The
project plan, system design, experiment and evaluation were carried out by Ekaterina
Shutova and Yufan Guo respectively. The author’s contribution was to provide the lexi-
cal classifications and the related statistics for the two tasks. We summarise the resulting
work in this chapter. Details of the work can be found in the following publications:
Shutova et al. (2010); Guo et al. (2010, 2011b). All the examples, figures and tables
in this chapter were originally authored by Ekaterina Shutova and Yufan Guo for the
publications above.
7.1 Use of verb and noun classification in metaphor identi-
fication
7.1.1 Introduction
Shutova (2011) proposed an approach for automatic metaphor identification based on
noun and verb classification. According to Shutova and Teufel (2010), the phenomenon
of metaphor is frequent in all types of discourse. A metaphor identification system that
95
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 96
is capable of distinguishing between literal and metaphorical expressions in unrestricted
text would be useful for many NLP applications.
Shutova followed the hypothesis proposed by Wilks (1978): metaphor demonstrates a
violation of selectional restrictions in a given context. For example:
(1) My car drinks gasoline. (Wilks, 1978)
The verb drink normally takes an animate subject and a liquid object. Therefore, drink
taking a car as a subject is a violation of the normal selectional restriction, which also
indicates that drink is used metaphorically. This approach was automated by a few sys-
tems, but they are limited in the following aspects: 1) Fass (1991)’s system overgenerates
with respect to metaphor, as it detects any kind of non-literalness or anomaly in lan-
guage (metaphors, metonymies and others). 2) Fass’ system and Krishnakumaran and
Zhu (2007)’s system are mainly based on hand-coded knowledge which affect the cover-
age of the systems. 3) Approaches like Gedigan et al. (2006) only identify metaphors for
a specific domain in a specific type of discourse.
In contrast to the previous works, the scope of this experiment is the whole of the BNC
(Leech, 1992) and the domain of the expressions the system can identify is unrestricted.
The motivation of using the clustering methods for metaphor identification lies in the
nature of metaphorical reasoning based on association. In the following examples, both
of the marriage and political regime target concepts are mapped to the source domain of
mechanism, despite of having quite different meanings.
(2) Our relationship is not really working.
(3) Diana and Charles did not succeed in mending their marriage.
(4) The wheels of Stalin’s regime were well oiled and already turning.
Shutova’s expectation is that such relatedness of mappings from distinct target concepts
to the sample source concept should appear in similar lexico-syntactic environments.
Thus, clustering concepts using GRs and lexical features would allow the system to
capture their relatedness by association and acquire metaphorical expressions beyond
the seed set. For example, if the sentence in (2) is in the seed set, the system should be
able to identify metaphors in both (3) and (4).
In summary, the system (1) requires an initial seed set of metaphorical expressions (source–
target domain mappings); (2) performs unsupervised noun clustering in order to acquire
target concepts associated with the same source domain; (3) creates a source domain verb
lexicon using unsupervised verb clustering; (4) identifies metaphorical expressions from
BNC that describes the target domain concepts using the verbs from the source domain
lexicon. The noun and verb clusters used for step 2 and 3 were supplied by the author
of this thesis.
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 97
7.1.2 Data
Shutova used Shutova (2010)’s dataset as a seed set. The dataset is composed of 62
phrases that are single-word metaphors representing verb-subject and verb-object rela-
tions, where a verb is used metaphorically. The seed phrases include e.g. verb - direct
object constructions: stir excitement, reflect enthusiasm, accelerate change, grasp theory
and subject - verb constructions: campaign surged, factor shaped.
The system carried out the metaphor identification on the BNC that was parsed using
the RASP parser of Briscoe et al. (2006). The GRs output of RASP for BNC created by
Andersen et al. (2008) was used.
7.1.3 Method
The system generalizes over the metaphorical expressions in the seed set by means of un-
supervised verb and noun clustering. As demonstrated in example 2, 3 and 4 in section
7.1.1, Shutova expects that the obtained clusters represent potential source and target
concepts between which metaphorical associations hold. The system searches the source
and target domain terms within object and subject relations from the BNC parsed by
RASP. These GRs can be classified as metaphorical or non-metaphorical according to
the source and target domain vocabulary in the related clusters. In addition, Shutova
employs a selectional preference strength filter to remove the candidate expressions that
are associated with verbs that are less likely to be used metaphorically. The hypothe-
sis of this approach is that only the verbs that exhibiting strong SP would be prone to
metaphoricity. The SP strength measure proposed by Resnik (1993) was used.
We discuss the works performed by the author of this thesis in the rest part of this section.
Verb and noun clustering
We adopt the verb clustering approach as described in chapter 3, which uses rich syn-
tactic and semantic features extracted using a shallow parser and a clustering method
suitable for the resulting high dimensional feature space. JSD was used as the similarity
measure. The reason that JSD is used instead of skew divergence is described in appendix
B. Recall that when we evaluated our approach on 204 verbs from 17 Levin classes in
chapter 3, we obtained 80.4 F-measure (which is high in particular for an unsupervised
approach). In this experiment, we apply this approach to a much larger set of 1610
verbs: all the verb forms appearing in VerbNet (Kipper et al., 2006a) with the exception
of highly infrequent ones. In addition, we applied the approach to noun clustering.
Feature Extraction Our verb dataset is a subset of VerbNet compiled as follows. For all
the verbs in VerbNet we extracted their occurrences (up to 10,000) from the raw corpus
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 98
Source: MECHANISM
Target Cluster: consensus relation tradition partnership resistance foundation alliance
friendship contact reserve unity link peace bond myth identity hierarchy relationship con-
nection balance marriage democracy defense faith empire distinction coalition regime di-
vision
Source: STORY; JOURNEY
Target Cluster: politics practice trading reading occupation profession sport pursuit affair
career thinking life
Source: LOCATION; CONTAINER
Target Cluster: lifetime quarter period century succession stage generation decade phase
interval future
Source: LIVING BEING; END
Target Cluster: defeat fall death tragedy loss collapse decline disaster destruction fate
Figure 7.1: Clustered target concepts
data collected originally by Korhonen et al. (2006a) for construction of VALEX lexicon.
Only the verbs found in this data more than 150 times were included in the experiment.
For verb clustering, we adopted the best performing features in chapter 3: automatically
acquired verb SCFs parameterized by their SPs and prepositions (section 3.1.2 for details
of the feature extraction).
Our noun dataset consists of 2000 most frequent nouns in the BNC. Following previous
works on semantic noun classification (Pantel and Lin, 2002; Bergsma et al., 2008), we
used GRs as features for noun clustering. We employed all the argument heads and verb
lemmas appearing in the subject, direct object and indirect object relations in the RASP-
parsed BNC. The feature vectors were first constructed from the corpus counts, and
subsequently normalized by the sum of the feature values before applying clustering.
Clustering Algorithm We used the SPEC method for clustering. See chapter 3 for the
details of how this was done for both verbs and nouns. We experimented with different
number of clusters settings (50, 100, 200, 300, 400) for both noun and verb clusterings.
Shutova found that 200 is the most suitable setting for both nouns and verbs by means
of qualitative analysis of clusters as representations of source and target domains. Some
of the clusters obtained as a result of applying the algorithm to our noun and verb
datasets are shown in Figures 7.1 and 7.2, respectively. The noun clusters represent
target concepts that Shutova expects to be associated with the same source concept (some
suggested source concepts are given in Figure 7.1, although the system only captures
those implicitly). The verb clusters contain coherent lists of source domain vocabulary.
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 99
Source Cluster: sparkle glow widen flash flare gleam darken narrow flicker shine blaze
bulge
Source Cluster: gulp drain stir empty pour sip spill swallow drink pollute seep flow drip
purify ooze pump bubble splash ripple simmer boil tread
Source Cluster: polish clean scrape scrub soak
Source Cluster: kick hurl push fling throw pull drag haul
Source Cluster: rise fall shrink drop double fluctuate dwindle decline plunge decrease soar
tumble surge spiral boom
Figure 7.2: Clustered verbs (source domains)
7.1.4 Evaluation and discussion
Shutova compared the system’s output to that of a baseline using WordNet synsets as
source and target domains. The precision of the two systems is determined with help of
human judges. The coverage is calculated based on the number of metaphorical expres-
sions found by the system.
Comparison against WordNet baseline
The baseline system employs the synonymy information from WordNet to expand on
the seed set. It assumes that the source and target vocabularies are represented by the
synonyms of the verbs and nouns in seed expressions. The system then searches for
phrases composed of lexical items belonging to those vocabularies. The coverage of the
two systems is compared by estimating the number of WordNet synsets in the metaphor-
ical expressions captured by the two systems. The baseline covers only 13% of the
metaphors identified using clustering. When compared to the output of the baseline, the
metaphors tagged by the system represent a considerably wider range of meanings, e.g.
given the seed metaphors stir excitement, throw remark, cast doubt the system identifies
previously unseen expressions swallow anger, hurl comment, spark enthusiasm etc. as
metaphorical.
Comparison with human judgments
Shutova used the help of annotators in order to assess the precision of metaphor iden-
tification by both systems. The annotators were presented with 78 randomly sampled
sentences containing metaphorical expressions as annotated by the system and by the
baseline. They were asked to decide whether the tagged expressions were metaphorical
or not. The kappa score (Siegel and Castellan, 1988) of this annotation task is 0.63 (n=2,
N=78, k=5). The precision is calculated as the percentage of metaphorical expressions
that were tagged correctly out of the ones that were tagged. A tagged metaphorical ex-
pression is considered to be correct if at least three annotators agree that this is the case.
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 100
The resulting precision of the system is 0.63, and the baseline obtains 0.44. For the
details on the annotation process and error analysis of the two systems, please consult
Shutova (2011).
7.1.5 Conclusion
We presented Shutova (2011)’s novel metaphor identification system for unrestricted text
using verb and noun clusterings. The system has a better coverage and precision than a
baseline using WordNet synsets. The resulting metaphors go far beyond synonymy and
generalize well over the source and target domains. In this task, we demonstrated that
our automatically acquired Levin-style verb classifications can be very useful for an NLP
task. To our knowledge, this is the first task-based evaluation of automatic Levin-style
classification. VerbNet lacks the statistical information required by the experiment we
have reported, so it could not have been used for it anyway. Since our clustering methods
and features can transfer to new languages and domains, the metaphor identification
system can be applied to a new task at a low cost.
7.2 Use of verb classes in argumentative zoning
7.2.1 Introduction
Argumentative zoning (AZ) is the analysis of the argumentative structure of a scientific
paper. It has proved useful for many NLP tasks (Teufel and Moens, 2002; Mizuta et
al., 2006; Tbahriti et al., 2005; Ruch et al., 2007). Most approaches to AZ are fully
supervised which means they are difficult to apply to new scientific domains. Also many
of their features (e.g. individual verbs) tend to be sparse, so large data annotation is
needed to obtain good features.
We experimented with features with more generalization power (including verb classes)
and also employed weakly-supervised ML, which only relies on a small amount of la-
belled data. This work is joint work with Yufan Guo. The author of this thesis is
responsible for providing verb classes. The work summarised here is published in two
papers (Guo et al., 2010, 2011b). The 2010 paper focusses on fully supervised learning,
and the recent 2011 paper focusses on weakly supervised learning. We will only discuss
latter work here because it aims to minimise the need for data annotation and therefore
relies more heavily on less sparse features. We also focus mainly on the parts relevant to
the verb class feature. See the relevant paper for further details of the experiment.
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 101
Figure 7.3: Example of annotated abstract in Guo et al. (2010)’s corpus
7.2.2 Data and experiments
AZ provides a sentence based classification of scientific text into categories that capture
information structure or scientific discourse of the paper. Originally developed by Simone
Teufel (Teufel, 1999), we employed a version of the AZ scheme modified for biomedical
papers (Mizuta et al., 2006). The AZ corpus built by Guo et al. (2010) contains 1000
biomedical abstracts. Sentences in each abstract are annotated according to one of the
seven categories of biomedical AZ appearing in abstracts: Background (BKG), Objective
(OBJ), Method (METH), Result (RES), Conclusion (CON), Related work and Future work.
Figure 7.3 shows an example of an annotated abstract.
To obtain the verb class features, we extracted 60 verb classes by clustering all verbs with
frequency greater than 150 in Korhonen et al. (2008)’s biomedical dataset (see section 5.3
for details). The acquisition method is same as in section 7.1.3: the SPEC method with
the SCF parameterized with preposition and SP was employed. The feature extraction
method is described in section 2.3.2. The other features used for AZ include location,
word, bi-gram, POS, GR, voice and verb. In table 7.1, All refers to all of these features,
and the verb feature is compared to the verb class feature.
We experimented with fully supervised classification methods including SVM and CRF,
and weakly supervised methods (active SVM, active SVM with self-training (ASSVM, trans-
ductive SVM) and semi-supervised CRF.
The results were evaluated using ACC, precision (P), recall(R) and F.
ACC =
no. of correctly classified sentences
total no. of sentences in the corpus
P =
no. of sentences correctly identified as the class
total no. of sentences identified as the class
R =
no. of sentences correctly identified as the class
total no. of sentences in the class
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 102
Features Acc MF BKG OBJ METH RES CON
All 0.81 0.76 0.86 0.56 0.76 0.88 0.76
¬Verb 0.81 (-0%) 0.79 0.84 0.77 0.73 0.87 0.75
¬Verb class 0.79 (-2%) 0.75 0.86 0.62 0.72 0.84 0.70
Table 7.1: Result on leaving one feature out for ASSVM. The difference of the accuracy
to that of all is also labelled.
F =
2× p× r
p+ r
The MF is the Marco average of F across five high frequency zone categories. All the
results are produced using 10-fold cross validation in order to avoid the bias on the
training data. In addition, only 10% of the labelled data are used as training data, and
the rest are used as the unlabelled training data to the classifier.
With all the features, ASSVM produced the best result at 0.81 in ACC and 0.76 in MF. It
outperforms the best fully supervised method SVM by 0.04 in ACC and 0.02 in MF. Next,
we conducted an analysis to investigate the usefulness of the verb class feature. We took
the best performing ASSVM method and conducted leave-one-out of the features on the
10% left out data. The results are compared to the results of using all the features (table
7.1). By excluding the verb feature, the ACC stays the same, but the MF is actually
improved by 3%. However, raw verb feature was regarded as a useful feature for AZ as
shown in previous experiments (Guo et al., 2010; Liakata et al., 2012). This interesting
result shows that when combined with the verb class feature, the raw verb does not offer
additional valuable information. We can also observe that ACC and MF are decreased
when the verb class feature is left out. The verb class information is particularly useful
for the Method, Result and Conclusion categories: a 4-6% decrease can be observed
when the verb class feature is removed. In addition, the verb class feature is the third
most important feature among all features. Only the location and POS feature are more
important than it. It is more important than many features commonly used in AZ, e.g.
words, bigrams.
7.2.3 Summary
The results reported here show that the weakly supervised classifier (ASSVM) outper-
forms the fully supervised classifier by making use of both labelled and unlabelled data.
The verb class feature was shown to be very useful for weakly supervised learning. In
contrast, the raw verb feature actually decreases the performance when the verb class
feature is used. The verb class feature alleviates the data sparseness problem of the raw
verb feature, which lowers the requirement of the amount of training data that need to be
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 103
annotated. This makes it a particularly suitable feature for a weakly-supervised classifier
when the labelled training data is small.
7.3 Release of large-scale verb classifications
To enable the use of automatically induced verb classes in further NLP tasks, we took our
best performing clustering method and feature combination and created two large scale
classifications: one for general English, and one for the biomedical domain.
For both clusterings, we used the SPEC method. The features we used were SCFs param-
eterized with prepositions and SPs. Chapter 3 provides details on the method, features
and feature extraction techniques.
We constructed the general English classification from the English Gigaword corpus
(Graff et al., 2003). We clustered all the verbs appearing in VerbNet which occurred
at least 1500 times in Gigaword. For the verbs with more than one class, we assigned
each verb to the class which, according to VerbNet, corresponds to its predominant sense
in WordNet. The verbs that are in the class which have less than three member verbs
are discarded. The resulting gold-standard hierarchy has three levels with 170, 144 and
60 classes on each level. These verbs were clustered according to the number of clus-
ters on each level in the gold standard. When evaluated against the three top levels of
VerbNet, the results are 59.1, 54.8 and 52.3 in F respectively for the three levels. On
the similar scale dataset, the following results were reported: 36.7, 37.9 and 40.0 in F in
hierarchical clustering (with tree constraint) in chapter 4 and Sun and Korhonen (2011),
as well as 52.8 in macro-averaged recall (48 classes) for supervised learning in Li and
Brew (2008). These results are not directly comparable to each other, as the evaluation
measure, features and corpora used are different.
For the biomedical classification, we employed the biomedical corpus in Korhonen et
al. (2008). Korhonen et al. (2008)’s gold-standard was extended with additional verbs
suggested by a linguist and domain experts. We only kept the verbs that occur at least
150 times in the corpus. These 399 verbs were clustered into three levels. We set the
number of clusters to be 78, 46 and 17 respectively for each level, same as the number
in the gold-standard. The results against the gold standard on the three levels are 68.8,
64.7 and 62.5 in F respectively.
We make these two resources publicly available in order to allow the researchers to use
automatically acquired classes for their tasks. The classification can be tuned to different
tasks or extended easily by re-running the clustering with different settings (e.g. features,
corpora, frequency cut). The resources are available at:
http://www.cl.cam.ac.uk/users/ls418/resource_release/
The clustering code is available on request.
CHAPTER 7. TASK-BASED EVALUATION OF VERB CLASSIFICATION 104
7.4 Summary
In this chapter, we evaluated the automatically acquired verb classes on two NLP tasks:
AZ and metaphor identification. To our knowledge, these are the first task-based eval-
uations of automatic verb classification. In the first task, the automatically acquired
verb and noun clusters enable a new approach for identifying metaphor expressions in
language. The clusters were used to represent the source and target concepts. The ex-
perimental results are promising in terms of recall and precision. This is an example
of a task that requires an automatically acquired rather than manually built classifica-
tion because it makes use of statistical information. In the AZ task, we used our verb
classes as an additional feature for AZ of biomedical abstracts. The result shows that
verb classes can improve the performance over the raw verb feature. Among all features,
it is the third most important feature. Automatically acquired classification is needed
because no manually developed biomedical verb classification is available. Finally, using
our best clustering methods, we created and evaluated two large verb classifications: one
for general English and another one for the biomedical domain. We have made these
classifications publicly available so that researchers wishing to use them for NLP tasks
and applications can do so.
Chapter 8
Conclusions
In this chapter, we summarize the contributions of this thesis (section 8.1) and outline
directions for future research (section 8.2).
8.1 Contributions of this thesis
The main contribution of this thesis is to advance the state of the art of automatic verb
classification by improving its accuracy and applicability across domains and languages.
We improved the accuracy of verb clustering by introducing novel methods and semantic
features that improved performance. We evaluated the classification methodology on
established datasets. We also applied and evaluated the methods to a different language
(French) and to a different domain (biomedical). For the first time, we performed task-
based evaluation of the automatic verb classification on two NLP tasks. The results show
that the automatically acquired classification can be very useful for NLP applications.
Our research resulted in several experimental findings and methodological proposals
which we discuss in the following sections:
8.1.1 Clustering methods
We introduced two new clustering methods to the task and to the NLP field.
SPEC was previously used for verb classification by e.g. Brew and Schulte im Walde
(2002). However, we used an alternative version: the MNCut (Maila and Shi,
2001) algorithm, which has a probabilistic interpretation. This algorithm is par-
ticularly useful for handling high-dimensional feature space of verb classification
(Brew and Schulte im Walde, 2002). In addition, we used this algorithm to acquire
SPs. We employed Zelnik-Manor and Perona (2004)’s method for automatically
detecting the number of clusters. In our experiments, the method outperformed
105
CHAPTER 8. CONCLUSIONS 106
clustering methods that have been used for verb classification in previous works
(K-Means and PC).
HGFC is a graph-based hierarchical clustering algorithm (Yu et al., 2006). Most previ-
ous verb classification work has focused on acquiring and evaluating flat classifica-
tions. Levin’s classification is not flat, but taxonomic in nature, which is practical
for NLP purposes since applications may differ in terms of the granularity they re-
quire from a classification. Additionally, all the previous works using hierarchical
clustering had used linkage (AGG) based hierarchical clustering. We addressed two
problems of the linkage method: error propagation - when a verb is misclassified at
a level, the error propagates to all other levels and local pairwise merging - only two
clusters can be combined at any level. We demonstrated that HGFC can avoid both
problems. We also modified HGFC so that it can be used to automatically deter-
mine the tree structure for clustering, and proposed two extensions: automatically
determining the number of clusters and adding soft constraints. In the experiments
we reported, HGFC greatly outperformed AGG on the all test sets, and it performed
similarly with the current best flat clustering method SPEC. The constrained ver-
sion of HGFC detects the missing hierarchy from the existing gold standards with
high accuracy. In addition, HGFC produces a fairly accurate multi-level hierarchy,
when the number of clusters and levels are detected automatically. Our qualitative
evaluation showed that both constrained and unconstrained versions of HGFC are
capable of learning valuable novel information not included in the gold standards.
8.1.2 Semantic features
We investigated the role of the semantic feature SP for verb classification. In previous
works, SP acquired from WordNet/GermaNet offered no significant improvement over
syntactic features (Schulte im Walde, 2006; Joanis, 2002). However, in manual verb
classification (e.g. VerbNet), detailed verb selectional restrictions were assigned to verbs
in many classes. We introduced a method for acquiring SPs from corpus data automati-
cally. We demonstrated that SP can be very useful for verb classification. Using the SPEC
method for clustering, the verb classification performance was greatly improved over the
mere use of syntactic features on two established datasets and also in the biomedical
domain.
8.1.3 Cross-lingual and cross-domain study
We applied our methods and features to a different language (French) and the biomedical
domain in order to investigate the portability of our verb clustering approach.
CHAPTER 8. CONCLUSIONS 107
Cross-linguistic study We took the SPEC verb clustering method and features (including
SP) developed for English and applied it to French which had no verb classification
available before the experiment was performed. Our investigation revealed similar-
ities between the English and French classifications supporting the argument that
Levin classes have a strong cross-linguistic basis. We demonstrated that both our
best methods and features are transferable between the two languages.
Cross-domain study We applied our best clustering methods (SPEC and HGFC) to Ko-
rhonen et al. (2008)’s biomedical dataset using the same features as Korhonen et
al.’s. The gold standard contains both general scientific and biomedical verbs. Both
our methods outperformed the PC method used by Korhonen et al. (2008). In all
the experiments, the SP feature was the best feature (as in the general domain). We
demonstrated that our methods can achieve good performance in the biomedical
domain without any change.
8.1.4 Task-based evaluation
We applied our verb classification to two NLP tasks. Although the manual verb classifica-
tion in VerbNet has proved useful for different application tasks, automatically acquired
verb classifications had not been used for any NLP task.
Metaphor identification We used the automatically acquired noun and verb clusters (ob-
tained with SPEC) to identify metaphorical expressions in language. The clusters
were used to represent the source and target concepts. We obtained promising
experimental results. This is an example of a task that requires an automatically
acquired rather than manually built classification because it makes use of statistical
information.
Argumentative zoning We used our verb classes as an additional feature for AZ of scien-
tific abstracts. The result shows that verb classes can improve the performance over
the raw verb feature. Using the raw verb feature over the verb class feature actually
decreases the performance. This demonstrates that the verb class feature provides a
generalisation over the raw verb feature. Also for this task automatically obtained
classification is needed because no manually developed one is available.
8.2 Directions for future research
Future work could further improve verb classification performance. We mention below
some ideas for on future research. Some extensions of our existing research were already
described in earlier sections:
CHAPTER 8. CONCLUSIONS 108
Section 3.6 (Verb clustering using selectional preferences)
Section 4.5 (Hierarchical verb clustering using graph factorization)
Section 6.8 (Cross-linguistic potential of verb classification)
8.2.1 Diathesis alternation as a new feature for verb classification
Encouraged by the good results obtained using the SP feature, we conducted a prelimi-
nary experiment with a semantic feature important for manual classification, but previ-
ously not used in automatic verb classification: diathesis alternations (DAs).
DAs are the regular alternations of the syntactic expression of verbal arguments, some-
times accompanied by a change in meaning. For example:
• The man broke the window.
• The window broke.
In Levin’s classification, a verb class is usually characterized in terms of DAs. For ex-
ample, COOK verbs (e.g. bake, cook, fry, toast . . . ) can take DAs such as the causative
alternation, middle alternation and instrument subject alternation.
There have been a few works on automatic DA detection (McCarthy and Korhonen,
1998; Lapata, 1999; McCarthy, 2000; Tsang and Stevenson, 2004), but they all rely on
WordNet. There is no prior work on incorporating automatically acquired DAs to aid
verb classification.
We can define two approaches to DA acquisition: detection and approximation. De-
tection is similar to the previous work (McCarthy and Korhonen, 1998; Lapata, 1999;
McCarthy, 2000; Tsang and Stevenson, 2004): SCFs are first acquired, and then supple-
mented with semantic features (e.g. SPs) in order to detect whether we have a DA. We can
replace the WordNet SPs with automatically acquired SPs which have the benefit that they
can be ported across tasks. Specific approximations of features related to DA have been
attempted in an earlier verb classification experiment; the causativity feature in Merlo
and Stevenson (2001) (section 2.3.2) is one example. Since our goal is to improve verb
classification, we do not have to know which sentences are alternating in order to make
use of DAs. We only need to make DAs as part of the classification model. One way is
to model DAs as correlations between frames. If we observe that two types of frames co-
occur frequently enough, we assume a potential occurrence of DA. One drawback of the
approximation approach is false positives (pairs of frames co-occur frequently, but they
are not DA). In what follows, we will evaluate the potential usefulness of approximation
(with false positives) for verb classification. We will discuss the two approaches in the
subsequent sections.
CHAPTER 8. CONCLUSIONS 109
Frame Example sentence Example frequency
NP+PP(on) Jessica sprayed paint on the wall. 40
NP+PP(with) Jessica sprayed the wall with paint. 30
PP(with) *The wall sprayed with paint. 0
PP(on) Jessica sprayed paint on the wall 30
Table 8.1: Example frames for verb spray
Figure 8.1: Graphical model for the joint probability of pairs of frames. v represents a
verb, a represents a DA and f represents a specific frame in total of M possible frames
Diathesis alternation approximation
A DA can be approximated by a pair of frames. We define a frame as SCF parameterized
for the preposition. Example frames for the verb “spray” are shown in table 8.1.
The feature value of a single frame feature is the frequency of the frame. Given two
frames fv(i), fv(j) of a verb v, they can be transformed into a feature pair (fv(i), fv(j))
as an approximation to a DA. The feature value of the DA feature (fv(i), fv(j)) is ap-
proximated by the joint probability of the pair of frames p(fv(i), fv(j)|v), obtained by
integrating all the possible DAs. In other words, the key assumption is that the joint
probability of two frames has a strong correlation with the DAs, if the joint probabil-
ity is properly modelled by taking account of the hidden DAs. We use the DA feature
(fv(i), fv(j)) with its value p(fv(i), fv(j)|v) as a new feature for verb clustering.
As a comparison point, we can ignore the DA and make a frame independence assump-
tion. The joint probability is decomposed as:
p(fv(i), fv(j)|v)
′ , p(fv(i)|v) · p(fv(j)|v) (8.1)
Since SCFs are generated by the underlying meaning components (Levin and Hovav,
2006), they are dependent. The dependency of the frames is represented by a simple
graphical model shown in figure 8.1. The verb v and frames f are observed, and alter-
nation a is hidden. The aim is to approximate but not to detect a DA, so a is summed
out:
p(fv(i), fv(j)|v) =
∑
a
p(fv(i), fv(j)|a) · p(a|v) (8.2)
CHAPTER 8. CONCLUSIONS 110
In order to evaluate this sum, we make a relaxation 1: the sum in equation 8.1 is replaced
with the maximum (max). This is a reasonable relaxation, as a pair of frames rarely
participates in more than one type of a DA.
p(fv(i), fv(j)|v) ≈ max(p(fv(i), fv(j)|a) · p(a|v)) (8.3)
The second relaxation is to further relax the first relaxation by replacing the max with
the least upper bound (sup): If fv(i) occurs a times, fv(j) occurs b times and b < a, the
number of times that a DA occurs between fv(i) and fv(j) must be smaller or equal to b.
p(fv(i), fv(j)|v) ≈ sup{p(fv(i), fv(j)|a)} · sup{p(a|v)} (8.4)
sup{p(fv(i), fv(j)|a)} = Z−1 ·min(fv(i), fv(j))
sup{p(a|v)} = 1
Z =
∑
m
∑
n
min(fv(m), fv(n))
So we end up with a simple form:
p(fv(i), fv(j)|v) ≈ Z
−1 ·min(fv(i), fv(j)) (8.5)
The equation is intuitive: If fv(i) occurs 40 times and fv(j) occur 30 times, the DA
between fv(i) and fv(j) ≤ 30 times. This upper bound value is used as the feature value
of the DA feature.
The original feature vector f of dimension M is transformed into M2 dimensions feature
vector f˜ . Table 8.2 shows the transformed feature space for the example verb spray. We
can see that the feature space matches our expectation well: the valid DA has a value
greater than 0 and the wrong DA is assigned the value of 0.
Preliminary experiment In order to evaluate the usefulness of this model, a preliminary
verb clustering experiment was performed using three feature sets:
• F1: F-SCF+PP(B) (See table 2.4)
• F2: The frame pair features built from F1 with frame independence assumption
(equation 8.1). This feature is not a proper DA feature as it ignores the inter-
dependency of the frames which are produced by the underlying DA.
• F3: The frame pair features (DAs) built from F1 with the frame dependency as-
sumption (equation 8.4).
1A relaxation is a method used in mathematical optimization for relaxing the strict requirement, by
either substituting for it another easier requirement or else dropping it completely.
CHAPTER 8. CONCLUSIONS 111
Frame pair Possible alternation Occurrence Feature value
NP+PP(on) NP+PP(with) Locative 30 0.158
NP+PP(on) PP(with) Causative(with) 0 0
NP+PP(on) PP(on) Causative(on) 30 0.158
NP+PP(with) PP(with) ? 0 0
NP+PP (with) PP(on) ? 30 0.158
PP(with) PP(on) ? 0 0
NP+PP(on) PP(on) ? 40 0.211
NP+PP(with) NP+PP(with) ? 0 0
PP(with) PP(with) ? 30 0.158
PP(on) PP(on) ? 30 0.158
Table 8.2: Example frame pair features for the verb spray
The datasets are the test sets 7-11 (3-14 classes) in Joanis et al. (2007), and the Sun et al.
(2008b)’s 17 classes test set (T2) introduced in chapter 3.
The SPEC clustering algorithm was used. We used a divergence-based distributional sim-
ilarity measure in the works described in chapter 3. Due to the high dimensionality of
the quadratic feature space, the computational cost of the divergence similarity mea-
sure (e.g. equation 3.1) is prohibitive. So we use the Bhattacharyya kernel (Jebara and
Kondor, 2003) to improve the computational efficiency.
wb(v, v
′) =
D∑
d=1
(vdv
′
d)
1/2 (8.6)
The mean-filed bound of the Bhattacharyya kernel is very similar to the KL divergence
kernel (Jebara et al., 2004). The form of the Bhattacharyya kernel is relatively simple,
which also helps the theoretical analysis in the next section.
To further reduce the computational complexity, a set of high frequency features over
instances was used. For 3-6 way classifications (Joanis et al.’s test set 7-9), 50 features
are used and 7-17 way classifications employ 100 features. In the next section, we will
show that F3 outperforms F1 regardless of the feature number setting.
The results are shown in table 8.3. The result of F2 is lower than that of F3, and
even lower than that of F1 for 3-6 way classification. This indicates that the frames
independence assumption is a poor assumption. F3 yields substantially better result than
F2 and F1. This experiment shows that DA features are clearly more effective than the
frame features on these two datasets, even when relaxations are used.
Analysis with the Bhattacharyya kernel In this section, we examine the effect of the DA
features by investigating their impact on the kernel, especially the correlation with the
feature frequency.
CHAPTER 8. CONCLUSIONS 112
Feature set
Joanis et al.
Sun et al.
7 8 9 10 11
F1 54.54 49.97 35.77 46.61 38.81 60.03
F2 50.00 49.50 32.79 54.13 40.61 64.00
F3 56.36 53.79 52.90 66.32 50.97 69.62
Table 8.3: Results when using F3 (DA), F2 (pair of independent frames) and F1 (single
frame) features with Bhattacharyya kernel
Figure 8.2: Comparison between frame features (in blue) and DA features (in red) with
different feature number settings. DA features clearly outperform frame features. The
left figure is the result on test set 10 (8 ways). The right figure is the result on test set 11
(14 ways). The x axis is the number of features. The y axis is the F-Measure result.
We prove that the DA feature increases the impact of the middle-range frequency frames
on the Bhattacharyya kernel. The high frequency features have yet a larger impact in
general than low frequency ones. This is a nice property as the high frequency features
are often considered to be more reliable than the low frequency ones. The details of the
mathematical proof are shown in the Appendix C.
An experiment was carried out using F1 and F3 features on Joanis et al. (2007)’s test
set 10 and 11. The frequency ranked frames were added to the clustering one at a time,
starting from the most frequent one. The results are shown in figure 8.2. F3 (in red)
clearly outperforms F1 (in blue) on all the feature number settings. After adding some
highly frequent frames (30 for test set 10 and 60 for test set 11), the performance for
F1 is not further improved. This is in line with the mathematical proof in equation C.2:
the kernel value is dominated by the top frequency frames in F1. The performance of
F3, in contrast, is generally improved for almost all the frames including the mid-range
frequency frames. However, the improvement becomes less significant for the frames
with relatively low frequency.
In conclusion, this experiment demonstrates that the performance of using frame fea-
tures is dominated by the high frequency frames, whereas the DA features reduce the
dominance by enabling the mid-range frequency frames to further improve the perfor-
CHAPTER 8. CONCLUSIONS 113
mance.
Future work Our preliminary experiment shows, for the first time, that automatically
acquired DA can provide a useful feature for verb classification. In the future, we plan to
evaluate the performance of DA features in a larger scale experiment. We were not able
to perform large scale experiments yet, because the dimensionality of the transformed
feature space is too high (quadratic of the original feature space). An unsupervised di-
mensionality reduction technique (e.g. Zhao and Liu (2007)) will need to be used in
order to improve the computational efficiency. Moreover, we plan to integrate the DA
feature with other features (e.g. SPs) in order to further improve the accuracy of verb
clustering.
Detecting diathesis alternations from selectional preferences
A few studies including McCarthy and Korhonen (1998); Lapata (1999); McCarthy
(2000) have attempted DA detection using SPs. WordNet (Miller, 1995) classes have
been employed as SP classes. We plan to investigate whether SPs acquisition using our
new unsupervised technique (section 3.2) could be used for DA detection. Comparing to
the latent variable model, this approach aims to actually detect DAs and find the partici-
pating instances instead of just approximating DAs.
We will investigate the best approaches to DA detection using automatically acquired SPs.
The method needs to be general enough to cover most types of DAs and efficient enough
for a large scale experiment.
One of the main problems in previous work on DA detection has been the sparse data
problem in syntactic slots for which SPs are acquired. In order to reliably detect DAs,
we plan to experiment with a very large corpus (e.g. Gigaword corpus (Graff et al.,
2003)). We will compare the resulting DAs to the DAs listed in Levin (1993). We will
also evaluate the usefulness of DA features in the verb classification task.
8.2.2 Partial membership model for verb polysemy
Polysemy is a pervasive phenomenon in language, particularly among high frequency
verbs. For example, in Levin’s verb classification (Levin, 1993), the verb cut belongs to
the CUT class and SPLIT class, among others. Earlier work on automatic verb classifica-
tion has largely ignored polysemy by assuming a single class for each verb (e.g. Sun et
al. (2008b); Li and Brew (2008); Joanis et al. (2007)). Few attempts have been made to
address the problem. Multi-label classification was used for supervised adjective classifi-
cation (Boleda et al., 2007). For verbs, the use of an unsupervised soft clustering method
CHAPTER 8. CONCLUSIONS 114
−3 −2 −1 0 1 2 30
0.2
0.4
0.6
0.8
1
1.2
1.4
Figure 8.3: The product of two blue Gaussians with mean (-1,1) and variance (0.16,
0.25) is a new Gaussian in red with mean -0.219 and variance 0.097
was investigated by Korhonen et al. (2003), but the output was still based on a single
class per verb.
Soft clustering has a few issues for modelling the overlapping clusters (Heller et al.,
2008). If two senses are modelled as two clusters, a soft clustering method would assign
probabilities to the latent cluster membership, e.g. 0.5 for the sense 1 and 0.5 for sense 2.
But what is the probability of the word having both senses? Soft clustering does not give
a probabilistic interpretation of it. The traditional method is to apply a threshold value
on the probability. For example, if the threshold value is set to 0.1, a word with sense
probabilities (0.95, 0.05) would have sense 1 only, but word with probabilities (0.8, 0.2)
would have both senses. The problem is that the threshold value is highly data dependent
and we lose the probability interpretation of the sense membership. Ideally, for the word
with two senses, the method should not only show the probability of having sense 1 or
sense 2, but also the probability of having both senses.
A novel unsupervised clustering method can be applied for modelling the overlap be-
tween lexical categories. A few clustering algorithms (Heller et al., 2008; Fu and Baner-
jee, 2008) have been proposed for modelling the overlapping clusters. They are all based
on the products of experts model (Hinton, 2002). In this model, each cluster is modelled
as an exponential family distribution. The product of the distributions can represent a
cluster overlap. Figure 8.3 shows the product of two Gaussian distributions. We can use
the single exponential family distribution for each class and use the products of distribu-
tions to model verb polysemy. The inference is achieved using the Markov Chain Monte
Carlo method. We will investigate the use of efficient inference methods (e.g. Hybrid
Markov chain Monte Carlo (Bonet-Cunha et al., 1998) or Variational Bayesian (Attias
and Ar, 1999)). We have already implemented a version of Hybrid Markov chain Monte
Carlo in MATLAB which is more efficient than the Gibbs sampling. The next step would
be to evaluate and tune the method on existing datasets.
CHAPTER 8. CONCLUSIONS 115
8.2.3 Data requirement for reliable clustering
Frequency-based filtering is needed in almost every verb classification experiment. Two
types of filtering are needed: feature filtering to remove the features which have a low
number of occurrences and verb filtering to ignore highly infrequent verbs. The purpose
of filtering is to improve clustering accuracy by removing noisy features and verbs which
have insufficient evidence in corpus data. Table 6.3 shows the effect of verb frequency
on clustering performance. Another purpose of filtering is to improve computational
efficiency. However, the computational efficiency requirement depends on the complexity
of the clustering algorithm and the available computational resources which differ among
experiments.
One interesting question is how to set the frequency cut value. The value can be set
automatically for supervised classification. In Li and Brew (2008), it is determined au-
tomatically by running cross-validation on the labelled training data with different cut
settings. It can be applied to both features and verbs. Unsupervised learning cannot
use this method as there is no labelled data available. Future work could carry out an
empirical investigation of the optimal frequency cut for large-scale verb clustering. We
can observe the relation between the clustering accuracy and the frequency cut value on
both features and verbs. Experiments should be carried out across domains, languages
and clustering methods. One alternative approach on filtering features is to use unsuper-
vised feature selection methods (Zhao and Liu, 2007) to select the most useful features.
Previous experiments (e.g. Sun et al. (2008b)) showed that the low frequency features
can also benefit verb classification. It would be interesting to compare the performance
of the feature selection methods to the frequency based filtering.
8.2.4 Further task and application based evaluations
We applied automatically acquired verb classifications to two NLP tasks: metaphor iden-
tification and argumentative zoning. There are many other important NLP tasks that
could benefit from such classifications.
One of such task is POS tagging. According to ACL wiki2, the fully supervised tagger
performance on Penn TreeBank sections 22-24 was not improved since Toutanova et al.
(2003). However, Manning (2011) was able to improve over Toutanova et al.’s results.
Among the other improvements, the unknown-word error is reduced by 13% by using
the word classes induced using Clark and Tim (2003)’s method. Their clustering features
are shallow: co-occurrences and morphological features. Future work could investigate
the use of Levin’s style verb clusters and noun clusters acquired using more sophisticated
or deeper features (e.g. SCFs and GRs).
2http://aclweb.org/aclwiki/index.php?title=POS_Tagging_%28State_of_the_art%29
CHAPTER 8. CONCLUSIONS 116
We have just released an automatically induced verb classification for the biomedical
domain (see section 7.3). This classification could be used to help event extraction, like in
the work by Kolya et al. (2011), where VerbNet was used to identify the event actors. In
this work, all the thematic roles and frames of the verbs in VerbNet were extracted. For
a parsed sentence, the verb’s argument structure was compared to its frames in VerbNet.
If a match was found, the event actor corresponding to each event verb was tagged with
the actor information in the appropriate slot in the sentence. Here is an example from
Kolya et al. (2011):
Sentence: Ram killed Shyam with a knife.
Parser output: (ROOT (S (NP (NNP Ram)) (VP (VBD killed) (NP (NNS Shyam)) (PP
(IN with) (NP (DT a) (NN knife)))) (. .)))
Acquired argument structure: [NP VP NP PP-with]
Matching frame from VerbNet:[<NP value=”Actor”> <VERB/><NP patient><PREP
value=”with”>]
The system was compared against a strong baseline - the noun in the subject relation
in the output of the Stanford parser (De Marneffe et al., 2006) (the approach is similar
to that employed in Vlachos et al. (2009a)). The F of the baseline is 65.98, and the F
of the approach using VerbNet is 67.99. Kolya et al.’s method could be extended with
automatically acquired verb classes, which can cover verb types frequent in biomedical
texts that are missing in VerbNet. In addition, the statistical information related to the
frames in clustering input can make the frame matching process probabilistic (e.g. we
know the probability of a verb taking a certain frame).
The work presented in this thesis has taken research on automatic verb classification
much closer to the situation where it can be realistically used to benefit this and many
other NLP tasks benefiting from Levin style classes that are tuned to the domain, language
or task in question.
Appendix A
Features used in this thesis
Chapter Num Reference Description
3
F1 F-CO Co-occurrence (Li and Brew, 2008)
F2 F-PP Prepositional preference, a subset of F-LP which only include the type and frequency of prepositions
in the indirect object relation
F3 F-LP Lexical preference, extracted as in Korhonen et al. (2008) using RASP parser
F4 F-SCF Basic SCF, extracted using Preiss et al. (2007)’s system
F5 F-SCF+CO The concatenation of the F-SCF and F-CO
F6 F-SCF+TENSE(B) F-SCF with the tense of the verb, the frequency of verbal POS tags is calculated specific to each SCF.
F7 F-SCF+PP(B) F-SCF with all PP frames parameterized for prepositions (Korhonen et al., 2008).
F8 F-SCF(B) Basic SCF feature, as in Sun et al. (2008b), extracted from the VALEX lexicon.
F9 F7 with F3 (subject only)
F10 F7 with F3 (object only)
F11 F7 with F3 (subject, object, indirect object)
F12-14 F9-F11 with 20 clusters from 200 argument heads
F15-17 F9-F11 with 30 clusters from 500 argument heads
4
F1 F-SCF Basic SCF, extracted using Preiss et al. (2007)’s system
F2 F-SCF+PP(B) F-SCF with all PP frames parameterized for prepositions (Korhonen et al., 2008).
F3 F-SCF+LP(A) F-SCF is parameterized by the F-LP in all argument slots (Korhonen et al., 2008).
5
F1 F-SCF Basic SCF, extracted using Preiss et al. (2007)’s system
F2 F-SCF+PP(A) F-SCF with two high frequency PP frames parameterized for prepositions: the PP and NP-PP frames
(Korhonen et al., 2008).
F3 F-SCF+PP(B) F-SCF with all PP frames parameterized for prepositions (Korhonen et al., 2008).
F4 F-SCF+TENSE(A) F-SCF with the tense of the verb. The frequency of verbal POS tags is calculated over all SCFs
(Korhonen et al., 2008).
F5 F-SCF+TENSE(B) Same as above, but the frequency of verbal POS tags is calculated specific to each SCF.
F6 F-SCF+VOICE(A) F-SCF with the active/passive voice of the verb. The frequency of the voice is calculated over all SCFs
(Korhonen et al., 2008)
F7 F-SCF+VOICE(B) Same as above, but the frequency of voice is calculated specific to each SCF.
F8 F-SCF+SP(A) F-SCF is parameterized by the F-SP in all argument slots. As in Korhonen et al. (2008), the SPs are
acquired automatically by clustering the argument head. The number of clusters was set to 10.
F9 F-SCF+SP(B) The number of clusters was set to 20.
F10 F-SCF+SP(C) The number of clusters was set to 50.
6
F1 F-SCF Basic SCF, extracted using Preiss et al. (2007)’s system
F2 F-SCF+TENSE(B) F-SCF with the tense of the verb, the frequency of verbal POS tags is calculated specific to each SCF.
F3 F2, with SCFs parameterized for prepositions.
F4-F8 F-CO COs extracted from the window of 4, 6 and 8 words, respectively. The relative word position is
ignored.
F5-F9 F4, F6 and F8 with the relative word position recorded.
F10 F-PP the type and frequency of preposition in the preposition relation.
F11 the type and frequency of nouns in the subject relation.
F12 the type and frequency of nouns in the object and indirect object relation.
F13 F-LP the combination of F10-F12
F14-F16 F1-F3 parameterized for LPs.
F17 F3 refined with SPs.
Table A.1: Summary of all the features that are used in this thesis. The references refer
to the features used in previous research (table 2.4). The features without references are
our new features.
117
Appendix B
Similarity measures used in this thesis
Name Description and Formula
Kullback–Leibler diver-
gence
Dkl(v, v′) =
∑D
i ln (
v(i)
v′(i) )v(i). The measure is not symmetric, namely
Dkl(v, v′) 6= Dkl(v′, v). An asymmetric measure cannot be used for build-
ing an undirected graph, so it cannot be used for SPEC and HGFC.
Skew divergence (Lee,
2001)
The input vector v is smoothed with v′. The level of smoothing is con-
trolled by a whose value is set to a value close to 1. The formula
is Dskew(v, v′) = Dkl(v′||a · v + (1 − a) · v′). This measure is asym-
metric. In our experiments, we symmetrize the skew divergence as:
D(v, v′)sskew = 12 (Dskew(v, v
′) + Dskew(v′, v)). The measure was only
used in experiments described in chapter 3. We employed JSD for all
other experiments. When compared to JSD, this measure has an extra
smoothing parameter. The value of the parameter is difficult to set auto-
matically without labelled training data.
Jensen-Shannon diver-
gence (Lin, 1991)
The average between v and v′ is denoted as m. The formula is
Djsd(v, v′) = 12Dkl(v||m) +
1
2Dkl(v
′||m). This measure is symmetric.
Bhattacharyya kernel (Je-
bara and Kondor, 2003)
Bhattacharyya kernel is a simple symmetric similarity measure. Its mean-
filed bound is very similar to KL divergence kernel (Jebara et al., 2004).
The measure was only used for theoretical analysis and fast computation
for the DA experiments. The formula is wb(v, v′) =
∑D
i=1
√
(v(i)v′(i)).
Table B.1: A list of all the similarity measures that are used in this thesis. v and v′
are two input vectors with D dimensions. The first three measures are actually distance
measures. These distance measures can be converted to a similarity measure by w(v, v′) =
exp(−D(v, v′))
118
Appendix C
A proof of the impact of diathesis
alternation features on the kernel
Assume we have two verbs v1 and v2 with feature vector dimensionality D. Let v1’s
features f11 . . . f1D has rank r11 . . . r1D according to their frequency. Suppose a low fre-
quency feature f11 and a high frequency feature f12, r11 > r12 and denote φ112 =
f11
f12
. The
impact of fij on the DA feature space can be written as
∑
{m,n∈D|fmn>fij}
min(fmn, fij) = aij × fij , aij = 2rij − 1 (C.1)
Let verb v2 has features f21 and f22, and we assume f21 = f22 for simplicity because the
focus is on f11 and f12.
According to equation 8.6, the ratio of f11 and f12’s contribution to the Bhattacharyya
kernel is: √
f11f21√
f12f22
=
√
φ112 (given f21 = f22) (C.2)
This shows that ratio of the impact of the frame feature is proportional to the ratio of
the frequency rank φ .
The ratio for the DA feature’s contribution is:
√
(2r11 − 1)
√
f11f21
√
(2r12 − 1)
√
f12f22
=
√
a11
a12
√
φ112 (C.3)
a
a′ is denoted as A. A is related to the feature frequency rank of the frame features. It
is invariant to the underlying frequency distribution given the ordering. It balances the
impact of feature frequency ratio φ: if φ < 1 then A > 1 and vice versa. Therefore, the
impact of low frequency features like f11 is increased. The value range of A is relatively
small when compared to φ, so the high frequency features, in general, still have a larger
impact than low frequency features. This is a nice property as the high frequency features
are often considered to be more reliable than the low frequency ones.
119
Appendix D
The list of verbs in T1 and T3
The verb selection criteria used in T1 and T3 were originally proposed in Joanis et al.
(2007) and Stevenson and Joanis (2003). The actual verbs are however not found in
those two papers. We thus include all the verbs used in T1 and T3.
120
APPENDIX D. THE LIST OF VERBS IN T1 AND T3 121
Class name Levin class Verbs
Putting 9.1-6
dip install lay lean lodge lower mount place put raise rest sit situate
suspend tuck
Spray/Load 9.7
cram inject load pack pile plant pump scatter settle smear spray
spread stick stuff wrap
Fill 9.8
bind block choke cover decorate edge endow face frame infect line
pave spot staff surround
Wipe 10.4.1-2
comb erase filter flush lick pluck polish prune scour scrub shave
skim strain suck wear
Steal and Remove 10.1, 10.5
capture discharge dismiss eliminate expel extract grab recover re-
move rescue seize separate snatch steal withdraw
Cheat 10.6 absolve acquit burgle cheat con cure defraud deprive free rob
Recipient 13.1, 13.3
allocate award extend feed give grant issue leave lend offer owe
pay sell serve vote
Object drop
26.1, 26.3,
26.7
assemble build cast compose dance develop direct fix make mix
perform play prepare produce write
Amuse 31.1
affect concern encourage engage impress inspire move relax satisfy
threaten throw touch transport try worry
Admire 31.2
admire appreciate enjoy fancy fear hate like love miss respect sup-
port tolerate trust value worship
Light and Substance 43.1, 43.4
beam blink emanate flare flash flicker glare gleam glow leak puff
radiate seep shed shine
Sound 43.2
bellow buzz clash clatter cling cry groan hiss moan murmur rattle
roar scream shriek thump
Change of State 45.1-4
burst change close collapse decrease divide double expand improve
increase operate sink strengthen stretch vary
Run 51.3.2
charge crawl creep drift hurry jump leap march race rush slide
stumble travel walk wander
Table D.1: Verbs and Levin verb classes in T1
APPENDIX D. THE LIST OF VERBS IN T1 AND T3 122
Class name Levin class Verbs
Putting 9.1-6
dip drip drop funnel lean lodge mount place position put rest scoop
sit spew spill stash tuck twist wedge wind
Spray/Load 9.7
cram daub drape heap inject load pack pile plant pump smudge
sow sprinkle stack stick stock strew string stuff wrap
Fill 9.8
bandage blot contaminate dam dapple deck douse drench em-
blazon endow face festoon garnish mask pave plug saturate soil
swathe taint
Steal and Remove 10.1, 10.5
abduct confiscate delete disgorge dislodge eject evict excise excom-
municate expel kidnap liberate lop ostracize pirate purloin reap
rescue shoo uproot
Wipe 10.4.1-2
bail buff comb expunge filter flush hose leach mop plow pluck
polish scour scratch scrub shave shear strain suck wear
Cheat 10.6
acquit bereave bilk bleed burgle cheat cleanse defraud deplete de-
prive dispossess divest free gull milk pardon plunder rob sap swin-
dle
Recipient 13.1, 13.3
allocate allot assign cede concede extend grant issue leave lend
loan offer owe peddle refund rent sell serve trade will
Object Drop
26.1, 26.3,
26.7
blow carve chant chisel choreograph compile direct fashion grind
intone knit mix paint perform play prepare recite sculpt toss write
Amuse 31.1
antagonize bewilder dismay dumbfound humble hypnotize infu-
riate nauseate outrage peeve reassure repel ruffle satisfy spellbind
stun tempt terrorize threaten wound
Admire 31.2
adore cherish deplore detest dislike distrust dread hate lament
loathe love prize regret relish resent respect revere treasure ven-
erate worship
Sound 43.2
beep blare buzz chime clank clink clunk creak fizzle gurgle peal
ping putter rattle sputter swish tinkle toll toot whir
Change of State 45.1-4
ameliorate blunt braise coddle degrade heat heighten lessen loose
narrow perk quadruple quicken scallop sink slow solidify steepen
thin toughen
Run 51.3.2
bowl clamber flit inch jog limp lumber parade romp rove scamper
scuttle sidle slither streak toddle tramp trek vault wade
Table D.2: Verbs and Levin verb classes in T3
Bibliography
O. Abend, R. Reichart, and A. Rappoport. A supervised algorithm for verb disambigua-
tion into verbnet classes. In Proceedings of COLING, pages 9–16, Stroudsburg, PA,
USA, 2008. ACL.
S. Abney and S. P. Abney. Parsing by chunks. In Principle-Based Parsing, pages 257–278.
Kluwer Academic Publishers, 1991.
S. Ananiadou and J. McNaught. Text Mining for Biology And Biomedicine. Artech
House, Inc., Norwood, MA, USA, 2005.
O. E. Andersen, J. Nioche, E. Briscoe, and J. Carroll. The BNC parsed with
RASP4UIMA. In Proceedings of LREC, Marrakech, Morocco, 2008.
H. Attias and L. Ar. Inferring parameters and structure of latent variable models by
variational Bayes. In Proceedings of Uncertainty in Artificial Intelligence, volume 30,
1999.
A. Azran and Z. Ghahramani. A new approach to data driven clustering. In Proceedings
of ICML, pages 57–64. ACM New York, NY, USA, 2006.
A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data cluster-
ing. In Proceedings of the 2006 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition-Volume 1, pages 190–197. IEEE Computer Society
Washington, DC, USA, 2006.
C. F. Baker, C. J. Fillmore, and J. B. Lowe. The Berkeley FrameNet project. In Proceed-
ings of COLING, pages 86–90, Montreal, 1998.
R. Basili, M. T. Pazienza, and P. Velardi. Hierarchical clustering of verbs. In Proceedings
of the Workshop on Acquisition of Lexical Knowledge from Text, 1993.
N. Bassiou and C. Kotropoulos. Long distance bigram models applied to word clustering.
Pattern Recogn., 44:145–158, January 2011.
S. Bergsma, D. Lin, and R. Goebel. Discriminative learning of selectional preference
from unlabeled text. In Proceedings of the EMNLP, 2008.
123
BIBLIOGRAPHY 124
B. Boguraev and E. Briscoe. Large lexicons for natural language processing: utilising
the grammar coding system of the Longman Dictionary of Contemporary English.
Computational Linguistics, 13(4):219–240, 1987.
B. Boguraev, T. Briscoe, J. Carroll, D. Carter, and C. Grover. The derivation of a gram-
matically indexed lexicon from the longman dictionary of contemporary english. In
Proceedings of ACL, pages 193–200, Stroudsburg, PA, USA, 1987. ACL.
G. Boleda, S. Schulte im Walde, and T. Badia. Modelling polysemy in adjective classes by
multi-label classification. In Proceedings of EMNLP-CoNLL, pages 171–180, Prague,
Czech Republic, June 2007. ACL.
L. Bonet-Cunha, D. Oliver, R. Redner, and A. Reynolds. A hybrid Markov chain Monte
Carlo method for generating permeability fields conditioned to multiwell pressure data
and prior information. SPE Journal, 3(3):261–271, 1998.
D. Bourigault, M.-P. Jacques, C. Fabre, C. Fre´rot, and S. Ozdowska. Syntex, analyseur
syntaxique de corpus. In Actes des 12e`mes journe´es sur le Traitement Automatique
des Langues Naturelles, 2005.
C. Brew and S. Schulte im Walde. Spectral clustering for german verbs. In Proceedings
of EMNLP, 2002.
T. Briscoe and J. Carroll. Automatic extraction of subcategorization from corpora. In
Proceedings of the fifth conference on Applied natural language processing, pages 356–
363, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
E. Briscoe, J. Carroll, and R. Watson. The second release of the RASP system. In Proceed-
ings of the COLING/ACL on Interactive presentation sessions, pages 77–80, 2006.
C. Brockmann and M. Lapata. Evaluating and combining approaches to selectional
preference acquisition. In Proceedings of EACL, 2003.
A. C. Browne, A. T. McCray, and S. Srinivasan. The SPECIALIST Lexicon. National
Library of Medicine Technical Reports, 2000.
A. Browne, G. Divita, C. Lu, L. McCreedy, and D. Nace. TECHNICAL REPORT
LHNCBC-TR-2003–003, Lexical Systems: A report to the Board of Scientific Coun-
selors. Lister Hill National Center for Biomedical Communications, National Library
of Medicine, 2003.
N. Chambers, D. Cer, T. Grenager, D. Hall, C. Kiddon, B. MacCartney, M. de Marneffe,
D. Ramage, E. Yeh, and C. Manning. Learning alignments and leveraging natural
logic. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and
Paraphrasing, pages 165–170. ACL, 2007.
BIBLIOGRAPHY 125
E. Charniak. A maximum-entropy-inspired parser. In Proceedings of NAACL, pages
132–139, Stroudsburg, PA, USA, 2000. ACL.
J. Chen, D.-H. Ji, C. L. Tan, and Z.-Y. Niu. Unsupervised relation disambiguation using
spectral clustering. In Proceedings of COLING-ACL, 2006.
W. Chou, R. Tsai, Y. Su, W. Ku, T. Sung, and W. Hsu. A semi-automatic method for an-
notating a biomedical proposition bank. In Proceedings of the workshop on frontiers
in linguistically annotated corpora 2006, pages 5–12. ACL, 2006.
S. Clark and J. Curran. Formalism-independent parser evaluation with ccg and depbank.
In Proceedings of ACL, pages 248–255, Prague, Czech Republic, June 2007. ACL.
A. Clark and I. Tim. Combining distributional and morphological information for part
of speech induction. In Proceedings of EACL, pages 59–66, 2003.
W. Cochran. The comparison of percentages in matched samples. Biometrika, pages
256–266, 1950.
D. Croch and T. H. King. Unifying lexical resources. In Proceedings of Interdisciplinary
Workshop on the Identification and Representation of Verb Features and Verb Classes,
Saarbruecken, Germany, 2005.
H. T. Dang. Investigations into the Role of Lexical Semantics in Word Sense Disam-
biguation. PhD thesis, CIS, University of Pennsylvania, 2004.
M. De Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses
from phrase structure parses. In Proceedings of LREC, volume 6, pages 449–454,
2006.
T. Dietterich. Approximate statistical tests for comparing supervised classification learn-
ing algorithms. Neural computation, 10(7):1895–1923, 1998.
C. Ding, T. Li, and M. Jordan. Convex and semi-nonnegative matrix factorizations.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(1):45–55, 2010.
R. Dixon. Ergativity. Cambridge Studies in Linguistics. Cambridge University Press,
1994.
A. Dolbey, M. Ellsworth, and J. Scheffczyk. Bioframenet: A domain-specific framenet
extension with links to biomedical ontologies. In Proceedings of the ”Biomedical
Ontology in Action” Workshop at KR-MED, pages 87–94, 2006.
K. Erk, A. Kowalski, S. Pado´, and M. Pinkal. Towards a resource for lexical semantics:
a large german corpus with extensive semantic annotation. In Proceedings of ACL,
pages 537–544, Stroudsburg, PA, USA, 2003. ACL.
BIBLIOGRAPHY 126
K. Erk. A simple, similarity-based model for selectional preferences. In Proceedings of
ACL, 2007.
G. Escudero, L. Ma`rquez, and G. Rigau. A comparison between supervised learning
algorithms for word sense disambiguation. In Proceedings of the 2nd workshop on
Learning language in logic and the 4th conference on Computational natural language
learning-Volume 7, pages 31–36. ACL, 2000.
I. Falk, C. Gardent, and J.-C. Lamirel. Classifying french verbs using french and english
lexical resources. In Proceedings of ACL. ACL, 2012.
D. Fass. met*: A method for discriminating metonymy and metaphor by computer.
Computational Linguistics, 17(1):49–90, 1991.
C. Fellbaum, editor. WordNet: An Electronic Lexical Database (ISBN: 0-262-06197-X).
MIT Press, first edition, 1998.
E. E. Ferrer. Towards a semantic classification of spanish verbs based on subcategori-
sation information. In Proceedings of the ACL 2004 workshop on Student research,
Stroudsburg, PA, USA, 2004. ACL.
D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2:139–172, 1987.
C. Friedman, P. Kra, and A. Rzhetsky. Two biomedical sublanguages: a description
based on the theories of zellig harris. Journal of Biomedical informatics, 35(4):222–
235, 2002.
Q. Fu and A. Banerjee. Multiplicative Mixture Models for Overlapping Clustering. In
Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages
791–796. IEEE Computer Society Washington, DC, USA, 2008.
M. Gedigan, J. Bryant, S. Narayanan, and B. Ciric. Catching metaphors. In Proceedings
of the 3rd Workshop on Scalable Natural Language Understanding, pages 41–48, New
York, 2006.
J. Geiß. Latent semantic sentence clustering for multi-document summarization. Techni-
cal Report UCAM-CL-TR-802, University of Cambridge, Computer Laboratory, July
2011.
A. Genkin, D. Lewis, and D. Madigan. BMR: Bayesian Multinomial Regression Soft-
ware. DIMACS, http://www. stat. rutgers. edu/madigan/BMR/, accessed on, 14(02),
2008.
G. Golub and C. Reinsch. Singular value decomposition and least squares solutions.
Numerische Mathematik, 14(5):403–420, 1970.
BIBLIOGRAPHY 127
D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword. Linguistic Data Consor-
tium, Philadelphia, 2003.
D. Graff. North american news text corpus. Linguistic Data Consortium, 1995.
R. Grishman, C. Macleod, and A. Meyers. COMLEX syntax: building a computational
lexicon. In Proceedings of the International Conference on Computational Linguistics,
Kyoto, Japan, 1994.
M. Gross. Me´thodes en syntaxe. Hermann, Paris, 1975.
Y. Guo, A. Korhonen, M. Liakata, I. S. Karolinska, L. Sun, and U. Stenius. Identifying
the information structure of scientific abstracts: an investigation of three different
schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language
Processing, pages 99–107, Stroudsburg, PA, USA, 2010. ACL.
Y. Guo, A. Korhonen, I. Silins, and U. Stenius. Weakly supervised learning of information
structure of scientific abstracts—is it accurate enough to benefit real-world tasks in
biomedicine? Bioinformatics, 27(22):3179–3185, 2011.
Y. Guo, A. Korhonen, and T. Poibeau. A weakly-supervised approach to argumentative
zoning of scientific documents. In Proceedings of EMNLP, pages 273–283, Strouds-
burg, PA, USA, 2011. ACL.
K. Hale and J. Keyser. A view from the middle, lexicon project work paper 10, 1987.
Z. Harris. Distributional structure. Word, 10(23):146–162, 1954.
K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of
ICML, pages 297–304. ACM, 2005.
K. Heller, S. Williamson, and Z. Ghahramani. Statistical models for partial membership.
In Proceedings of ICML, pages 392–399. ACM New York, NY, USA, 2008.
S. Hensman and J. Dunnion. Automatically building conceptual graphs using VerbNet
and WordNet. In Proceedings of the 3rd International Symposium on Information
and Communication Technologies (ISICT), pages 115–120, Las Vegas, NV, 2004.
G. Hinton. Training products of experts by minimizing contrastive divergence. Neural
Computation, 14(8):1771–1800, 2002.
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218,
1985.
R. Jackendoff. Semantic Structures. The M.I.T. Press, Cambridge, MA, 1990.
BIBLIOGRAPHY 128
A. Jain, M. Murty, and P. Flynn. Data clustering: a review. ACM computing surveys,
31(3), 1999.
T. Jebara and R. Kondor. Bhattacharyya and expected likelihood kernels. In Learning
Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th
Kernel Workshop, page 57. Springer, 2003.
T. Jebara, R. Kondor, and A. Howard. Probability product kernels. The Journal of
Machine Learning Research, 5:819–844, 2004.
E. Joanis, S. Stevenson, and D. James. A general feature space for automatic verb classi-
fication. Natural Language Engineering, 2007.
E. Joanis. Automatic Verb Classification Using a General Feature Space. Master’s thesis,
University of Toronto, 2002.
R. Jones. Semi-supervised learning on small worlds. In Link Discovery Workshop at
KDD, 2004.
P. Kingsbury and M. Palmer. From TreeBank to PropBank. In Proceedings of LREC-
2002, Gran Canaria, Canary Islands, Spain, 2002.
K. Kipper, A. Korhonen, N. Ryant, and M. Palmer. Extensive classifications of english
verbs. In Proceedings of the 12th EURALEX International Congress, 2006.
K. Kipper, A. Korhonen, N. Ryant, and M. Palmer. Extending VerbNet with novel verb
classes. In Proceedings of LREC, Genova, Italy, 2006.
K. Kipper, A. Korhonen, N. Ryant, and M. Palmer. A large-scale classification of english
verbs. Language Resources and Evaluation, 42:21–40, 2008.
K. Kipper-Schuler. VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis,
Computer and Information Science Dept., University of Pennsylvania, Philadelphia,
PA, June 2005.
T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
A. K. Kolya, A. Ekbal, and S. Bandyopadhyay. A hybrid approach for event extraction
and event actor identification. In Proceedings of RANLP, pages 592–597, Hissar,
Bulgaria, September 2011.
A. Korhonen, Y. Krymolowski, and Z. Marx. Clustering polysemic subcategorization
frame distributions semantically. In Proceedings of ACL, pages 64–71, Morristown,
NJ, USA, 2003. ACL.
A. Korhonen, Y. Krymolowski, and T. Briscoe. A large subcategorization lexicon for
natural language processing applications. In Proceedings of LREC, 2006.
BIBLIOGRAPHY 129
A. Korhonen, Y. Krymolowski, and N. Collier. Automatic classification of verbs in
biomedical texts. In Proceedings of ACL, pages 345–352, Morristown, NJ, USA,
2006. ACL.
A. Korhonen, Y. Krymolowski, and N. Collier. The Choice of Features for Classification
of Verbs in Biomedical Texts. In Proceedings of COLING, 2008.
A. Korhonen. Subcategorization Acquisition. PhD thesis, University of Cambridge, UK,
2002.
S. Krishnakumaran and X. Zhu. Hunting elusive metaphors using lexical resources. In
Proceedings of the Workshop on Computational Approaches to Figurative Language,
pages 13–20, Rochester, NY, 2007.
C. Kunze and L. Lemnitzer. GermaNet-representation, visualization, application. In
Proceedings of LREC, 2002.
J.-C. Lamirel, R. Mall, P. Cuxac, and G. Safi. Variations to incremental growing neural
gas algorithm based on label maximization. In International Joint Conference on
Neural Networks - IJCNN 2011, San Jose, United States, July 2011.
M. Lapata. Acquiring lexical generalizations from corpora: A case study for diathesis
alternations. In Proceedings of ACL, pages 397–404. ACL Morristown, NJ, USA,
1999.
J. H. Lau, P. Cook, D. McCarthy, D. Newman, and T. Baldwin. Word sense induction
for novel sense detection. In Proceedings of EACL, pages 591–601, 2012.
L. Lee. On the effectiveness of the skew divergence for statistical language analysis. In
Artificial Intelligence and Statistics, pages 65–72, 2001.
G. Leech. 100 million words of english: the british national corpus. Language Research,
28(1):1–13, 1992.
B. Levin and M. Hovav. Argument realization. Computational Linguistics, 32(3):447–
450, 2006.
B. Levin. English Verb Classes and Alternations: a preliminary investigation. University
of Chicago Press, Chicago and London, 1993.
J. Li and C. Brew. Which Are the Best Features for Automatic Verb Classification. In
Proceedings of ACL, 2008.
M. Liakata, S. Saha, S. Dobnik, C. Batchelor, and D. Rebholz-Schuhmann. Automatic
recognition of conceptualisation zones in scientific articles and two life science appli-
cations. Journal of Bioinformatics, 2012.
BIBLIOGRAPHY 130
D. Lin and P. Pantel. Discovery of inference rules for question-answering. Natural
Language Engineering, 7(4):343–360, 2001.
Y.-R. Lin, Y. Chi, S. Zhu, H. Sundaram, and B. L. Tseng. Facetnet: a framework for
analyzing communities and their evolutions in dynamic networks. In Proceeding of
the 17th international conference on World Wide Web, pages 685–694, New York,
NY, USA, 2008. ACM.
J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on
Information Theory, 37(1):145–151, 1991.
T. Lippincott, D. O´ Se´aghdha, and A. Korhonen. Exploring subdomain variation in
biomedical language. BMC Bioinformatics, 12(1):212, 2011.
T. Lippincott, D. O´ Se´aghdha, and A. Korhonen. Learning syntactic verb frames using
graphical models. In Proceedings of ACL. ACL, 2012.
C. E. Lipscomb. Medical Subject Headings (MeSH). Bulletin of the Medical Library
Association, 2000. 88(3): 265–266.
M. Maila and J. Shi. A random walks view of spectral segmentation. In Proceedings of
AISTATS, 2001.
C. D. Manning, P. Raghavan, and H. Schu¨tze. Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA, 2008.
C. D. Manning. Part-of-speech tagging from 97% to 100%: is it time for some lin-
guistics? In Proceedings of CICLING, pages 171–189, Berlin, Heidelberg, 2011.
Springer-Verlag.
Y. Matsuo, T. Sakaki, K. Uchiyama, and M. Ishizuka. Graph-based word clustering using
a web search engine. In Proceedings of the EMNLP, pages 542–550, 2006.
D. McCarthy and A. Korhonen. Detecting verbal participation in diathesis alternations.
In Proceedings of ACL, volume 36, pages 1493–1495. ACL, 1998.
D. McCarthy. Using semantic preferences to identify verbal participation in role switch-
ing alternations. In Proceedings of NAACL, pages 256–263. Morgan Kaufmann Pub-
lishers Inc. San Francisco, CA, USA, 2000.
D. McCarthy. Lexical Acquisition at the Syntax-Semantics Interface: Diathesis Alterna-
tions, Subcategorization Frames and Selectional Preferences. PhD thesis, University of
Sussex, UK, 2001.
Q. McNemar. Note on the Sampling Error of the Difference Between Correlated Propor-
tions or Percentages. Psychometrika, 12(2):153–157, 1947.
BIBLIOGRAPHY 131
P. Merlo and S. Stevenson. Automatic verb classification based on statistical distributions
of argument structure. Computational Linguistics, 27(3):373–408, 2001.
P. Merlo, S. Stevenson, V. Tsang, and G. Allaria. A multilingual paradigm for automatic
verb classification. In Proceedings of ACL, 2002.
C. Messiant, T. Poibeau, and A. Korhonen. Lexschem: a large subcategorization lexicon
for french verbs. In Proceedings of LREC, 2008.
C. Messiant. ASSCI : A subcategorization frames acquisition system for French. In
Proceedings of ACL Student Research Workshop, pages 55–60, 2008.
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM,
38(11):39–41, 1995.
Y. Miyao, R. Sætre, K. Sagae, T. Matsuzaki, and J. Tsujii. Task-oriented evaluation
of syntactic parsers and their representations. In Proceedings of ACL, pages 46–54,
Columbus, Ohio, June 2008. ACL.
Y. Mizuta, A. Korhonen, T. Mullen, and N. Collier. Zone analysis in biology articles
as a basis for information extraction. International Journal of Medical Informatics,
75(6):468–487, 2006.
P. J. Moreno, P. P. Ho, and N. Vasconcelos. A Kullback-Leibler divergence based kernel
for SVM classification in multimedia applications. In Proceedings of NIPS, volume 16,
pages 1385–1392, 2004.
R. Navigli and G. Crisafulli. Inducing word senses to improve web search result cluster-
ing. In Proceedings of EMNLP, pages 116–126. ACL, 2010.
S. J. Nelson, T. Powell, and B. L. Humphreys. The Unified Medical Language System
(UMLS) Project. In Encyclopedia of Library and Information Science, pages 369–378.
Marcel Dekker, 2002.
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.
In Proceedings of NIPS, page 849. MIT Press, 2002.
T. E. Oliphant. Python for scientific computing. Computing in Science and Engineering,
9:10–20, 2007.
D. O´ Se´aghdha and A. Copestake. Semantic classification with distributional kernels. In
Proceedings of COLING, 2008.
D. O´ Se´aghdha and A. Korhonen. Modelling selectional preferences in a lexical hierar-
chy. In Joint Conference on Lexical and Computational Semantics, 2012.
BIBLIOGRAPHY 132
D. O´ Se´aghdha. Latent variable models of selectional preference. In Proceedings of the
48th Annual Meeting of the ACL, pages 435–444. ACL, 2010.
M. Palmer. Consistent criteria for sense distinctions. Computers and the Humanities,
pages 217–222, 2000.
P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of SIGKDD,
pages 613–619. ACM, 2002.
T. Pham, H. Ng, and W. Lee. Word sense disambiguation with semi-supervised learning.
In Proceedings of AAAI, volume 20, page 1093, 2005.
S. Pinker. Learnability and Cognition: The acquisition of Argument Structure. Cam-
bridge, Mass.: MIT Press, 1989.
S. P. Ponzetto and R. Navigli. Knowledge-rich word sense disambiguation rivaling su-
pervised systems. In Proceedings of ACL, pages 1522–1531, 2010.
J. Preiss, T. Briscoe, and A. Korhonen. A system for large-scale acquisition of verbal,
nominal and adjectival subcategorization frames from corpora. In Proceedings of
ACL, volume 45, page 912, 2007.
J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering:
Structure detection by optimization. Pattern Recognition, 33(4):617–634, 2000.
P. Resnik. Selection and Information: A Class-based Approach to Lexical Relationships.
PhD thesis, University of Pennsylvania, Philadelphia, PA, USA, 1993.
M. Rios, W. Aziz, and L. Specia. Tine: A metric to assess mt adequacy. In Proceedings of
the Sixth Workshop on Statistical Machine Translation, pages 116–122. ACL, 2011.
D. L. T. Rohde, L. M. Gonnerman, and D. C. Plaut. An improved method for deriving
word meaning from lexical co-occurrence. Cognitive Psychology, 7:573–605, 2004.
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 from yester-
day’s news to tomorrow’s language resources. In Proceedings of LREC, pages 29–31.
ACL, 2002.
A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external clus-
ter evaluation measure. In Proceedings of EMNLP-CONLL, 2007.
P. Ruch, C. Boyer, C. Chichester, I. Tbahriti, A. Geissbu¨hler, P. Fabry, J. Gobeill, V. Pillet,
D. Rebholz-Schuhmann, C. Lovis, and A.-L. Veuthey. Using argumentation to extract
key sentences from biomedical abstracts. International Journal of Medical Informatics,
76(2-3):195–200, 2007.
BIBLIOGRAPHY 133
K. Sagae and J. Tsujii. Dependency parsing and domain adaptation with lr models and
parser ensembles. In Proceedings of EMNLP-CoNLL’07 shared task, pages 1044–
1050, 2007. Prague, Czech Republic.
P. Saint-Dizier. Verb Semantic Classes Based on ’alternations’ and WordNet-like criteria .
In P. Saint-Dizier, editor, Predicative Forms in Natural language and lexical Knowledge
Bases , pages 247–279. Kluwer Academic, 1998.
J. Saloja¨rvi, K. Puolama¨ki, and S. Kaski. On discriminative joint density modeling. Ma-
chine Learning: ECML 2005, pages 341–352, 2005.
Y. Sasaki, S. Montemagni, P. Pezik, D. Rebholz-Schuhmann, J. McNaught, and S. Ana-
niadou. BioLexicon: A lexical resource for the biology domain. In T. Salakoski and
S. Pyysalo, editors, Proceedings of the Third International Symposium on Semantic
Mining in Biomedicine (SMBM 2008), Turku, Finland, pages 109–116. Turku Centre
for Computer Science (TUCS), 2008.
S. Schulte im Walde and C. Brew. Inducing german semantic verb classes from purely
syntactic subcategorisation information. In Proceedings of ACL, pages 223–230, Mor-
ristown, NJ, USA, 2001. ACL.
S. Schulte im Walde, C. Hying, C. Scheible, and H. Schmid. Combining EM Training
and the MDL Principle for an Automatic Verb Classification incorporating Selectional
Preferences. In Proceedings of ACL, pages 496–504, 2008.
S. Schulte im Walde. Experiments on the choice of features for learning verb classes. In
Proceedings of EACL, pages 315–322, Morristown, NJ, USA, 2003. ACL.
S. Schulte im Walde. Experiments on the automatic induction of german semantic verb
classes. Computational Linguistics, 32(2):159–194, 2006.
S. Schulte im Walde. Human associations and the choice of features for semantic verb
classification. Research on Language and Computation, 6:79–111, 2008.
D. Seung and L. Lee. Algorithms for non-negative matrix factorization. In Proceedings
of NIPS, volume 13, pages 556–562, 2001.
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on
pattern analysis and machine intelligence, 22(8):888–905, 2000.
L. Shi and R. Mihalcea. Putting pieces together: Combining FrameNet, VerbNet and
WordNet for robust semantic parsing. In Proceedings of CICLING, 2005.
E. Shutova and S. Teufel. Metaphor corpus annotated for source - target domain map-
pings. In Proceedings of LREC 2010, Malta, 2010.
BIBLIOGRAPHY 134
E. Shutova, L. Sun, and A. Korhonen. Metaphor identification using verb and noun
clustering. In Proceedings of COLING, pages 1002–1010. ACL, 2010.
E. Shutova. Automatic metaphor interpretation as a paraphrasing task. In Proceedings
of NAACL, Los Angeles, USA, 2010.
E. V. Shutova. Computational approaches to figurative language. Technical report,
University of Cambridge, Computer Laboratory, August 2011.
S. Siegel and N. J. Castellan. Nonparametric statistics for the behavioral sciences.
McGraw-Hill Book Company, New York, USA, 1988.
S. Siegel. Nichtparametrische statistische Methoden. Klotz, 5., unvera¨nd. aufl. edition,
2001.
M. Silverstein. Hierarchy of Features and Ergativity. Humanities Press, 1976.
A. Søgaard. Semisupervised condensed nearest neighbor for part-of-speech tagging. In
Proceedings of ACL: short papers, volume 2, pages 48–52, 2011.
I. Spasic, S. Ananiadou, and J. Tsujii. Masterclass: a case-based reasoning system for
the classification of biomedical terms. Journal of Bioinformatics, 21(11):2748–2758,
2005.
S. Stevenson and E. Joanis. Semi-supervised verb class discovery using noisy features. In
Proceedings of HLT-NAACL, pages 71–78, Morristown, NJ, USA, 2003. ACL.
L. Sun and A. Korhonen. Improving verb clustering with automatically acquired selec-
tional preferences. In Proceedings of EMNLP, pages 638–647, 2009.
L. Sun and A. Korhonen. Hierarchical verb clustering using graph factorization. In Pro-
ceedings of EMNLP, pages 1023–1033, Edinburgh, Scotland, UK., July 2011. ACL.
L. Sun, A. Korhonen, and Y. Krymolowski. Automatic classification of english verbs
using rich syntactic features. In Proceedings of IJCNLP, Hyderabad,India, 2008.
L. Sun, A. Korhonen, and Y. Krymolowski. Verb class discovery from rich syntactic data.
Lecture Notes in Computer Science, 4919:16, 2008.
L. Sun, A. Korhonen, T. Poibeau, and C. Messiant. Investigating the cross-linguistic
potential of verbnet: style classification. In Proceedings of COLING, pages 1056–
1064. ACL, 2010.
Y. Suzuki and F. Fukumoto. Classifying japanese polysemous verbs based on fuzzy c-
means clustering. In Proceedings of TextGraphs-4, pages 32–40, 2009.
BIBLIOGRAPHY 135
R. Swier and S. Stevenson. Unsupervised semantic role labelling. In Proceedings of
EMNLP, pages 95–102, 2004.
M. Swift. Towards automatic verb acquisition from VerbNet for spoken dialog process-
ing. In Proceedings of Interdisciplinary Workshop on the Identification and Represen-
tation of Verb Features and Verb Classes, Saarbruecken, Germany, 2005.
M. Tate and S. Brown. Note on the cochran q test. Journal of the American Statistical
Association, 65(329):155–160, 1970.
I. Tbahriti, C. Chichester, F. Lisacek, and P. Ruch. Using argumentation to retrieve
articles with similar citations: An inquiry into improving related articles search in the
medline digital library. International Journal of Medical Informatics, 75, 2005.
Y. W. Teh, H. D. III, and D. Roy. Bayesian agglomerative clustering with coalescents. In
Advances in Neural Information Processing Systems, volume 20, 2008.
S. Teufel and M. Moens. Summarizing scientific articles: experiments with relevance and
rhetorical status. Computational Linguistics, 28(4):409–445, December 2002.
S. Teufel. Argumentative Zoning: Information Extraction from Scientific Text. PhD
thesis, University of Edinburgh, 1999.
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech
tagging with a cyclic dependency network. In Proceedings of NAACL, pages 173–
180, Stroudsburg, PA, USA, 2003. ACL.
V. Tsang and S. Stevenson. Using selectional profile distance to detect verb alternations.
In HLT/NAACL 2004 Workshop on Computational Lexical Semantics, 2004.
T. Tuells. The derivation of a large computational lexicon for a two-level morphological
analyzer for catalan from a machine-readable dictionary. 1997.
A. Ushioda. Hierarchical clustering of words. In Proceedings of COLING, pages 1159–
1162. ACL, 1996.
V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc.,
New York, NY, USA, 1995.
G. Va´zquez, A. Ferna´ndez, I. Castello´n, and M. A. Martı´. Clasificacio´n verbal: Alter-
nancias de dia´tesis. In Quaderns de Sintagma. Universitat de Lleida, 2000.
D. Verma and M. Meila. Comparison of spectral clustering methods. Advances in Neural
Information Processing Systems (NIPS 15), 2003.
BIBLIOGRAPHY 136
N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings compar-
ison: is a correction for chance necessary? In Proceedings of ICML, pages 1073–1080,
New York, NY, USA, 2009. ACM.
A. Vlachos, P. Buttery, D. O. Se´aghdha, and T. Briscoe. Biomedical event extraction
without training data. In Proceedings of the Workshop on Current Trends in Biomed-
ical Natural Language Processing: Shared Task, pages 37–40, Stroudsburg, PA, USA,
2009. ACL.
A. Vlachos, A. Korhonen, and Z. Ghahramani. Unsupervised and constrained dirich-
let process mixture models for verb clustering. In Proceedings of the Workshop on
Geometrical Models of Natural Language Semantics, pages 74–82, 2009.
U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 –
416, 2007.
P. Vossen. EuroWordNet: A Multilingual Database with Lexical Semantic Networks.
Kluwer Academic Publishers, Dordrecht, 1998.
J. H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the
American statistical association, 58(301):236–244, 1963.
T. Wattarujeekrit, P. K. Shah, and N. Collier. Pasbio: predicate-argument structures for
event extraction in molecular biology. BMC Bioinformatics, 5:155, 2004.
Y. Wilks. Making preferences more active. Artificial Intelligence, 11(3):197–223, 1978.
S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemometrics and
intelligent laboratory systems, 2(1):37–52, 1987.
Z. Wu and R. Leahy. An optimal graph theoretic approach to data clustering: Theory
and its application to image segmentation. IEEE transactions on pattern analysis and
machine intelligence, pages 1101–1113, 1993.
Z. Yang and E. Oja. Clustering by low-rank doubly stochastic matrix decomposition. In
Proceedings of ICML, 2012.
K. Yu, S. Yu, and V. Tresp. Soft clustering on graphs. In Proceedings of NIPS, volume 18,
page 1553. MIT; 1998, 2006.
B. Zapirain, E. Agirre, and L. Ma`rquez. Robustness and generalization of role sets:
PropBank vs. VerbNet. In Proceedings of ACL, pages 550–558, 2008.
L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In Proceedings of NIPS,
volume 17, page 16. MIT Press, 2004.
BIBLIOGRAPHY 137
X. Zhang, Y. Wang, and P. Li. Word alignment based recognizing textural entailment. In
Progress in Informatics and Computing (PIC), 2010 IEEE International Conference
on, volume 1, pages 352–355. IEEE, 2010.
Z. Zhao and H. Liu. Spectral feature selection for supervised and unsupervised learning.
In Proceedings of ICML, pages 1151–1157, New York, NY, USA, 2007. ACM.