Refinements in Hierarchical
Phrase-Based
Translation Systems
Juan Miguel Pino
Cambridge University Engineering Department
Dissertation submitted to the University of Cambridge for the degree of
Doctor of Philosophy
Declaration
I hereby declare that the contents of this dissertation are original and have
not been submitted in whole or in part for consideration for any other degree
or qualification in this, or any other University. This dissertation is the result
of my own work and includes nothing which is the outcome of work done in
collaboration, except where specifically indicated in the text. This disser-
tation contains less than 65,000 words including appendices, bibliography,
footnotes, tables and equations and has less than 150 figures.
i
Acknowledgements
First, I am truly indebted to my supervisor Prof. Bill Byrne. Bill provided
relentless support throughout my PhD studies. He taught me how to be
a better writer and a better researcher. I am also very grateful that he
provided me with invaluable opportunities such as participating in translation
competitions or carrying out an internship at Google for three months.
I also would like to thank Dr. Stephen Clark and Dr. Miles Osborne for
being part of my thesis committee and providing very valuable feedback on
the first submission of this thesis.
Without the lab Computer Officers, Patrick Gosling and Anna Langley,
experiments reported in this thesis would not have been possible, I am there-
fore grateful for all their hard work on the computer systems.
I also wish to thank all my colleagues at the Machine Intelligence Lab. Dr.
Adria` de Gispert provided me with critical guidance throughout my research:
two chapters of this thesis originated from a very fruitful collaboration with
him. I am grateful to Aurelien Waite for all his help on infrastructure and
for helping me becoming a better engineer. Thanks to Dr. Gonzalo Iglesias
for illuminating discussions on FSTs for machine translation as well as C++
software engineering. Dr. Graeme Blackwood and Dr. Jamie Brunning pro-
vided much support at the start of my studies as well as critical translation
tools such as word alignment and lattice rescoring tools.
I am also in debt to my office colleagues Matt Shannon, Matt Seigel and
Chao Zhang for making PhD studies a pleasurable experience.
Finally, thank you Anna for your infinite patience, guidance and support
and for telling me to apply for at least two jobs at the end of my studies.
Thank you Sofia and Fania: I will try to teach you many languages, just in
case machine translation doesn’t improve quickly enough.
ii
Abstract
The relatively recently proposed hierarchical phrase-based translation model
for statistical machine translation (SMT) has achieved state-of-the-art per-
formance in numerous recent translation evaluations. Hierarchical phrase-
based systems comprise a pipeline of modules with complex interactions. In
this thesis, we propose refinements to the hierarchical phrase-based model
as well as improvements and analyses in various modules for hierarchical
phrase-based systems.
We took the opportunity of increasing amounts of available training data
for machine translation as well as existing frameworks for distributed com-
puting in order to build better infrastructure for extraction, estimation and
retrieval of hierarchical phrase-based grammars. We design and implement
grammar extraction as a series of Hadoop MapReduce jobs. We store the re-
sulting grammar using the HFile format, which offers competitive trade-offs
in terms of efficiency and simplicity. We demonstrate improvements over two
alternative solutions used in machine translation.
The modular nature of the SMT pipeline, while allowing individual im-
provements, has the disadvantage that errors committed by one module are
propagated to the next. This thesis alleviates this issue between the word
alignment module and the grammar extraction and estimation module by
considering richer statistics from word alignment models in extraction. We
use alignment link and alignment phrase pair posterior probabilities for gram-
mar extraction and estimation and demonstrate translation improvements in
Chinese to English translation.
This thesis also proposes refinements in grammar and language modelling
both in the context of domain adaptation and in the context of the interaction
between first-pass decoding and lattice rescoring. We analyse alternative
strategies for grammar and language model cross-domain adaptation. We
also study interactions between first-pass and second-pass language model in
iii
iv
terms of size and n-gram order. Finally, we analyse two smoothing methods
for large 5-gram language model rescoring.
The last two chapters are devoted to the application of phrase-based
grammars to the string regeneration task, which we consider as a means to
study the fluency of machine translation output. We design and implement a
monolingual phrase-based decoder for string regeneration and achieve state-
of-the-art performance on this task. By applying our decoder to the output
of a hierarchical phrase-based translation system, we are able to recover the
same level of translation quality as the translation system.
Contents
1 Introduction 1
1.1 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Challenges for Translation . . . . . . . . . . . . . . . . 1
1.1.2 Machine Translation Current Quality . . . . . . . . . . 2
1.2 Statistical Machine Translation . . . . . . . . . . . . . . . . . 5
1.2.1 The SMT Pipeline . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . 8
2 Statistical Machine Translation 9
2.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Source-Channel Model . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 HMM and Word-to-Phrase Alignment Models . . . . . 14
2.3.2 Symmetrisation Heuristics . . . . . . . . . . . . . . . . 16
2.4 Log-Linear Model of Machine Translation . . . . . . . . . . . . 19
2.5 Phrase-Based Translation . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Alignment Template Model . . . . . . . . . . . . . . . 21
2.5.2 Phrase-Based Model . . . . . . . . . . . . . . . . . . . 22
2.5.3 Phrase Pair Extraction . . . . . . . . . . . . . . . . . . 22
2.5.4 Phrase-Based Decoding . . . . . . . . . . . . . . . . . . 24
2.6 Hierarchical Phrase-Based Translation . . . . . . . . . . . . . 26
2.6.1 Introduction and Motivation . . . . . . . . . . . . . . . 27
2.6.2 Hierarchical Grammar . . . . . . . . . . . . . . . . . . 27
2.6.3 Log-linear Model for Hierarchical Phrase-Based Trans-
lation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.4 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . 35
2.6.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 35
v
CONTENTS vi
2.7 Language Modelling . . . . . . . . . . . . . . . . . . . . . . . 36
2.7.1 n-gram language models . . . . . . . . . . . . . . . . . 37
2.7.2 Back-off and Interpolated Models . . . . . . . . . . . . 38
2.7.3 Modified Kneser-Ney Smoothing . . . . . . . . . . . . . 39
2.7.4 Stupid Backoff Smoothing . . . . . . . . . . . . . . . . 40
2.8 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.8.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 40
2.8.2 Minimum Error Rate Training . . . . . . . . . . . . . . 42
2.9 Decoding with Finite State Transducers . . . . . . . . . . . . . 42
2.10 Lattice Rescoring . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.10.1 5-gram Language Model Lattice Rescoring . . . . . . . 43
2.10.2 Lattice Minimum Bayes’ Risk Rescoring . . . . . . . . 44
2.10.3 Lattice Minimum Bayes-Risk for System Combination . 45
3 Data Structures for Hierarchical Phrase-Based Translation
Grammars 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Applications of MapReduce to SMT . . . . . . . . . . . 50
3.2.2 SMT Models Storage and Retrieval Solutions . . . . . . 52
3.3 HFile Description . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 HFile Internal Structure . . . . . . . . . . . . . . . . . 55
3.3.2 Record Retrieval . . . . . . . . . . . . . . . . . . . . . 57
3.3.3 Bloom Filter Optimisation for Query Retrieval . . . . . 57
3.3.4 Local Disk Optimisation . . . . . . . . . . . . . . . . . 58
3.3.5 Query sorting optimisation . . . . . . . . . . . . . . . . 58
3.4 Hierarchical Rule Extraction with MapReduce . . . . . . . . . 58
3.4.1 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . 59
3.4.2 Corpus-Level Feature Computation . . . . . . . . . . . 61
3.4.3 Feature Merging . . . . . . . . . . . . . . . . . . . . . 62
3.5 Hierarchical Rule Filtering for Translation . . . . . . . . . . . 62
3.5.1 Task Description . . . . . . . . . . . . . . . . . . . . . 63
3.5.2 HFile for Hierarchical Phrase-Based Grammars . . . . 63
3.5.3 Suffix Arrays for Hierarchical Phrase-Based Grammars 64
3.5.4 Text File Representation of Hierarchical Phrase-Based
Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.5 Experimental Design . . . . . . . . . . . . . . . . . . . 65
3.5.6 Results and Discussion . . . . . . . . . . . . . . . . . . 67
CONTENTS vii
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Hierarchical Phrase-Based Grammar Extraction from Align-
ment Posterior Probabilities 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.1 General Framework for Rule Extraction . . . . . . . . . 78
4.3.2 Extraction from Viterbi Alignment Links . . . . . . . . 79
4.3.3 Extraction from Posteriors Probabilities over Align-
ment Links . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.4 Extraction from Posteriors over Phrase Pairs . . . . . . 87
4.3.5 Hierarchical Rule Extraction . . . . . . . . . . . . . . . 91
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.1 Experimental Setup: Grammar Pattern Definition . . . 96
4.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . 97
4.4.3 Grammar Coverage . . . . . . . . . . . . . . . . . . . . 99
4.4.4 Translation Results . . . . . . . . . . . . . . . . . . . . 101
4.4.5 Comparison between WP and PP . . . . . . . . . . . 104
4.4.6 Symmetrising Alignments of Parallel Text . . . . . . . 104
4.4.7 Additional Language Pair: Russian-English . . . . . . . 106
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5 Domain Adaptation for Hierarchical Phrase-Based Transla-
tion Systems 109
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Description of the System to be Developed . . . . . . . . . . . 110
5.3 Domain Adaptation for Machine Translation . . . . . . . . . . 114
5.3.1 Domain Adaptation . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Previous Work on Domain Adaptation for SMT . . . . 115
5.4 Language Model Adaptation for Machine Translation . . . . . 116
5.5 Domain Adaptation with Provenance Features . . . . . . . . . 118
5.6 Interaction between First Pass Language Model and Rescoring
Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.7 Language Model Smoothing in Language Model Lattice Rescor-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
CONTENTS viii
6 String Regeneration as Phrase-Based Translation 125
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Phrase-Based Translation Model for String Regeneration . . . 130
6.4 Phrase-Based Translation Rules for String Regeneration . . . . 131
6.4.1 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . 131
6.4.2 Rule Filtering . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 String Regeneration Search . . . . . . . . . . . . . . . . . . . . 132
6.5.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.3 FST Building . . . . . . . . . . . . . . . . . . . . . . . 137
6.5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 139
6.6.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6.3 Sentence Splitting . . . . . . . . . . . . . . . . . . . . . 140
6.6.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.6.5 n-gram Order for n-gram Rules . . . . . . . . . . . . . 142
6.6.6 Overlapping n-gram Rules . . . . . . . . . . . . . . . . 144
6.6.7 Future Cost . . . . . . . . . . . . . . . . . . . . . . . . 145
6.6.8 Rescoring . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7 String Regeneration Applied to Machine Translation 150
7.1 SMT Baseline System . . . . . . . . . . . . . . . . . . . . . . 151
7.2 Regeneration Decoder Baseline on MT Output . . . . . . . . . 152
7.3 Analysis of the Loss of MT Hypotheses . . . . . . . . . . . . . 154
7.4 Biased Language Model for Regeneration . . . . . . . . . . . . 157
7.5 Translation and Regeneration Hypothesis Combination . . . . 157
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8 Summary and Future Work 161
8.1 Review of Work . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.1.1 Hierarchical Phrase-Based Grammars: Infrastructure . 162
8.1.2 Hierarchical Phrase-Based Grammars: Modelling . . . 162
8.1.3 Hierarchical Phrase-Based System Development . . . . 163
8.1.4 Fluency in Hierarchical Phrase-Based Systems . . . . . 163
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Chapter 1
Introduction
1.1 Machine translation
Machine translation is the process of translation from input speech or text in
a natural language into another natural language by some kind of automatic
system. Real world examples include online services such as Google Trans-
late1, Bing Translate2, SDL3, PROMT4, etc. Interacting with any online
automatic translation service with the expectation of a high quality transla-
tion can be a frustrating experience. Indeed, variations in word order across
languages and syntax and the use of real world knowledge for puns or idioms
make translation a very challenging task.
1.1.1 Challenges for Translation
In order to illustrate the difficulties that arise in translation, we present sev-
eral examples that make translation challenging for humans, and a fortiori
for computers. In some languages, some concepts are common enough to be
designated by one word, but in another language, an entire sentence may be
needed to describe that concept. For example, the word sobremesa in Span-
ish can be translated into English as the time spent after a meal, talking to
the people with whom the meal was shared.5 In this situation, a human trans-
1 https://translate.google.com
2 https://www.bing.com/translator
3 http://www.freetranslation.com/
4 http://www.online-translator.com
5 http://blog.maptia.com/posts/untranslatable-words-from-other-cultures
1
CHAPTER 1. INTRODUCTION 2
lator is left with the choice of keeping the translation short but inexact or
long but cumbersome. For a computer, or rather a statistical model, such a
situation will represent an outlier in terms of the ratio of the number of words
that are translation of each other. Self-referential sentences can also present
challenges for translation. For example, the sentence This sentence has five
words has at least two acceptable translations into Russian from the syntac-
tic and semantic point of view: В этом предложении пять слов and Это
предложение состоит из пяти слов. However, the second translation has
six words and therefore cannot be accepted. Another challenge for transla-
tion is word ordering. The first sentence of the novel The Metamorphosis by
Franz Kafka reads Als Gregor Samsa eines Morgens aus unruhigen Tra¨umen
erwachte, fand er sich in seinem Bett zu einem ungeheuren Ungeziefer ver-
wandelt. One possible English translation is As Gregor Samsa awoke one
morning from uneasy dreams, he found himself transformed in his bed into
a gigantic insect-like creature. In German, the words for transformed (ver-
wandelt) and insect (Ungeziefer) come right at the end of the sentence and
create an effect of surprise for the reader. In English, however, the verb
transformed comes in the middle of the sentence and the effect of surprise
is not as great. This example demonstrates how variations in word order-
ing between languages can make translation challenging. Computers have
the additional challenge that the choice of word ordering should produce a
grammatical sentence. This is a challenge for humans too but computers are
particularly bad at producing a grammatical output.
1.1.2 Machine Translation Current Quality
Even though machine translation is a challenging task, it is still useful in
a number of situations. For example, machine translation can be used to
obtain the gist, i.e. the general meaning, of a document in a foreign language.
Machine translation can also be used for post-editing: a document in a foreign
language is first automatically translated and then the translation is corrected
by a human translators. This is precisely the approach taken by translation
companies such as Unbabel6.
Machine translation has therefore gotten to the point where it is actually
useful for practical applications, but it is reasonable to ask how far we are
from perfect translation. We give indications in terms of the BLEU met-
6 https://www.unbabel.com
CHAPTER 1. INTRODUCTION 3
System 1-2-3-4 2-3-4 1-3-4 1-2-4 1-2-3
CUED 38.95 34.80 35.60 35.87 35.28
Reference 1 – 46.19 – – –
Reference 2 – – 47.27 – –
Reference 3 – – – 45.43 –
Reference 4 – – – – 46.90
Table 1.1: BLEU score obtained by the Cambridge University Engineering
Department system and by human translators on the MT08 test set. 4 refer-
ences are provided. The BLEU score is measured against various subsets of
references: all references (1-2-3-4), all but the first (2-3-4), etc. The BLEU
score for a human reference when the set of reference contains the human
reference is not computed and is simply 100.
ric (Papineni et al., 2002), which we will define in Section 2.8.1. For now, we
simply need to know that BLEU measures how well a translation hypothesis
resembles a set of human translation references. The Cambridge Univer-
sity Engineering Department (CUED) submitted a translation system to the
NIST Open Machine Translation 2012 Evaluation.7 We run this system on
the newswire portion of the NIST Open Machine Translation 2008 Evalua-
tion test set (MT08). For this MT08 test set, 4 references are available. We
measure the case insensitive BLEU score of the CUED system against the 4
references and against subsets of 3 references. Similarly, we measure the case
insensitive BLEU score of each reference against the other references. BLEU
scores for humans and for the CUED system are summarised in Table 1.1.
On average, CUED obtains a BLEU score of 35.39 against 3 references. On
average, human references obtain a BLEU score 46.45 against the other ref-
erences. We can draw two conclusions from these observations. First, this
is evidence that many possible translations are acceptable. Second, because
the automatic system performance for this particular setting is approximately
10 BLEU points below human performance, this gives an indication of the
quality of state-of-the-art automatic translation. After manual inspection
of the system output, we show a few examples of output sentences that are
relatively long and still have reasonable quality in Figure 1.1:
7 http://www.nist.gov/itl/iad/mig/openmt12.cfm
CHAPTER 1. INTRODUCTION 4
Minister of agriculture and social affairs, told AFP: “The prime
minister submitted his resignation to Abbas, chairman of the
president, and at the same time asked him to form a new cabinet,
responsible for handling daily affairs of the new government.”
Reference: Minister for Agriculture and Social Affairs Habbash
told AFP: “Prime Minister Fayyad offered his resignation to Pres-
ident Abbas. The president accepted it and at the same time also
asked him to form a new cabinet government to be responsible
for routine administration.”
Chavez has ordered government officials to keep an eye on foreign-
ers visiting venezuela’s speech, when a person is found to have
publicly criticized him or the Venezuelan government, are to be
deported.
Reference: Chavez has already ordered government officials
to closely monitor the speech of foreigners when they visit
Venezuela. If anyone is found publicly criticizing him or the
Venezuelan government, they should be all deported.
US-China strategic economic dialogue “focused on the economic,
environmental and other issues, the most important thing is that
the issue of the RMB exchange rate, the US Congress members
are of the view that the value of the Renminbi underestimated.
Reference: The US-China Strategic Economic Dialogue mainly
discusses economic, environmental protection and other issues.
Yet, the most important is the issue of the renminbi exchange
rate. US congressmen believe that the renminbi is excessively
undervalued.
Figure 1.1: Example output of a translation system, together with one ref-
erence. Even though the sentences are relatively long, they are overall fluent
and intelligible. True casing is restored manually.
CHAPTER 1. INTRODUCTION 5
1.2 Statistical Machine Translation
The current dominant approach to machine translation is a statistical ap-
proach. Given an input in a natural language, a statistical model of trans-
lation will attempt to predict the best translation according to a statistical
criterion, such as the most likely translation under a probabilistic model of
translation. The data used to estimate the parameters for such a model
consists of a parallel corpus, which is a set of sentence pairs in two differ-
ent natural languages that are translation of each other, and a monolingual
corpus which is a set of sentences in the language we would like to translate
into.
Parallel data can be obtained from multilingual institutions proceedings
such as the United Nations (Franz et al., 2013),the Canadian Hansard (Ger-
mann, 2001) or the European Parliament (Koehn, 2005). This type of data
is relatively clean since it is created by professional translators. However, it
may not match the genre of the input sentences we wish to translate, such
as newswire. The importance of having a good match between the data used
for training and testing is discussed in Section 5.3. In order to address this
concern and also in order to obtain more data, parallel data can also be
extracted automatically from comparable corpora (Smith et al., 2013). How-
ever, the widespread availability of machine translation and the development
of automatic techniques to extract parallel corpora automatically increase
the risk of having automatically translated output of poor quality present in
the training data. These concerns have been acknowledged and addressed by
a watermarking algorithm (Venugopal et al., 2011).
1.2.1 The SMT Pipeline
Typically, state-of-the-art SMT systems are organised into a pipeline of de-
terministic or statistical modules, as shown in Figure 1.2. Parallel text is first
preprocessed. This consists of data cleaning and tokenisation. For morpho-
logically poor languages such as English, simple rules for tokenisation, e.g.
separate words by white space and punctuation, are usually good enough.
For morphologically rich languages, more advanced techniques are needed
for effective tokenisation, for example morphological analysis (Habash and
Rambow, 2005), in order to combat data sparsity and work with a vocabu-
lary with reasonable size. In the case of languages without word boundaries,
such as Chinese, word segmentation techniques need to be applied (Zhang
CHAPTER 1. INTRODUCTION 6
and Clark, 2007), in order to be able to break sentences into words.
The preprocessed parallel text is then word-aligned (see Section 2.3). A
word alignment is simply a mapping between source words and target words
in a parallel sentence pair. Word alignment models were originally used to
describe the translation process and to perform translation (Germann et al.,
2001). Word alignment toolkits are available online8,9 as well as a word to
word translation decoder.10 Nowadays, word alignment models are used as
an intermediate step in the translation pipeline. A translation grammar is
then extracted and estimated from the word alignments (see Section 2.6.4)
and translation is performed under the translation grammar.
The monolingual data is also preprocessed in the same fashion as the
target side of the parallel text and one or more language models are esti-
mated from this data (see section Section 2.7). The language models and
the translation grammar are used by the translation decoder for translation.
In order to optimise the SMT model parameters, alternating decoding and
tuning steps are carried out.
Typically, a decoder can output a best possible translation, or an n-
best list of translations or even a lattice of translations, which is a compact
representation of an n-best list with a very large number n. In the latter case,
optional rescoring steps can be carried out in order to include models that
are too complex or not robust enough to be included in first pass decoding
(see Section 2.10). In particular, we study a model of string regeneration that
allows any reordering of the output of the first pass decoder in Chapter 7.
1.3 Contributions
The contributions of this thesis are outlined as follows:
• We describe a novel approach to grammar extraction and estimation.
Our approach ties the models of word alignment and grammar extrac-
tion and estimation more closely. It results in a more robust extraction
and estimation of translation grammars and leads to improvements in
translation quality, as demonstrated in Chinese to English translation
experiments.
8 http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1
9 https://code.google.com/p/giza-pp
10 http://www.isi.edu/licensed-sw/rewrite-decoder
CHAPTER 1. INTRODUCTION 7
.Parallel Text
Preprocessing
Word Alignment
Grammar Extraction
Monolingual Text
Preprocessing
Language Modelling
Decoding
Tuning
Rescoring
Figure 1.2: Machine Translation System Development Pipeline
CHAPTER 1. INTRODUCTION 8
• We describe a system that allows the efficient extraction and filtering of
very large grammars. This method has been in continued use at CUED
and was employed for submissions to the NIST 201211 and the WMT
2013 (Bojar et al., 2013) translation evaluations. This was a carefully
engineering effort that required detailed understanding of grammar ex-
traction procedures and how to implement them in a Hadoop frame-
work. There are many implementation strategies that can be taken,
but obtaining the processing performance needed to support the ex-
periments reported in this thesis requires a careful design process.
• We designed and implemented a system for string regeneration, inspired
from phrase-based SMT techniques. We obtain state-of-the-art results
in the string regeneration task and demonstrate potential applications
to machine translation.
1.4 Organisation of the thesis
We now describe the thesis organisation. In Chapter 2, we review statistical
machine translation background: all components of the machine transla-
tion pipeline presented in Figure 1.2 are reviewed in detail. In Chapter 3,
we present our system for efficient extraction of translation grammars from
parallel text and retrieval of translation rules from very large translation
grammars. In Chapter 4, we present our novel grammar extraction proce-
dure that makes use of posterior probabilities from word alignment models.
We then provide hierarchical phrase-based system building recommendations
about what decisions to make in terms of language modelling and grammar
design to obtain the best possible systems for translation in Chapter 5. In
Chapter 6, we introduce our phrase-based decoder for string regeneration
and study fluency through the word reordering task. Finally, in Chapter 7,
we apply our regeneration decoder to the output of a machine translation
system.
11 http://www.nist.gov/itl/iad/mig/openmt12.cfm
Chapter 2
Statistical Machine Translation
Statistical Machine Translation (SMT) (Brown et al., 1993; Lopez, 2008;
Koehn, 2010) has become the dominant approach to machine translation, as
increasing amounts of data and computing power have become available. In
the SMT paradigm, given a sentence in a source language, conceptually all
possible sentences in a target language are assigned a probability, a score, or a
cost and the best translation is picked according to a certain decision criterion
that relates to these probabilities, scores or costs. The research challenge
is to develop models that assign scores that reflect human judgements of
translation quality.
In this chapter, we first review the historical background of SMT in Sec-
tion 2.1. We then present the original source-channel model for SMT in
Section 2.2. Word alignment models, which we review in Section 2.3, were
introduced within the framework of the source-channel model. The original
source-channel model was extended into the log-linear model, presented in
Section 2.4. The field of SMT shifted from word-based models to phrase-
based models, introduced in Section 2.5, while retaining word-based models
in their first formulation as a preliminary step. Phrase-based translation was
extended into hierarchical phrase-based translation, which we review in Sec-
tion 2.6. We then examine various features employed in state-of-the-art de-
coders in Section 2.6.5. The target language model, which is one of the most
important features in translation, is explored in more detail in Section 2.7.
In Section 2.8, we review optimisation techniques that are employed in order
to tune the decoder parameters. We finally present how finite state trans-
ducers can be used in decoding in Section 2.9. Various rescoring procedures
are reviewed in Section 2.10.
9
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 10
2.1 Historical Background
In this section, we present a brief historical background of statistical ma-
chine translation. A more comprehensive account of the history of machine
translation in general can be found elsewhere (Hutchins, 1997; Hutchins and
Lovtskii, 2000).
Warren Weaver can be considered the father of modern SMT. At a time
when the first computers were being developed, he examined their poten-
tial application to the problem of machine translation. In his memoran-
dum (Weaver, 1955), he addressed the problem of multiple meanings of a
source word by considering the context of that source word, which heralds
phrase based translation techniques and the use of context in machine trans-
lation. He was also the first to frame machine translation as a source-channel
model by considering that a sentence in a foreign language is some form of
code that needs to be broken, in analogy to the field of cryptography. Finally,
he also emphasised the statistical aspect of machine translation. However,
he also predicted that the most successful approaches to machine translation
would take advantage of language invariants by using an intermediate lan-
guage representation in the translation process. Even though state-of-the-art
statistical translation systems do not use this kind of approach, we do notice
a resurgence in intermediate language representation techniques (Mikolov
et al., 2013).
The first successful implementations of Warren Weaver’s ideas were car-
ried out by IBM in the 1990s. The source-channel model together with a
series of word alignment models were introduced by Brown et al. (1993)
while Berger et al. (1996) addressed the problem of multiple meanings us-
ing context in a maximum entropy framework. Word-based models were
extended into different variants of phrase-based models in 1999 and at the
beginning of the century (Och et al., 1999; Koehn et al., 2003; Och and Ney,
2004) and later on into synchronous context-free grammar models (Chiang,
2005, 2007).
2.2 Source-Channel Model
Statistical machine translation was originally framed as a source-channel
model (Shannon, 1948; Brown et al., 1990, 1993). Given a foreign sentence
f , we want to find the original English sentence e that went through a noisy
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 11
channel and produced f . Note that in the source-channel model notation,
what we would like to recover—the English sentence—is called the source
while what is observed—the foreign sentence—is called the target. A source-
channel model assigns probabilities from source (English) to target (foreign)
but in translation, the model is used to infer the source that was most likely
to have generated the target.
We do not use this convention here and call the source what we are
translating from and the target what we are translating into. This convention
is frequently adopted (Och et al., 1999; Och and Ney, 2002, 2004) in SMT, and
more so since SMT has been framed as a log-linear model (see Section 2.4).
We use the decision rule in Equation 2.1, which minimises the risk under a
zero-one loss function (see Section 2.10.2):
e^ = argmax
e
p(e | f)
e^ = argmax
e
p(f | e) p(e)
p(f)
(Bayes’ rule)
e^ = argmax
e
p(f | e) p(e) (2.1)
e^ is the hypothesis to be selected. p(f | e) is called the translation model
while p(e) is called the (target) language model.
The translation model and the language model are estimated separately
for practical reasons: the amount of parallel data used to train the trans-
lation model is in general orders of magnitude smaller than the amount of
monolingual data used to train the language model. Another justification is
that using two separate models makes the translation process modular: im-
proving the translation model may help improve adequacy, i.e. how well the
meaning of the source text is preserved in the translated text, while improv-
ing the language model may help improve fluency, i.e. how well-formed the
translation is. It is therefore considered preferable to train both a translation
model and a language model. In these models, parallel sentence pairs and
target sentences are not used directly as parameters because of an obvious
sparsity problem. Parallel sentence pairs are further broken down using word-
based models (see Section 2.3), phrase-based models (see Section 2.5) and
hierarchical phrase-based models (see Section 2.6). For language modelling,
sentences are broken down into windows of consecutive words using n-gram
language models (see Section 2.7). We will see in the next section how to
decompose the translation model using word alignment, which is introduced
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 12
as a latent variable into the source-channel model.
2.3 Word Alignment
In the previous section, we have briefly described the source-channel model,
which describes the translation process. This model cannot be used directly
in practice as it has too many parameters, namely all imaginable sentence
pairs and target sentences. In order to address this issue, the alignment
between source words and target words will be introduced as a latent variable
in the source channel model.
Given a sentence pair (f P e) with source sentence length J = |f | and
target sentence length I = |e|, a word alignment a for this sentence pair is a
mapping between the source and target words. In other words, a is a subset
of the cross product of the set of source words and their positions and the
set of target words and their positions, as defined in Equation 2.2:
a ⊂ {((fuP j)P (ziP i))P (jP i) ∈ [1P J ]× [1P I]} (2.2)
When the context of which sentence pair (f P e) is being word-aligned is obvi-
ous, we may simply consider source word positions and target word positions.
In that case, a is simply defined as a subset (source position, target position),
in Equation 2.3:
a ⊂ [1P J ]× [1P I] (2.3)
Each element of a is called an alignment link. Alignment links between source
and target words correspond to semantic or syntactic equivalences shared by
these words in the source and target language and in a particular sentence
pair. Alignments can present many-to-one and one-to-many mappings as well
as reordering as highlighted by crossing links. An example of word alignment
is shown in Figure 2.1.
Brown et al. (1993) introduce the alignment a as a latent variable in the
translation model p(f | e), as in Equation 2.4:
p(f | e) =
∑
a
p(f Pa | e) (2.4)
We abuse notation by calling a both the latent variable and the set of align-
ment links, which is an instance of the latent variable. For mathematical
convenience and in order to allow simplifications, given a sentence pair (f P e)
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 13
.Son˜e´ con una piedra lunar pa´lida
I dreamt of a pale moonstone
Figure 2.1: Example of word alignment a for a Spanish-English sentence pair.
f is the Spanish sentence, e is the English sentence. The source (Spanish)
length J is 6 as well as the target (English) length I. This alignment exhibits
many-to-one mappings (I and dreamt align to Son˜e´), one-to-many mappings
(moonstone aligns to piedra and lunar), as well as crossing links (the link
pale—pa´lida crosses the link moonstone—lunar).
with source length J and target length I, a is restricted to be a function from
source word positions to target word positions, as in Equation 2.5:
a : [1P J ] −→ [0P I]
j 7−→ vu
(2.5)
The target position zero is included to model source words not aligned to any
target word; these unaligned source words are virtually aligned to a so-called
null word. Note that this definition is not symmetric: it only allows many-
to-one mappings from source to target. Various symmetrisation strategies,
presented in Section 2.3.2, have been devised to address this limitation. Also
note that we did not take into account the null word in our initial definition
of alignments in Equation 2.2 because in general, alignments are obtained
from symmetrisation heuristics (see Section 2.3.2) where the null word is
ignored. We can use the latent variable a to rewrite the translation model
in Equation 2.6, with f = fJ) , e = zT) and a = vJ) :
p(fJ) | zT)) =
∑
lJ1
p(fJ) P v
J
) | zT))
=
∑
lJ1
J∏
u5)
p(fuP vu | f u−)) P vu−)) P zT))
=
∑
lJ1
J∏
u5)
p(fu | f u−)) P vu)P zT)) p(vu | f u−)) P vu−)) P zT))
(2.6)
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 14
Brown et al. (1993) present a series of five translation models of increas-
ing complexity that parameterise the terms p(fu | f u−)) P vu)P zT)) and p(vu |
f u−)) P v
u−)
) P z
T
)). Parameter estimation is carried out with the expectation-
maximisation algorithm (Dempster et al., 1977). Also based on Equation 2.6,
Vogel et al. (1996) introduce an HMM model (Rabiner, 1989) for word align-
ment and Deng and Byrne (2008) extend the HMMmodel to a word-to-phrase
HMM model. We describe these two models in the following sections.
We have described word alignment models in the context of the source-
channel model. In that context, word alignment models can be used di-
rectly for word-based decoding.1 However, nowadays, word alignment models
are used as a preliminary step in the machine translation training pipeline,
namely prior to rule extraction (see Section 2.5.3 and Section 2.6.4). In that
case, the word alignment models are used to produce Viterbi alignments,
defined in Equation 2.7:
xvJ) = argmax
lJ1
p(fJ) P v
J
) | zT)) (2.7)
One contribution of this thesis is to use alignment posterior probabilities
instead of Viterbi alignments for rule extraction (see Chapter 4).
2.3.1 HMM and Word-to-Phrase Alignment Models
We review HMM and word-to-phrase HMM models as these models are used
in experiments throughout this thesis. Vogel et al. (1996) introduce an HMM
alignment model that treats target word positions as hidden states and source
words as observations. The model is written in Equation 2.8:
p(fJ) P v
J
) | zT)) =
J∏
u5)
p(vu | vu−)P I) p(fu | zlj) (2.8)
Word-to-phrase HMM models (Deng and Byrne, 2008) were designed to cap-
ture interesting properties of IBM Model 4 (Brown et al., 1993) in an HMM
framework in order to keep alignment and estimation procedures exact. We
now present this model in more detail, using our usual source/target conven-
tion, which is the reverse than the one adopted in the original publication2.
1 e.g. http://www.isi.edu/licensed-sw/rewrite-decoder
2 s in the publication corresponds to e in this thesis; t corresponds to f ; J corresponds
to J ; I corresponds to I.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 15
In the word-to-phrase HMM alignment model, the source sentence f is seg-
mented into source phrases vK) . The alignment a is represented by a set of
variables (K) P vK) P hK) P K) where:
• K is the number of source phrases that form a segmentation of the
source sentence f .
• vK) is the alignment from target words to source phrases.
• K) indicates the length of each source phrase.
• hK) is a hallucination sequence that indicates whether a source phrase
was generated by the target null word or by a usual target word.
The general form of the model is presented in Equation 2.9:
p(f Pa | e) = p(vK) P KP vK) P hK) P K) | e)
= p(K | JP e)×
p(vK) P
K
) P h
K
) | KP JP e)×
p(vK) | vK) P hK) P K) P KP JP e)
(2.9)
We now review the modelling decisions taken for each of the components
from Equation 2.9. The first component is simply modelled by:
p(K | JP e) = K (2.10)
where is a threshold that controls the number of segments in the source.
The second component is modelled using the Markov assumption:
p(vK) P
K
) P h
K
) | KP JP e) =
K∏
k5)
p(vkP hkP k | vk−)P k−)P hk−)P KP JPe)
=
K∏
k5)
p(vk | vk−)P hkP I) y(hk)n(k | zlk) (2.11)
As in the HMM word alignment model vk depends only on vk−), the target
length I and the binary value hk. y(hk) is simply controlled by the parameter
p( by y(0) = p(. n(k | zlk) is a finite distribution on source phrase length
that depends on each target word. This parameter is analogous to the fertility
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 16
parameter introduced in IBM Model 3 (Brown et al., 1993) and that controls
how many source words are aligned to a given target word.
The third component from Equation 2.9 is defined in Equation 2.12 and
represents the word-to-phrase translation parameter:
p(vK) |vK) P hK) P K) P KP JP e) =
K∏
k5)
p(vk | zlk P hkP k) (2.12)
One key contribution from the word-to-phrase HMM model is to use bigram
translation probabilities to model one single phrase translation, as shown in
Equation 2.13:
p(vk | zlk P hkP k) = t)(vk[1] | hk · zlk)
k∏
u52
t2(vk[j] | vk[j − 1]P hk · zlk) (2.13)
where hk ·zlk is zlk if hk = 1 and the null word otherwise, t) is a word-to-word
translation probability and t2 is a bigram translation probability.
Figure 2.2 shows a simplified version of the generative story for an HMM
word-to-phrase alignment model: first, pick the number of source phrases K
according to e (K | JP I); then pick a target word given the previously chosen
one; finally generate the target phrase from the source word using fertility
and bigram translation probabilities. For example, we generate the source
phrase les vaches from the target word cows according to Equation 2.14:
p(les vaches | cows) = p(les | cows) p(vaches | cows P les) (2.14)
Thus bigram probabilities take into account the context of the target word
to some extent.
2.3.2 Symmetrisation Heuristics
We have mentioned that the IBM and HMM alignment models are not sym-
metric: they only allow a many-to-one mapping from source words to target
words. In order to address this issue, one can train alignment models in
both source-to-target and target-to-source directions, obtain Viterbi align-
ments from both models and apply symmetrisation strategies (Och et al.,
1999; Och and Ney, 2003; Koehn et al., 2003). Och et al. (1999) designed a
first symmetrisation heuristic that was later on dubbed as the grow heuris-
tic. Koehn et al. (2003) later extended the grow heuristic into the grow-diag
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 17
.The wolf loves fat cows
p(K = 5 | I = 5P J = 6)
Le loup aime les vaches grasses
) = 1 2 = 1 + = 1
5 = 14 = 2
p(2 | 1) p(3 | 2)
p(5 | 3)
p(4 | 5)
Figure 2.2: Simplified generative story for an HMMword-to-phrase alignment
model. Adapted from (Deng and Byrne, 2008). The adjective noun sequence
fat cows is reordered into the noun adjective sequence vaches grasses. The
word cows has fertility 2 as it is translated into the target phrase les vaches.
and grow-diag-final heuristics and examined the impact on translation per-
formance for each heuristic.
Alignments from source to target (i.e. in which the alignment is a func-
tion from source positions to target positions) and target to source are de-
noted af2p and ap2f respectively. Let us consider a sentence pair (f P e), and
source-to-target and target-to-source Viterbi alignments af2p and ap2f . The
intersection and union heuristics are defined as follows:
• intersection: a = ap2f ∩ af2p
• union: a = ap2f ∪ af2p
The intersection heuristics typically produces high precision alignments while
the union heuristics typically produces high recall alignments (Och and Ney,
2003). We now present the grow heuristic and its variants, which are based
on the initial intersection and union heuristics. The grow heuristic algorithm
is presented in Figure 2.3. The input is a sentence pair (fJ) P zT)), a source-to-
target alignment af2p and a target-to-source alignment ap2f . The resulting
alignment a is initialised with the intersection (line 2). Then alignment links
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 18
1: function Grow(fJ) P zJ) Paf2pPap2f )
2: a← af2p ∩ ap2f
3: while true do
4: added ← false
5: for i ∈ [1P I] do
6: for j ∈ [1P J ] do
7: if (jP i) ∈ a then
8: for (kP l) ∈ Neighbours((jP i)) ∩(af2p ∪ ap2f ) do
9: if k not aligned in a or l not aligned in a then
10: a← a ∪ (kP l)
11: added ← true
12: end if
13: end for
14: end if
15: end for
16: end for
17: if not added then
18: break
19: end if
20: end while
21: return a
22: end function
Figure 2.3: Algorithm for the grow symmetrisation heuristic (Koehn, 2010).
The alignment is initialised from the intersection and alignment links that
are neighbours to existing alignment links are iteratively added if the source
or the target is not already aligned.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 19
that are in the union and that are neighbours of already existing alignment
links (line 8) are considered. If the source or the target word is not already
aligned (line 9), then the link is added to the resulting alignment (line 10).
This is repeated until no more links are added.
In the grow heuristic, neighbours are defined as horizontal or vertical
neighbours. If diagonal neighbours are also considered, then the heuristic
becomes grow-diag. The grow heuristic also has an optional step called final.
Alignment links in the union where the source or the target is not already
aligned can also be added to the resulting alignment. If only links in the
union where the source and the target are not already aligned are considered
for the final procedure, then the optional step is called final-and.
Symmetrisation heuristics have been shown to be beneficial for alignments
both in terms of alignment quality as measured by comparing automatic
alignments to human alignments and in translation quality when alignments
are used as an intermediate step in the translation pipeline.
2.4 Log-Linear Model of Machine Translation
As we have seen in Section 2.2, SMT was historically framed as a source-
channel model. As an alternative, Berger et al. (1996) introduce maximum
entropy models for natural language processing. In maximum entropy mod-
elling, we wish to estimate a conditional probability p(y | x). Given a
training sample, various feature functions deemed to be relevant are picked.
We then constrain p such that the expected value of each feature function f
with respect to the empirical distribution is equal to the expected value of
f with respect to the model p. Finally, p is chosen among all models that
satisfy the constraints defined by the features and such that its entropy is
maximum. Berger et al. (1996) show how a maximum entropy model can be
parameterised as an exponential, or log-linear model. They apply this model
to three machine translation related tasks. First, they use a maximum en-
tropy model to predict the translation of a word using the context for that
word. Then, they use a maximum entropy model to predict the target lan-
guage word order. Finally, they apply maximum entropy modelling in order
to predict the source sentence segmentation.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 20
Och et al. (1999) notice that using an erroneous version of the source-
channel model, that is using the following equation:
e^ = argmax
e
p(e | f) p(e) (2.15)
gives comparable performance with respect to using the correct formulation
of the source-channel model given in Equation 2.1. Then Och and Ney (2002)
propose the following log-linear model extension:
e^ = argmax
e
p(e | f)
= argmax
e
exp(
∑M
m5) mhm(ePf))∑
e0 exp(
∑M
m5) mhm(e
0Pf))
= argmax
e
exp(
M∑
m5)
mhm(ePf)) (2.16)
where hm are called feature functions and m are called feature weights. The
log-linear model is an extension to the source-channel model because it can
be reduced to the original source-channel model with the following settings:
• b = 2
• h)(ePf) = log(p(f |e))
• h2(ePf) = log(p(e))
• ) = 2 = 1
Log-linear models were originally trained with the maximum likelihood cri-
terion, which precisely makes them equivalent to maximum entropy mod-
els (Berger et al., 1996). More effective training techniques such as mini-
mum error rate training (Och, 2003) were introduced subsequently (see Sec-
tion 2.8.2). An advantage of minimum error rate training is that the criterion
for optimisation and the evaluation metric are consistent. Because minimum
error rate training does not require computing a normalisation constant, in
practice, SMT models effectively become linear models, with the objective
function presented in Equation 2.17:
e^ = argmax
e
M∑
m5)
mhm(ePf) (2.17)
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 21
2.5 Phrase-Based Translation
So far, we have presented two modelling approaches to machine translation:
the original source-channel model and the current log-linear model. We also
have presented word alignment models, which were introduced within the
source-channel model framework and which are instances of word-based mod-
els.
In the phrase-based translation paradigm, the minimal unit of translation
consists of phrases. Phrases are sequences of consecutive words, that need
not be syntactically or semantically motivated, but nevertheless are used as
translation units. Benefits of phrase-based models include:
• effectively disambiguating the translation of a word in a certain local
context;
• effective local reordering within a phrase such as the adjective-noun
inversion from French to English;
• effective translation of multi-word expressions, such as idioms.
Phrase-based models currently achieve state-of-the-art performance for cer-
tain language pairs that do not involve much reordering such as French-
English. They can be defined in the source-channel model framework (see
Section 2.2) or the log-linear model framework (see Section 2.4). Because
the source-channel model is no longer widely used anymore and because it
is a special case of a log-linear model, we will focus our presentation on the
log-linear model.
There are different variations on the phrase-based translation paradigm.
We will focus on two popular approaches, namely the alignment template
model (Och et al., 1999; Och and Ney, 2004) and the phrase-based model (Koehn
et al., 2003; Koehn, 2010).
2.5.1 Alignment Template Model
The alignment template model uses the log-linear model presented in Equa-
tion 2.16 as a starting point and repeated in Equation 2.18:
e^ = argmax
pJ1
exp(
M∑
m5)
mhm(f
J
) P z
T
))) (2.18)
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 22
In order to reduce the number of parameters, two latent variables .K) and
zK) are introduced. zK) is a sequence of alignment templates while .K) is a
permutation of size K. An alignment template is a triple ( ~F P ~EP ~V) where ~F
is a sequence of source word classes, ~E is a sequence of target word classes
and ~V is an alignment between ~F and ~E. .K) together with zK) define a source
segmentation of fJ) into source phrases ~fK) , a target segmentation of zT) into
target phrases ~zK) and a bijective mapping between source phrases and target
phrases where ~fk is mapped to ~zk for k ∈ [1P K]. The alignment templates
zK) are constrained in such a way that the alignment template classes match
the word classes. The alignment template translation model is summarised
in Figure 2.4.
Using the max approximation and making the feature functions depend on
the hidden variables, the translation model can be rewritten in Equation 2.19:
e^ = argmax
pJ1 ;z
K
1 ;
K
1
exp(
M∑
m5)
mhm(f
J
) P z
T
)P .
K
) P z
K
) )) (2.19)
2.5.2 Phrase-Based Model
The phrase-based model is similar to the alignment template model but does
not make use of source and target word classes. Again, we use the latent
variables .K) and zK) . This time, zk is defined as a triple ( ~fP ~zP ~V) where ~f
is a sequence of source words, ~z is a sequence of target words and ~V is an
alignment between the words in ~f and ~z. The reason for using the alignment
information between phrase pairs is to be able to compute the lexical feature
(see Section 2.6.5). Because the lexical feature can be computed by other
means than this alignment information (see again Section 2.6.5), it is also
possible to simply define zk as a phrase pair.
We have presented two variants of the phrase-based model. We will now
describe how to obtain the phrase pairs used for the latent variable z.
2.5.3 Phrase Pair Extraction
A preliminary step to phrase pair extraction is to obtain word aligned parallel
text. One possibility is to train source-to-target and target-to-source word
alignment models, obtain Viterbi alignments in both directions, and apply
symmetrisation heuristics, as described in Section 2.3.2. Then phrase pairs
are extracted from each word aligned sentence pair.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 23
.f) f2 f+ f4 f5 f6 f7
f) f2 f+ f4 f5 f6 f7
f˜) f˜2 f˜+ f˜4
z˜) z˜2 z˜+ z˜4
z2z) z+ z4 z5 z6
z) z2 z+ z4 z5 z6
z) z2 z+ z4
Figure 2.4: Alignment template translation process. Adapted from (Och and
Ney, 2004). The source word sequence f 7) is first transformed into a source
class sequence f 7). The source classes are then segmented into source phrases
f˜ 4) . The source phrases are then reordered and aligned to target phrases z˜4).
For example, the source phrase f˜) is aligned to the target phrase z˜2 through
z). This means that the permutation .4) has value .2 = 1. This also means
that z) define a word alignment between the source words f) and f2 and the
target word z+. Finally, the target phrases z˜4), which encode the target class
sequence z6) generate the target word sequence z6).
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 24
.El mundo es grande
The world is big
Figure 2.5: Rule extraction for a sentence pair. For example, the phrase
(El mundo, The world) is extracted. The phrase pair (es grande, big) is not
extracted because it is not consistent with the alignment.
Let us consider a sentence pair (f P e) and an alignment a. We extract all
phrase pairs that are consistent with the alignment. This means that we ex-
tract all phrase pairs (f u2u1 P z
i2
i1
) that satisfy Equation 2.20 and Equation 2.21:
∀(jP i) ∈ aP (j ∈ [j)P j2]⇔ i ∈ [i)P i2]) (2.20)
[j)P j2]× [i)P i2] ∩ a 6= ∅ (2.21)
Equation 2.20 requires that no alignment link be between a word inside the
phrase pair and a word outside the phrase pair. Equation 2.21 requires that
there be at least one alignment link between a source word in the source
phrase and a target word in the target phrase. For example, in Figure 2.5,
the phrase pair 〈El mundo, The world〉 is extracted while the phrase pair 〈es
grande, big〉 is not because the word es (is) is aligned outside the phrase pair.
Note that the consistency constraint sometimes refers to only Equation 2.20
rather than both Equation 2.20 and Equation 2.21.
2.5.4 Phrase-Based Decoding
We have introduced two types of phrase-based models and described a tech-
nique to extract phrase-pairs, which are an essential component of these
models. We will now describe the decoding process, which is an effective
means to obtain the optimal translation of a source sentence.
We first describe a naive strategy for decoding, in order to motivate the
need for a more efficient decoder. A naive decoder may follow the following
steps, given a source sentence f of length J to be translated:
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 25
• Consider all possible segmentations of f into source phrases. There are
2J−) such segmentations.
• For each segmentation, consider all possible permutations of the source
phrases. For a segmentation of size K, there are K; such permutations.
• For each permutation of the source phrases, consider all translations of
each source phrase, and concatenate the target phrases according to the
permutation. The source phrases translations are given by the phrase
pairs extracted from the training data (see Section 2.5.3). If there are
10 translations per source phrase, we obtain 10K possible translation
(for the segmentation and permutation considered).
• Rank the translations by their score and pick the highest scoring trans-
lation.
We can see that the search space is too large for this naive approach to
be feasible. We now present a practical solution to the decoding process
in phrase-based translation. We will first introduce the translation process.
Then, we will describe how translation hypotheses are built incrementally.
We will then motivate the need for pruning and how pruning is carried out.
Finally, we will describe how future cost estimation may reduce search errors.
Translation Process Given a source sentence, the translation process is
to iteratively pick a source phrase that has not been translated yet, translate
that source phrase into a target phrase and append the target phrase to
the translation, and repeat this process until the source sentence has been
entirely covered by source phrases. While the process is not complete, the
concatenation of target phrases is called a partial hypothesis.
Hypothesis Expansion The decoding process starts from an initial empty
partial hypothesis. This empty partial hypothesis is extended by picking
source phrases, appending their translations to the empty hypothesis. At this
stage, we have obtained several partial hypotheses. The partial hypotheses
are repeatedly extended until all source words have been covered. Partial
hypotheses are represented by states that contain the information necessary
to compute the cost of an extension. If we use an n-gram language model
as a feature, the state will encode the cost of the partial hypothesis and the
last n− 1 words of the partial hypothesis.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 26
Hypothesis Recombination When two partial hypotheses share the same
n− 1 words, only the partial hypothesis with the lower cost can lead to the
best final hypothesis. Therefore, the partial hypothesis with higher cost can
be discarded, or alternatively, it is possible to make these two partial hy-
potheses share the same state for rescoring purposes.
Stack Based Decoding and Pruning The decoding search space is very
large as seen above. Approximations therefore need to be made for an ef-
fective search. The partial hypotheses are grouped in stacks by the number
of source words covered. This allows pruning. Each time a hypothesis ex-
pansion produces a hypothesis that belongs to a certain stack, that stack is
pruned. There are two commonly used types of pruning: histogram prun-
ing and threshold pruning (Koehn, 2010). Histogram pruning enforces a
maximum number of partial hypotheses in each stack. Threshold pruning
examines the cost of the best partial hypothesis in a stack and discards all
partial hypotheses in that stack whose cost is greater than the best cost
plus a threshold. Histogram pruning provides a simple guarantee in terms
of computational complexity. On the other hand, because it relies on cost,
threshold pruning ensures a consistent quality for partial hypotheses that
survive the pruning threshold.
Future Cost Partial hypotheses that cover the same number of source
words are grouped together for pruning purposes. However, their cost may
not be directly comparable, for example partial hypotheses that correspond
to the translation of frequent words in the source might have a smaller cost
than partial hypotheses that correspond to the translation of rare words in
the source. To address this issue, a future cost that represents how difficult
it is to translate the rest of the sentence is added to the model cost of each
partial hypothesis.
2.6 Hierarchical Phrase-Based Translation
In the previous section, we have described the phrase-based translation
paradigm. In this section, we present the hierarchical phrase-based trans-
lation model. This model relies on the synchronous context free grammar
formalism. Reordering between source and target languages is modelled by
the grammar rather than by an ad hoc reordering model, although using both
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 27
the grammar and a reordering model may be beneficial (Huck et al., 2013). A
closely related formalism, inversion transduction grammars, was previously
introduced (Wu, 1995, 1997). However, because hierarchical phrase-based
grammar only contain lexicalised rules, translation decoding is more practi-
cal. An alternative to hierarchical phrase-based translation that also models
discontinuous phrases but does not use the same grammar formalism has also
been introduced recently (Galley and Manning, 2010).
2.6.1 Introduction and Motivation
Phrase-based models generally impose a limit on the size of the phrase pairs
extracted while, in decoding, phrases are typically reordered within a cer-
tain limit. These restrictions may be problematic for language pairs such as
German-English or Chinese-English that allow arbitrary reordering in some
instances. Hierarchical phrase-based translation was introduced in order to
overcome the reordering limitations from phrase-based models (Chiang, 2005,
2007). For example, the Chinese sentence with English gloss in Figure 2.6
requires “nested” reordering (Chiang, 2007):
• The phrase with North Korea have diplomatic relations must be re-
ordered into the phrase have diplomatic relations with North Korea.
• The phrase few countries one of must be reordered into the phrase one
of (the) few countries.
• After the two previous segments are reordered, have diplomatic rela-
tions with North Korea that one of the few countries must be reordered
into one of the few countries that have diplomatic relations with North
Korea.
Phrase-based systems can model this type of movement but they require very
long phrase pairs, which is impractical to rely upon because of data sparsity.
On the other hand, hierarchical phrase-based grammars do model this type
of movement using shorter phrase pairs but with more complex rules.
2.6.2 Hierarchical Grammar
2.6.2.1 Hierarchical Grammar Definition
A hierarchical phrase-based grammar, or hierarchical grammar, or Hiero
grammar, which is a particular instance of a synchronous context free gram-
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 28
.澳洲 是 与 北韩 有 邦交 的 少数 国家 之一
Australia is with NorthKorea have
diplomatic
relations that few countries
one
of
Australia is
one
of
the
few countries that have
diplomatic
relations with
North
Korea
Figure 2.6: Example of Chinese sentence that needs nested reordering in
order to be translated into English. Adapted from (Chiang, 2007).
mar (Lewis II and Stearns, 1968; Aho and Ullman, 1969), is a set of rewrite
rules of the following type:
m → 〈
P P∼〉
where m is a nonterminal,
and are sequences of terminals and nontermi-
nals and ∼ is an alignment between nonterminals. Terminals that appear in
are words in the source language while terminals that appear in are words
in the target language. Nonterminals are chosen from a finite set disjoint
from the set of terminals. The alignment between nonterminals indicates
which nonterminals in the source and target languages correspond to each
other. The alignment ∼ can be written with a set of matching indices.
2.6.2.2 Types of Rules
A rule may or may not contain any nonterminals. A rule without nonter-
minals (on the right hand side) is called a phrase-based rule. A rule with
nonterminals is called a hierarchical rule. A hierarchical grammar also usu-
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 29
g) : h → 〈mPm〉
g2 : h → 〈hmP hm〉
g+ : m → 〈m) 的 m2P the m2 that m)〉
g4 : m → 〈m 之一P one of m〉
g5 : m → 〈与 m) 有 m2P have m2 with m)〉
g6 : m → 〈澳洲P Australia〉
g7 : m → 〈是P is〉
g0 : m → 〈北韩P North Korea〉
g1 : m → 〈邦交P diplomatic relations〉
g)( : m → 〈少数 国家P few countries〉
Figure 2.7: Example of hierarchical grammar. Adapted from (Chiang, 2007).
With this grammar, it is possible to obtain a derivation for the sentence pair
from Figure 2.6. The derivation is shown in Figure 2.8.
ally contains the following rules called glue rules:
h →〈mPm〉
h →〈hmP hm〉
with h the start symbol. The first glue rule is necessary to be able to start
a derivation. The second glue rule allows concatenation of phrase-based or
hierarchical rules.
2.6.2.3 Hierarchical Grammar Example
Let us consider for example the grammar in Figure 2.7 where each rewrite rule
is given a name gi. With this grammar, it is possible to write a derivation,
i.e. a sequence of rules, that rewrites the start symbol h into the sentence
pair presented in Section 2.6.1 (Chiang, 2007). For example we can apply
the derivation g2P g2P g)P g6P g7P g4P g+P g)(P g5P g1P g0 as in Figure 2.8.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 30
S → 〈SX; SX〉
→ 〈SX1X2; SX1X2〉
→ 〈X1X2X3; X1X2X3〉
→ 〈澳洲 X2X3;Australia X2X3〉
→ 〈澳洲 是 X3;Australia is X3〉
→ 〈澳洲 是 X 之一;Australia is one of X〉
→ 〈澳洲 是 X1 的 X2 zhiyi;Australia is one of the X2 that X1〉
→ 〈澳洲 是 X1 的 少数 国家 之一;Australia is one of the few countries that X1〉
→ 〈澳洲 是 与 X1 有 X2 的 少数 国家 之一;
Australia is one of the few countries that have X2 with X1〉
→ 〈澳洲 是 与 X1 有 邦交 的 少数 国家 之一;
Australia is one of the few countries that have diplomatic relations with X1〉
→ 〈澳洲 是 与 北韩 有 邦交 的 少数 国家 之一;
Australia is one of the few countries that have diplomatic relations with North Korea〉
Figure 2.8: Example of hierarchical grammar derivation. Adapted from (Chi-
ang, 2007). A derivation is simply a sequence of rules that rewrite the start
symbol h into a sentence pair. This particular derivation produces the sen-
tence pair from Figure 2.6.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 31
2.6.2.4 Constraints on Hierarchical Grammars
The definition given for a hierarchical grammar is very general. In prac-
tice, systems impose constraints on the types of rule the grammar con-
tains for efficiency reasons. We first introduce the concept of rule pat-
tern (Iglesias et al., 2009a) in order to describe these constraints in terms
of patterns. A rule pattern is simply a pair of regular expressions that
match the source and target sides of a hierarchical rule. For example,
the rule m → 〈Buenas tardes mP Good afternoon m〉 has a rule pattern
〈+mP#+m〉 where is the source vocabulary and # is the target vocabu-
lary. For ease of notation, we use the notation w for either + or #+. Thus
w simply represents a sequence of terminals. In our example, the pattern for
the rule m → 〈Buenas tardes mP Good afternoon m〉 is 〈wmPwm〉. Chiang
(2007) defines the following set of pattern-related constraints that must be
satisfied by the rules in a hierarchical grammar:
• If a rule m → 〈
P 〉 has a pattern 〈wPw〉, then |
| ≤ 10 and || ≤ 10.
• A rule m → 〈
P 〉 must satisfy |
| ≤ 5.
• Rules have at most 2 nonterminals.
• The source side of a rule cannot have adjacent nonterminals. Setiawan
and Resnik (2010) relax this constraint in order to model Chinese-
English reordering phenomena that may be not always captured in
training because of data sparsity. Note that the target side can still
have adjacent nonterminals.
More fine-grained constraints can be defined on rule patterns, which are ob-
tained from rules by replacing terminal sequences by the placeholder w. For
example, we use the configuration described in Table 2.1 for all translation
experiments in this thesis, unless specified otherwise. This grammar was
obtained following a greedy strategy of adding in turn the most beneficial
patterns for Arabic-English translation.
Another restriction on hierarchical grammars is the extent of reordering
allowed. de Gispert et al. (2010a) investigate the use of shallow-c grammars
that precisely control the depth of reordering in translation. We describe here
shallow-1 grammars (Iglesias et al., 2009a; de Gispert et al., 2010a) as they
will be used for translation experiments in this thesis. Shallow-1 grammars
allow only one level of reordering, they do not allow nested reordering as in
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 32
〈source , target〉 include 〈source , target〉 include
〈w X , w X〉 no 〈X w , w X〉 yes
〈w X , X w〉 yes 〈X w , w X w〉 yes
〈w X , w X w〉 yes 〈X w , X w〉 no
〈w X w , w X〉 yes 〈w X w , X w〉 yes
〈w X w , w X w〉 yes
〈X1 w X2 , w X1 w X2〉 no 〈X2 w X1 , w X1 w X2〉 yes
〈X1 w X2 , w X1 w X2 w〉 no 〈X2 w X1 , w X1 w X2 w〉 yes
〈X1 w X2 , w X1 X2〉 no 〈X2 w X1 , w X1 X2〉 yes
〈X1 w X2 , w X1 X2 w〉 no 〈X2 w X1 , w X1 X2 w〉 yes
〈X1 w X2 , X1 w X2〉 no 〈X2 w X1 , X1 w X2〉 yes
〈X1 w X2 , X1 w X2 w〉 no 〈X2 w X1 , X1 w X2 w〉 yes
〈X1 w X2 , X1 X2 w〉 no 〈X2 w X1 , X1 X2 w〉 yes
〈w X1 w X2 , w X1 w X2〉 no 〈w X2 w X1 , w X1 w X2〉 yes
〈w X1 w X2 , w X1 w X2 w〉 yes 〈w X2 w X1 , w X1 w X2 w〉 yes
〈w X1 w X2 , w X1 X2〉 yes 〈w X2 w X1 , w X1 X2〉 yes
〈w X1 w X2 , w X1 X2 w〉 yes 〈w X2 w X1 , w X1 X2 w〉 yes
〈w X1 w X2 , X1 w X2〉 yes 〈w X2 w X1 , X1 w X2〉 yes
〈w X1 w X2 , X1 w X2 w〉 yes 〈w X2 w X1 , X1 w X2 w〉 yes
〈w X1 w X2 , X1 X2 w〉 yes 〈w X2 w X1 , X1 X2 w〉 yes
〈X1 w X2 w , w X1 w X2〉 yes 〈X2 w X1 w , w X1 w X2〉 yes
〈X1 w X2 w , w X1 w X2 w〉 yes 〈X2 w X1 w , w X1 w X2 w〉 yes
〈X1 w X2 w , w X1 X2〉 yes 〈X2 w X1 w , w X1 X2〉 yes
〈X1 w X2 w , w X1 X2 w〉 yes 〈X2 w X1 w , w X1 X2 w〉 yes
〈X1 w X2 w , X1 w X2〉 yes 〈X2 w X1 w , X1 w X2〉 yes
〈X1 w X2 w , X1 w X2 w〉 no 〈X2 w X1 w , X1 w X2 w〉 yes
〈X1 w X2 w , X1 X2 w〉 yes 〈X2 w X1 w , X1 X2 w〉 yes
〈w X1 w X2 w , w X1 w X2〉 no 〈w X2 w X1 w , w X1 w X2〉 yes
〈w X1 w X2 w , w X1 w X2 w〉 no 〈w X2 w X1 w , w X1 w X2 w〉 yes
〈w X1 w X2 w , w X1 X2〉 no 〈w X2 w X1 w , w X1 X2〉 yes
〈w X1 w X2 w , w X1 X2 w〉 no 〈w X2 w X1 w , w X1 X2 w〉 yes
〈w X1 w X2 w , X1 w X2〉 no 〈w X2 w X1 w , X1 w X2〉 yes
〈w X1 w X2 w , X1 w X2 w〉 no 〈w X2 w X1 w , X1 w X2 w〉 yes
〈w X1 w X2 w , X1 X2 w〉 no 〈w X2 w X1 w , X1 X2 w〉 yes
Table 2.1: Rule patterns included in a baseline hierarchical grammar. A
pattern is obtained from a rule by replacing terminal sequences by the place-
holder w. The decision whether to include each pattern was based on prelim-
inary experiments in Arabic-English. Most beneficial patterns were added
incrementally. Unless specified otherwise, this configuration is used in all
subsequent translation experiments.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 33
the example presented in Section 2.6.1. A shallow-1 grammar is defined as
follows:
h → 〈mPm〉
h → 〈hmP hm〉
m → 〈
sP s〉(
sP s ∈ (i ∪ {k })+)
m → 〈kP k 〉
k → 〈sP t〉(s ∈ i+P t ∈ i ∗)
where h is the start symbol, i is the set of terminals and k is the set of
nonterminals. There are two nonterminals apart from the start symbol: m
and k . The rule type m → 〈
sP s〉 corresponds to all hierarchical rules. It
is possible to apply this type of rule only once in any derivation. Indeed,
the right hand side contains only one type of nonterminal, k , which can
be rewritten only with a phrasal rule corresponding to the line k → 〈sP t〉.
Note that for rules of the type k → 〈sP t〉, t can be the empty word, thus
these rules, called deletion rules, allow deletion on the target side. Shallow-1
grammar are used for language pairs that do not present much reordering.
Shallow-1 grammars were previously shown to work as well as full hierarchical
grammars for the Arabic-English language pair (Iglesias et al., 2009a) and
for the Spanish-English language pair (Iglesias et al., 2009c). In addition,
shallow-1 grammars reduce the search space of the decoder greatly, resulting
in a much faster decoding time, a reduced memory use and potentially fewer
search errors under the translation grammar.
2.6.3 Log-linear Model for Hierarchical Phrase-Based
Translation
We now define in more detail the log-linear model for hierarchical translation,
which usually makes a maximum (max) approximation, i.e. replaces a sum
by the maximum term in the sum and assumes that the other terms are
negligible. We follow the original description (Chiang, 2007). For a sentence
pair (f P e), let us define D the set of possible derivations D of this sentence
pair under a hierarchical grammar. We will use the following notation for a
derivation D:
• the foreign yield f . We define f(D) = f
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 34
• the English yield e. We define z(D) = e
We can now derive the log-linear model for hierarchical translation:
e^ = argmax
e
p(e | f)
= argmax
e
∑
O∈D
p(DP e | f) (marginalisation)
= argmax
e
argmax
O∈D
p(DP e | f) (max approximation)
= z(argmax
e;O∈D
p(DP e|f))
= z(argmax
O|f(O)5f
p(D)) (2.22)
Thanks to the max approximation, the distribution over derivations in-
stead of the distribution over English sentences is modelled log-linearly and
we obtain finally the following decoding equation:
xe = z(argmax
O|f(O)5f
exp(
M∑
m5)
mhm(D))) (2.23)
One of the features, the language model, plays a particular role. The language
model feature can be written as:
hM(D) = pLM(z(D)) (2.24)
where b is the index of the language model feature, pLM is the language
model and z(D) is the English yield of the derivation D. It is not possible to
compute the language model using dynamic programming since the language
model needs context in order to be computed, therefore the language model
feature is typically computed after a parsing step.
Note that Equation 2.23 is an approximation and that there have been
attempts to perform marginalisation over the latent variable D while keep-
ing the translation process tractable (Blunsom et al., 2008a; de Gispert
et al., 2010a). This can give gains over the max approximation, although
subsequent rescoring steps (see Section 2.10) can produce similar perfor-
mance (de Gispert et al., 2010a).
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 35
2.6.4 Rule Extraction
We have so far given the definition of a hierarchical grammar and explained
how it is used with statistical models. It is also necessary to extract an
appropriate grammar, defined by its rules. The extraction is performed on
a parallel corpus. The parallel corpus is first word-aligned, then rules are
extracted from the alignment.
We first extract phrase pairs as described in Section 2.5.3. For each
extracted phrase pair 〈f u2u1 P zi2i1〉, we define the following rule: m → 〈f u2u1 P zi2i1〉.
These rules are called initial rules. We extend the set of initial rules with
the following recursion: given a rule m → 〈
P 〉 and an initial rule m →
〈f u2u1 P zi2i1〉 such that
=
)f u2u1
2 and = )zi2i12, then extract the rule
m → 〈
)m
2P )m2〉. Note that
) or
2, but not both, can be the empty
word, and similarly for ) and 2.
2.6.5 Features
The following features are commonly used in log-linear models for machine
translation:
• Source-to-target and target-to-source translation models. As described
above, the translation process produces a derivation D. A derivation
D can be seen as a sequence of n rules m → 〈)P
)〉P OOOP m → 〈nP
n〉.
Then the source-to-target translation model is simply
∏n
i5) p(
i|i)
where the p(
i|i) are typically estimated using relative frequency,
based on the appearance of phrasal and hierarchical rules in the parallel
corpus. The target-to-source translation model is symmetric.
• Source-to-target and target-to-source lexical translation model. Typi-
cally, the source-to-target lexical translation model lx(fJ) P zT)) is com-
puted as following for a foreign sentence fJ) translated into the English
sentence zT):
lx(fJ) P z
T
)) =
1
(I + 1)J
J∏
u5)
T∑
i5(
p)(zi|fu) (2.25)
where p) is the word-to-word translation probability in IBM Model 1
and z( the null word. The target-to-source lexical translation model is
symmetric. One reason to use lexical models in addition to translation
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 36
models is that translation models are relatively sparse compared to
lexical models, thus lexical models can smooth the translation models.
• Number of uses of the glue rule in a derivation. This feature trades off
monotonic translation versus reordering.
• Word insertion penalty. This feature controls the length of the output.
• Rule insertion penalty. This feature controls the number of rules used
in a derivation to generate a translation. Using many rules is closer
to word-to-word translation while using few rules means that longer
phrase pairs are used. This feature is similar to the phrase penalty in
phrase-based statistical machine translation (Koehn, 2010).
• Word deletion scale factor. This feature controls the number of times
a deletion rule is applied.
• Rule count feature. This feature indicates whether a rule occurs once,
twice or more in the parallel data (Bender et al., 2007). Thus it indi-
cates how reliable a rule is.
The feature weights are optimised using minimum error rate training (Och,
2003) under the BLEU score (Papineni et al., 2002) (see also Section 2.8.1).
2.7 Language Modelling
We mentioned in Section 2.6.3 that a language model was one of the fea-
tures of log-linear models for translation. This feature is critical as it helps
to obtain a fluent translation output. A language model is a probability
distribution over word sequences. It can be used to assign a probability to
a sequence of words or to predict the word most likely to appear in a cer-
tain context. Language models have applications in fields where the goal is
to produce fluent output, such as automatic speech recognition or machine
translation. n-gram language models are typically used because they are
robust, can be easily trained on large amounts of data and model local gram-
matical relationships. Since they do not model the language structure nor
long distance relationships between words, work has been conducted (Shen
et al., 2008) to overcome this issue. We first review n-gram language models,
then review different smoothing techniques and we finally describe in more
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 37
detail two smoothing techniques that are used for experiments in this thesis:
Kneser-Ney smoothing and Stupid Backoff smoothing.
2.7.1 n-gram language models
For simplicity, let us first consider the case of a bigram language model.
Let us consider a vocabulary k and the two special symbols and
corresponding respectively to the start-of-sentence symbol and the end-of-
sentence symbol. We define l = k ∪ {P }. Let us now define the
Markov chain (mi)i∈N with values in l and transition probability p. p has
the following properties:
• p(m( = ) = 1: we always start a sentence with a start-of-sentence
symbol.
• p(mn+) = | mn = ) = 1: once we reach the end of a sentence,
we stay in the end-of-sentence state. This is because we do not consider
infinite word sequences.
• p(mn+) = | mn = ) = 0: we cannot stay in the start-of-sentence
state and have to transition to either a word or the end-of-sentence
state.
A bigram model is defined by the conditional independence assumptions of
the Markov chain (mi) and the translation probability p. A bigram model
will therefore assign the probability p(w) to a sequence of words w = w)OOOwn
in Equation 2.26:
p(w) =
n+)∏
i5)
p(wi | wi−)) (2.26)
with the convention w( = and wn+) = .
An n-gram language model can be defined similarly: this time the random
variables mi take values in l n−) instead of l . An n-gram model will assign
the probability p(w) to a sequence of words w = w)OOOwn in Equation 2.27:
p(w) =
n+)∏
i5)
p(wi | wi−)i−n+)) (2.27)
with the same convention that w( = and wn+) = . Parame-
ters can be trained using maximum likelihood estimation, so the parameter
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 38
p(wi|wi−)i−n+)) is computed in Equation 2.28.
p(wi|wi−)i−n+)) =
x(wii−n+))
x(wi−)i−n+))
(2.28)
where x(O) counts the number of occurrences of a particular word sequence in
the training data. Maximum likelihood estimation assigns zero probability
to unseen events, therefore different smoothing strategies have been explored
to address this problem.
2.7.2 Back-off and Interpolated Models
Smoothing strategies for language modelling make use of lower order dis-
tributions either by backoff or interpolation. The general form of a backoff
model (Chen and Goodman, 1998) is presented in Equation 2.29:
pbackoff(wi | wi−)i−n+)) =
{
(wi | wi−)i−n+)) if x(wii−n+)) S 0
(wi−)i−n+))pbackoff(wi | wi−)i−n+2) if x(wii−n+)) = 0
(2.29)
where x is an occurrence count in the monolingual training corpus for an
n-gram. The conditions x(wii−n+)) S 0 and x(wii−n+)) = 0 can be replaced by
x(wii−n+)) ≥ cutoff and x(wii−n+)) < cutoff respectively when n-grams with
an occurrence less than the cutoff threshold are ignored in the training data.
The general form of an interpolated model (Chen and Goodman, 1998)
is presented in Equation 2.30:
pinterpolate(wi | wi−)i−n+)) = wi−1i−G+1pML(wi | w
i−)
i−n+))+
(1− wi−1i−G+1)pinterpolate(wi | w
i−)
i−n+2)
(2.30)
where:
• pML is the maximum likelihood estimate.
• wi−1i−G+1 is the interpolation parameter.
The difference between backoff and interpolated model is that interpolated
models make use of lower order distributions even when the n-gram counts
are greater than zero. However, an interpolated model can be written in the
form of a backoff model. Let us define (wi | wi−)i−n+)) in Equation 2.31:
(wi | wi−)i−n+)) = pinterpolate(wi | wi−)i−n+)) (2.31)
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 39
and
(wi−)i−n+)) in Equation 2.32:
(wi−)i−n+)) = 1− wi−1i−G+1 (2.32)
We can see that by plugging in the definitions of (wi | wi−)i−n+)) and
(wi−)i−n+))
in Equation 2.29, it is possible to write an interpolated model as a backoff
model. This observation is trivial but it is useful in practice in order to make
use of the ARPA file format3 which only supports back-off models.
2.7.3 Modified Kneser-Ney Smoothing
In this section, we present interpolated modified Kneser-Ney smoothing (Chen
and Goodman, 1998), which is the most popular smoothing strategy used for
machine translation. The motivation for Kneser-Ney smoothing (Kneser and
Ney, 1995) as given by Chen and Goodman (1998) is that the use of lower or-
der distributions for smoothing needs to take into account information from
the higher order distribution. For example, let us consider a bigram model.
We want to assign a probability to the word Francisco given its previous
word, say hello. If we have not seen the bigram hello Francisco in the train-
ing corpus, we need to back-off to the unigram Francisco. We assume that
the unigram Francisco is very frequent in our corpus but that it is only seen
in the context of the bigram San Francisco. If we simply use the maximum
likelihood estimate of the unigram Francisco, we will obtain a relatively high
probability for the bigram hello Francisco. This should not be the case be-
cause the corpus provides evidence that Francisco is very unlikely to follow
any other word than San.
The modified Kneser-Ney smoothed probability is defined in Equation 2.33.
pKN(wi | wi−)i−n+)) =
x(wii−n+))−D(x(wii−n+)))
x(wi−)i−n+))
+
(wi−)i−n+))pKN(wi|wi−)i−n+2)
(2.33)
where D and
are defined in Equation 2.34 and Equation 2.35.
D(x) =
0 if x = 0
D) if x = 1
D2 if x = 2
D+ if x ≥ 3
(2.34)
3 http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 40
(wi−)i−n+)) =
D)c)(w
i−)
i−n+)•) +D2c2(wi−)i−n+)•) +D++c++(wi−)i−n+)•)
x(wi−)i−n+))
(2.35)
c), c2 and c++ are defined in Equation 2.36.
c)(w
i−)
i−n+)•) = |{wi : x(wi−n+)wi) = 1}|
c2(w
i−)
i−n+)•) = |{wi : x(wi−n+)wi) = 2}|
c++(w
i−)
i−n+)•) = |{wi : x(wi−n+)wi) ≥ 3}|
(2.36)
2.7.4 Stupid Backoff Smoothing
The Stupid Backoff smoothing scheme (Brants et al., 2007) is similar to the
general backoff smoothing scheme presented in Equation 2.29 and presented
in Equation 2.37:
sstupid backoff(wi | wi−)i−n+)) =
n(wii−G+1)
n(wi−1i−G+1)
if x(wii−n+)) S 0
sstupid backoff(wi | wi−)i−n+2) if x(wii−n+)) = 0
(2.37)
We use the notation s to indicate that that the Stupid Backoff Smoothed
score is not a probability. We note the following differences with respect to
the traditional backoff scheme:
• The non backed off score uses no discounting and simply uses relative
frequency.
• The backed off score scaling parameter is independent of the n-gram
history.
• As a result, the Stupid Backoff does not define a probability distribu-
tion.
Stupid Backoff smoothing was designed to fit the MapReduce framework and
build very large language models (see Section 3.1). This scheme will be used
in subsequent translation rescoring experiments throughout this thesis.
2.8 Optimisation
2.8.1 Evaluation Metrics
Evaluating the quality of machine translation output is a challenge in that
very often many possible translations are acceptable and of equal quality. Hu-
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 41
man experts seem to be the most qualified for this task, however their service
is expensive and time consuming. Automatic metrics have been therefore
designed in order to allow rapid development and comparison of machine
translation systems. We describe the BLEU metric since this is the most
widely used metric and because we use it extensively in our experiments. We
refer to publications for alternative metrics such as METEOR (Banerjee and
Lavie, 2005), NIST (Doddington, 2002) or TER (Snover et al., 2006).
The BiLingual Evaluation Understudy (BLEU) score (Papineni et al.,
2002) is defined as the geometric mean of n-gram modified precisions with
respect to one or several translation references times a brevity penalty. More
precisely, given h input sentences s)P OOOP sS, we consider a set of h candidate
translations x)P OOOP xS. For each input sentence si, there is a finite number gi
of reference translations ri;)P OOOP ri;Ri . The reference translation that is closest
in length to the candidate translation xi is denoted rni . The BLEU metric is
defined as following:
BLEU(x)P OOOP xSP r);)P OOOr);R1 P OOOP rS;)P OOOP rS;RS) = BP
N∏
n5)
p
1
N
n (2.38)
BP is the brevity penalty and is defined as:
BP = exp(min(0P 1−
∑S
i5) |rni|∑S
i5) |xi|
)) (2.39)
pn is the modified n-gram precision and is defined as∑S
i5)
∑
g∈ni xountnwip(gP xiP ri;)P OOOP ri;Ri)∑S
i5)
∑
g∈ni xount(gP xi)
(2.40)
where g is an n-gram, xount(gP xi) is the number of times g occurs in xi and
xountnwip(gP xiP ri;)P OOP ri;Ri) = min(xount(gP xi)Pmax(xount(gP ri;))P OOP xount(gP ri;Ri)))
(2.41)
The maximum n-gram order is usually c = 4. The BLEU metric can be
computed efficiently, is language independent and correlates well with human
judgements, especially for statistical translation systems.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 42
2.8.2 Minimum Error Rate Training
Minimum error rate training (Och, 2003) is a method to optimise the feature
weights in a log-linear model for translation with respect to a particular
evaluation metric. We present this method assuming that the evaluation
metric is BLEU. The objective function is defined as follows:
xM) = argmax
M1
BLEU(x(M) )P r) (2.42)
where x(M) ) are the candidate translations obtained using feature weights
M) and r are the reference translations. Since computing x(M) ) requires
computing the most likely translations given a set of features, this optimisa-
tion problem is complex: it requires two argmax operations. Och therefore
approximates the search space of translations with n-best lists given by a
decoder. The optimisation problem is then solved by computing, along a
particular direction in the feature space, intervals where the BLEU function
is constant and choosing for each direction the optimal feature set. It was
found that optimising feature weights in a consistent way with the evalu-
ation metric, for example BLEU, could improve the translation quality as
measured by this metric significantly.
2.9 Decoding with Finite State Transducers
We use HiFST, a decoder based on finite state transducers (Iglesias et al.,
2009b; Iglesias, 2010). The decoder solves Equation 2.23 in three steps. In
the first step, the source sentence is parsed using a variant of the CYK algo-
rithm (Chappelier and Rajman, 1998). Each cell in the CYK grid contains
a nonterminal symbol and a span represented by a start and end index and
backpointers to other cells in the grid. In the second step, a lattice (Ueffing
et al., 2002), i.e. a compact representation of the possible translations as a
directed acyclic graph where edges are words, is recursively built by following
the backpointers in the grid. In the third step, a language model represented
as an acceptor is composed with the original lattice to give translations a
language model score.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 43
2.10 Lattice Rescoring
An advantage of the HiFST decoder (see Section 2.9) is that it directly gener-
ates translation lattices that can be rescored in order to improve translation
quality. Rescoring methods are used in machine translation when their in-
tegration into the decoder is challenging in terms of memory use. 5-gram
language model rescoring and lattice minimum Bayes’ risk rescoring are two
such methods. On the other hand, if scalability issues can be overcome,
integrating complex models into first-pass decoding may provide a simpler
and easier to use system. Furthermore, because larger parts of the decoding
search space may be explored, single pass methods may also provide transla-
tion quality gains, however this is not guaranteed unless there is no pruning
during decoding.
2.10.1 5-gram Language Model Lattice Rescoring
Brants et al. (2007) showed that using vast amounts of data for training
language models can improve translation. However, using high order n-gram
language models trained on a large dataset in decoding can create memory
problems. A solution is to use such models to rescore the output lattice of
a first-pass translation. These first-pass lattices are generated using a lower
order language model.
The language model we use for rescoring is a 5-gram zero cutoff Stupid
Backoff language model (see Section 2.7.2). To build this model, n-grams of
order up to 5 and containing words in the target side of the parallel corpus are
extracted in parallel from the monolingual data, resulting in an n-gram count
file. Then, n-grams from the first-pass translation lattice are extracted using
counting transducers (Allauzen et al., 2003). Finally, n-grams from the count
file are filtered according to the n-grams from the first-pass translation lattice.
A sentence specific Stupid Backoff (Brants et al., 2007) 5-gram language
model is built using these n-grams and constitutes a feature along with the
first-pass language model, the translation model, and a length penalty in a
log-linear model that is used for lattice rescoring. The weights are optimised
for BLEU on a development set.
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 44
2.10.2 Lattice Minimum Bayes’ Risk Rescoring
In Section 2.2, we have presented the general objective function for trans-
lation in Equation 2.43 as a decision that minimises the number of er-
rors (Bishop, 2006, p. 39-40):
xe = argmax
e
p(e | f) (2.43)
This objective function is a particular case of Minimum Bayes-Risk (MBR)
decoding for SMT (Kumar and Byrne, 2004). MBR takes the general form
in Equation 2.44:
xe = argmin
e0∈E
∑
e∈E
a(eP e0)e (e | f) (2.44)
where:
• f is the source sentence
• e and e0 are target sentences
• E is the set of possible translations
• a is a loss function
If the loss function a is defined as a zero-one loss function as in Equation 2.45:
∀eP e0 a(eP e0) =
{
−1 if e = e0
0 otherwise
(2.45)
then we obtain the objective defined in Equation 2.43. In that sense, the
minimum error objective is a special case of the MBR objective. If the
translation quality is measure by BLEU, then an appropriate function for
a is 1−SBLEU where SBLEU is the sentence level BLEU score, defined as
BLEU with a corpus containing only one sentence pair. Kumar and Byrne
(2004) use this definition of SBLEU for experiments on n-best list MBR
rescoring. Even though SBLEU between two hypotheses might be zero, it
is less likely that a particular hypothesis have an expected loss of zero. For
other applications such as metrics evaluation, it is beneficial to introduce
smoothing for sentence-level BLEU (Lin and Och, 2004). Kumar and Byrne
(2004) demonstrate gains in translation quality when rescoring n-best lists
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 45
with the MBR objective function. Gains measured by the metric used to
define the loss function are of course more important.
n-best list MBR rescoring requires d(n) sentence-level BLEU computa-
tions for each hypothesis, so d(n2) sentence-level BLEU computations for
each source sentence. This restricts the depth of the n-best list; for example
Kumar and Byrne (2004) use a 1000-best list but n could be a very large num-
ber. Tromble et al. (2008) remedy this issue by introducing a lattice minimum
Bayes’ risk decoding technique where E is a lattice of possible translations
rather than an n-best list. They use Equation 2.46 as an approximation of
Equation 2.44:
xz = argmax
p′∈E
(
(|z′|+
∑
u∈N
u=u(z
′)p(u|E)
)
(2.46)
where
• N is the set of all n-grams in the lattice E
• u is an n-gram specific constant
• =u(z′) is the number of times u occurs in z′
• p(u|E) is the posterior probability of the n-gram u in the lattice E
Decoding is performed using weighted finite state transducers (Tromble et al.,
2008). A more efficient implementation was designed subsequently (Black-
wood et al., 2010; Blackwood, 2010).
2.10.3 Lattice Minimum Bayes-Risk for System Combi-
nation
Machine translation system combination is a paradigm that exploits the dif-
ferences in strength and weaknesses from different machine translation sys-
tems. Given an input source sentence to be translated, the source sentence
is translated by several MT systems which produce one or more translations.
These translations are then rescored and/or combined with the hope to pro-
duce a final translation of equal or better quality than the translations of any
of the individual systems.
System combination techniques can be classified by whether the final
translation can be a novel hypothesis that has not been generated by any of
CHAPTER 2. STATISTICAL MACHINE TRANSLATION 46
the individual systems. System combination techniques that can generate a
novel hypothesis usually employ consensus network decoding (Fiscus, 1997).
In consensus network decoding for MT, n-best lists obtained from individual
systems are aligned to a reference, e.g. the MBR hypothesis, then the final
hypothesis is obtained by majority voting. In practice, it is difficult to apply
consensus network decoding to deep n-best lists or to lattices because of the
expensive alignment operation.
Another broad class of system combination techniques that do not gener-
ate novel hypotheses include n-best list rescoring as well as lattice rescoring
techniques. Blackwood (2010) argues that one possible advantage over con-
sensus network decoding is that fluency is more likely to be preserved and
that it is possible to exploit larger hypothesis spaces coming from individual
systems.
It is possible to extend the lattice Minimum Bayes-Risk decoding strategy
presented in Section 2.10.2 for system combination. We consider two systems
that produce two translation lattices E) and E2. The statistics p(u|E) are
computed by interpolation. The decoding equation becomes:
xe = argmax
e0∈E1∪E2
(
(|e0|+
∑
u∈N1∪N2
u=u(e
0)()p(u|E)) + 2p(u|E2))
)
(2.47)
The weights ()P 2) are such that ) + 2 = 1 and are tuned for BLEU on
a development set.
Chapter 3
Data Structures for Hierarchical
Phrase-Based Translation
Grammars
We have seen in Chapter 2 that the main components of an SMT system
are the translation model and the language model. The translation model is
trained on parallel text while the language model is typically trained on the
target side of the parallel text and possibly additional monolingual target
text. It has been shown that increasing the amount of parallel text (Pino
et al., 2010) and increasing the amount of monolingual text (Brants et al.,
2007) is beneficial for translation quality.
Monolingual and parallel text availability and distributed computing tech-
niques have allowed SMT researchers to build ever larger language models
and translation models, based on very large amounts of data. However, de-
coders need to be able to retrieve information, such as n-gram conditional
probabilities or translation grammar rules, efficiently from these models to
be able to translate an input sentence or a set of input sentences in a rea-
sonable amount of time. For online systems such as the commercial systems
mentioned in Chapter 1, translation even needs to be carried out in real time,
i.e. for example with a speed of 2000 words translated per minute.
Therefore SMT models need to be stored in a data structure that supports
efficient querying and is scalable. We introduce a solution that addresses
these two requirements and that has the advantage that it reuses an existing
framework in order to ease the implementation. We store translation models
47
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 48
0
1000
2000
3000
4000
5000
6000
7000
2006 2007 2008 2009 2010 2011 2012 2013
N
um
be
r o
f m
illi
on
s
of
to
ke
ns
Year
EN parallel data
EN monolingual data
Figure 3.1: Number of English tokens (in millions) in parallel and monolin-
gual data available for the WMT French-English constrained track transla-
tion shared task for the years 2006 to 2013.
as a set of key-value pairs in an HFile1. We apply this strategy in order to
retrieve relevant rules from a hierarchical phrase-based grammar at the test
set level. We compare our approach to alternative strategies and show that
our approach offers competitive performance in terms of speed, memory and
simplicity (Pino et al., 2012). We have made software written for this work
available online.2
3.1 Introduction
Current machine translation research is characterised by increasing amounts
of available data. For example, Figure 3.1 shows that for the WMT machine
translation workshop (Bojar et al., 2013) French-English constrained track
translation shared task, the English side of parallel data has increased from
13.8M tokens in 2006 to 1012.7M tokens in 2013, and that available English
monolingual data has increased from 27.5M tokens to 6852.7M tokens over
the same period. Along with growing amounts of data, the use of more power-
1 http://hbase.apache.org
2 https://github.com/jmp84/ruleXtract (branch PBML-2012)
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 49
ful computers and distributed computing models such as MapReduce (Dean
and Ghemawat, 2008; Lin and Dyer, 2010) has enabled machine translation
researchers to build larger statistical machine translation models. MapRe-
duce has thus been applied to various steps of the translation pipeline: word
alignment (Dyer et al., 2008), translation model building (Dyer et al., 2008)
and language modelling (Brants et al., 2007). We review these applications
of MapReduce in more detail in Section 3.2. The challenge is to find effective
modelling techniques that can be implemented in the MapReduce framework.
However, once SMT models such as language models and translation
grammars are built, either with MapReduce or some other method, the mod-
els must be made usable by translation decoders, which only need a fraction
of the information contained in those models to be able to translate an input
source sentence or a set of input source sentences. For example, in translation
from French to English with a phrase-based model (see Section 2.5), given an
input sentence Salut toi, we only need to know the translation probabilities
the translation model assigns to translations of the words Salut and toi and
to translations of the phrase Salut toi. Similarly, given a test set, a decoder
only needs to retrieve the rules whose source side matches part of one of
the source sentences in the test set to be able to generate hypotheses. With
large models, simply retrieving relevant rules together with their translation
probabilities becomes a challenge.
In the system described by Iglesias et al. (2009b), given a training parallel
corpus, rules are extracted and target phrases and hierarchical phrases with
counts are stored on disk. Then, given an input test set, rules are extracted
from the parallel corpus a second time. Only rules relevant to the test set are
retained: source sides of phrase-based rules have to match an n-gram from
the test set and consecutive terminals in source sides of hierarchical rules
also have to match an n-gram from the test set. As a consequence, some
hierarchical rules are never used in decoding. Source sides and rules together
with their counts are kept in memory using hash map datastructures, target
sides with counts are loaded from disk as precomputed in the first step, so
that source-to-target and target-to-source probabilities can be computed.
This method becomes progressively slower with larger amounts of data
because it requires extracting rules from the training parallel corpus and
recomputing translation probabilities each time we translate a new test set
or a new sentence. We would like to improve on this design for more rapid
experimentation. We also would like to use a computing infrastructure that
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 50
requires minimal maintenance. For example, HBase3, the open source non-
relational database implementation of BigTable (Chang et al., 2008), has
been applied to the use of distributed language models (Yu, 2008). Rule
filtering, i.e. the retrieval of rules relevant to the translation of a test set or
a sentence, is an essential step in our pipeline that can be a bottleneck. Our
goal is to reduce its processing time from several hours to a few minutes.
In this chapter, we address the problem of retrieving relevant rules with
their translation probabilities. We report on investigations into storing the
model in the HFile data structure. To our knowledge, this is the first de-
tailed proposed implementation of translation model storage and filtering
using HFile data structures. We find that this approach offers a good com-
promise between speed, performance and ease of implementation. The in-
frastructure necessary for filtering is lightweight and requires the use of only
one machine. We will apply this approach to test set rule filtering prior to
decoding. We will discuss alternative strategies as well as their strengths
and weaknesses in terms of speed and memory usage. In Section 3.2, we
will review approaches that have been used for model building and model
filtering. The HFile data structure that is used to store models will be pre-
sented in Section 3.3. In Section 3.4, we detail our custom implementation
of rule extraction and translation model estimation using MapReduce. Our
method and alternative strategies for rule filtering are compared empirically
in Section 3.5. We conclude in Section 3.6.
3.2 Related Work
In this section, we review applications of MapReduce to the translation
pipeline as well as techniques for storing and retrieving from SMT models.
3.2.1 Applications of MapReduce to SMT
Brants et al. (2007) introduce a new smoothing method called Stupid Backoff
(see Section 2.7.4). The Stupid Backoff smoothing scheme is recalled in
3 http://hbase.apache.org
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 51
Equation 3.1:
pstupid backoff(wi | wi−)i−n+)) =
n(wii−G+1)
n(wi−1i−G+1)
if x(wii−n+)) S 0
pstupid backoff(wi | wi−)i−n+2) if x(wii−n+)) = 0
(3.1)
With respect to the traditional backoff scheme, the Stupid Backoff scheme
uses no discounting and simply uses relative frequency for the non backed-off
score and the backed-off score scaling parameter is independent of the n-
gram history. Therefore this scheme does not define a conditional probability
distribution over a word given its history.
The Stupid Backoff language model building and application fit the MapRe-
duce framework well. The input to language model building is a large mono-
lingual corpus. The first step is to build a vocabulary. This is done with the
canonical example of word counting with MapReduce. Counts are needed
in order to remove words occurring less than a threshold from the vocabu-
lary. The second step is to obtain n-grams and their count. This is done
again with MapReduce, and the map and reduce functions are analogous to
the ones defined for word counting but this time unigrams are replaced by
n-grams. For n-gram counting, the partition, or sharding, function hashes
on the first two words of each n-gram. In addition, unigram counts and the
total size of the corpus is available in each partition, i.e. to each reducer.
This allows relative frequencies to be computed. Brants et al. (2007) also
demonstrate that for large amounts of monolingual data, i.e. above 10 bil-
lion tokens, Stupid Backoff smoothing and Kneser-Ney smoothing perform
comparably. In addition, only Stupid Backoff smoothing can be scaled to
datasets with more than 31 billion tokens. The scalability of Kneser-Ney
smoothing has been improved in recent work (Heafield et al., 2013).4
Dyer et al. (2008) observe that translation model estimation has become
prohibitive on a single core and that existing ad hoc parallelisation algo-
rithms may be more fragile than using an existing framework such as the
Hadoop implementation of MapReduce.5 They provide solutions to word
alignment model estimation and translation rule extraction and estimation
using MapReduce and demonstrate the scalability of their method.
The convenience of the MapReduce framework for parallelisation has led
to the building of end-to-end toolkits for entire phrase-based (Gao and Vogel,
4 see also http://kheafield.com/code/kenlm/estimation/, Scalability section
5 https://hadoop.apache.org/
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 52
2010) and hierarchical phrase-based models (Venugopal and Zollmann, 2009)
for translation using the MapReduce framework.
3.2.2 SMT Models Storage and Retrieval Solutions
We now review techniques appearing in the literature that have been used
to store SMT models and to retrieve the information needed in translation
from these models. SMT models are usually discrete probabilistic models and
can therefore be represented as a set of key-value pairs. To obtain relevant
information from a model stored in a certain data structure, a set of keys
called a query set is formed; each key in this query set is then sought in that
datastructure. Strategies include:
• storing the model as a simple data structure in memory
• storing the model in a text file
• storing the model in more complicated data structures such as tries (Fred-
kin, 1960) (in memory or disk)
• storing fractions of the entire model
• storing data as opposed to a precomputed model
• storing models in a distributed fashion
Each of these strategies is discussed below.
In some cases, it may be possible to fit a model into RAM. In this case
the model can be stored as a memory associative array, such as a hash ta-
ble. In-memory storage allows allows for faster query retrieval than on-disk
storage, however only smaller models will fit in memory. In-memory storage
has been used to store model parameters between iterations of expectation-
maximisation for word alignment (Dyer et al., 2008; Lin and Dyer, 2010).
For larger models, the set of key-value pairs can be stored as a table
in a single text file on local disk. Values for keys in the query set can be
retrieved by scanning through the entire file. For each key in the file, its
membership is tested in the query set. This is the approach adopted in the
Joshua 5.0 decoder (Post et al., 2013)6, which uses regular expressions or
6 Inferred from the decoder training scripts available at http://joshua-decoder.org/
5.0/index.html.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 53
n-grams to test membership (see Section 3.5.4). Venugopal and Zollmann
(2009) use MapReduce to scan a file concurrently: the map function tests if
the vocabulary of a rule matches the vocabulary of a test set.
The model can also be stored using a trie associative array (Fredkin,
1960). A trie is a type of tree where each node represents a shared prefix of a
set of keys represented by the child nodes. Each node only stores the prefix it
represents. The keys are therefore compactly encoded in the structure of the
trie itself. Querying the trie is a O(log(n)) operation, where n is the number
of keys in the dataset. The trie may also be small enough to fit in physical
memory to further reduce querying time. Zens and Ney (2007) use tries to
store a phrase-based grammar. All the source phrases are represented in a trie
stored on disk. Only relevant parts of the trie are loaded into memory when
a source phrase is sought. Ganitkevitch et al. (2012) extend this approach
to store hierarchical phrase-based grammars. Both source sides and target
sides are stored in a packed representation of a trie. Packed tries have also
been applied for storing language models (Pauls and Klein, 2011; Heafield,
2011).
It is also possible to create a much smaller approximate version of the
model. Talbot and Osborne (2007a) represent a set of n-grams as a Bloom
filter (Bloom, 1970). They first use a standard Bloom filter to define a binary
feature that indicates whether an n-gram was seen in a monolingual corpus.
They also use a Bloom filter to encode n-grams together with quantised
counts in order to define a multinomial feature, that is a feature with a finite
set of possible values—in this case the quantised values. Both these data
structures can substantially reduce the disk space and memory usage with
respect to lossless representations of language models. However, they allow
false positives for n-gram membership queries and overestimates of n-gram
quantised counts. The authors demonstrate that the features they introduce
are useful in translation, despite this lack of exactness. In a related pub-
lication (Talbot and Osborne, 2007b), the authors demonstrate how these
techniques can be used as a replacement to represent a smoothed language
model. Talbot and Brants (2008) use a Bloomier filter (Chazelle et al., 2004)
to represent n-grams and necessary statistics such as probabilities and back-
off weights. Unlike Bloom filters that only support Boolean characteristic
functions on a set, Bloomier filters support arbitrary functions, in this case a
mapping between an n-gram and its statistics. False positives can still occur
but for true positives, correct statistics are retrieved.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 54
Guthrie and Hepple (2010) propose an extension to previous work on
randomised language models (Talbot and Osborne, 2007b) which prevents
the random corruption of model parameters but does not stop the random
assignment of parameters to unseen n-grams. Levenberg and Osborne (2009)
extend randomised language models to stream-based language models. An-
other way of building a smaller approximate version of a model is to retain
items with high frequency counts from a stream of data (Manku and Mot-
wani, 2002). This technique has been applied to language modelling (Goyal
et al., 2009) and translation rule extraction (Przywara and Bojar, 2011).
Instead of doing some precomputation on a dataset, it is possible to com-
pute the sufficient statistics at query time using a suffix array (Manber and
Myers, 1990), so that the model can be estimated only when needed. A suf-
fix array is a sequence of pointers to each suffix in a training corpus. The
sequence is sorted with respect to the lexicographic order of the referenced
suffixes. Suffix arrays have been used for computing statistics for language
models (Zhang and Vogel, 2006), phrase-based systems (Callison-Burch et al.,
2005; Zhang and Vogel, 2005), and hierarchical phrase-based systems (Lopez,
2007). Callison-Burch et al. (2005) store both the source side and the target
side of a parallel corpus in two suffix arrays. They also maintain an index
between a position in the source or target side of the corpus and the sentence
number. During decoding, all occurrences of a source phrase are located in
the suffix array representing the source side of the corpus. This produces
a source marginal count. For each of these occurrences, the corresponding
sentence pair and word alignment are retrieved and rules are extracted. This
produces a rule count, which is normalised with the source marginal count to
produce a source-to-target probability. Lopez (2007) extends this approach
to address the grammar extraction and estimation of hierarchical rules. Note
that the use of suffix arrays for translation model estimation only supports
the computation of source-to-target probabilities.
Finally, some approaches store language models in a distributed fashion.
We saw in Section 3.2.1 how a Stupid Backoff language model can be built.
After relative frequencies have been computed, another partition function
that hashes on the last two words of an n-gram is applied so that backoff
operations can be done within the same partition. At decoding time, the
decoder architecture is modified in order to request a batch of n-grams rather
than a single n-gram. The Stupid Backoff language model building can be
scaled to a corpus of 2 trillion tokens and the language model distributed
application can be made real time.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 55
Zhang et al. (2006) propose a distributed large language model backed
by suffix arrays. HBase has also been used to build a distributed language
infrastructure (Yu, 2008). The method we propose to use is closely related to
the latter but we use a more lightweight infrastructure than HBase. In addi-
tion, it is also possible to apply our method to language model querying (Pino
et al., 2012), which demonstrates the flexibility of the infrastructure.
Note that there are many alternative solutions to HBase/HFile for storage
of a large number of key-value pairs, such as Berkeley DB7 or Cassandra8.
The purpose of this chapter is not to compare these alternatives but rather to
compare existing solutions employed in the machine translation community
to one of these solutions.
3.3 HFile Description
We now describe the data structure we use to store models and we review
relevant features to the design of our system. To store a model represented
as key-value pairs, we use the HFile file format,9 which is an open source
reimplementation of the SSTable file format (Chang et al., 2008). The HFile
format is a lookup table with key and value columns. The entries are free
to be an arbitrary string of bytes of any length. The table is sorted lexico-
graphically by the key byte string for efficient record retrieval by key.
3.3.1 HFile Internal Structure
High Level Structure As can be seen in Figure 3.2, the HFile internal
structure is divided into blocks. There are various types of block, depending
on the kind of information a block contains. In order to distinguish block
types, the first 8 bytes of a block will indicate its type. The blocks are
grouped into three parts. The first part (top) is organised into data blocks
interleaved with leaf data index blocks and Bloom blocks. The second part
(middle) consists of intermediate data index blocks. The third part (bottom)
consists of metadata: root data index blocks, a file information block and a
Bloom filter index block.
7 https://oss.oracle.com/berkeley-db.html
8 http://cassandra.apache.org
9 http://hbase.apache.org
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 56
Data Block
...
Leaf Data Index Block / Bloom Block
...
Data Block
...
Leaf Data Index Block / Bloom Block
...
Data Block
Intermediate Data Index Blocks
Root Data Index Blocks
File Info Block
Bloom Filter Index Block
Figure 3.2: HFile internal structure.11 The structure is divided into three
parts. In the first part, data blocks are interspersed with leaf data index
blocks and Bloom blocks. The second part contains intermediate data index
blocks. The third part consists of metadata: root data index blocks, file
format information and a Bloom filter index block.
Index Blocks The root data index blocks map keys to their relevant in-
termediate data index block, indicated by an offset. In turn, an intermediate
data index block maps keys to their relevant leaf data index block, again
indicated by an offset. Finally, a leaf data index block maps keys to their
relevant data block, still indicated by an offset. The same mechanism is em-
ployed for the Bloom filter. The Bloom filter index block maps keys to their
relevant Bloom block.
Block Size Configuration The block size is configurable, with a default
size of 64KB. Note that HFile blocks are not to be confused with Hadoop
Distributed File System (HDFS) blocks whose default size is 64MB.
Block Compression The HFile format allows for the blocks to be com-
pressed. The choice of compression codec is selected when the file is created.
We choose the GZip compression codec (Deutsch and Gailly, 1996) for all
our experiments.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 57
3.3.2 Record Retrieval
When the HFile is opened for reading, the root data index blocks are loaded
into memory. To retrieve a value from the HFile given a key, the appropriate
intermediate index block is located by a binary search through the root data
index. Binary searches are conducted on the intermediate and leaf index
blocks to identify the data block that contains the key. The identified data
block is then loaded off the disk into memory and the key-value record is
retrieved by scanning the data block sequentially.
3.3.3 Bloom Filter Optimisation for Query Retrieval
It is possible to query for a key that is not contained in the HFile. This very
frequently happens in translation because of language data sparsity: in our
case, because keys are n-grams of terminals and nonterminals, the number of
possible keys is exponential in the size of the vocabulary, and many of these
will not have been observed in any rule extracted from training data.
Querying the existence of a key is expensive as it involves all the op-
erations described in Section 3.3.2: loading and binary searching the root
data index, loading and binary searching an intermediate data index block,
loading and binary searching a leaf data index block and finally loading and
scanning a data block.
For fast existence check queries, the HFile format allows the inclusion of
an optional Bloom filter (Bloom, 1970). A Bloom filter provides a probabilis-
tic, memory efficient representation of the key set with a O(1) membership
test operation. The Bloom filter may provide a false positive, but never a
false negative, for existence of a key in the HFile. Therefore, it is safe to use
in our setting as some keys will be looked up in the HFile even though they
are not present, but the situation where keys that are in the HFile are not
looked up will not happen.
For a large HFile, the Bloom filter may also be very large. Therefore the
Bloom filter is also organised into blocks called Bloom blocks. Each block
contains a smaller Bloom filter that covers a range of keys in the HFile.
Similar to the root data index, a Bloom index is constructed. To check for
the existence of a key, a binary search is conducted on the Bloom index, the
relevant Bloom block is loaded, and the membership test performed.
11After http://hbase.apache.org/book/book.html (simplified)
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 58
Unlike work on Bloom filter language models (Talbot and Osborne, 2007a,b),
this filter only tests the existence of a key and does not return any statistics
from the value. If a membership test is positive, a usual key search will still
be carried out. During the execution of a query, two keys may reference the
same index or Bloom blocks. To prevent these blocks from being repeatedly
loaded from disk, the blocks are cached after reading.
3.3.4 Local Disk Optimisation
The HFile format is designed to be used with HDFS, a distributed file system
based on the Google File System (Ghemawat et al., 2003). Large files are
split into HDFS blocks that are stored on many nodes in a cluster. However,
the HFile format can also be used completely independently of HDFS. If
its size is smaller than local disk space, the entire HFile can be stored on
the local disk of one machine and accessed through the machine’s local file
system. We report in Section 3.5 that using local disk is faster than using
HDFS.
3.3.5 Query sorting optimisation
Prior to HFile lookup, we sort keys in the query set lexicographically. If two
keys in the set of queries are contained in the same block, then the block is
only loaded once. In addition, the computer hardware and operating system
allow further automatic improvements to the query execution. Examples
of these automatic improvements include reduced disk seek time, caching
data from disk by the operating system,12 or CPU caching data from main
memory (Patterson and Hennessy, 2009).
3.4 Hierarchical Rule Extraction with MapRe-
duce
In the previous section, we have described the HFile format. We will use
this format to store a translation grammar. In this section, we describe how
we generate this grammar and produce the HFile that represents the gram-
mar. We describe a MapReduce implementation of hierarchical grammar
12 The Linux Documentation Project, The File System, http://tldp.org
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 59
extraction and estimation. The implementation is an extension of Method 3
described by Dyer et al. (2008), which was originally developed to estimate
translation probabilities for phrase-based models.
MapReduce (Dean and Ghemawat, 2008) is a framework designed for
processing large amounts of data in a distributed fashion on a computer
cluster. In this framework, the programmer needs to implement a function
called map and a function called reduce. The map function is defined in
Equation 3.2:
map : (k)P v)) 7−→ [(k2P v2)] (3.2)
where (k)P v)) is a key-value pair and [(k2P v2)] is a list of key-value pairs.
We use the [ ] notation to denote a list. Keys and values are arbitrary byte
sequences. As the key-value pairs (k2P v2) are computed by the map function,
the MapReduce framework groups all values v2 by key k2 in order to provide
an input (k2P [v2]) to the reduce function, defined in Equation 3.3:
reduce : (k2P [v2]) 7−→ [(k+P v+)] (3.3)
The input to the reduce function is a key and a list of values. The reduce
function generates a list of key-value pairs.
Using this notation, we can describe the grammar extraction and estima-
tion as a series of MapReduce jobs, summarised in Figure 3.3. This architec-
ture is similar to the one described in the original publication (Dyer et al.,
2008). For source-to-target and target-to-source probability computation,
we used Method 3 described in that publication, which was reported to be
faster than other methods. Keeping each feature computation as a separate
MapReduce job makes the integration of a new feature more convenient.
3.4.1 Rule Extraction
The first step in grammar extraction and estimation is rule extraction. For
this step, v) represents a sentence pair together with a word alignment and
k) contains metadata about the parallel corpus such as what collection the
sentence pair comes from. This kind of information is useful for domain
adaptation as demonstrated in Section 5.5. Themap function simply extracts
hierarchical rules for that sentence pair and outputs the rules. k2 is a rule
and v2 contains the metadata from k). For rule extraction, there is no reduce
step. Equation 3.4 illustrates the map function for rule extraction:
mvp : (newswireP (f P ePa)) 7−→ [(r)P newswire)P (r2P newswire)P OOO] (3.4)
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 60
.k) : metadata
k2 : word-aligned sentence pair
Rule Extraction (see Section 2.6.4)
map : extract rules
reduce : none
k) : rule
k2 : metadata
Source-to-target Probability
map : output (src, (trg, 1))
reduce : sum trg counts, normalise
Target-to-source Probability
map : output (trg, (src, 1))
reduce : sum src counts, normalise
k) : rule
k2 : source-to-target pr
k) : rule
k2 : target-to-source pr
Merge
map : output (src, (trg, feature))
reduce : output (src, list of trg and features)
k) : src
k2 : list of trg and features
Figure 3.3: Translation grammar extraction and estimation pipeline as a se-
ries of MapReduce jobs. Ellipses represent (intermediate) input/output of
MapReduce jobs. Rectangles represent MapReduce jobs. Source, target and
probability are abbreviated into src, trg and pr. Rules are first extracted,
then the source-to-target and target-to-source translation models are esti-
mated. Finally, the features, i.e. the translation models, are merged and the
final output has rules sources as keys and lists of targets with features as
values.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 61
3.4.2 Corpus-Level Feature Computation
Rule extraction is followed by several MapReduce jobs that are run simulta-
neously in order to compute corpus-level features. Corpus-level features are
features related to each rule and the computation of which requires looking
at the entire set of extracted rules. In this description, we only consider
source-to-target and target-to-source probabilities, but many other corpus-
level features are possible, for example features that relate to the word align-
ment from which rules were extracted. We use the metadata stored in the
output of the extraction job in order to compute collection specific proba-
bilities. Collection specific probabilities are probabilities computed over a
specific subset of the training corpus. The metadata indicates whether a rule
was extracted from that specific subset.
Let us consider first the collection-specific source-to-target probability
computation. The input to the map function is the output of the extraction
job, that is a rule together with metadata. The map function checks that
the rule comes from the collection we are interested in and if so, outputs the
source of the rule as a key k2 and the target of the rule together with a count
of one as a value v2. Targets (and counts) corresponding to the same source
are grouped together by the MapReduce framework. The input to the reduce
function is a source (k2) and a list of targets with counts ([v2]). The reduce
function aggregates the counts for the distinct targets and then normalises
those counts with the total count corresponding to the source k2. The output
of the reduce function is a list of pairs (rule, probability).
The target-to-source probability computation is very similar. Apart from
inverting the role played by source and target, the map and reduce function
also need to change the rule format for rules that invert the order of nonter-
minals in the source and the target, for example m→〈m)vm2,m2wm)〉. For
this type of rule, the map function has to invert the order of nonterminals
in both the source and target so that probabilities are computed correctly.
For example, the map function will output the target of the rule written as
m)wm2 and the source of the rule written as m2vm). The reduce function will
restore the original notation.
We have described the computation of collection specific translation prob-
abilities. The source-to-target and target-to-source probabilities over the en-
tire corpus are computed in an identical manner with the collection simply
being the entire corpus. Figure 3.3 illustrates only two corpus-level fea-
tures, however, in practice, more features are computed: source-to-target
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 62
and target-to-source probability for each collection and for the entire parallel
text, as well as lexical features (see Section 2.6.5).
3.4.3 Feature Merging
Once all MapReduce features have been computed, a MapReduce job that
merges the features is called. The input key k) to the map function is a
rule and the input value v) is a feature. The map function outputs the
source of the rule as a key k2 and the target of the rule together with the
feature as value v2. Targets and features corresponding to the same source
are grouped together by the MapReduce framework. The input to the reduce
function is a source (k2) and a list of targets with features ([v2]). The reduce
function simply merges the features for identical targets. The output key k+
is the source k2 and the output value v+ is a list of targets with all computed
features.
The output of this merging job is converted to an HFile format. Because
the keys in the HFile format need to be sorted, we also want the output of the
merging job to be sorted by key. One way to achieve this is to specify only one
reducer for the job but this solution is very slow because only one core will be
used for the reduce step. A better way is to use a TotalOrderPartitioner. We
first use a small amount of data so that it may be feasible to carry out rule
extraction with only one reducer in the merge step. We use the output of this
small scale extraction as input to an InputSampler. The sampler will generate
a partition file for the partitioner. The use of a TotalOrderPartitioner will
guarantee that all keys are sorted after the reduce step.
3.5 Hierarchical Rule Filtering for Translation
In the previous section, we have described how to obtain an HFile that rep-
resents a hierarchical translation grammar. We will now describe how this
datastructure can be used to retrieve rules for a given test set efficiently.
We describe our system called ruleXtract, and compare it to other methods
through time and memory measurements.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 63
3.5.1 Task Description
Given a test set, defined as a set of sentences to be translated, and a hier-
archical phrase-based translation grammar, we would like to retrieve all the
relevant rules from the model. A phrase-based rule in the HFile is relevant
if its source is a subsequence of a sentence in the test set. A hierarchical
rule in the HFile is relevant if its source, where nonterminals are instantiated
into terminals, is a subsequence of a sentence in the test set. For exam-
ple, with a test set containing one sentence Salut toi, the phrase-based rules
with sources Salut, toi, Salut toi are relevant and the hierarchical rules with
sources Salut X and X toi are relevant, because instantiating Salut X into
Salut toi produces a subsequence of the test sentence and similarly for X toi.
3.5.2 HFile for Hierarchical Phrase-Based Grammars
Given a test set and an HFile storing a hierarchical phrase-based grammar,
we first generate queries from the test set, then retrieve relevant rules along
with their corpus-level features from the HFile. To generate queries, we have
a set of allowed source patterns and instantiate these patterns against the test
set. Rule patterns were defined in Section 2.6.2. A source pattern is simply
the source side of a rule pattern. Here, we briefly redefine source patterns. A
source pattern is a regular expression that matches the source side of a rule.
For example, the pattern +m, with the source vocabulary, represents a
rule source side containing a sequence of terminals followed by the nontermi-
nal X. If the input sentence is Salut a` toi, the pattern will be instantiated as
Salut X, Salut a` X and a` X. We impose the following constraints on source
pattern instantiation where the first four relate to constraints in extraction
and the last one relates to a decoding constraint:
• smax: maximum number of terminals for phrase-based rules.
• smax elements: maximum number of terminals and nonterminals.
• smax terminals: maximum number of consecutive terminals for hierarchi-
cal rules.
• smax NT: maximum nonterminal span in a hierarchical rule.
• smax span: maximum span for the rule. This constraint is not to be
confused with the previous smax NT constraint. While smax NT represents
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 64
the maximum span of a nonterminal on the right hand side of a rule,
smax span represents the maximum span of an entire rule, or alternatively
the maximum span of the nonterminal on the left hand side of the rule.
Because the latter constraint is enforced by the decoder, we also enforce
it in source pattern instantiation so that rules that will never be used
by the decoder are not generated.
The source pattern instances are then sorted for more efficient HFile lookup
(see Section 3.3.5). Each query is then looked up in the HFile and if present,
an HFile record is retrieved. We now compare our approach to similar ap-
proaches whose aim is to obtain rules which are relevant to a test set.
3.5.3 Suffix Arrays for Hierarchical Phrase-Based Gram-
mars
One alternative strategy for building a set specific grammar is to use suf-
fix arrays introduced in Section 3.2. We use the cdec software (Dyer et al.,
2010) implementation of suffix arrays for hierarchical phrase-based rule ex-
traction. The cdec implementation is based on earlier work (Lopez, 2007)
which extends suffix array based rule extraction from phrase-based system
to hierarchical phrase-based systems (see again Section 3.2).
Given a test set, a set of source pattern instances is generated similarly to
what is done for ruleXtract. These source pattern instances are then sought
in a suffix array compiled from the source side of a parallel corpus. Rules are
extracted using the word alignment, and source-to-target probabilities are
then computed as described in Section 3.2.
3.5.4 Text File Representation of Hierarchical Phrase-
Based Grammars
We now describe the Joshua decoder (Weese et al., 2011) implementation for
storing and retrieving from a translation model. The first implementation
variant, which we call Joshua, stores the translation model in a text file.
Given a test set, each word in the test set vocabulary is mapped to the list
of sentences in which it appears. Then, each rule in the translation model is
compiled to a regular expression, and each sentence that contains at least a
vocabulary word of the rule is matched against this regular expression. If at
least one match is successful, the rule is retained.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 65
A faster version is provided that matches consecutive terminals in the
source side of a rule to the set of n-grams extracted from the test set. We
call this version Joshua Fast. A parallel version also exists that chunks the
grammar file and distributes each chunk processing as a separate process on
a cluster running Sun Grid Engine (Gentzsch, 2001). We call this version
Joshua Parallel. The parallel version using the faster matching algorithm is
called Joshua Fast Parallel.
3.5.5 Experimental Design
In this section, we describe our experimental setup in terms of data and
various configurations for grammar extraction, grammar filtering and time
and memory measurements.
Parallel Text We use a small parallel corpus of 750,950 word-aligned sen-
tence pairs and a larger corpus of 9,221,421 word-aligned sentence pairs from
the NIST’12 Chinese-English evaluation, in order to investigate how systems
perform with varying amounts of parallel text and whether those systems are
robust when dealing with larger data sets.
Grammar Extraction From the parallel corpora, we extract hierarchical
grammars with the source-to-target probability feature only. This is done be-
cause we do not want feature computation to introduce variability in timing
results when comparing different strategies and software implementations.
In addition, not all systems can generate the same set of features. For exam-
ple, the suffix array implementation of rule extraction is not able to generate
target-to-source probabilities. Note that in practice for translation decod-
ing, given a vector of parameters, we could simply replace multiple features
in the translation model by a single value representing the dot product of
the features with the parameter vector. However, we need to keep separate
feature values for optimisation (see Section 2.8.2).
The rule extraction constraints described in Section 3.5.2 are set to a
maximum source phrase length smax of 9, a maximum source element (termi-
nal and nonterminal) length smax elements of 5, a maximum number of source
consecutive terminals smax terminals of 5 and a maximum source nonterminal
span smax NT of 10. These settings are standard in all our translation systems
and give competitive performance in terms of translation quality.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 66
With these settings, the small grammar contains approximately 60M rules
while the larger grammar contains approximately 726M rules. The grammars
we obtain are converted to the Joshua format in order to be able to filter
them with the Joshua decoder. We do not generate a hierarchical grammar
with Joshua simply because we want to compare the performance of rule
retrieval from the same grammar.
Grammar filtering For ruleXtract, we use the following constraints for
source pattern instantiation: the maximum source phrase length smax is
set to 9, the maximum source element (terminal and nonterminal) length
smax elements is set to 5, the maximum number of source consecutive terminals
smax terminals is set to 5, the maximum source nonterminal span smax NT is
set to ∞ and the maximum rule span smax span is set to ∞. The first three
constraints are standard in all our translation systems. We use the value
∞ for smax NT and smax span so that the number of rules obtained after fil-
tering is identical between ruleXtract and Joshua. Thus, time and memory
measurements are comparable.
Measurements We report time measurements for query processing and
query retrieval and the total time used to obtain a set specific rule file for
a test set of 1755 Chinese sentences and 51008 tokens. We also report peak
memory usage. For ruleXtract, query processing involves generating source
pattern instances and sorting them according to the HFile sorting order. If
we use a Bloom filter, it also involves pre-filtering the queries with the Bloom
filter. Query retrieval involves HFile lookup. For the Joshua configurations,
query processing involves indexing the test set and generating test set n-
grams and query retrieval involves regular expression matching.
For the Joshua Parallel configurations, we use 110 jobs for the larger
grammar on a cluster of 9 machines. For this latter configuration, we report
the maximum time spent on a job (not the sum) and the maximum memory
usage by a job.
Hardware configuration the machine used for query processing has 94GB
of memory and an Intel Xeon X5650 CPU. The distributed file system is
hosted on the querying machine and other machines with the same specifi-
cation, which are used to generate the HFile.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 67
Our experimental setup was designed for accurate comparisons between
alternative strategies for grammar filtering. In practice, an end-to-end trans-
lation system need not use this exact setup. For example the grammar
extracted by Joshua is in general smaller than the grammar extracted by
ruleXtract because Joshua includes target side constraints. On the other
hand, ruleXtract uses threshold cutoffs, such as a minimum source-to-target
translation probability, in order to reduce the size of the test set specific
grammar.
3.5.6 Results and Discussion
Results are summarised in Table 3.1, from which we can draw the following
observations:
Speed The Total Time column shows that ruleXtract is competitive with
alternative strategies in terms of speed.
Memory The Peak Memory column shows that both ruleXtract and Joshua
are memory hungry, with a usage of up to 40GB. In the case of ruleXtract,
this is because we keep all source pattern instances for all test set sentences
are kept in memory. In the case of Joshua, this is due to a caching optimisa-
tion: rule source sides that have already been considered relevant to the test
set are stored in a hash map.
Local disk optimisation Comparing HDFS and Local rows (row 1 vs.
row 3, row 2 vs. row 4, row 7 vs. row 9, row 8 vs. row 10), we can see that
using the local filesystem as opposed to HDFS gives a small decrease in query
retrieval time, which becomes more substantial for the larger grammar. This
is because when the HFile is stored on HDFS, it is composed of several HDFS
blocks that can be located on different data nodes. Since the HFile size is
smaller than the disk space on each data node, it is preferable to store the
HFile on local disk. However, because the HFile was generated on HDFS
through a MapReduce job, this requires copying or moving the HFile from
the HDFS file system to the local disk file system prior to running grammar
retrieval.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 68
Small Grammar
Row System Query Query Total Peak # Rules
Processing Retrieval Time Memory
1 ruleXtract 9m1s 7m36s 16m40s 40.8G 6435124
HDFS
2 ruleXtract 8m57s 2m16s 11m15s 39.9G 6435124
Bloom, HDFS
3 ruleXtract 8m54s 7m33s 16m30s 40.4G 6435124
Local
4 ruleXtract 8m50s 2m19s 11m11s 38.8G 6435124
Bloom, Local
5 Joshua 0.9s 29m51s 29m54s 42.2G 6435124
6 Joshua 0.9s 7m25s 7m28s 40.1G 7493178
Fast
Large Grammar
Row System Query Query Total Peak # Rules
Processing Retrieval Time Memory
7 ruleXtract 8m56s 22m18s 31m17s 42.2G 47978228
HDFS
8 ruleXtract 9m12 15m33s 24m49s 40.7G 47978228
Bloom, HDFS
9 ruleXtract 8m55s 21m3s 30m1s 41.6G 47978228
Local
10 ruleXtract 9m0s 14m43s 23m46s 40.6G 47978228
Bloom, Local
11 Joshua 0.9s out of out of out of out of
memory memory memory memory
12 Joshua 0.9s out of out of out of out of
Fast memory memory memory memory
13 Joshua 0.9s 537m10s 537m11s 10.1G 47978228
No Cache
14 Joshua 0.9s 78m53s 78m54s 10.1G 83339443
Fast No Cache
15 Joshua total time for slowest job: 43m36s 4G 47978228
Parallel
16 Joshua total time for slowest job: 44m29s 4G 83339443
Fast Parallel
Table 3.1: Time and memory measurements for rule filtering with differ-
ent strategies for a small and a large grammar. Various configurations for
ruleXtract and Joshua are compared for time and memory usage. The fast
configuration for Joshua filters rules by matching consecutive terminals to
test set n-grams, which explains that the number of rules obtained is higher
than in all other configurations. This also means that some filtered rules
are never used in decoding. Measurements were carried out twice for each
condition and the second run was recorded. Thus, measurements reported
take advantage of operating system caching.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 69
Bloom filter optimisation Comparing rows with and without Bloom
(row 1 vs. row 2, row 3 vs. row 4, row 7 vs. row 8 and row 9 vs. row 10), we
can see that the use of a Bloom filter gives an important decrease in query
retrieval time. This is due to the fact that the number of source pattern in-
stances queries is 31,552,746 and after Bloom filtering, the number of queries
is 1,146,554 for the small grammar and 2,309,680 for the larger grammar,
reducing the number of time consuming HFile lookups respectively by 96%
and 93%. Note that Bloom filters increase query processing time only for the
large grammar and more so when using HDFS.
Parallelisation In order to run Joshua on the larger grammar and avoid
memory problems, we needed to use parallelisation, which provided compet-
itive speeds and a low memory footprint (maximum 4G per job) (see rows
15 and 16).
For comparison, cdec’s total processing time is 57m40s for the small gram-
mar, which is significantly slower than the other methods. However, we do
not include cdec’s suffix array implementation performance in Table 3.1 be-
cause the suffix array method involves much on-the-fly computation, such
as rule extraction and source-to-target probability estimation, that has al-
ready been precomputed in the case of Joshua and ruleXtract, making direct
comparison difficult.
Despite this apparent slowness, the use of suffix array methods for rule
extraction favours rapid experimentation because no time consuming pre-
computation is required. With our method, the precomputation involved
consists in generating an HFile. This takes about 20 minutes for the small
corpus and about 5 hours for the larger corpus. With the suffix array method,
compiling the small parallel corpus into a suffix array data structure takes
2m49s and extracting a grammar for the test set from that corpus takes
57m40s as mentioned. Compiling the large parallel corpus into a suffix array
takes 84m29s and extracting a grammar for our test set takes 569m54s. This
gives an indication of the total time needed to extract a grammar for an
arbitrary corpus and an arbitrary test set.
In our case, even though HFile generation might be relatively time con-
suming (5 hours for the larger corpus), once the HFile is generated, it is
possible to quickly extract grammars for arbitrary test sets and arbitrary fil-
tering parameters for rapid experimentation. This approach favours repeated
experimentation with varying development and test sets.
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 70
Small Grammar
System Extraction Time Retrieval Time Total Time
cdec 2m49s 57m40s 1h29s
ruleXtract 20m 11m11s 31m11s
Large Grammar
cdec 1h24m29s 9h29m54s 10h54m23s
ruleXtract 5h 23m46s 5h23m46s
Table 3.2: Timings for cdec and ruleXtract for both rule extraction and
retrieval.
Table 3.2 compares timings between suffix array rule extraction and re-
trieval as implemented by cdec and our system. For cdec, the extraction time
corresponds to corpus compilation. For both the small and large grammars,
ruleXtract performs twice as fast.
The ruleXtract system works in batch mode and dividing the number of
words in the test set by the total time in the best configuration (ruleXtract,
Bloom, Local row 10) for the large grammar yields a speed of 35.8 words per
second which is a real time system speed for batch processing tasks in which
latency has little effect. However, running the system in that configuration
gives a speed of 2.5 words per second for the longest sentence in the test
set (135 words) and 1.3 words per second for a sentence of length 20. This
may be due to several factors. First, different sentences in a test set may
share the same source pattern instances, which saves computation time for
batch processing. Second, slow I/O operations such as opening the HFile and
reading the HFile metadata is shared by all test sentences in batch processing.
Finally, processing only one sentence at a time may not benefit from caching
(see Section 3.3.3 and Section 3.3.5) as much as processing an entire test set.
Future work will be dedicated to reduce latency and obtain an actual real
time system at the translation instance level.
3.6 Conclusion
We have presented techniques to build large translation grammars and to
filter these grammars to a given test set efficiently. Our strategy is relatively
easy to implement, it is flexible in that it can be applied to other tasks such
CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 71
as language modelling, and it does not require extensive computing resources
as it is run on one machine. We have demonstrated that its performance in
terms of speed and memory usage is competitive with other current alterna-
tive approaches.
There are several possible extensions to our current infrastructure. First,
rule retrieval can be parallelised by source sentence in order to provide sen-
tence specific grammars rather than test set specific grammar. Rule retrieval
can also be parallelised by processing queries in parallel. The HFile for-
mat does not support concurrent reading with multithreading, but it allows
multiprocessing reading, which makes this option suitable for MapReduce.
Finally, in order to provide a real time system, causes of latency may be
tackled; for example optimising the source pattern instance creation phase
would help reduce latency.
Chapter 4
Hierarchical Phrase-Based
Grammar Extraction from
Alignment Posterior Probabilities
In Chapter 3, we have described how to exploit the MapReduce framework
and the HFile format in order to generate very large translation grammars
and retrieve rules from these grammars efficiently. The main contribution
was at the infrastructure level rather at the modelling level. In this chapter,
we will attempt to improve models for translation grammar extraction.
Standard practice in SMT systems is to decouple the word alignment
phase from the rule extraction phase. Typically, word alignment models
are only used to obtain a set of alignment links, then those alignment links
determine constraints that are followed in the rule extraction step (see Sec-
tion 2.6.4). In this chapter, we attempt to leverage the information contained
in alignment models by extracting rules from alignment posterior probabili-
ties (de Gispert et al., 2010b). These statistics are computed from the HMM
alignment model (Vogel et al., 1996) and they are used both to generate
constraints for rule extraction and for translation model estimation.
This chapter presents two rule extraction methods. With the first method,
rules are extracted from alignment link posterior probabilities. With the sec-
ond method, rules are extracted from alignment posterior probabilities over
phrase pairs. We demonstrate improvements on a medium scale Chinese-
English task with these methods. We also investigate how best to exploit
source-to-target and target-to-source alignment models.
72
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 73
.Er hat den Ball gesehen
He has seen the ball
Figure 4.1: German-English word-aligned sentence pair. The spurious align-
ment link between the German word hat (has) and the English word seen
prevents the phrase pair 〈hat, has〉 to be extracted from this sentence pair.
4.1 Introduction
In state-of-the-art SMT systems, rules are extracted from word-aligned paral-
lel text. The alignments are typically generated by applying symmetrisation
heuristics (see Section 2.3.2) to Viterbi alignments (see Equation 2.7) ob-
tained from source-to-target and target-to-source word alignment models.
Additional information that these models could provide, such as posterior
probabilities over alignment links, is not used. For example, let us consider
the word-aligned German-English sentence pair in Figure 4.1. Intuitively, we
can tell that there is a spurious alignment link between the German word
hat (has) and the English word seen. This link will prevent the extraction
of the useful phrase pair 〈hat, has〉 from this sentence pair. However, it is
possible that the posterior probability of this spurious link according to the
alignment model is relatively low. We hypothesise that posterior probability
information from alignment models is more reliable than the links obtained
from Viterbi alignment.
In this chapter, we use HMM alignment models (see Section 2.3.1) to gen-
erate the statistics needed to both extract rules and estimate the translation
models. We hypothesise that this tighter coupling between alignment and
translation models will provide better translation quality.
We will evaluate the grammar we obtain in two ways. First, we will assess
the grammar’s ability to generate a reference translation from a source sen-
tence. This is determined by the type of reordering allowed by the grammar
and by the choice of translations for each source side of a rule. We will then
evaluate translation quality provided by this grammar.
Conceptually, our extraction method consists in extracting all possible
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 74
phrase pairs and hierarchical phrase pairs given a sentence pair and selecting
only those that satisfy certain statistical criteria related to alignment pos-
terior probabilities. For example, we can select phrase pairs that contain
a link with a high posterior probability; or we can select phrase pairs that
contain a link with a high posterior probability and that have a high phrase
pair posterior probability. The selection process determines which rules the
grammar will contain and will therefore define the ability of the grammar
to generate a reference translation given a source sentence. We can also use
statistics from alignment models to estimate translation models in a novel
way. In this work, we will use phrase pair posterior probability instead of
integer counts to estimate translation models.
4.2 Related Work
The limitations of extracting translation rules from Viterbi alignments, i.e.
that potentially useful information from the alignment models is ignored, has
been addressed previously. Venugopal et al. (2008) extract rules from n-best
lists of alignments and n-best lists of syntactic parses for a syntax-augmented
hierarchical system (Zollmann and Venugopal, 2006). In the alignment step,
an n-best list of alignments a1P OOOPan is produced with posterior probabil-
ities p(a1 | f P e)P OOOP p(an | f P e). These posteriors are normalised to pro-
duce probabilities xp(a1)P OOOP xp(an). Similarly, probabilities xp(1)P OOOP xp(n0)
are obtained for an n′-best list of parses 1P OOOPn0 . For each alignment ai
and parse j , syntax-augmented hierarchical rules are extracted with a count
xp(ai) xp(j).
Alignment n-best lists have also been used to create a structure called
weighted alignment matrix (Liu et al., 2009). Probabilities xp(a1)P OOOP xp(an)
for n-best alignments a1P OOOPan are computed as previously (Venugopal et al.,
2008). Then, for each word pair (fuP zi), the alignment link posterior proba-
bility pm(jP i) is computed in Equation 4.1.
pm(jP i) =
n∑
k5)
xp(ak)(akP iP j) (4.1)
(akP iP j) indicates whether there is a link between i and j in the alignment
ak. Given a sentence pair, all phrase pairs with a maximum source length
and a maximum target length that contain a link with a posterior greater
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 75
than zero are extracted. The fractional counts assigned to these phrase pairs
are computed in terms of the link posteriors and then used to estimate the
translation models by relative frequency. The fractional count computation
approximates the posterior probability of all alignments consistent with the
phrase pair. Our method also uses link posterior probabilities to constrain
the extraction but the posteriors are computed exactly rather than approx-
imated. In addition, posterior probabilities of consistent alignments is also
computed exactly. Finally, our method is also applied to hierarchical phrase-
based translation.
Alignment posterior probabilities without approximation have also been
used. For a given test set, Deng and Byrne (2008) first extract phrase pairs
in a standard manner. Then, source phrases in the test set that do not
have any corresponding target in the list of extracted phrase pairs are se-
lected. For each of these source phrases, sentence pairs where the source
phrase occurs are considered. For each such sentence pair, all target phrases
in the target sentence are assigned phrase pair posterior probabilities (see
Section 4.3.4) according to the source-to-target and target-to-source align-
ment models, then ranked by the geometric average of the two probabilities.
The top phrase pair is retained if its scores are above specific thresholds.
Our definition of phrase pair posterior probabilities and the procedure to
compute them are directly inspired by the work we just described. However,
we do not use the word-to-phrase HMM model but the simpler word-to-word
HMM model. In addition, our method is applied to hierarchical phrase-based
grammars rather than simpler phrase-based grammars. Finally, our grammar
extraction scheme does not consist in first extracting a standard grammar
and then augmenting the grammar with additional rules: we modify the ex-
traction procedure to directly extract a hierarchical grammar from alignment
link posterior probabilities or phrase pair posterior probabilities.
Kumar et al. (2007) also use exact computation of alignment link pos-
teriors in a different application setting. First, instead of using the Viterbi
criterion for word alignment reminded in Equation 4.2,
a^ = argmax
a
p(f Pa | e) (4.2)
the maximum a posteriori criterion (Matusov et al., 2004), shown in Equa-
tion 4.3, is used:
xvu = argmax
i
p(vu = i | f P e) (4.3)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 76
Then, given a parallel corpus for three languages F , G, E, the link posteriors
for the language pair (F , E) are computed in terms of the posteriors for the
language pair (F , G) and (G, E). G is called a bridge language. The moti-
vation is that alignments for the F -G language pair and the G-E language
pair may inform alignment for F -E. Multiple bridge languages are used and
produce corresponding posterior matrices. The matrices are interpolated and
alignments are extracted for each bridge language and for the interpolation.
Translation gains are obtained in system combination.
We also note approaches to tighter coupling between hierarchical phrase-
based grammars and alignments or even direct modelling of phrase alignment.
Marcu and Wong (2002) introduce a joint phrase-based model that does not
make use of word alignments. In this generative model, a sentence pair is
produced by concatenating phrase pairs, or so-called concepts. The authors
consider a simpler model with only joint phrase pair translation probabili-
ties and a more complex model with translation and distortion probabilities.
The parameter are trained with an approximate version of the expectation-
maximisation algorithm (Dempster et al., 1977). Experiments demonstrate
translation improvements over IBM Model 4. Birch et al. (2006) constrain
this model in order to be able to apply it to larger parallel corpora. When
searching for a set of phrase pairs to cover a training sentence pair, phrase
pairs that are consistent with the intersection of Viterbi alignments (see
Section 2.3.2) are considered first; other phrase pairs are considered only
when the sentence pair cannot be covered entirely. Results close to standard
phrase-based models are obtained.
DeNero and Klein (2008) prove that phrase alignment is an NP-hard
problem. Given a sentence pair (f P e), a bijective phrase alignment a is
defined as a bijective mapping between source phrases that form a partition
of f and target phrases that form a partition of e. A scoring function
is also defined that assigns a real-valued score to any phrase pair 〈source
phrase, target phrase〉. The score of a bijective phrase alignment is simply the
product of the scores of its phrase pairs. Given (f P eP ), the phrase alignment
optimisation problem is to find the best scoring alignment. DeNero and Klein
(2008) show that this problem is NP-hard by showing that the corresponding
decision problem is NP-complete via reduction of the SAT problem. We give
here an indication of the size of the search space. The number of possible
source partitions is 2|f |−). Given a source partition with K+1 phrases, there
are (K + 1); possible permutation of the source phrases and 2(
|e|−1
K ) possible
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 77
target partitions with K + 1 phrases. In conclusion, there is little hope to
solve the phrase alignment problem exactly.
Saers and Wu (2009) report an improvement on a phrase-based system
where word alignment has been trained with an inversion transduction gram-
mar rather than IBM or HMMmodels. Phrase alignment is directly modelled
with an inversion transduction grammar. The phrase alignment search space
is more restrictive than the space considered in DeNero and Klein (2008) and
the expectation maximisation algorithm can be carried out in d(n6) where n
is the number of tokens in a sentence. Pauls et al. (2010) also use an inversion
transduction grammar to directly align phrases to nodes in a string-to-tree
model. Bayesian methods have also been developed to induce a grammar
directly from an unaligned parallel corpus (Blunsom et al., 2008b, 2009). Fi-
nally, Cmejrek et al. (2009) extract rules directly from bilingual chart parses
of the parallel corpus without using word alignments. We take a different
approach in that we aim to start with very strong alignment models and use
them to guide grammar extraction.
Finally, some work on smoothing, which could be complementary to the
approach taken in this thesis, has been conducted to address the shortcomings
of relative frequency estimation for translation models. Foster et al. (2006)
conduct an extensive series of experiments that either replace the relative
frequency estimated phrase table by a smoothed phrase table or add the
smoothed phrase table as a feature and observe improvement in translation
quality.
4.3 Rule Extraction
In Section 4.2, we have reviewed approaches that “widen” the translation
pipeline by using alignment n-best lists. We have also reviewed applications
of exact computation of alignment posterior probabilities and attempts to
directly model phrase alignment. We will now describe our grammar extrac-
tion methods, based on exact computation of alignment posterior probabili-
ties under an alignment model. As in previous work (Hopkins et al., 2011),
we first present a general approach that encompasses both standard methods
based on rule extraction from Viterbi alignments as well as our methods. For
clarity of presentation, we first describe our methods in the simpler case of
phrase-based rule extraction, then extend them to hierarchical phrase-based
rule extraction.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 78
1: function ExtractRules(fJ) P zT)Pa)
2: for 1 ≤ j) ≤ j2 ≤ J do
3: for 1 ≤ i) ≤ i2 ≤ I do
4: if SourceConstraints(f u2u1 )
∧ AlignConstraints(f u2u1 P zi2i1 Pa)
∧ TargetConstraints(zi2i1) then
5: Extract(m→〈f u2u1 ,zi2i1〉, Count(f u2u1 P zi2i1))
6: end if
7: end for
8: end for
9: end function
Figure 4.2: General procedure for phrase-based rule extraction: both tradi-
tional rule extraction from Viterbi alignment and our method are instances
of this procedure.
4.3.1 General Framework for Rule Extraction
We first describe a general method for the extraction of phrase-based rules.
An extension of this procedure for hierarchical rules is described in Sec-
tion 4.3.5. The algorithm is described in Figure 4.2. Given a sentence pair
(fJ) P z
T
)), for each source index pair (j)P j2) defining a source phrase f
u2
u1
(line
2), for each target index pair (i)P i2) defining a target phrase zi2i1 (line 3), if
source constraints, target constraints and alignment constraints are satisfied
(line 4), then the phrase pair (f u2u1 , z
i2
i1
) is extracted with a certain count (line
5): the phrase pair is added to the list of phrase pairs used in translation, and
the count will be used subsequently to compute translation models by relative
frequency. The purpose of the constraints is to obtain a manageable number
of rules. If we did not impose constraints, we would extract T(T+))J(J+))
4
(not
necessarily distinct) rules for the sentence pair (fJ) P zT)). During translation,
the decoder would need to apply more pruning, which would potentially lead
to more search errors and a decrease in translation quality.
We will now refine this general procedure to make it more practical and
closer to a possible implementation. Let us call the source constraints CS, the
alignment constraints CL and the target constraints CT . These are Boolean
functions used to select phrase pairs. In practice, source constraints are
checked on the source phrases before looking at the target phrases. If source
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 79
constraints are not met, then we need not consider target phrases for that
source phrase. In addition, target phrases are only considered if they satisfy
alignment constraints with the source phrase and, if they do, we rank them
according to a certain ranking function R. Target constraints also depend
on the ranking R, for example we can decide to keep only a certain num-
ber of target phrases per source phrase. When a phrase pair is extracted,
it is assigned a count which will be used to estimate the source-to-target
and target-to-source translation models. The counting function is called C.
With this notation, we obtain the revised extraction procedure in Figure 4.3.
We will now describe different rule extraction strategies in terms of the con-
straints CS, CL, CT , the ranking function R and the counting function C.
4.3.2 Extraction from Viterbi Alignment Links
In this section, we describe the standard extraction procedure within the
framework introduced in Section 4.3.1. Common practice takes a fixed set of
word alignment links L and extracts rules from this set. Alignment links
L are obtained from the alignment model a either by the Viterbi algo-
rithm or by maximum a posteriori estimation (Matusov et al., 2004; Ku-
mar et al., 2007) and possibly using symmetrisation heuristics to combine
links obtained from source-to-target and target-to-source alignment models
(see Section 2.3.2). We can restate this common approach in the framework
proposed in Section 4.3.1 and in Figure 4.3 where constraints, ranking and
counting functions are defined as follows:
• source constraints CS(f u2u1 ):
j2 − j) < smax (4.4)
where smax is a integer threshold defined experimentally. smax repre-
sents the maximum length of a source phrase.
• alignment constraints CL(f u2u1 P zi2i1 Pa):(
∀(jP i) ∈ LP j ∈ [j)P j2]⇔ i ∈ [i)P i2]
)
∧
(
L∩[j)P j2]×[i)P i2] 6= ∅
)
(4.5)
Alignment constraints have already been described in Section 2.5.3 as
the conditions required for phrase pair extraction. The first bracketed
constraint requires that there be no alignment link between a word
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 80
1: function ExtractRules(fJ) P zT)Pa)
2: for 1 ≤ j) ≤ j2 ≤ J do
3: if ¬CS(f u2u1 ) then . Source constraints
4: continue
5: end if
6: i ← ∅ . Sorted target phrases
7: for 1 ≤ i) ≤ i2 ≤ I do
8: if CL(f u2u1 P zi2i1 Pa) then . Alignment constraints
9: i ← i ∪ zi2i1
10: end if
11: end for
12: Sort(iPR) . Target phrases ranked according to R
13: for zi2i1 ∈ i do
14: if CT (zi2i1 P i ) then . Target constraints
15: Extract(m→〈f u2u1 ,zi2i1〉, C(f u2u1 P zi2i1))
16: end if
17: end for
18: end for
19: end function
Figure 4.3: General procedure for phrase-based rule extraction. This version
is more practical and closer to a possible implementation than the algorithm
in Figure 4.2. Source phrases are first considered. Only if source constraints
CS are satisfied, then target phrases are considered. Targets that satisfy
alignment constraints CL with their source are ranked by R. Finally, phrase
pairs where the target satisfies target constraints can be extracted with a
certain count C. Note that the target constraints implicitly depend on the
ranking of the targets by R.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 81
inside the phrase pair and a word outside of it. The second bracketed
constraint requires that there be at least one alignment link in the
phrase pair. Sometimes, an additional constraint specifies that the
boundary words in the phrase pair should be aligned. In this work,
this constraint is not present. A phrase pair that satisfies Equation 4.5
is said to be consistent with the alignment (see Section 2.5.3).
• target constraints CT (zi2i1 P i ): no constraint is imposed in this work.
Target constraints based on length may be imposed depending on the
implementation.
• ranking and counting functions:
R(f u2u1 P zi2i1) = C(f u2u1 P zi2i1) = 1 (4.6)
The above constraints, ranking and counting functions define the standard
approach to grammar extraction. In the next sections, we depart from this
approach and apply novel functions to rank and count target-side translations
according to their quality in the context of each parallel sentence, as defined
by the word alignment models. We also depart from common practice in that
we do not use a set of links as alignment constraints. We thus have better
control over the number of extracted rules as well as the relative frequency
estimates of the source-to-target and target-to-source translation models.
4.3.3 Extraction from Posteriors Probabilities over
Alignment Links
For presentation, we only consider source-to-target alignment models: the
random variable a that models the alignment process takes values in func-
tions from source word positions to target word positions. However, it it
possible to apply our method with any directional alignment model. We will
use the link posterior probability p(vu = i | fJ) P zT)) to guide rule extraction.
This statistic expresses how likely it is that a word fu in source position j
and a word zi in target position i are aligned given the sentence pair (fJ) P zT)).
The link posterior probability can be computed efficiently for Model 1, Model
2 and HMM. In our experiments, we only use the HMM model to compute
link posteriors but comparisons between link posteriors obtained from var-
ious models may be interesting in the future. We will derive a closed form
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 82
solution for these models to compute the link posterior probability. Applying
the definition of conditional probability, we obtain the general form of the
link posterior probability in Equation 4.7.
p(vu0 = i( | fJ) P zT)) =
p(vu0 = i(P f
J
) | zT))
p(fJ) | zT))
(4.7)
Using Equation 4.7, we will now derive the link posterior probability pM1(vu0 =
i( | fJ) P zT)) for Model 1, pM2(vu0 = i( | fJ) P zT)) for Model 2 and pHMM(vu0 =
i( | fJ) P zT)) for the HMM model.
4.3.3.1 Link Posterior Probability for Model 1
Let us derive pM1(vu0 = i( | fJ) P zT)), the link posterior probability under Model
1. We use the notation from (Brown et al., 1993), where t is the word-to-
word translation table and " is a constant. We compute the numerator from
Equation 4.7 by marginalising over all possible alignments and inverting sum
and product signs to obtain Equation 4.8:
pM1(vu0 = i(P f
J
) | zT))
=
T∑
l15(
OOO
T∑
lj0−15(
T∑
lj0+15(
OOO
T∑
lJ5(
pM1(v)OOOvu0−)i(vu0+)OOOvJ P f
J
) | zT))
=
T∑
l15(
OOO
T∑
lj0−15(
T∑
lj0+15(
OOO
T∑
lJ5(
"
(1 + I)J
t(fu0 | zi0)
J∏
u5)
u 65u0
t(fu | zlj)
=
"
(1 + I)J
t(fu0 | zi0)
J∏
u5)
u 65u0
T∑
i5(
t(fu | zi) (4.8)
We compute the denominator from Equation 4.7 similarly (see Equation (15)
in (Brown et al., 1993)) and obtain Equation 4.9:
pM1(f
J
) | zT)) =
"
(1 + I)J
J∏
u5)
T∑
i5(
t(fu | zi) (4.9)
After simplification, we obtain Equation 4.10 from Equation 4.8 and Equa-
tion 4.9:
pM1(vu0 = i( | fJ) P zT)) =
t(fu0 |zi0)∑T
i5( t(fu0 |zi)
(4.10)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 83
4.3.3.2 Link Posterior Probability for Model 2
We apply the same method to compute pM2(vu0 = i( | fJ) P zT)), the link
posterior probability for Model 2. We also use notation from (Brown et al.,
1993) but we replace the notation for the alignment probability v(i | jP JP I)
by pl(i | jP JP I) for clarity. We compute the numerator from Equation 4.7 in
Equation 4.11:
pM2(vu0 = i(P f
J
) | zT))
=
T∑
l15(
OOO
T∑
lj0−15(
T∑
lj0+15(
OOO
T∑
lJ5(
pM2(v)OOOvu0−)i(vu0+)OOOvJ P f
J
) | zT))
=
T∑
l15(
OOO
T∑
lj0−15(
T∑
lj0+15(
OOO
T∑
lJ5(
" pl(i( | j(P JP I) t(fu0 | zi0)
J∏
u5)
u 65u0
pl(vu | jP JP I) t(fu | zlj)
= " pl(i( | j(P JP I) t(fu0 | zi0)
J∏
u5)
u 65u0
T∑
i5(
pl(i | jP JP I) t(fu | zi) (4.11)
We compute the denominator from Equation 4.7 similarly and obtain Equa-
tion 4.12:
pM2(f
J
) | zT)) = "
J∏
u5)
T∑
i5(
pl(i | jP JP I) t(fu | zi) (4.12)
After simplification, we obtain Equation 4.13 from Equation 4.11 and Equa-
tion 4.12.
pM2(vu0 = i( | fJ) P zT)) =
pl(i( | j(P JP I) t(fu0 | zi0)∑T
i5( pl(i | j(P JP I) t(fu0 | zi)
(4.13)
4.3.3.3 Link Posterior Probability for the HMM Model
We now derive pHMM(vu0 = i(|fJ) P zT)), the link posterior probability for the
HMM model (Vogel et al., 1996; Rabiner, 1989). These derivations are stan-
dard once we realise that the observed sequence is the source sentence fJ) ,
the hidden sequence is vJ) and that in addition to standard presentations
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 84
of HMM, all probabilities are conditioned on the target sentence zT). We
compute the numerator from Equation 4.7 in Equation 4.14:
pHMM(vu0 = i(P f
J
) | zT)) = pHMM(vu0 = i(P f u0) P fJu0+) | zT))
= pHMM(f
J
u0+)
| vu0 = i(P f u0) P zT)) pHMM(vu0 = i(P f u0) | zT))
= pHMM(f
J
u0+)
| vu0 = i(P zT)) pHMM(vu0 = i(P f u0) | zT))
= u0(i() u0(i() (4.14)
where u0(i() and u0(i() are respectively the backward and forward HMM
probabilities defined in Equation 4.15:
u(i) = pHMM(f
J
u+) | vu = iP zT))
u(i) = pHMM(vu = iP f
u
) | zT))
(4.15)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 85
The forward and backward probabilities can be computed recursively as
shown in Equation 4.16 and Equation 4.17:
u(i) = pHMM(vu = iP f
u
) | zT))
=
T∑
k5(
pHMM(vu = iP vu−) = kP f
u−)
) P fu | zT))
=
T∑
k5(
pHMM(fu | vu = iP vu−) = kP f u−)) P zT)) pHMM(vu = iP vu−) = kP f u−)) | zT))
=
T∑
k5(
pHMM(fu | zi) pHMM(vu = i | vu−) = kP f u−)) P zT)) pHMM(vu−) = kP f u−)) | zT))
=
T∑
k5(
pHMM(fu | zi) pHMM(vu = i | vu−) = kP I) u−)(k) (4.16)
u(i) = pHMM(f
J
u+) | vu = iP zT))
=
T∑
k5(
pHMM(f
J
u+2P vu+) = kP fu+) | vu = iP zT))
=
T∑
k5(
pHMM(f
J
u+2 | vu+) = kP fu+)P vu = iP zT)) pHMM(vu+) = kP fu+) | vu = iP zT))
=
T∑
k5(
pHMM(f
J
u+2 | vu+) = kP zT)) pHMM(fu+) | vu+) = kP vu = iP zT))
pHMM(vu+) = k | vu = iP zT))
=
T∑
k5(
u+)(k) pHMM(fu+) | zk) pHMM(vu+) = k | vu = iP I) (4.17)
The denominator from Equation 4.7 is computed in Equation 4.18:
pHMM(f
J
) | zT)) =
T∑
k5(
pHMM(vJ = kP f
J
) | zT))
=
T∑
k5(
J(k) (4.18)
We will use the link posterior probabilities under the HMM model in
order to define constraints, ranking and counting functions.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 86
4.3.3.4 Constraints, Ranking and Counting Functions from HMM
Link Posterior Probabilities
We use HMM link posterior probabilities computed in Section 4.3.3.3 in order
to define constraints, ranking and counting functions:
• source constraints CS(f u2u1 ):
j2 − j) < smax (4.19)
This is the same constraint as defined for standard Viterbi extraction
in Section 4.3.2.
• alignment constraints CL(f u2u1 P zi2i1 Pa):
∃(jP i) ∈ [j)P j2]× [i)P i2]P p(vu = i | fJ) P zT)) S (4.20)
∀(jP i) ∈ [1P J ]× [1P I] ∩ {(jP i) : p(vu = i | fJ) P zT)) S } (4.21)
j ∈ [j)P j2]⇔ i ∈ [i)P i2]
where is a threshold defined experimentally. Intuitively, is a high
link posterior probability. The first constraint (Equation 4.20) means
that we require at least one link with a high posterior probability in the
phrase pair considered. The second constraint (Equation 4.21) means
that there should be no link with a high posterior probability that
be inconsistent with the phrase pair. Note that these constraints are
identical to the Viterbi alignment constraints defined in Section 4.3.2
if we choose L to be the set of all links with high posterior defined in
Equation 4.22:
L = {(jP i) ∈ [1P J ]× [1P I] : p(vu = i | fJ) P zT)) S } (4.22)
Also note that the second constraint does not consider links to the null
word (see Section 2.3) relevant. This is because we do not need to
include the null word in a translation rule.
• target constraints CT (zi2i1 P i ): we pick the first k translation candidates
according to the ranking function R.
• ranking function:
R(f u2u1 P zi2i1) =
u2∏
u5u1
i2∑
i5i1
p(vu = i | fJ) P zT))
i2 − i) + 1 (4.23)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 87
This ranking function is very similar to the score used for lexical fea-
tures described in Section 2.6.5. Here, we use link posteriors instead
of Model 1 translation probabilities. This function favours short target
phrases, therefore we do not use it as a counting function. Preliminary
experiments found that this function is not appropriate for counting
rules and that it gives poor results. We therefore use the same count-
ing function as in standard practice described in Section 4.3.2.
• counting function:
C(f u2u1 P zi2i1) = 1 (4.24)
We have described rule extraction from alignment link posterior prob-
abilities. Next, we will describe rule extraction from alignment posterior
probabilities over phrase pairs. This method will use the same source con-
straints, alignment constraints and target constraints but different ranking
and counting functions.
4.3.4 Extraction from Posteriors over Phrase Pairs
In the previous section, we defined and gave closed form solutions to align-
ment link posterior probabilities for Model 1, Model 2 and the HMM model.
We can also define alignment posterior probabilities over phrase pairs. Let
us consider the phrase pair 〈f u2u1 P zi2i1〉 in the sentence pair (fJ) P zT)). In Equa-
tion 4.25, we define V(j)P j2; i)P i2), the set of alignments that have no links
between inside the phrase pair and outside the phrase pair:
V(j)P j2; i)P i2) = {vJ) : vu ∈ [i)P i2]⇔ j ∈ [j)P j2]} (4.25)
Alignments in V(j)P j2; i)P i2) satisfy the consistency constraint defined in
Equation 4.5 but do not require a link in [j)P j2] × [i)P i2]. The posterior
probability of these alignments given the sentence pair is defined in Equa-
tion 4.26:
p(V(j)P j2; i)P i2) | zT)P fJ) ) =
p(fJ) P V(j)P j2; i)P i2) | zT))
p(fJ) | zT))
=
∑
lJ1∈L(u1;u23i1;i2) p(f
J
) P v
J
) | zT))∑
lJ1
p(fJ) P v
J
) | zT))
(4.26)
We call this quantity the phrase pair posterior probability. We will now derive
formula for the phrase pair posterior probability in the case of Model 1, Model
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 88
2 and the HMM Model. Again, experiments only use phrase pair posteriors
computed from the HMM model, but comparing those with the posteriors
obtained from Model 1 and Model 2 may be interesting for future research.
4.3.4.1 Phrase Pair Posterior Probability for Model 1
Let us first define Jin, Jout, Iin and Iout in Equation 4.27 for a set of indices
i), i2, j), j2:
Jin = [j)P j2]
Jout = [1P J ] \ Jin
Iin = [i)P i2]
Iout = [0P I] \ Iin
(4.27)
For Model 1, the numerator from Equation 4.26 is obtained in Equation 4.28:∑
lJ1∈L(u1;u23i1;i2)
pM1(f
J
) P v
J
) | zT))
=
∑
l1∈Tout
OOO
∑
lj1−1∈Tout
∑
lj1∈Tin
OOO
∑
lj2∈Tin
∑
lj2+1∈Tout
OOO
∑
lJ∈Tout
pM1(v
J
) P f
J
) | zT))
=
∑
l1∈Tout
OOO
∑
lj1−1∈Tout
∑
lj1∈Tin
OOO
∑
lj2∈Tin
∑
lj2+1∈Tout
OOO
∑
lJ∈Tout
"
(1 + I)J
J∏
u5)
t(fu | zlj)
=
"
(1 + I)J
( ∏
u∈Jout
∑
i∈Tout
t(fu | zi)
) (∏
u∈Jin
∑
i∈Tin
t(fu | zi)
)
(4.28)
The denominator from Equation 4.26 has already been computed in Equa-
tion 4.9. Simplifying Equation 4.28 and Equation 4.9, we obtain Equa-
tion 4.29:
pM1(V(j)P j2; i)P i2) | zT)P fJ) )
=
( ∏
u∈Jout
∑
i∈Tout
t(fu | zi)∑T
i′5( t(fu | zi′)
)(∏
u∈Jin
∑
i∈Tin
t(fu | zi)∑T
i′5( t(fu | zi′)
)
=
( ∏
u∈Jout
∑
i∈Tout
pM1(vu = i | fJ) P zT))
)(∏
u∈Jin
∑
i∈Tin
pM1(vu = i | fJ) P zT))
)
(4.29)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 89
4.3.4.2 Phrase Pair Posterior Probability for Model 2
To avoid repetition, we skip the derivation which is analogous to the deriva-
tion for Model 1. We obtain the phrase pair posterior in Equation 4.30:
pM2(V(j)P j2; i)P i2) | zT)P fJ) )
=
( ∏
u∈Jout
∑
i∈Tout
pM2(vu = i | fJ) P zT))
)(∏
u∈Jin
∑
i∈Tin
pM2(vu = i | fJ) P zT))
)
(4.30)
4.3.4.3 Phrase Pair Posterior Probability for HMM
Let us now compute the phrase pair posterior probability for the HMM
model. The denominator from Equation 4.26 can be computed using the
forward algorithm while the numerator can be computed using a modified
forward algorithm (Deng, 2005). Let us define ~u(i), the modified forward
probability in Equation 4.31:
~u(i) = pHMM(V(j)P j2; i)P i2)P f
u
) P vu = i | zT)) (4.31)
The numerator from Equation 4.26 can be computed in Equation 4.32:
p(V(j)P j2; i)P i2)P f
J
) | zT)) =
T∑
i5(
~J(i) (4.32)
The denominator from Equation 4.26 can be computed using the regular
forward probability in Equation 4.33:
p(fJ) | zT)) =
T∑
i5(
J(i) (4.33)
Like the regular forward probability, the modified forward probability can
also be computed recursively. We can also write the modified forward prob-
ability as in Equation 4.34:
~u(i) = pHMM(V(j)P j2; i)P i2)P f
u
) P vu = i | zT))
=
∑
lj1∈L(u1;u23i1;i2)
pHMM(v
u−)
) P f
u
) P vu = i | zT))
=
∑
lj1∈L(u1;u23i1;i2)
lj5i
pHMM(v
u−)
) P f
u
) P vu = i | zT))
(4.34)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 90
The computation of ~u(i) is by a constrained forward algorithm where the
constraint is given in Equation 4.35. This is because an alignment inV(j)P j2; i)P i2)
cannot have a link from inside the phrase pair to outside the phrase pair (see
Equation 4.25):
∀(jP i) ∈ Jout × Iin ∪ Jin × IoutP ~u(i) = 0 (4.35)
For a link (jP i) ∈ Jout×Iout∪Jin×Iin that satisfies the constraint from Equa-
tion 4.35, we can derive the modified forward probability in Equation 4.36:
~u(i) = pHMM(V(j)P j2; i)P i2)P f
u
) P vu = i | zT))
=
∑
lj−11 ∈L(u1;u23i1;i2)
pHMM(fuP f
u−)
) P vu = iP v
u−)
) | zT))
=
∑
lj−11 ∈L(u1;u23i1;i2)
pHMM(fu | f u−)) P vu = iP vu−)) P zT))×
pHMM(vu = i | f u−)) P vu−)) P zT))×
pHMM(f
u−)
) P v
u−)
) | zT))
= pHMM(fu | zi)
∑
lj−11 ∈L(u1;u23i1;i2)
pHMM(vu = i | vu−)P I) pHMM(f u−)) P vu−)) | zT))
= pHMM(fu | zi)
T∑
k5(
∑
lj−11 ∈L(u1;u23i1;i2)
lj−15k
pHMM(vu = i | vu−) = kP I)
pHMM(f
u−)
) P vu−) = kP v
u−2
) | zT))
= pHMM(fu | zi)
T∑
k5(
pHMM(vu = i | vu−) = kP I)∑
lj−11 ∈L(u1;u23i1;i2)
lj−15k
pHMM(f
u−)
) P vu−) = kP v
u−2
) | zT))
= pHMM(fu | zi)
T∑
k5(
pHMM(vu = i | vu−) = kP I) ~u−)(k) (4.36)
We will use the phrase pair posterior probabilities under the HMM model in
order to define ranking and counting functions.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 91
4.3.4.4 Constraints, Ranking and Counting Functions from HMM
Link and Phrase Posterior Probabilities
In order to keep the size of the rule set manageable, we use the same source
constraints, alignment constraints and target constraints as for link posterior
extraction defined in Section 4.3.3. We use the phrase pair posterior proba-
bilities under the HMM model both for ranking and scoring extracted rules.
This approach assigns a fractional count to each extracted rule, which al-
lows finer estimation of the source-to-target and target-to-source translation
models. The ranking and counting functions are defined in Equation 4.37:
R(f u2u1 P zi2i1) = C(f u2u1 P zi2i1) = pHMM(V(j)P j2; i)P i2) | f u) P zJ) ) (4.37)
Under the framework described in Section 4.3.1, we have described stan-
dard rule extraction from Viterbi alignments and two novel approaches to
rule extraction based on link posterior probabilities and phrase pair pos-
terior probabilities. In order to expose concepts with more clarity, we have
restricted the presentation to the extraction of phrase based rules as opposed
to hierarchical rules. We will now show how to generalise the techniques pre-
sented so far to the extraction of hierarchical rules.
4.3.5 Hierarchical Rule Extraction
In this section, we extend the techniques presented so far to hierarchical rule
extraction. In order to avoid repetition, we describe these techniques for
the rule pattern 〈wmwPwmw〉 only (see Section 2.6.2.4 for the definition of
patterns). We first rewrite the algorithm in Figure 4.3 into the algorithm in
Figure 4.4 for this pattern. We will now describe constraints, counting and
ranking functions for standard Viterbi extraction and for extraction from
alignment posteriors.
4.3.5.1 Hierarchical Rule Extraction from Viterbi Alignment Links
We now describe the constraints, ranking and counting functions for hierar-
chical rule extraction from Viterbi alignments:
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 92
1: function ExtractRules(fJ) P zT)Pa)
2: for 1 ≤ j) ≤ j2 < j+ ≤ j4 ≤ J do
3: if ¬CS(f u2u1mf u4u3 ) then . Source constraints
4: continue
5: end if
6: i ← ∅ . Sorted hierarchical target phrases
7: for 1 ≤ i) ≤ i2 < i+ ≤ i4 ≤ I do
8: if CL(f u2u1mf u4u3 P zi2i1mzi4i3 Pa) then . Alignment constraints
9: i ← i ∪ zi2i1mzi4i3
10: end if
11: end for
12: Sort(iPR) . Hierarchical target phrases ranked according to R
13: for zi2i1mz
i4
i3
∈ i do
14: if CT (zi2i1mzi4i3 P i ) then . Target constraints
15: Extract(m→〈f u2u1mf u4u3 ,zi2i1mzi4i3〉, C(f u2u1mf u4u3 P zi2i1mzi4i3))
16: end if
17: end for
18: end for
19: end function
Figure 4.4: General procedure for hierarchical phrase-based rule extraction.
The procedure is presented for the pattern 〈wmwPwmw〉 only. This algo-
rithm is analogous to the algorithm used to extract phrase-based rules and
presented in Figure 4.3.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 93
• source constraints CS(f u2u1mf u4u3 ):
(j2 − j) + 1) + 1 + (j4 − j+ + 1) ≤ smax elements
j2 − j) < smax terminals
j4 − j+ < smax terminals
span(m) ≤ smax NT
(4.38)
We remind definitions introduced in Section 3.5.2 for the thresholds in
Equation 4.38. smax elements is the maximum number of terminals and
nonterminals in the source. smax terminals is the maximum number of
consecutive terminals in the source. smax NT is the maximum span for
a nonterminal, i.e. the maximum number of terminals covered by a
nonterminal. These thresholds are defined experimentally.
• alignment constraints CL(f u2u1mf u4u3 P zi2i1mzi4i3 Pa):
∀(jP i) ∈ LP j ∈ [j)P j4]⇔ i ∈ [i)P i4]
L ∩ [j)P j4]× [i)P i4] 6= ∅
∀(jP i) ∈ LP j ∈ (j2P j+)⇔ i ∈ (i2P i+)
L ∩ {j2 + 1} × (i2P i+) 6= ∅
L ∩ {j+ − 1} × (i2P i+) 6= ∅
L ∩ (j2P j+)× {i2 + 1} 6= ∅
L ∩ (j2P j+)× {i+ − 1} 6= ∅
(4.39)
The first two constraints require that that the phrase pair (f u4u1 P z
i4
i1
) be
consistent with the links L. The third constraint requires that there
be no link from inside the phrase pair (f u3−)u2+) P z
i3−)
i2+)
), which corresponds
to the nonterminal X, to outside the phrase pair. The last four con-
straints mean that the boundary words fu2+)P fu3−)P zi2+)P zi3−) are not
unaligned.
• target constraints CT (zi2i1mzi4i3 P i ): no constraint. Again, other imple-
mentations may impose for example length-based constraints on the
target.
• ranking and counting functions:
R(f u2u1mf u4u3 P zi2i1mzi4i3) = C(f u2u1mf u4u3 P zi2i1mzi4i3) = 1 (4.40)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 94
4.3.5.2 Hierarchical Rule Extraction from Link Posterior Proba-
bilities
We now describe the constraints, ranking and counting functions for the link
posterior extraction of hierarchical rules:
• source constraints CS(f u2u1mf u4u3 ):
j2 − j) < shier max
j4 − j+ < shier max
j+ − j2 < smax NT
(4.41)
shier max is the maximum length for consecutive terminals in the source.
smax NT is the maximum number of terminals covered by a nonterminal.
These thresholds are defined experimentally.
• For alignment constraints, let us first define Jin, Jout, Iin and Iout in
Equation 4.42 for a set of indices i), i2, i+, i4, j), j2, j+, j4:
Jin = [j)P j2] ∪ [j+P j4]
Jout = [1P J ] \ Jin
Iin = [i)P i2] ∪ [i+P i4]
Iout = [0P I] \ Iin
(4.42)
Alignment constraints CL(f u2u1mf u4u3 P zi2i1mzi4i3 Pa) are defined in Equation 4.43.
∃(jP i) ∈ (j2P j+)× (i2P i+)P p(vu = i | fJ) P zT)) S
∃(jP i) ∈ [j)P j2]× IinP p(vu = i | fJ) P zT)) S
∃(jP i) ∈ [j+P j4]× IinP p(vu = i | fJ) P zT)) S
∃(jP i) ∈ Jin × [i)P i2]P p(vu = i | fJ) P zT)) S
∃(jP i) ∈ Jin × [i+P i4]P p(vu = i | fJ) P zT)) S
∀(jP i) ∈ [1P J ]× [1P I] ∩ {(jP i) : p(vu = i | fJ) P zT)) S }
j ∈ (j2P j+)⇔ i ∈ (i2P i+)
j ∈ Jin ⇔ i ∈ Iin
(4.43)
• target constraints CT (zi2i1mzi4i3 P i ):
i2 − i) < thier max
i4 − i+ < thier max
(4.44)
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 95
thier max is the maximum number of consecutive terminals in the target
side. Another constraint requires to pick the first k targets per source
according to the ranking function R. In the experiments to follow,
k = 3.
• ranking function:
R(f u2u1mf u4u3 P zi2i1mzi4i3) =
∏
u∈Jin
∑
i∈Tin
p(vu = i | fJ) P zT))
i2 − i) + i4 − i+ + 2 (4.45)
• counting function:
C(f u2u1mf u4u3 P zi2i1mzi4i3) = 1 (4.46)
4.3.5.3 Hierarchical Rule Extraction from Phrase Posterior Prob-
abilities
For hierarchical rule extraction based on phrase pair posteriors, we use the
same constraints as for hierarchical rule extraction based on link posteri-
ors. The ranking and counting functions are defined in terms of alignment
posterior probabilities over hierarchical phrase pairs.
We define V(j)P j2; j+P j4; i)P i2; i+P i4) in Equation 4.47.
V(j)P j2; j+P j4; i)P i2; i+P i4) = {vJ) : vu ∈ Iin ⇔ j ∈ Jin} (4.47)
where Iin and Jin have been defined in Equation 4.42. We can now define the
ranking and counting functions in Equation 4.48.
R(f u2u1mf u4u3 P zi2i1mzi4i3) = pHMM(V(j)P j2; j+P j4; i)P i2; i+P i4) | f u) P zJ) )
C(f u2u1mf u4u3 P zi2i1mzi4i3) = pHMM(V(j)P j2; j+P j4; i)P i2; i+P i4) | f u) P zJ) )
(4.48)
In order to compute these functions for the HMM model, we use the same
constrained forward algorithm from Section 4.3.4. This time, the constraints
are given by Equation 4.49.
∀(jP i) ∈ Jout × Iin ∪ Jin × IoutP ~u(i) = 0 (4.49)
where Iin, Jin, Iout and Jout have been defined in Equation 4.42.
We have presented a framework to extract phrase based rules and hierar-
chical rules. This framework encompasses standard extraction from Viterbi
alignment links as well as two novel methods that use link posterior prob-
abilities and phrase pair posterior probabilities. We will now evaluate our
original methods, first by assessing the expressiveness of the grammars ex-
tracted with these methods and then by measuring translation quality.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 96
G0 G1 G2 G3 G4
S→〈X,X〉 X→〈w X,X w〉 X→〈w X,X w〉 X→〈w X,X w〉 X→〈w X,X w〉
S→〈S X,S X〉 X→〈X w,w X〉 X→〈X w,w X〉 X→〈X w,w X〉 X→〈X w,w X〉
X→〈w,w〉 X→〈w X,w X〉 X→〈w X,w X〉 X→〈X1wX2,wX2X1〉
X→〈w X w,w X w〉 X→〈X1wX2,X2X1w〉
Table 4.1: Hierarchical phrase-based grammars containing different types
of rules. The grammar expressiveness is greater as more types of rules are
included. In addition to the rules shown in the respective columns, G), G2,
G+ and G4 also contain the rules of G(.
4.4 Experiments
In this section, we carry out experiments in order to demonstrate the ef-
fectiveness of our original grammar extraction method. We will assess the
quality of the grammars extracted in terms of how expressive these grammars
are, and in terms of translation quality. In Section 4.4.1, we define the pat-
terns to be included in the grammars used for experiments. In Section 4.4.2,
we describe our experimental setup. In Section 4.4.3 and Section 4.4.4, we
report results for grammar expressiveness and translation quality. In Sec-
tion 4.4.5, we compare our two original methods: link posterior extraction
vs. phrase pair posterior extraction. Finally, in Section 4.4.6, we explore
various methods to exploit information from source-to-target and target-to-
source alignment models.
4.4.1 Experimental Setup: Grammar Pattern Defini-
tion
In this section we define the hierarchical phrase-based grammars we use for
translation experiments. Each grammar is defined by the patterns it contains
(see Section 2.6.2.4). We first start with a basic grammar G( defined in the
left-most column of Table 4.1. G( is a monotonic phrase-based translation
grammar. It includes all phrase-based rules, represented by the rule pattern
m → 〈wPw〉, and the two glue rules that allow concatenation.
Our approach is to extend this grammar by successively incorporating
sets of hierarchical rules, or patterns. The goal is to obtain a grammar with
few rule patterns but capable of generating a large collection of translation
candidates for a given input sentence.
Patterns were added to the base grammar G( according to their frequency
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 97
of use in a translation experiment. We analysed one iteration of minimum
error rate training. For each sentence, the decoder generates a 1000-best
list of hypotheses. For each hypothesis, the decoder also generates its single
best derivation. We extracted rules from the single best derivations of the
single best hypotheses and analysed their pattern frequency. Table 4.2 shows
the most frequently used rule patterns. We assume that if a rule pattern is
used frequently, this means that it is needed in a grammar to produce an
acceptable translation.
〈source , target〉
〈w , w〉
〈w X w , w X w〉
〈w X , X w〉
〈X w , w X〉
〈X2 w X1 , X1 X2 w〉
〈X2 w X1 , w X1 X2〉
〈X2 w X1 , X1 w X2〉
〈X w , w X w〉
〈X2 w X1 , X1 w X2 w〉
〈X2 w X1 , w X1 w X2〉
〈w X1 w X2 , w X1 X2〉
〈w X , w X w〉
〈w X w , w X〉
〈X1 w X2 w , X1 w X2〉
Table 4.2: Most frequently used rule patterns. Frequency was measured
by running the decoder on a development set, and counting rule patterns
on single best derivations of single best hypotheses. We hypothesise that a
frequent rule pattern is needed to produce a translation with good quality.
The grammars used in experiments are summarised in Table 4.1. Each
of these grammars will first be evaluated for expressiveness: we will mea-
sure how often a target sentence can be recovered from a source sentence in
decoding. We will also evaluate these grammars for translation quality.
4.4.2 Experimental Setup
We report on experiments in Chinese to English translation. Parallel training
data consists of the collections listed in Table 4.3. This is approximately
50M words per language. We report translation results on a development set
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 98
ADSO_v5 LDC2006E26 LDC2007E87
LDC2002L27 LDC2006E34 LDC2008E40
LDC2003E07 LDC2006E85 LDC2008E56
LDC2003E14 LDC2006E92 LDC2008G05
LDC2005E83 LDC2006G05 LDC2009E16
LDC2005T06 LDC2007E06 LDC2009G01
LDC2005T10 LDC2007E101 proj_syndicate
LDC2005T34 LDC2007E103 UMD_NewsExplorer
LDC2006E24 LDC2007E46 Wikipedia
Table 4.3: List of collections used as parallel training data for translation
experiments.
tune-nw and a test set test-nw1. These contain translations produced by the
GALE program and portions of the newswire sections of NIST MT02 through
MT06.1 They contain 1755 sentences and 1671 sentences respectively. Results
are also reported on a smaller held-out test set test-nw2, containing 60% of
the NIST newswire portion of MT06, that is, 369 sentences. In Section 4.4.5,
the entire newswire portion of MT06 is used and the test set is named test-
nw3.
The parallel texts for both language pairs are aligned using MTTK (Deng
and Byrne, 2005, 2008). For decoding, we use HiFST (see Section 2.9).
The language model is a 4-gram language model estimated over the English
side of the parallel text and the AFP and Xinhua portions of the English
Gigaword Fourth Edition (LDC2009T13) for the first pass translation, in-
terpolated with a zero-cutoff Stupid Backoff 5-gram (see Section 2.7.4 and
Section 3.2) estimated using 10.5B words of English newswire text for 5-gram
language model rescoring. In tuning the systems, standard minimum error
rate training iterative parameter estimation under BLEU is performed on
the development set.
Rules with a source-to-target probability less than 0.01 are discarded. In
addition, rules that did not occur more than nzms times are discarded. For
Viterbi extraction, nzms = 1. For link posterior extraction, nzms = 2 and
for phrase pair posterior extraction, nzms = 0O2. These thresholds offer were
defined experimentally and provide competitive performance for the various
1 http://www.itl.nist.gov/iad/mig/tests/mt
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 99
conditions without giving unreasonably slow decoding times. The thresholds
introduced in previous sections are defined as follows:
• smax is set to 9 for Viterbi extraction and to 5 for posterior extraction.
A preliminary experiment showed that for posterior extraction, setting
smax to 9 did not provide any improvements and slowed down decoding
times.
• smax elements is set to 5.
• smax terminals is set to 5.
• smax NT is set to 10.
• is set to 0.5.
• shier max and thier max are set to 3.
The thresholds relating to Viterbi extraction (smax, smax elements, smax terminals,
smax NT) are standard in our translation systems. , shier max and thier max were
chosen experimentally and provide competitive performance without slowing
down decoding unreasonably.
4.4.3 Grammar Coverage
We first evaluate the grammars defined in Section 4.4.1 for expressiveness.
We measure the ability of different grammars to produce a reference transla-
tion given an input sentence. Rules are extracted from the sentence pair we
want to align and we run the decoder in alignment mode (de Gispert et al.,
2010a), which is equivalent to replacing the language model by an accep-
tor for the reference translation. We compare the percentage of references
that can be successfully produced by grammars G(, G), G2 and G+2 for the
following extraction methods:
• Viterbi (V): this is the standard extraction method based on a set
of alignment links. We distinguish four cases, depending on the model
used to obtain the set of links: source-to-target (V-st), target-to-source
(V-ts), and two common symmetrisation strategies: union (V-union)
and grow-diag-final (V-gdf) (see Section 2.3.2).
2We did not include ,4 in coverage analysis
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 100
30
40
50
60
70
80
V-st
V-ts
V-union
V-gdf
V-merge
WP-st
WP-ts
WP-merge
G0
G1
G2
G3
Figure 4.5: Percentage of parallel sentences successfully aligned for various
extraction methods and grammars.
• Word Link Posteriors (WP): the extraction method is based on link
posteriors described in Section 4.3.3. These rules can be obtained either
from the posteriors of the source-to-target (WP-st) or the target-to-
source (WP-ts) alignment models. We apply the constraints described
in Section 4.3.3. We do not report alignment percentages when using
phrase pair posteriors as they are roughly identical to the WP case.
• Finally, in both cases, we also report results when merging the extracted
rules in both directions into a single rule set (V-merge and WP-
merge).
Figure 4.5 shows the results obtained for a random selection of 10,000 parallel
corpus sentences. As expected, we can see that for any extraction method,
the percentage of aligned sentences increases when switching from G( to
G), G2 and G+. Posterior-based extraction is shown to outperform standard
methods based on a Viterbi set of alignment links for nearly all grammars.
The highest alignment percentages are obtained when merging rules obtained
under models trained in each direction (WP-merge), approximately reach-
ing 80% for grammar G+.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 101
The maximum rule span (smax span defined in Section 3.5.2) in alignment
was allowed to be 15 words, so as to be similar to translation, where the
maximum rule span is 10 words. Relaxing this in alignment to 30 words
yields approximately 90% coverage for WP-merge under G+.
We note that if source constraints, alignment constraints and target con-
straints were not applied, then alignment percentages would be 100% even
for G(, but the extracted grammar would include many noisy rules with poor
generalisation power, for example entire sentence pairs.
4.4.4 Translation Results
In this section we investigate the translation performance of each hierarchical
grammar, as defined by rules obtained from three rule extraction methods:
• Viterbi union (V-union): standard rule extraction from the union
of the source-to-target and target-to-source alignment link sets.
• Word Posteriors (WP-st): extraction based on word posteriors as
described in Section 4.3.3. The posteriors are provided by the source-
to-target alignment model.
• Phrase Posteriors (PP-st): extraction based on alignment poste-
riors over phrase pairs, as described in Section 4.3.4, with fractional
counts equal to the phrase pair posterior probability under the source-
to-target alignment model.
Table 4.4 reports the translation results. It also shows the following decoding
statistics as measured on the tune-nw set: decoding time in seconds per input
word, and number of instances of search pruning per input word. Preliminary
experimental work was conducted on grammars G) and G2. Because of slow
decoding times, G4 was not included in this set of experiments; instead, gram-
mar G4 was used to contrast the WP and PP methods in Section 4.4.5. As
a contrast, we extract rules according to the heuristics introduced by Chiang
(2007) and apply the filters described by Iglesias et al. (2009a) to generate a
standard hierarchical phrase-based grammar GS described in Section 2.6.2.4.
This uses rules with up to two nonadjacent nonterminals, but excludes iden-
tical rule types such as m→〈w m,w m〉 or m→〈w m) w m2,w m) w m2〉,
which were reported to cause computational difficulties without a clear im-
provement in translation (Iglesias et al., 2009a).
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 102
Grammar Extraction # Rules tune-nw test-nw1 test-nw2
time prune BLEU BLEU BLEU
,H V-union 979k 3.7 0.3 35.1 35.6 37.6
V-union 614k 0.4 0.0 33.6 34.6 36.4
,1 WP-st 920k 0.9 0.0 34.3 34.8 37.5
PP-st 894k 1.4 0.0 34.4 35.1 37.7
V-union 735k 1.0 0.0 34.5 35.4 37.2
,2 WP-st 1132k 5.8 0.5 35.1 36.0 37.7
PP-st 1238k 7.8 0.7 35.5 36.4 38.2
V-union 967k 1.2 0.0 34.9 35.3 37.0
,3 WP-st 2681k 8.3 1.1 35.1 36.2 37.9
PP-st 5002k 10.7 2.6 35.5 36.4 38.5
Table 4.4: Chinese to English translation results with alternative grammars
and extraction methods (lower-cased BLEU shown). The number of rules
and time (secs/word) and prune (times/word) measurements are done on
tune-nw set.
Condition tune-nw test-nw1
GS/V-union 34.60/34.56/0.02/0.41 34.89/34.73/0.03/0.52
G2/PP-st 34.74/34.56/0.02/0.44 35.04/34.91/0.06/0.62
Table 4.5: Repeated MERT runs with different random parameter initialisa-
tions. Original first-pass BLEU score, BLEU score mean, variance and spread
(in original/mean/variance/spread format) are reported for two conditions.
The results show that the optimisation is relatively stable.
MERT stability For the results reported in Table 4.4, the MERT param-
eter optimisation was carried out once. Following Clark et al. (2011), we
run the MERT parameter optimisation step 10 times from different initial
random parameters for the following conditions: GS/V-union and G2/PP-
st. We report the original first-pass decoding BLEU score and the first-pass
decoding BLEU score mean, variance and spread (i.e. difference between
minimum and maximum) for the additional 10 runs on the tune-nw and test-
nw1 sets in Table 4.5. The results show that the optimisation is relatively
stable. In addition, we can see that the gain in BLEU for our method on the
test set is consistent for the original run and for the average of the additional
10 runs.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 103
Grammar complexity As expected, for the standard extraction method
(see rows entitledV-union), grammar G) is shown to underperform all other
grammars due to its structural limitations: in parsing the source, G) does
not allow inversions after concatenating two phrases. Grammar G2 obtains
much better scores, nearly generating the same translation quality as the
baseline grammar GS . Finally, G+ does not prove able to outperform G2
consistently, which suggests that the phrase-disjoint rules with one nonter-
minal are redundant for the translation grammar.
Rule extraction method For all grammars, we find that the proposed ex-
traction methods based on alignment posteriors outperform standard Viterbi-
based extraction, with improvements ranging from 0.5 to 1.1 BLEU points
for test-nw1 (depending on the grammar) and from 1.0 to 1.5 for test-nw2.
In all cases, the use of phrase posteriors PP is the best option. Interestingly,
we find that G2 extracted with WP and PP methods outperforms the more
complex GS grammar as obtained from Viterbi alignments.
Rule set statistics For grammarG2 and for the tune-nw set, Viterbi-based
extraction produces 0.7M rules, while the WP and PP extraction methods
yield 1.1M and 1.2M rules, respectively. We further analyse the sets of rules
m → 〈
P 〉 in terms of the number of distinct source and target sequences
and which are extracted. Viterbi extraction yields 82k distinct source
sequences whereas the WP and PP methods yield 116k and 146k sequences,
respectively. In terms of the average number of target sequences for each
source sequence, Viterbi extraction yields an average of 8.7 while WP and
PP yield 9.7 and 8.4 rules on average. This shows that method PP yields
wider coverage but with sharper source-to-target rule translation probability
distributions than method WP, as the average number of translations per
rule is determined by a threshold of 0.01 for the minimum source-to-target
probability.
Decoding time and pruning in search In connection to the previous
comments, we find an increased need for search pruning, and subsequently
slower decoding speed, as the search space grows larger with methods WP
and PP. A larger search space is created by the larger rule sets, which allows
the system to generate new hypotheses of better quality.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 104
4.4.5 Comparison between WP and PP
Section 4.4.4 has shown that the PP extraction method gives the best trans-
lation results in our experiments. Reasons for this may be that more rules
were extracted and the translation models were better estimated. These re-
sults were obtained with a fixed value of the parameter nzms that was found to
give good results for each method. Now, we would like to observe the effect
of varying nzms. Given an extraction method, we want to observe the effect of
decreasing nzms, that is augmenting the size of the rule set. Additionally, we
want to obtain two comparable rule sets in terms of statistics such as number
of rules, number of different rule source sides for both phrasal and hierarchi-
cal rules, etc., for two different extraction methods in order to observe the
effect of estimating the translation model with one method versus the other.
Table 4.6 summarises the results. The table shows 4 configurations: the
WP extraction method with two different values of nzms and the PP extrac-
tion method with two different values of nzms. Configurations (WP nzms=2)
and (PP nzms=1) give comparable rule sets. Configurations (WP nzms=1)
and (PP nzms=0.2) also give comparable rule sets in terms of size. We first
study the effect of decreasing nzms for each extraction method. For the WP
method, decreasing nzms from 2 to 1 leads to a average decrease of 0.1 BLEU
computed on the test sets for different grammars. We believe that increasing
the size of the rule set can lead to more pruning and therefore to a degra-
dation in translation performance. For the PP method, decreasing the nzms
from 1 to 0.2 leads to an average gain of 0.2 BLEU. We conclude that the
PP method is more robust than the WP method with larger sets of rules.
We then study the effect of using phrase pair posteriors (in PP) versus
using integer counts (in WP) to estimate translation models for comparable
rule sets. The configurations (PP nzms=1) and (PP nzms=0.2) with respect
to the configurations (WP nzms=2) and (WP nzms=1) present an average
gain of 0.2 BLEU. This shows that the translation model estimation using
phrase pair posteriors is beneficial in translation.
4.4.6 Symmetrising Alignments of Parallel Text
In this section, we investigate extraction from alignments (and posterior dis-
tributions) over parallel text which are generated using alignment models
trained in the source-to-target (st) and target-to-source (ts) directions. Our
motivation is that symmetrisation strategies have been reported to be bene-
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 105
,
)
,
2
,
4
C
on
fig
ur
at
io
n
tu
ne
-n
w
te
st
-n
w
1
te
st
-n
w
3
tu
ne
-n
w
te
st
-n
w
1
te
st
-n
w
3
tu
ne
-n
w
te
st
-n
w
1
te
st
-n
w
3
W
P
S
z
ms
=
2
34
.3
34
.8
22
.7
35
.1
36
.0
23
.0
34
.9
35
.5
23
.0
W
P
S
z
ms
=
1
34
.4
35
.1
22
.4
35
.2
36
.0
23
.2
34
.7
35
.4
22
.4
P
P
S
z
ms
=
1
34
.5
34
.7
22
.4
35
.3
36
.2
23
.2
34
.8
35
.6
23
.1
P
P
S
z
ms
=
0.
2
34
.4
35
.1
22
.8
35
.5
36
.4
23
.3
34
.9
35
.7
22
.9
Ta
bl
e
4.
6:
P
er
fo
rm
an
ce
co
m
pa
ri
so
n
m
ea
su
re
d
by
lo
w
er
ca
se
B
LE
U
ac
ro
ss
di
ffe
re
nt
gr
am
m
ar
s
fo
r
di
ffe
re
nt
va
lu
es
of
n
z
ms
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 106
ficial for Viterbi extraction methods (Koehn et al., 2003).
Results are shown in Table 4.7 for grammar G2. We find that rules
extracted under the source-to-target alignment models (V-st, WP-st and
PP-st) consistently perform better than the V-ts,WP-ts and PP-ts cases.
Also, for Viterbi extraction we find that the source-to-target V-st case per-
forms better than any of the symmetrisation strategies, which is not consis-
tent with previous findings for non-hierarchical phrase-based systems (Koehn
et al., 2003).
We use the PP rule extraction method to extract two sets of rules, under
the st and ts alignment models respectively. We now investigate two ways of
merging these sets into a single grammar for translation. The first strategy is
PP-merge and merges both rule sets by assigning to each rule the maximum
count assigned by either alignment model. We then extend the previous
strategy by adding three binary feature functions to the system, indicating
whether the rule was extracted under the stmodel, the tsmodel or both. The
motivation is that minimum error rate training can weight rules differently
according to the alignment model they were extracted from. However, we do
not find any improvement with either strategy.
Finally, we use linearised lattice minimum Bayes-risk decoding to com-
bine translation lattices (see Section 2.10.3) as produced by rules extracted
under each alignment direction (see rows named LMBR(V-st,V-ts) and
LMBR(PP-st,PP-ts)). Gains are consistent when comparing this to apply-
ing lattice minimum Bayes’ risk to each of the best individual systems (rows
named LMBR(V-st) and LMBR(PP-st)). Overall, the best-performing
strategy is to extract two sets of translation rules under the phrase pair pos-
teriors in each translation direction, and then to perform translation twice
and combine the output translations.
4.4.7 Additional Language Pair: Russian-English
We also investigated our proposed method in the context of the Russian-
English language pair, which presents less reordering challenges than Chinese-
English. We repeated experiments from Table 4.4 with the Viterbi method
as a baseline and the phrase-pair posterior method, which showed the best
results for Chinese-English. We use the same word alignment models and
development sets as for the system described in Section 5.2. Results are pre-
sented in Table 4.8. For this language pair, the translation quality obtained
by the Viterbi method and the phrase-pair extraction method is comparable,
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 107
Rule Extraction tune-nw test-nw1 test-nw2
V-st 34.7 35.6 37.5
V-ts 34.0 34.8 36.6
V-union 34.5 35.4 37.2
V-gdf 34.4 35.3 37.1
WP-st 35.1 36.0 37.7
WP-ts 34.5 35.0 37.0
PP-st 35.5 36.4 38.2
PP-ts 34.8 35.3 37.2
PP-merge 35.5 36.4 38.4
PP-merge-MERT 35.5 36.4 38.3
LMBR(V-st) 35.0 35.8 38.4
LMBR(V-st,V-ts) 35.5 36.3 38.9
LMBR(PP-st) 36.1 36.8 38.8
LMBR(PP-st,PP-ts) 36.4 36.9 39.3
Table 4.7: Translation results under grammar G2 with individual rule sets,
merged rule sets, and rescoring and system combination with lattice-based
minimum Bayes’ risk (lower-cased BLEU shown)
Grammar Extraction # Rules tune test1 test2
GS V-union 886k 33.7 32.7 25.6
V-union 386k 33.5 32.3 25.6
G) PP-st 458k 33.4 32.1 25.5
V-union 500k 33.3 32.2 25.5
G2 PP-st 712k 33.7 32.2 25.6
V-union 901k 33.7 32.4 25.5
G+ PP-st 2787k 33.7 32.3 25.8
Table 4.8: Russian to English translation results with alternative grammars
and extraction methods (lower-cased BLEU shown). Experiments from Ta-
ble 4.4 were repeated for the baseline and the phrase-pair posterior extraction
method.
CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 108
with differences in the order of 0.1 BLEU. We hypothesise that because there
is less reordering between Russian and English, the quality of the Viterbi
alignments, therefore the quality of the rules extracted, is high enough to
already produce relatively high quality translations.
4.5 Conclusion
We have presented new rule extraction methods that lead to larger rule sets
as well as better estimated translation models. In Chinese-English trans-
lation experiments, a simple grammar estimated with these methods was
shown to outperform a baseline hierarchical grammar extracted from Viterbi
alignments. A fine-grained analysis was provided to explain the advantages
of the phrase-pair posterior extraction method. Finally, it was shown that
the best way to exploit both alignment directions is to train separate mod-
els for each direction, translate separately and combine the lattice outputs.
In addition, the same techniques were explored in the context of Russian-
English translation and provided the same translation quality as the baseline
technique.
Chapter 5
Domain Adaptation for
Hierarchical Phrase-Based
Translation Systems
In Chapter 3, we have demonstrated improvements in scalability for genera-
tion of and retrieval from hierarchical phrase-based grammars. In Chapter 4,
we have introduced an original model for hierarchical phrase-based gram-
mar extraction and estimation. In this chapter, we will investigate further
refinements for hierarchical phrase-based grammar modelling in the context
of domain adaptation. We will also investigate possible improvements for
language modelling both in the context of domain adaptation and in terms
of training data size.
5.1 Introduction
As described in Section 1.2.1, SMT state-of-the-art systems consist of a
pipeline of various modules. These modules interact in complex ways, which
makes experimentation challenging, but also provides an opportunity to make
improvements in separate modules with the hope that the end-to-end sys-
tem will be improved overall. In this chapter, we describe investigations
into language model and grammar refinement techniques for SMT. We com-
pare various strategies for grammar and language modelling in the context
of domain adaptation. We also explore the interaction between a first pass
language model used in decoding and a second pass language model used in
109
CHAPTER 5. HIERO DOMAIN ADAPTATION 110
rescoring. Finally, we compare two smoothing strategies for large language
model rescoring. The techniques presented in this chapter may be used in or-
der to build high quality systems for a translation evaluation or to customise
a translation system for a client in the industry setting. Figure 5.1 indicates
which modules we attempt to improve.
Experiments are based on a system submitted to the WMT13 Russian-
English translation shared task (Pino et al., 2013). Because we developed a
very competitive system for this evaluation, refinements on that system give a
good indication of which techniques are worthwhile applying to SMT systems.
We briefly summarise the system building in Section 5.2. In Section 5.3, we
review domain adaptation techniques for machine translation. In Section 5.4,
we show how to adapt a language model to obtain better performance on a
specific domain such as newswire text. In Section 5.5, we show how additional
grammar rule features related to specific domains may help in translation.
In Section 5.6, we investigate the various strategies for combining first pass
translation and language model rescoring in terms of language model training
data size and n-gram language model order. Finally, in Section 5.7, we
compare two smoothing strategies for large language model rescoring.
5.2 Description of the System to be Developed
The experiments reported in this chapter are based on the system submitted
to theWMT13 Russian-English translation shared task (Pino et al., 2013). 19
systems were submitted to this particular task. The CUED system obtained
the best case-sensitive BLEU score and the second best human judgement
score amongst constrained-track systems (Bojar et al., 2013).1 In this section,
we summarise the system building.
We use all the Russian-English parallel data available in the constrained
track. We filter out non Russian-English sentence pairs with the language-
detection library.2 A sentence pair is filtered out if the language detector
detects a different language with probability more than 0.999995 in either
the source or the target. This discards 78,543 sentence pairs from an initial
2,543,462. For example, the sentence in Figure 5.2 was detected as French
with a probability of 0.999997. In addition, sentence pairs where the source
sentence has no Russian character, defined by the Perl regular expression
1 http://matrix.statmt.org/matrix/systems_list/1738
2 http://code.google.com/p/language-detection/
CHAPTER 5. HIERO DOMAIN ADAPTATION 111
.Parallel Text
Preprocessing
Word Alignment
Grammar
Extraction
Monolingual Text
Preprocessing
Language
Modelling
Decoding
Tuning
Language Modelling
for Rescoring
First pass LM /
rescoring LM
interaction
Figure 5.1: Machine Translation System Development Pipeline. Modules
that we attempt to refine are in bold blue. Grammar extraction and first-
pass language modelling are refined in the context of domain adaptation. We
also investigate smoothing for language modelling in rescoring. Finally, we
study the interaction between the first-pass language model and the rescoring
language model.
Sur base des pre´ce´dents avis, il semble que l’hoˆtel a fait un effort.
Figure 5.2: Example of target side sentence of a Russian-English sentence
pair. The sentence was detected as actually being French with 0.999998
probability.
CHAPTER 5. HIERO DOMAIN ADAPTATION 112
[\x0400-\x04ff], are discarded. This regular expression corresponds to the
Unicode block for Cyrillic characters (in hexadecimal notation, code 0400 to
code 04ff). This further discards 19000 sentence pairs. We take the view that
discarding a small portion of training data in order to obtain cleaner data is
beneficial in translation, when such cues are easily and reliably available.
The Russian side of the parallel corpus is tokenised with the Stanford
CoreNLP toolkit.3 The English side of the parallel corpus is tokenised with
a standard English tokeniser, which splits sentences into tokens using white
space and punctuation, and retains abbreviations as single tokens. Both sides
of the parallel corpus are then lowercased. Corpus statistics after filtering
and tokenisation are summarised in Table 5.1.
Language # Tokens # Types
RU 47.4M 1.2M
EN 50.4M 0.7M
Table 5.1: Russian-English parallel corpus statistics after filtering and to-
kenisation. Parallel text contains approximately 50M tokens on each side.
This translation task can be characterised as a medium size translation task.
Parallel data is aligned using the MTTK toolkit (Deng and Byrne, 2008):
we train a word-to-phrase HMM model with a maximum phrase length of 4
in both source-to-target and target-to-source directions (see Section 2.3.1).
The final alignments are obtained by taking the union of alignments obtained
in both directions. A hierarchical phrase-based grammar is extracted from
the alignments as described in Section 3.4. The constraints for extraction
described in Section 3.5.2 are set as follows:
• smax = 9 (maximum number of source terminals for phrase-based rules)
• smax elements = 5 (maximum number of source terminals and nontermi-
nals)
• smax terminals = 5 (maximum number of source consecutive terminals)
• smax NT = 10 (maximum source nonterminal span)
3 http://nlp.stanford.edu/software/corenlp.shtml
CHAPTER 5. HIERO DOMAIN ADAPTATION 113
We are using a shallow-1 hierarchical grammar (see Section 2.6.2.4) in our
experiments. This model is constrained enough that the decoder can build
exact search spaces, i.e. there is no pruning in search that may lead to search
errors under the model.
We use the KenLM toolkit (Heafield et al., 2013) to estimate a single
modified Kneser-Ney smoothed 4-gram language model (see Section 2.7.3) on
all available monolingual data available in the constrained track. Statistics
for the monolingual data are presented in Table 5.2.
Corpus # Tokens
EU + NC + UN + CzEng + Yx 652.5M
Giga + CC + Wiki 654.1M
News Crawl 1594.3M
afp 874.1M
apw 1429.3M
cna + wpb 66.4M
ltw 326.5M
nyt 1744.3M
xin 425.3M
Total 7766.9M
Table 5.2: Statistics for English monolingual corpora used to train lan-
guage models. Abbreviations are used for the following corpora: Europarl
(EU), News Commentary (NC), United Nations (UN), Czech-English corpus
(CzEng), Yandex (Yx), 101 French-English corpus (Giga), Common Crawl
(CC), Wikipedia Headlines (Wiki). Corpora “afp” and below are the various
news agency from the GigaWord corpus.
For first-pass decoding, we use HiFST as described in Section 2.9. First-
pass decoding is followed by a large 5-gram language model rescoring step as
described in Section 2.10. The 5-gram language model also uses all available
monolingual data described in Table 5.2.
Two development sets are available: newstest2012 with 3003 sentences
and newstest2013 with 3000 sentences. We extract odd numbered sentences
from newstest2012 in order to obtain a tuning set newstest2012.tune. Even
numbered sentences from newstest2012 form a test set newstest2012.test.
The newstest2013 set is used as an additional test set. newstest2012.tune,
CHAPTER 5. HIERO DOMAIN ADAPTATION 114
Configuration tune test1 test2
baseline 1st pass 33.22 32.01 25.35
baseline +5g 33.33 32.26 25.53
Table 5.3: WMT13 Russian-English baseline system. Performance is mea-
sured by case-insensitive BLEU. The 1st pass configuration indicates results
obtained after decoding. The +5g configuration indicates results obtained
after 5-gram language model rescoring.
newstest2012.test and newstest2013 are respectively shortened to tune, test1
and test2.
We report our baseline experiment in Table 5.3. We indicate case-insensitive
BLEU score after first-pass decoding and 5-gram language model rescoring.
We will now investigate various strategies in order to improve performance
over the baseline. We will investigate domain adaptation techniques in order
to estimate better language models and grammars for translating newswire
data. We will also study the interaction between the first-pass language
model and the rescoring language model. Finally, we will study an alterna-
tive smoothing technique to Stupid Backoff for the rescoring language model.
5.3 Domain Adaptation for Machine Transla-
tion
In this section, we first introduce in Section 5.3.1 the general problem of
domain adaptation and how certain machine translation settings may be
instances of this problem. We then review previous work on addressing the
problem of domain adaptation in SMT in Section 5.3.2.
5.3.1 Domain Adaptation
Let us first formalise the problem of domain adaptation. In a standard multi-
class classification problem, we are given a training set {(xiP yi) ∈ X ×Y P i ∈
[1P c ]} where X is a set of instances to be labelled and Y is a finite set of
labels. The multi-class classification learning problem is to find a function
f : X → Y that minimises the number of prediction errors. Machine transla-
tion can be seen as a multi-class classification problem where X is the set of
CHAPTER 5. HIERO DOMAIN ADAPTATION 115
In addition, a new seven-year EU budget needs to be passed,
which is very complicated due to the current crisis.
Figure 5.3: Example of English sentence in the newswire domain.
all possible source sentences and Y is the set of all possible target sentences.
The multi-class classification framework is not very intuitive for machine
translation. First, at least in theory, there are an infinite number of possible
target sentences to choose from in translation. In addition, many possible
translations are correct or acceptable in general. However, the multi-class
classification setting is assumed by the original objective function (see Sec-
tion 2.2), which we recall in Equation 5.1. This objective function minimises
the number of misclassifications.
e^ = argmax
e
p(e | f) (5.1)
A domain is a particular distribution D over X . For example, D can be
such that source sentences in X drawn from D are in the newswire domain.
For example, if source sentences are in English, then source sentences drawn
sampled from X with distribution D may look like the sentence in Figure 5.3.
For domain adaptation, we assume an out-of-domain distribution DO and
a in-domain distribution DT . A model is trained on a sample drawn from DO
but the model performance is evaluated on a sample drawn from DT . Domain
adaptation consists in altering the training procedure by using information
from domain DT in order to achieve better performance on DT . The out-
of-domain and in-domain distributions are also called source domain and
target domain respectively. A simple example of domain adaptation setting
for SMT may be that the parallel text is a parallel corpus of tweets4 and
sentences to be translated come from newspaper articles.
5.3.2 Previous Work on Domain Adaptation for SMT
In machine translation, we can distinguish two types of domain adapta-
tion problem: cross-domain adaptation and dynamic adaptation (Foster and
4 https://twitter.com/
CHAPTER 5. HIERO DOMAIN ADAPTATION 116
Kuhn, 2007). In cross-domain adaptation, a small amount of in-domain
training data is available while in dynamic adaptation, no in-domain training
data is available. Dynamic adaptation may be important for online transla-
tion systems for example. Our concern is cross-domain adaptation, where at
least an in-domain development set is available for parameter tuning.
Koehn and Schroeder (2007) explore various techniques for machine trans-
lation domain adaptation. They use Europarl (Koehn, 2005) as out-of-
domain training data and a news commentary parallel corpus5 for in-domain
training data. Training the language model on in-domain training data gives
an improvement of 0.8 BLEU with respect to the baseline. Training two sep-
arated language models on the in-domain and out-of-domain training data
and interpolating them with weights set to minimise perplexity on the tuning
set gives an improvement of 0.4 BLEU. Using these two language models as
separate features to be optimised by MERT (Och, 2003) gives an improve-
ment of 0.6 BLEU. Finally, using two separate translation models trained on
the in-domain and out-of-domain data and tuning the weights with MERT
gives an improvement of 0.9 BLEU. These results are reported on the tuning
set only. In this chapter, we explore very similar techniques.
5.4 Language Model Adaptation for Machine
Translation
In this section, we contrast two simple language model adaptation techniques
for machine translation. Experiments are carried out on the Russian-English
system described in Section 5.2.
We use the KenLM toolkit to estimate separate modified Kneser-Ney
smoothed 4-gram language models for each of the corpora listed in Table 5.2.
The component models are then interpolated with the SRILM toolkit (Stol-
cke, 2002) to form a single LM. The interpolation weights are optimised for
perplexity on the concatenation of the English side of the news-test2008,
newstest2009 and newssyscomb2009 development sets which were provided
for previous translation evaluations, using the compute-best-mix tool, which
is part of the SRILM toolkit. The weights reflect both the size of the compo-
nent models and the genre of the corpus the component models are trained
on, e.g. weights are larger for larger corpora in the news genre (Foster et al.,
5 http://statmt.org/wmt07/shared-task.html
CHAPTER 5. HIERO DOMAIN ADAPTATION 117
Row Configuration tune test1 test2
1 baseline 1st pass 33.22 32.01 25.35
2 baseline +5g 33.33 32.26 25.53
3 linear 1st pass 33.71 32.26 25.47
4 linear +5g 33.74 32.51 25.58
5 loglinear 1st pass 33.23 31.75 25.37
6 loglinear +5g 33.28 31.85 25.48
Table 5.4: Performance comparison, measured by case-insensitive BLEU, be-
tween an uninterpolated language model (baseline), a linearly interpolated
language model (linear) and a log-linearly interpolated language model (log-
linear). Off-line linear interpolation (rows 3 and 4) is the best strategy.
The loglinear configuration performs worst, however this may be due to the
MERT optimisation algorithm not being able to find a high quality set of
parameters with 20 features.
2013). Thus we obtain a linear mixture of language models. We call this
configuration “linear”. As a contrast, we also keep the individual language
models as separate features in optimisation. We call this configuration “log-
linear”.
We compare the baseline, the linear and the loglinear configurations in
Table 5.4. We observe that the best strategy for building a language model
is to do offline linear interpolation to minimise perplexity on a development
set that has the same genre as the translation test sets (row 3). The second
best strategy is to use an uninterpolated language model (row 1). The worst
strategy is to do a log-linear interpolation of the various language model
components by tuning these language language models as separate features
with lattice MERT (row 5). Note that these observations are confirmed even
after rescoring with a large Stupid Backoff 5-gram language model (rows 2,
4 and 6). It is possible that the loglinear configuration, which has 20 fea-
tures as opposed to 12 features for the baseline and the linear configuration,
performs worst because the MERT algorithm becomes unstable with more
features (Foster and Kuhn, 2009). These results are in contrast to those
obtained previously (Koehn and Schroeder, 2007), where the loglinear inter-
polation was found to provide more benefits over the linear interpolation. In
addition, more language models are considered (9 vs. 2). Finally, both tune
and test results are reported here.
CHAPTER 5. HIERO DOMAIN ADAPTATION 118
5.5 Domain Adaptation with Provenance Fea-
tures
In Section 5.4, we have presented domain adaptation strategies for the lan-
guage model. In this section, we will focus on domain adaptation for the
translation model. Chiang et al. (2011) introduce provenance lexical fea-
tures. The concept of provenance was introduced previously (Matsoukas
et al., 2009) and provides metadata information such as genre or collection
for each sentence pair in parallel text. Word translation tables and lexical
features are computed for each provenance. Finally, for each provenance, the
ratio of the provenance specific lexical feature with the global lexical feature
is added as a feature to the translation system.
We use a very similar approach with the following differences and exten-
sions:
• The lexical features are computed with Model 1 rather than Viterbi
alignments, as described in Section 2.6.5.
• The features added to the system are not the ratios between a prove-
nance specific lexical feature and the global lexical feature but simply
the provenance specific lexical features.
• We extend this technique to compute provenance specific source-to-
target and target-to-source translation models as described in Sec-
tion 3.4.
For our experiments, we use 4 provenances that correspond to the 4 sub-
corpora provided for parallel text: the Common Crawl corpus, the News
Commentary corpus, the Yandex corpus and the Wiki Headlines corpus.
This gives an additional 16 features to our system: source-to-target and
target-to-source translation and lexical features for each provenance. This
configuration is called “provenance”.
When retrieving relevant rules for a particular test set, various thresholds
are applied, such as number of targets per source or translation probability
cutoffs. Thresholds involving source-to-target translation scores are applied
separately for each provenance and the union of all surviving rules for each
provenance is kept. This configuration is called “provenance union”. This
specific configuration was used for submission to the translation task com-
petition at WMT13. To our knowledge, this original technique has not been
CHAPTER 5. HIERO DOMAIN ADAPTATION 119
Row Configuration tune test1 test2
1 no provenance 1st pass 33.71 32.26 25.47
2 no provenance +5g 33.74 32.51 25.58
3 provenance 1st pass 33.22 32.14 25.40
4 provenance +5g 33.22 32.14 25.41
5 provenance union 1st pass 33.65 32.36 25.55
6 provenance union +5g 33.67 32.58 25.63
7 no provenance union 1st pass 33.22 31.78 25.59
8 no provenance union +5g 33.22 31.78 25.59
Table 5.5: Effect of using provenance features. Case-insensitive BLEU scores
are reported. The most effective strategy is provenance union. Because the
“no provenance union” does not improve over the baseline, we conclude that
the gains obtained by the provenance union strategy are not only due to
additional rules; instead, they are due to the conjunction of additional rules
with provenance features for better discrimination.
employed previously. Results show that it provides additional gains over the
simple use of provenance features.
The first-pass language model is the linear mixture described in Sec-
tion 5.4. We also compare the various configurations after 5-gram rescoring.
Results are presented in Table 5.5. We can see that adding provenance fea-
tures slightly degrades performance (row 3 vs. row 1, row 4 vs. row 2). This
again may be due to the inability of the MERT optimisation algorithm to
tune more than the 12 basic features. However, if we use the provenance
union strategy, we can see that having additional rules together with prove-
nance features is the best strategy (row 5 vs. row 1, row 6 vs. row 2). In
order to verify whether the gains are only due to additional rules, we rerun the
experiment with the provenance union configuration but remove the prove-
nance features (row 7 and row 8). The performance of the “no provenance
union” configuration is worse or equal to the “no provenance” configuration,
which demonstrates that the provenance union strategy is optimal in this set
of experiments and that the gains provided are not only due to having addi-
tional translation rules. The effectiveness of the provenance union strategy
is due to additional rules in conjunction with additional features for better
discrimination between rules.
CHAPTER 5. HIERO DOMAIN ADAPTATION 120
5.6 Interaction between First Pass Language
Model and Rescoring Language Model
In Section 5.4 and Section 5.5, we have investigated various strategies to
adapt the language model and translation model to the genre of the test
set to be translated. In each case, we verified that the conclusions held
even after rescoring with a large 5-gram language model. In this section, we
study the interaction between the first pass language model and the rescoring
language model in terms of n-gram order and training data size. We give
recommendations on how much data should be used to train a first pass
language model and what order should be chosen for the first pass n-gram
language model.
Our translation systems are typically run in two steps. The first step
usually consists in decoding with a 4-gram language model. The second step
consists in 5-gram lattice rescoring. Brants et al. (2007) argue that single
pass decoding is conceptually simpler and may lose less information. In this
section, we also attempt to verify this claim by incorporating 5-gram language
models directly in first-pass decoding and evaluate performance with respect
to a two-pass strategy.
We describe the various configurations used for experiments and reported
in Table 5.6. For monolingual data size, we use a “large” configuration which
uses all available data described in Table 5.2, and a “medium” configuration
which uses the following corpora: EU, NC, UN, CzEng, Yx, Giga, CC, Wiki,
afp and xin. (The keys for abbreviations are described in Table 5.2). We
also train 4-gram and 5-gram language models with large and medium data
for the first pass language model. The “large 4g” configuration corresponds
to the linear configuration from Table 5.4.
The initial parameter for the MERT optimisation step was the same for
all configurations. This parameter was obtained from a separate Arabic-to-
English experiment carried out earlier. Reusing already optimised parame-
ters between experiments allows us to do faster experimentation since less
iterations are needed until convergence.
Results are presented in Table 5.6. Comparing row 1 and row 3, we can
see that for a 4-gram language model, more training data is beneficial in
first-pass decoding. This conclusion is confirmed after 5-gram rescoring (row
2 vs. row 4). However, when comparing row 5 vs. row 7 and row 6 vs. row
8, we conclude that in the case of a first-pass 5-gram language model, more
CHAPTER 5. HIERO DOMAIN ADAPTATION 121
Row Configuration tune test1 test2
1 medium 4g 1st pass 32.96 31.53 24.60
2 medium 4g +5g 33.43 32.12 25.30
3 large 4g 1st pass 33.71 32.26 25.47
4 large 4g +5g 33.74 32.51 25.58
5 medium 5g 1st pass 32.62 31.62 24.77
6 medium 5g +5g 33.20 31.96 25.27
7 large 5g 1st pass 32.99 31.79 25.18
8 large 5g +5g 32.99 31.79 25.18
Table 5.6: Comparing various data size conditions and n-gram orders for lan-
guage model training and the effect on large 5-gram language model rescor-
ing. Case-insensitive BLEU scores are reported. In conclusion, more lan-
guage model training data is helpful but the use of higher order n-gram
language models is only beneficial in rescoring.
training data is only helpful in first-pass decoding. It is possible that even
more monolingual data is required for a 5-gram language model to generate a
lattice of sufficient quality in first-pass decoding so as to obtain improvements
in rescoring.
Comparing row 2 vs. row 6 and row 4 vs. row 8, we see that for equal
amounts of training data, the best strategy is to use a first-pass 4-gram
language model followed by large 5-gram language model rescoring. One
possible reason is that a first pass 5-gram language model may not have
enough training data to be reliably estimated and therefore may not produce
a first-pass lattice of high quality translations to be rescored, whereas a first
pass 4-gram language model is able to discard poor translations in first pass
decoding. One caveat with this conclusion of course is that the maximum
amount of data experimented with is 7.8B words as opposed to 2 trillion
words reported by Brants et al. (2007). Future work may be dedicated to fill
this gap in experimentation. For example, we plan to exploit a large 5-gram
language model (Buck et al., 2014) trained on the Common Crawl corpus6
in rescoring experiments. We are not aware of any studies in the machine
translation literature on interactions between first pass and second pass lan-
guage models. The main conclusion, i.e. that a two-pass decoding approach
6 https://commoncrawl.org/
CHAPTER 5. HIERO DOMAIN ADAPTATION 122
is beneficial over a single-pass approach given the amount of training data,
has not been demonstrated experimentally before.
5.7 Language Model Smoothing in Language
Model Lattice Rescoring
In Section 5.6, we have given recommendations about what strategy to adopt
in first-pass decoding in order to obtain optimal results in rescoring. The
conclusion was to use a 4-gram language model trained on large amounts
of data. In this section, we attempt to improve on the 5-gram rescoring
procedure by exploring two alternative smoothing strategies for the 5-gram
language model used for rescoring.
We train a Stupid Backoff 5-gram language model (see Section 2.7.4 and
Section 3.2.1) and a modified Kneser-Ney 5-gram language model on all avail-
able English data for WMT13, described in Table 5.2. We compare the two
smoothing methods in rescoring over various configurations used throughout
this chapter. We did not include the “provenance union no provenance” and
the “large 5g” configurations. On average, over the seven experiments from
Table 5.7, we obtain a gain of 0.11 BLEU on the tuning set tune, a gain of
0.14 BLEU on the first test set test1 and a gain of 0.13 BLEU on the second
test set test2. We conclude that the use of modified Kneser-Ney smoothing
is slightly beneficial in 5-gram lattice rescoring when the amount of train-
ing data is in the order of several billion tokens. This confirms conclusions
from Figure 5 in a previous publication (Brants et al., 2007) that Kneser-Ney
smoothing is superior to Stupid Backoff smoothing with amounts of train-
ing data in a similar order to the work here. These conclusions were made
for first pass decoding while here. This is the first time these two smooth-
ing strategies are compared in the context of 5-gram language model lattice
rescoring. We also note that we obtain an improvement over the system sub-
mitted to the WMT13 translation task using the “provenance union + KN
5g” configuration (row 15 in bold vs. row 14).
5.8 Conclusion
In this chapter, we have investigated various refinements on translation sys-
tem building. We have explored domain adaptation techniques in order to
CHAPTER 5. HIERO DOMAIN ADAPTATION 123
Row Configuration tune test1 test2
1 baseline 1st pass 33.22 32.01 25.35
2 baseline + SB 5g 33.33 32.26 25.53
3 baseline + KN 5g 33.31 32.28 25.65
4 linear 1st pass 33.71 32.26 25.47
5 linear + SB 5g 33.74 32.51 25.58
6 linear + KN 5g 33.73 32.26 25.48
7 loglinear 1st pass 33.23 31.75 25.37
8 loglinear + SB 5g 33.28 31.85 25.48
9 loglinear + KN 5g 33.30 31.92 25.35
10 provenance 1st pass 33.22 32.14 25.40
11 provenance + SB 5g 33.22 32.14 25.41
12 provenance + KN 5g 33.54 32.37 25.75
13 provenance union 1st pass 33.65 32.36 25.55
14 provenance union + SB 5g 33.67 32.58 25.63
15 provenance union + KN 5g 33.89 32.82 25.84
16 medium 5g 1st pass 32.62 31.62 24.77
17 medium 5g + SB 5g 33.20 31.96 25.27
18 medium 5g + KN 5g 33.24 32.29 25.47
19 medium 4g 1st pass 32.96 31.53 24.60
20 medium 4g + SB 5g 33.43 32.12 25.30
21 medium 4g + KN 5g 33.65 32.46 25.55
Table 5.7: Comparison between the gains obtained by Stupid Backoff 5-
gram rescoring and Kneser-Ney 5-gram rescoring. Case-insensitive BLEU
scores are reported. In conclusion, at the level of several billion tokens for
language model training, Kneser-Ney smoothing is beneficial over Stupid
Backoff smoothing in the context of 5-gram lattice rescoring. Row 15 (in
bold) achieves a better performance than row 14, which was our system
submitted to the WMT13 translation shared task.
CHAPTER 5. HIERO DOMAIN ADAPTATION 124
refine the language model and the translation model. For language model
building, we find that the best strategy for cross-domain adaptation is to
build an interpolated language model and tune the interpolation weights in
order to minimise the perplexity on a development set in the target domain,
as opposed to tuning the language model weights with MERT. We also have
examined the effect of provenance features in the translation model. We
found that the use provenance features is beneficial only when these features
are used to discriminate rules from a larger rule set.
After having attempted to refine models used in first-pass decoding, we
have studied the interaction between the first pass language model and the
5-gram language model used in lattice rescoring. We have come to the con-
clusion that in our setting, a two-pass strategy is preferable and that using
large amounts of data to train a 4-gram language model for first-pass decod-
ing yields better results in rescoring.
Finally, we have demonstrated that Kneser-Ney smoothing produces bet-
ter results than Stupid Backoff smoothing when training a 5-gram language
model on 7.8 billion words for rescoring purposes. This confirms previous
research in the context of rescoring as opposed to first-pass decoding.
Chapter 6
String Regeneration as
Phrase-Based Translation
In the previous chapters, we have proposed various improvements to hier-
archical phrase-based translation systems, in terms of infrastructure (Chap-
ter 3), grammar modelling (Chapter 4 and Chapter 5) and language mod-
elling for first pass decoding as well as rescoring (Chapter 5). In the chapter
to follow (Chapter 7), we will also introduce techniques for potential improve-
ments in fluency in the output of hierarchical phrase-based systems. In this
chapter, we lay the ground work to be applied in translation in Chapter 7.
Machine translation output was originally evaluated by human judges in
terms of adequacy, or how well the meaning of the source text is preserved in
the translated text, and fluency, or how well-formed the translation is (White
et al., 1993), i.e. to what extent it is syntactically, morphologically, and ortho-
graphically correct (Reiter and Dale, 1997). Adequacy and fluency motivate
the separation of SMT models into a translation model and a language model
(see Section 2.2). There is increasing concern that translation output often
lacks fluency (Knight, 2007). In this chapter, we study the related prob-
lem of string regeneration: given a sentence where the word order has been
scrambled, how difficult is it to recover the original sentence ? Our approach
is inspired by phrase-based translation techniques and achieves state-of-the-
art performance on the string regeneration task. Software written for this
chapter is available online.1
1 https://github.com/jmp84/NgramGen
125
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 126
6.1 Introduction
Before the widespread use of automatic metrics such as BLEU (see Sec-
tion 2.8.1) or METEOR (Banerjee and Lavie, 2005), machine translation
systems were originally evaluated in terms of adequacy and fluency. Current
popular automatic metrics for translation can be “gamed” in the sense that it
is possible to obtain a high score according to these metrics with an output
that lacks fluency. For example, a translation may obtain a high BLEU score
simply because it contains many n-grams that are also in the reference trans-
lation but this translation may not be well-formed syntactically. Figure 6.1
shows how an improvement of 20 BLEU points at the sentence level between
an MT hypothesis and an oracle MT hypothesis, i.e. the best performing
hypothesis among a set, may not translate into an improvement in fluency.
Of course, BLEU is designed to measure quality at the set level, so hopefully
other oracle MT hypotheses will probably improve fluency. The nature of the
translation metrics may have led research on translation to neglect fluency.
Correct word ordering is an essential part of fluency. Producing a trans-
lation with words correctly ordered is a very difficult task. Different word
ordering between distant language pairs such as Chinese-English is one of the
main reasons for incorrect word ordering in translation. In general, obtain-
ing a correct word order is a fundamental problem in many natural language
processing applications. In machine translation, because source language and
target language have a different word ordering, a machine translation system
needs to model word reordering. In some language pairs such as French-
English, the source language gives good cues to the target word order; how-
ever, in other language pairs such as Chinese-English or Japanese-English,
the word order can be very different in the source and the target language.
In natural language generation tasks, such as paraphrasing or summarisa-
tion, word ordering is also critical at the surface realisation step (Reiter and
Dale, 1997). Word ordering is also one possible type of grammatical error,
especially for foreign language learners (Yu and Chen, 2012).
In this chapter, we study the word ordering problem in isolation through
the task of string regeneration (Wan et al., 2009). Given an input sentence
where the word order has been scrambled, the string regeneration task con-
sists in recovering the original sentence. This task can be seen as one of the
steps in surface realisation for language generation where the input is a list
of content words and function words and the output a grammatical sentence.
It also allows us to study the realisation task independently of the transla-
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 127
SMT System (19.76 Sentence-Level BLEU): enda accord-
ing to jerzy made a televised speech, the president of burundi civil
war of resistance army soldiers in the past ten years, and had to
settle on january 5 before the completion of the new military
leadership will be in place before january 7, four of the former
rebel army leader.
SMT System Oracle (39.91 Sentence-Level BLEU): en, a
us $according to al made a televised speech, said that the presi-
dent of the burundi’s rebel army soldiers in the past 10-year civil
war must be before january 5 and the completion of the new mil-
itary leadership will be in place by january 7, and four of the
former leaders of the rebel forces.
Reference: According to President Ndayizeye’s televised speech,
former rebel fighters in Burundi’s 10-year civil war must be settled
in by January 5. The new army leadership is to be made up of
40 percent former rebels and should be in place by January 7.
Figure 6.1: Example showing how a large improvement in BLEU score does
not imply an improvement in fluency. The first quote is one of the output
sentences produced by an SMT system presented in Chapter 7. The second
quote is the oracle translation for that system, i.e. the best possible transla-
tion to be found in the lattice of translations generated by the system. The
third quote is one of the human reference translations. Note that the oracle
hypothesis was picked to maximise sentence-level BLEU, which shows that,
in principle, it is possible to obtain a translation with a high BLEU score
but that lacks fluency.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 128
tion task. Finally, it can have alternative applications such as automatically
producing alternative references for machine translation.
We use a simple but extensible approach, inspired from stack-based de-
coding for phrase-based translation (see Section 2.5.4). Given a bag of words
as input, we use n-gram rules extracted from a large monolingual corpus and
concatenate these n-grams in order to form hypotheses. The hypotheses are
scored with a linear model in the same manner as translation hypotheses in
phrase-based translation. One of the critical features in this system is the
language model. In experiments to follow, we only study this main language
model feature, however, our decoder has other features implemented and ar-
bitrary features can be added easily to the system, demonstrating the system
flexibility.
Our strategy achieves state-of-the-art performance as measured by BLEU
(Section 2.8.1) in the string regeneration task. We also study various capa-
bilities of our decoder: in Section 2.5.4, we show how to split the input into
chunks for more rapid experimentation; in Section 6.6.4, we analyse the ef-
fect of stack-based pruning; in Section 6.6.5, we study the effect of including
n-gram rules with a larger n-gram order; in Section 6.6.6, we relax decoding
constraints by allowing n-gram rules to overlap; in Section 6.6.7, we intro-
duce future cost estimation using a unigram language model; finally, in Sec-
tion 6.6.8, we demonstrate how SMT rescoring techniques are also applicable
to the output of our decoder.
6.2 Background
Brown et al. (1990) anecdotally mention the string regeneration task, or bag
of words translation, in order to demonstrate the effectiveness of n-gram
language models. With a 3-gram language model, they recover 63% of a sen-
tences with less than 11 words with brute force search over all permutations.
This is of course impractical for longer sentences.
String regeneration has recently gained interest as a standalone task.Wan
et al. (2009) introduced the string regeneration task as a proxy to measure
the grammaticality of a natural language generation system output. String
regeneration can also be considered as a simplified version of the linguistic
realisation task in a natural language generation system (Reiter and Dale,
1997), which consists in generating a surface form that is syntactically, mor-
phologically, and orthographically correct.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 129
Wan et al. (2009) use a modified version of the minimum spanning tree
algorithm in order to find an optimal dependency tree that covers an input
bag of words. Once a dependency tree is built, an n-gram language model
is used so as to order siblings in the tree. Performance is measured by the
BLEU metric. Wan et al. (2009) demonstrate BLEU improvements when
regenerating Section 23 of the Penn Treebank using their original algorithm
vs. baseline algorithms based on n-gram language models only. Base nouns
indicated by the Penn Treebank are kept together for experimentation.
He et al. (2009) also use the dependency tree formalism for string real-
isation. They describe a sentence realiser that takes as input an unordered
dependency tree, i.e. a dependency tree that only encode head-modifier re-
lations but not the surface string ordering. Because an exhaustive search
is intractable (due to the number of possible children permutations at each
node), He et al. (2009) adopt a divide-and-conquer approach that recursively
linearises subtrees in a bottom-up fashion. A log-linear model is used to
select the best possible linearisation for each subtree. Features include de-
pendency relations, n-gram language models for the surface string and for
the sequence of heads.
CCG grammars (Zhang and Clark, 2011) have also been used for the
string regeneration task as an alternative to the dependency grammar for-
malism. Given an input bag of words, the best CCG parse that covers the
input is searched for. Similarly to previous work, input base nouns are kept
as single units. In follow-up work (Zhang et al., 2012), CCG grammars are
combined with large language models and provide performance improvements
as measured by the BLEU metric.
Some of the works described so far are instances of tree linearisation:
given a set of unordered dependencies, reconstruct a dependency tree which
gives the word ordering (Belz et al., 2012). Zhang (2013) extends this task
to partial tree linearisation: the input is a set of possibly missing POS tags
and possibly missing unordered dependencies. This last work achieves state-
of-the art performance on string regeneration on the Penn Treebank, when
using dependency information on the input. Zhang (2013) also reports results
on string regeneration without using dependency information on the input
but keeping base noun phrases as single units.
In this chapter, we restrict the input to be a bag of words without any
additional syntactic or semantic information. Our work is directly inspired
by recent work that exploits phrase-based grammars together with a hier-
archical phrase-based decoder based on FSTs (see Section 2.9) (de Gispert
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 130
et al., 2014). We also use phrase-based grammars, but our decoding pro-
cedure is analogous to stack-based decoding for phrase-based translation.
To our knowledge, our method achieves state-of-the-art performance on this
task. We will now present our model, which is analogous to the phrase-based
translation model, but in a monolingual setting.
6.3 Phrase-Based Translation Model for String
Regeneration
In the previous section, we have reviewed various strategies for string regen-
eration. We observe that, in general, the use of syntax was beneficial over
the exclusive use of an n-gram language model. We do not take advantage of
syntactic information, instead, we complement an n-gram language with n-
gram rules extracted from monolingual data. We now describe our method,
which is inspired from phrase-based translation techniques (see Section 2.5).
We employ a feature based linear model (see Section 2.4) as our objective
function:
e^ = argmax
e
· s(e) (6.1)
where:
• e^ is the best possible hypothesis according to the model.
• e is an English sentence.
• is a feature weight vector.
• s(e) is a feature vector for the hypothesis e.
(e) contains a language model g(e) as well as local features '(e) that
are additive over phrases that segment e. Similarly, has a weight g for
the language model and weights ' for local features. As in phrase-based
translation, a hypothesis is decomposed into phrases eT). We can then rewrite
Equation 6.1 into Equation 6.2:
xe = argmax
e5eI1
gg(e) + ' ·'(e)
= argmax
e5eI1
gg(e) + ' ·
T∑
i5)
'(ei)
(6.2)
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 131
Note that this model makes a max approximation: various segmentations
of the hypotheses are considered but the score only depends on one single
segmentation. Experiments to follow only make use of the language model
feature but our implementation has other features integrated, such as word
count or rule count, and arbitrary features can be added easily. Our decoder
algorithm described subsequently in Section 6.5 generates hypotheses left to
right by concatenating phrases e1, ..., eI . For a phrase index i ∈ [1P I], we
define the partial model score sc(ei1) of partial hypothesis ei1 as in Equa-
tion 6.3:
sc(ei1) = gg(e
i
1) + ' ·
i∑
k5)
'(ek) (6.3)
6.4 Phrase-Based Translation Rules for String
Regeneration
In the previous section, we have introduced our model, inspired from phrase-
based translation, for string regeneration. This model assumes that a hy-
pothesis is segmented into phrases. In this section, we describe how to obtain
these phrases from monolingual data. In Section 6.4.1, we review the n-gram
rule extraction process, which is based on the Hadoop implementation of
MapReduce. In Section 6.4.2, we describe how to obtain relevant rules for a
given test set.
6.4.1 Rule Extraction
n-gram rules are extracted from large collections of monolingual text. The
text is tokenised and lowercased. We then employ the MapReduce framework
as implemented by Hadoop in order to extract n-grams together with their
occurrence counts from the corpus.
The input key to the map function is irrelevant. The input value to the
map function is a line of text. The map function simply extracts all n-grams
from the input line of text and outputs these n-grams with a count of one.
The input to the reduce function is an n-gram and a series of constant values
equal to one. The reduce function simply sums the counts and produces
the n-gram with its total count. This MapReduce job is analogous to the
canonical “word count” example except that here n-grams of higher order
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 132
than unigrams are considered. For our experiments, we extract n-grams
from order 1 to 5.
6.4.2 Rule Filtering
Once we have extracted all n-grams from a monolingual corpus, we would
like to retrieve the n-grams relevant to a test set that we wish to experiment
on. An n-gram is relevant if the vocabulary of the n-gram is included in any
of the vocabularies of each sentence in the test set.
Contrary to the process described in Section 3.5, we do not generate
queries and then seek those queries in an HFile. This is because the number
of possible queries is too large to fit in memory. For a sentence of length c
with words distinct from each other, the number of possible relevant 5-grams
is c × (c − 1)× (c − 2)× (c − 3)× (c − 4). For sentence of length 100,
this would correspond to approximately 9B queries. Even with an alternative
implementation where keys in the HFile are lexicographically sorted n-grams
and values are the permutations and counts found in the monolingual data,
the number of 5-gram queries would be
(
N
5
)
, which is approximately 75M
queries for a 100 word sentence.
We therefore take the alternative approach of scanning the set of n-grams
and retaining the relevant ones. For this, we again use a MapReduce job. The
input key to the map function is an n-gram and the input value is a count.
For each test sentence, the map function checks if the n-gram vocabulary is
included in the vocabulary of that sentence and if so, it generates all possible
coverages of this n-gram for that sentence, where a coverage indicates the
positions of the words covered by the n-gram in the input sentence. The map
output key is the n-gram together with a sentence id. The map output value
is a coverage and a count. The reduce function simply removes duplicates in
the list of coverages.
6.5 String Regeneration Search
In Section 6.3, we have described our string regeneration model and in Sec-
tion 6.4, we have described how to extract n-gram rules. We now present
our search algorithm and introduce our string regeneration decoder, Ngram-
Gen, which reorders an input bag of word using the model introduced and
stack-based decoding techniques reviewed in Section 2.5.4.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 133
6.5.1 Input
The input to the decoder consists of:
• A bag of words, or multiset, w, represented as a vector. For example,
we have w = [do, you, do] for the input sentence “do you do” to be
regenerated.
• A set of n-grams, where the vocabulary of each n-gram is a subset of
the vocabulary of the input bag of words. Each n-gram is also associ-
ated with one or more coverages as computed in Section 6.4.2. For a
given n-gram, its coverage is a bit vector that indicates what positions
the n-gram covers in the input bag of words. In case of repeated words,
an n-gram can have several possible coverages. For example, with the
input bag of words [do, you, do], the n-gram [do, you] has two possible
coverages: 110 and 011. The input n-grams and coverages are repre-
sented as an associative array datastructure. We denote this input by
N . Note that it would be possible for the decoder to compute possible
coverages on the fly, however we choose to precompute those at the
filtering step described in Section 6.4.2.
• A n-gram language model G. The n-gram language model is used as a
the main feature in the model to encourage fluent output.
We will now describe how hypotheses, i.e. permutations of the input, are
generated by the decoder from a bag of words.
6.5.2 Algorithm
Our algorithm is inspired by stack based decoding for phrase-based transla-
tion, reviewed in Section 2.5.4. We first give an overview before outlining
the algorithm details.
6.5.2.1 Overview
In order to represent partial hypotheses, we use a vector of columns (more
commonly called stacks). Each column contains a set of states. Each state
represents a set of partial hypotheses. All states in a column represent hy-
potheses that cover the same number of words in the input.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 134
We start to build hypotheses from an initial state that represents the
empty hypothesis. This initial state is located in the first column, which
represents the hypotheses that have not covered any word in the input. The
empty hypothesis is then extended with the input n-gram rules and their
coverage. New states are created to take into account these extensions. Then,
partial hypotheses represented by these states are repeatedly extended until
hypotheses that cover the entire input are produced.
6.5.2.2 Algorithm Details
As mentioned in Section 6.5.2.1, we represent the set of partial hypotheses
with a vector of columns that we denote b . For an input w of size |w|, b
has a size of |w| + 1. Elements of b are denoted b(, ..., b|w|. The i-th
columnbi represents hypotheses that have covered i words in the input. The
columnsbi are the analog of stacks in stack based decoding for phrase-based
translation.
Each column contains a set of states that represents a set of partial hy-
pothesis. A state contains the information necessary to extend a partial
hypothesis and represents a set of partial hypotheses that can be recombined
(see Section 2.5.4). This means that if two partial hypotheses h) and h2 are
represented by a state s and h) has a higher model score than h2, then if h)
and h2 are extended with the same rules, then the extension of h) will have
a higher score than the extension h2. In our case, two hypotheses can be
recombined if they share the same last n − 1 words where n is the order of
the n-gram language model.
The states are sorted by their score. For a given state, its score is the best
model score of any of the partial hypotheses that the state represents. The
decoding algorithm is presented in Figure 6.2. We first initialise the matrix
b to be a vector of empty columns of size |w|+1 (line 2). The first column
is filled with an initial state representing an empty hypothesis. Then, we
loop over the column indices (line 3), the states in each column (line 4) and
the n-gram rules (line 5). In each case, we attempt to extend a state with an
n-gram rule. We first test if the n-gram rule r is applicable to state s (line
6). If this is the case, we extend the state s with the rule r (line 7). We
will now describe what kind of information is contained in a state. This will
allow use to also describe the last two operations, CanApply and Extend.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 135
1: function Decode(wPN PG)
2: Initialize(b)
3: for 0 ≤ i < |w| do
4: for state s ∈bi do
5: for ngram r ∈ N do
6: if CanApply(sP r) then
7: Extend(sP r)
8: end if
9: end for
10: end for
11: end for
12: end function
Figure 6.2: NgramGen decoding algorithm. The input is a bag of words
w, a set N of n-gram rules with their coverage and a language model G.
The column vector b is first initialised with a state representing the empty
hypothesis. Then, the initial state is iteratively extended.
State Definition The states must contain enough information in order to
be able to extend a hypothesis, and states also represent all hypotheses that
can be recombined. Thus, a state contains the following information:
• Coverage: the coverage is a bit vector that indicates which words in
the input have been covered. This implies a sorting of the input bag of
words. The sorting is arbitrary and we simply choose to represent the
bag of words by the sequence of words to be recovered.
• History: in order to compute the n-gram language model score correctly
when extending a partial hypothesis, we store the last n − 1 words of
the partial hypothesis we want to extend. The definition of history is
therefore the same as the definition of history for an n-gram language
model. In our implementation, the history is simply encoded as an
lm::ngram::State as defined in the KenLM toolkit (Heafield, 2011).
CanApply Operation Given a state s with coverage x(s) and an n-gram
rule r with coverage x(r), rule r can be applied to state s if x(s) and x(r) are
disjoint. We will see in Section 6.6.6 that this constrain can be relaxed if we
allow for overlapping n-grams.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 136
Extend Operation Given a state s with coverage x(s) and and n-gram
rule r with coverage x(r) that can be applied to s, we extend s with r into a
new state s′ as follows:
• The coverage x(s′) of s′ is the bitwise OR of x(s) and x(r):
x(s′) = x(s) | x(r) (6.4)
• The new partial score sc(s′) for s′ is defined in terms of the partial score
sc(s) for s, the history h(s) and the rule r:
sc(s′) = sc(s) + gg(r|h(s)) + ' ·'(r) (6.5)
• If s′ already exists in the column corresponding to its coverage, then
the score of the existing state is updated if sc(s′) is better than the
existing score. Otherwise, s′ is simply added as a new state to the
column corresponding to its coverage.
Pruning The search space for an input bag of size c is the number of
permutations of the input, which is c ; if all input words are distinct. We
therefore need to apply pruning during search to make the decoding process
tractable. We support histogram pruning and threshold pruning (see Sec-
tion 2.5.4). After each extension, if we add a new state, we enforce that
the column where a state was added satisfies the pruning constraints. This
means that for histogram pruning, with m the maximum number of states
per column, if we add a new state to a column that has m states, then the
state with the lowest score is removed from that column.
For threshold pruning with a threshold t, the situation is slightly more
complex. When we want to add a state s with score sc(s) to a column where
the best score is w, we consider three cases:
• sc(s) ≤ w and sc(s) ≥ w− t: the state s is added to the column.
• sc(s) ≤ w and sc(s) < w− t: the state s is not added to the column.
• sc(s) S w: the state s is added to the column and all states in that
column with a score less than sc(s)− t are removed from the column.
We have presented the decoding algorithm and various data structures
to support the algorithm. We will now describe how the set of hypotheses is
encoded with FSTs.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 137
6.5.3 FST Building
We mentioned in Section 6.5.2.2 that a state represents partial hypotheses
that can be recombined as well as the necessary information to be able to
extend a partial hypothesis. However, since we do not keep backpointers
between states, we cannot recover hypotheses directly from b , the vector
of columns. Instead, we build an FST that contains all hypotheses. For
each extension, an arc is added to the output FST. We make use of the
OpenFst toolkit (Allauzen et al., 2007). This allows us to rescore the output
using tools already integrated with OpenFst (Blackwood, 2010). Another
advantage is that hypotheses that are recombined are not deleted and all
completed hypotheses are stored in the output FST. This allows hypotheses
that would be discarded in recombination to remain in the output and get
an opportunity to become the best hypothesis after rescoring.
6.5.4 Example
We now demonstrate how our algorithm works with an example. Our input
consists of:
• a bag of words: [do, you, do]. Note that the word do is repeated.
• a bigram language model. Thus the history for each state will consist
of the last word of the partial hypothesis.
• monolingual n-gram rules with their coverage of the input bag of words.
We use the rules listed in Table 6.1.
After running our algorithm without pruning, we obtain the data structure
represented in Figure 6.3. We can see that the same state {111, do} is reused
for the two hypotheses do you do, you do do, because these hypotheses have
the same coverage and the same history. We also note that the hypothe-
sis do you do is repeated. This is not an issue as we use an operation of
determinisation on the resulting FST.
We will now present various experiments that demonstrate the effective-
ness of our decoder for the string regeneration task. We will also analyse
various capabilities of the decoder in order to understand which configura-
tions lead to better results.
CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 138
N-gram Coverage
do 100
do 001
you do 011
you do 110
do you 110
do you 011
Table 6.1: Example of input monolingual n-gram rules with their coverage,
corresponding to the example input [do, you, do]. The FST representing the
set of hypotheses obtained with these rules is represented in Figure 6.3.
.000,