Refinements in Hierarchical Phrase-Based Translation Systems Juan Miguel Pino Cambridge University Engineering Department Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Declaration I hereby declare that the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other University. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated in the text. This disser- tation contains less than 65,000 words including appendices, bibliography, footnotes, tables and equations and has less than 150 figures. i Acknowledgements First, I am truly indebted to my supervisor Prof. Bill Byrne. Bill provided relentless support throughout my PhD studies. He taught me how to be a better writer and a better researcher. I am also very grateful that he provided me with invaluable opportunities such as participating in translation competitions or carrying out an internship at Google for three months. I also would like to thank Dr. Stephen Clark and Dr. Miles Osborne for being part of my thesis committee and providing very valuable feedback on the first submission of this thesis. Without the lab Computer Officers, Patrick Gosling and Anna Langley, experiments reported in this thesis would not have been possible, I am there- fore grateful for all their hard work on the computer systems. I also wish to thank all my colleagues at the Machine Intelligence Lab. Dr. Adria` de Gispert provided me with critical guidance throughout my research: two chapters of this thesis originated from a very fruitful collaboration with him. I am grateful to Aurelien Waite for all his help on infrastructure and for helping me becoming a better engineer. Thanks to Dr. Gonzalo Iglesias for illuminating discussions on FSTs for machine translation as well as C++ software engineering. Dr. Graeme Blackwood and Dr. Jamie Brunning pro- vided much support at the start of my studies as well as critical translation tools such as word alignment and lattice rescoring tools. I am also in debt to my office colleagues Matt Shannon, Matt Seigel and Chao Zhang for making PhD studies a pleasurable experience. Finally, thank you Anna for your infinite patience, guidance and support and for telling me to apply for at least two jobs at the end of my studies. Thank you Sofia and Fania: I will try to teach you many languages, just in case machine translation doesn’t improve quickly enough. ii Abstract The relatively recently proposed hierarchical phrase-based translation model for statistical machine translation (SMT) has achieved state-of-the-art per- formance in numerous recent translation evaluations. Hierarchical phrase- based systems comprise a pipeline of modules with complex interactions. In this thesis, we propose refinements to the hierarchical phrase-based model as well as improvements and analyses in various modules for hierarchical phrase-based systems. We took the opportunity of increasing amounts of available training data for machine translation as well as existing frameworks for distributed com- puting in order to build better infrastructure for extraction, estimation and retrieval of hierarchical phrase-based grammars. We design and implement grammar extraction as a series of Hadoop MapReduce jobs. We store the re- sulting grammar using the HFile format, which offers competitive trade-offs in terms of efficiency and simplicity. We demonstrate improvements over two alternative solutions used in machine translation. The modular nature of the SMT pipeline, while allowing individual im- provements, has the disadvantage that errors committed by one module are propagated to the next. This thesis alleviates this issue between the word alignment module and the grammar extraction and estimation module by considering richer statistics from word alignment models in extraction. We use alignment link and alignment phrase pair posterior probabilities for gram- mar extraction and estimation and demonstrate translation improvements in Chinese to English translation. This thesis also proposes refinements in grammar and language modelling both in the context of domain adaptation and in the context of the interaction between first-pass decoding and lattice rescoring. We analyse alternative strategies for grammar and language model cross-domain adaptation. We also study interactions between first-pass and second-pass language model in iii iv terms of size and n-gram order. Finally, we analyse two smoothing methods for large 5-gram language model rescoring. The last two chapters are devoted to the application of phrase-based grammars to the string regeneration task, which we consider as a means to study the fluency of machine translation output. We design and implement a monolingual phrase-based decoder for string regeneration and achieve state- of-the-art performance on this task. By applying our decoder to the output of a hierarchical phrase-based translation system, we are able to recover the same level of translation quality as the translation system. Contents 1 Introduction 1 1.1 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Challenges for Translation . . . . . . . . . . . . . . . . 1 1.1.2 Machine Translation Current Quality . . . . . . . . . . 2 1.2 Statistical Machine Translation . . . . . . . . . . . . . . . . . 5 1.2.1 The SMT Pipeline . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . 8 2 Statistical Machine Translation 9 2.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Source-Channel Model . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 HMM and Word-to-Phrase Alignment Models . . . . . 14 2.3.2 Symmetrisation Heuristics . . . . . . . . . . . . . . . . 16 2.4 Log-Linear Model of Machine Translation . . . . . . . . . . . . 19 2.5 Phrase-Based Translation . . . . . . . . . . . . . . . . . . . . 21 2.5.1 Alignment Template Model . . . . . . . . . . . . . . . 21 2.5.2 Phrase-Based Model . . . . . . . . . . . . . . . . . . . 22 2.5.3 Phrase Pair Extraction . . . . . . . . . . . . . . . . . . 22 2.5.4 Phrase-Based Decoding . . . . . . . . . . . . . . . . . . 24 2.6 Hierarchical Phrase-Based Translation . . . . . . . . . . . . . 26 2.6.1 Introduction and Motivation . . . . . . . . . . . . . . . 27 2.6.2 Hierarchical Grammar . . . . . . . . . . . . . . . . . . 27 2.6.3 Log-linear Model for Hierarchical Phrase-Based Trans- lation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.4 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . 35 2.6.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 35 v CONTENTS vi 2.7 Language Modelling . . . . . . . . . . . . . . . . . . . . . . . 36 2.7.1 n-gram language models . . . . . . . . . . . . . . . . . 37 2.7.2 Back-off and Interpolated Models . . . . . . . . . . . . 38 2.7.3 Modified Kneser-Ney Smoothing . . . . . . . . . . . . . 39 2.7.4 Stupid Backoff Smoothing . . . . . . . . . . . . . . . . 40 2.8 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.8.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 40 2.8.2 Minimum Error Rate Training . . . . . . . . . . . . . . 42 2.9 Decoding with Finite State Transducers . . . . . . . . . . . . . 42 2.10 Lattice Rescoring . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.10.1 5-gram Language Model Lattice Rescoring . . . . . . . 43 2.10.2 Lattice Minimum Bayes’ Risk Rescoring . . . . . . . . 44 2.10.3 Lattice Minimum Bayes-Risk for System Combination . 45 3 Data Structures for Hierarchical Phrase-Based Translation Grammars 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1 Applications of MapReduce to SMT . . . . . . . . . . . 50 3.2.2 SMT Models Storage and Retrieval Solutions . . . . . . 52 3.3 HFile Description . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 HFile Internal Structure . . . . . . . . . . . . . . . . . 55 3.3.2 Record Retrieval . . . . . . . . . . . . . . . . . . . . . 57 3.3.3 Bloom Filter Optimisation for Query Retrieval . . . . . 57 3.3.4 Local Disk Optimisation . . . . . . . . . . . . . . . . . 58 3.3.5 Query sorting optimisation . . . . . . . . . . . . . . . . 58 3.4 Hierarchical Rule Extraction with MapReduce . . . . . . . . . 58 3.4.1 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 Corpus-Level Feature Computation . . . . . . . . . . . 61 3.4.3 Feature Merging . . . . . . . . . . . . . . . . . . . . . 62 3.5 Hierarchical Rule Filtering for Translation . . . . . . . . . . . 62 3.5.1 Task Description . . . . . . . . . . . . . . . . . . . . . 63 3.5.2 HFile for Hierarchical Phrase-Based Grammars . . . . 63 3.5.3 Suffix Arrays for Hierarchical Phrase-Based Grammars 64 3.5.4 Text File Representation of Hierarchical Phrase-Based Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.5 Experimental Design . . . . . . . . . . . . . . . . . . . 65 3.5.6 Results and Discussion . . . . . . . . . . . . . . . . . . 67 CONTENTS vii 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4 Hierarchical Phrase-Based Grammar Extraction from Align- ment Posterior Probabilities 72 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 General Framework for Rule Extraction . . . . . . . . . 78 4.3.2 Extraction from Viterbi Alignment Links . . . . . . . . 79 4.3.3 Extraction from Posteriors Probabilities over Align- ment Links . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.4 Extraction from Posteriors over Phrase Pairs . . . . . . 87 4.3.5 Hierarchical Rule Extraction . . . . . . . . . . . . . . . 91 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.1 Experimental Setup: Grammar Pattern Definition . . . 96 4.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . 97 4.4.3 Grammar Coverage . . . . . . . . . . . . . . . . . . . . 99 4.4.4 Translation Results . . . . . . . . . . . . . . . . . . . . 101 4.4.5 Comparison between WP and PP . . . . . . . . . . . 104 4.4.6 Symmetrising Alignments of Parallel Text . . . . . . . 104 4.4.7 Additional Language Pair: Russian-English . . . . . . . 106 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5 Domain Adaptation for Hierarchical Phrase-Based Transla- tion Systems 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Description of the System to be Developed . . . . . . . . . . . 110 5.3 Domain Adaptation for Machine Translation . . . . . . . . . . 114 5.3.1 Domain Adaptation . . . . . . . . . . . . . . . . . . . . 114 5.3.2 Previous Work on Domain Adaptation for SMT . . . . 115 5.4 Language Model Adaptation for Machine Translation . . . . . 116 5.5 Domain Adaptation with Provenance Features . . . . . . . . . 118 5.6 Interaction between First Pass Language Model and Rescoring Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 Language Model Smoothing in Language Model Lattice Rescor- ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 CONTENTS viii 6 String Regeneration as Phrase-Based Translation 125 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3 Phrase-Based Translation Model for String Regeneration . . . 130 6.4 Phrase-Based Translation Rules for String Regeneration . . . . 131 6.4.1 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . 131 6.4.2 Rule Filtering . . . . . . . . . . . . . . . . . . . . . . . 132 6.5 String Regeneration Search . . . . . . . . . . . . . . . . . . . . 132 6.5.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.3 FST Building . . . . . . . . . . . . . . . . . . . . . . . 137 6.5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 139 6.6.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.6.3 Sentence Splitting . . . . . . . . . . . . . . . . . . . . . 140 6.6.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.6.5 n-gram Order for n-gram Rules . . . . . . . . . . . . . 142 6.6.6 Overlapping n-gram Rules . . . . . . . . . . . . . . . . 144 6.6.7 Future Cost . . . . . . . . . . . . . . . . . . . . . . . . 145 6.6.8 Rescoring . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7 String Regeneration Applied to Machine Translation 150 7.1 SMT Baseline System . . . . . . . . . . . . . . . . . . . . . . 151 7.2 Regeneration Decoder Baseline on MT Output . . . . . . . . . 152 7.3 Analysis of the Loss of MT Hypotheses . . . . . . . . . . . . . 154 7.4 Biased Language Model for Regeneration . . . . . . . . . . . . 157 7.5 Translation and Regeneration Hypothesis Combination . . . . 157 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8 Summary and Future Work 161 8.1 Review of Work . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.1.1 Hierarchical Phrase-Based Grammars: Infrastructure . 162 8.1.2 Hierarchical Phrase-Based Grammars: Modelling . . . 162 8.1.3 Hierarchical Phrase-Based System Development . . . . 163 8.1.4 Fluency in Hierarchical Phrase-Based Systems . . . . . 163 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Chapter 1 Introduction 1.1 Machine translation Machine translation is the process of translation from input speech or text in a natural language into another natural language by some kind of automatic system. Real world examples include online services such as Google Trans- late1, Bing Translate2, SDL3, PROMT4, etc. Interacting with any online automatic translation service with the expectation of a high quality transla- tion can be a frustrating experience. Indeed, variations in word order across languages and syntax and the use of real world knowledge for puns or idioms make translation a very challenging task. 1.1.1 Challenges for Translation In order to illustrate the difficulties that arise in translation, we present sev- eral examples that make translation challenging for humans, and a fortiori for computers. In some languages, some concepts are common enough to be designated by one word, but in another language, an entire sentence may be needed to describe that concept. For example, the word sobremesa in Span- ish can be translated into English as the time spent after a meal, talking to the people with whom the meal was shared.5 In this situation, a human trans- 1 https://translate.google.com 2 https://www.bing.com/translator 3 http://www.freetranslation.com/ 4 http://www.online-translator.com 5 http://blog.maptia.com/posts/untranslatable-words-from-other-cultures 1 CHAPTER 1. INTRODUCTION 2 lator is left with the choice of keeping the translation short but inexact or long but cumbersome. For a computer, or rather a statistical model, such a situation will represent an outlier in terms of the ratio of the number of words that are translation of each other. Self-referential sentences can also present challenges for translation. For example, the sentence This sentence has five words has at least two acceptable translations into Russian from the syntac- tic and semantic point of view: В этом предложении пять слов and Это предложение состоит из пяти слов. However, the second translation has six words and therefore cannot be accepted. Another challenge for transla- tion is word ordering. The first sentence of the novel The Metamorphosis by Franz Kafka reads Als Gregor Samsa eines Morgens aus unruhigen Tra¨umen erwachte, fand er sich in seinem Bett zu einem ungeheuren Ungeziefer ver- wandelt. One possible English translation is As Gregor Samsa awoke one morning from uneasy dreams, he found himself transformed in his bed into a gigantic insect-like creature. In German, the words for transformed (ver- wandelt) and insect (Ungeziefer) come right at the end of the sentence and create an effect of surprise for the reader. In English, however, the verb transformed comes in the middle of the sentence and the effect of surprise is not as great. This example demonstrates how variations in word order- ing between languages can make translation challenging. Computers have the additional challenge that the choice of word ordering should produce a grammatical sentence. This is a challenge for humans too but computers are particularly bad at producing a grammatical output. 1.1.2 Machine Translation Current Quality Even though machine translation is a challenging task, it is still useful in a number of situations. For example, machine translation can be used to obtain the gist, i.e. the general meaning, of a document in a foreign language. Machine translation can also be used for post-editing: a document in a foreign language is first automatically translated and then the translation is corrected by a human translators. This is precisely the approach taken by translation companies such as Unbabel6. Machine translation has therefore gotten to the point where it is actually useful for practical applications, but it is reasonable to ask how far we are from perfect translation. We give indications in terms of the BLEU met- 6 https://www.unbabel.com CHAPTER 1. INTRODUCTION 3 System 1-2-3-4 2-3-4 1-3-4 1-2-4 1-2-3 CUED 38.95 34.80 35.60 35.87 35.28 Reference 1 – 46.19 – – – Reference 2 – – 47.27 – – Reference 3 – – – 45.43 – Reference 4 – – – – 46.90 Table 1.1: BLEU score obtained by the Cambridge University Engineering Department system and by human translators on the MT08 test set. 4 refer- ences are provided. The BLEU score is measured against various subsets of references: all references (1-2-3-4), all but the first (2-3-4), etc. The BLEU score for a human reference when the set of reference contains the human reference is not computed and is simply 100. ric (Papineni et al., 2002), which we will define in Section 2.8.1. For now, we simply need to know that BLEU measures how well a translation hypothesis resembles a set of human translation references. The Cambridge Univer- sity Engineering Department (CUED) submitted a translation system to the NIST Open Machine Translation 2012 Evaluation.7 We run this system on the newswire portion of the NIST Open Machine Translation 2008 Evalua- tion test set (MT08). For this MT08 test set, 4 references are available. We measure the case insensitive BLEU score of the CUED system against the 4 references and against subsets of 3 references. Similarly, we measure the case insensitive BLEU score of each reference against the other references. BLEU scores for humans and for the CUED system are summarised in Table 1.1. On average, CUED obtains a BLEU score of 35.39 against 3 references. On average, human references obtain a BLEU score 46.45 against the other ref- erences. We can draw two conclusions from these observations. First, this is evidence that many possible translations are acceptable. Second, because the automatic system performance for this particular setting is approximately 10 BLEU points below human performance, this gives an indication of the quality of state-of-the-art automatic translation. After manual inspection of the system output, we show a few examples of output sentences that are relatively long and still have reasonable quality in Figure 1.1: 7 http://www.nist.gov/itl/iad/mig/openmt12.cfm CHAPTER 1. INTRODUCTION 4 Minister of agriculture and social affairs, told AFP: “The prime minister submitted his resignation to Abbas, chairman of the president, and at the same time asked him to form a new cabinet, responsible for handling daily affairs of the new government.” Reference: Minister for Agriculture and Social Affairs Habbash told AFP: “Prime Minister Fayyad offered his resignation to Pres- ident Abbas. The president accepted it and at the same time also asked him to form a new cabinet government to be responsible for routine administration.” Chavez has ordered government officials to keep an eye on foreign- ers visiting venezuela’s speech, when a person is found to have publicly criticized him or the Venezuelan government, are to be deported. Reference: Chavez has already ordered government officials to closely monitor the speech of foreigners when they visit Venezuela. If anyone is found publicly criticizing him or the Venezuelan government, they should be all deported. US-China strategic economic dialogue “focused on the economic, environmental and other issues, the most important thing is that the issue of the RMB exchange rate, the US Congress members are of the view that the value of the Renminbi underestimated. Reference: The US-China Strategic Economic Dialogue mainly discusses economic, environmental protection and other issues. Yet, the most important is the issue of the renminbi exchange rate. US congressmen believe that the renminbi is excessively undervalued. Figure 1.1: Example output of a translation system, together with one ref- erence. Even though the sentences are relatively long, they are overall fluent and intelligible. True casing is restored manually. CHAPTER 1. INTRODUCTION 5 1.2 Statistical Machine Translation The current dominant approach to machine translation is a statistical ap- proach. Given an input in a natural language, a statistical model of trans- lation will attempt to predict the best translation according to a statistical criterion, such as the most likely translation under a probabilistic model of translation. The data used to estimate the parameters for such a model consists of a parallel corpus, which is a set of sentence pairs in two differ- ent natural languages that are translation of each other, and a monolingual corpus which is a set of sentences in the language we would like to translate into. Parallel data can be obtained from multilingual institutions proceedings such as the United Nations (Franz et al., 2013),the Canadian Hansard (Ger- mann, 2001) or the European Parliament (Koehn, 2005). This type of data is relatively clean since it is created by professional translators. However, it may not match the genre of the input sentences we wish to translate, such as newswire. The importance of having a good match between the data used for training and testing is discussed in Section 5.3. In order to address this concern and also in order to obtain more data, parallel data can also be extracted automatically from comparable corpora (Smith et al., 2013). How- ever, the widespread availability of machine translation and the development of automatic techniques to extract parallel corpora automatically increase the risk of having automatically translated output of poor quality present in the training data. These concerns have been acknowledged and addressed by a watermarking algorithm (Venugopal et al., 2011). 1.2.1 The SMT Pipeline Typically, state-of-the-art SMT systems are organised into a pipeline of de- terministic or statistical modules, as shown in Figure 1.2. Parallel text is first preprocessed. This consists of data cleaning and tokenisation. For morpho- logically poor languages such as English, simple rules for tokenisation, e.g. separate words by white space and punctuation, are usually good enough. For morphologically rich languages, more advanced techniques are needed for effective tokenisation, for example morphological analysis (Habash and Rambow, 2005), in order to combat data sparsity and work with a vocabu- lary with reasonable size. In the case of languages without word boundaries, such as Chinese, word segmentation techniques need to be applied (Zhang CHAPTER 1. INTRODUCTION 6 and Clark, 2007), in order to be able to break sentences into words. The preprocessed parallel text is then word-aligned (see Section 2.3). A word alignment is simply a mapping between source words and target words in a parallel sentence pair. Word alignment models were originally used to describe the translation process and to perform translation (Germann et al., 2001). Word alignment toolkits are available online8,9 as well as a word to word translation decoder.10 Nowadays, word alignment models are used as an intermediate step in the translation pipeline. A translation grammar is then extracted and estimated from the word alignments (see Section 2.6.4) and translation is performed under the translation grammar. The monolingual data is also preprocessed in the same fashion as the target side of the parallel text and one or more language models are esti- mated from this data (see section Section 2.7). The language models and the translation grammar are used by the translation decoder for translation. In order to optimise the SMT model parameters, alternating decoding and tuning steps are carried out. Typically, a decoder can output a best possible translation, or an n- best list of translations or even a lattice of translations, which is a compact representation of an n-best list with a very large number n. In the latter case, optional rescoring steps can be carried out in order to include models that are too complex or not robust enough to be included in first pass decoding (see Section 2.10). In particular, we study a model of string regeneration that allows any reordering of the output of the first pass decoder in Chapter 7. 1.3 Contributions The contributions of this thesis are outlined as follows: • We describe a novel approach to grammar extraction and estimation. Our approach ties the models of word alignment and grammar extrac- tion and estimation more closely. It results in a more robust extraction and estimation of translation grammars and leads to improvements in translation quality, as demonstrated in Chinese to English translation experiments. 8 http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1 9 https://code.google.com/p/giza-pp 10 http://www.isi.edu/licensed-sw/rewrite-decoder CHAPTER 1. INTRODUCTION 7 .Parallel Text Preprocessing Word Alignment Grammar Extraction Monolingual Text Preprocessing Language Modelling Decoding Tuning Rescoring Figure 1.2: Machine Translation System Development Pipeline CHAPTER 1. INTRODUCTION 8 • We describe a system that allows the efficient extraction and filtering of very large grammars. This method has been in continued use at CUED and was employed for submissions to the NIST 201211 and the WMT 2013 (Bojar et al., 2013) translation evaluations. This was a carefully engineering effort that required detailed understanding of grammar ex- traction procedures and how to implement them in a Hadoop frame- work. There are many implementation strategies that can be taken, but obtaining the processing performance needed to support the ex- periments reported in this thesis requires a careful design process. • We designed and implemented a system for string regeneration, inspired from phrase-based SMT techniques. We obtain state-of-the-art results in the string regeneration task and demonstrate potential applications to machine translation. 1.4 Organisation of the thesis We now describe the thesis organisation. In Chapter 2, we review statistical machine translation background: all components of the machine transla- tion pipeline presented in Figure 1.2 are reviewed in detail. In Chapter 3, we present our system for efficient extraction of translation grammars from parallel text and retrieval of translation rules from very large translation grammars. In Chapter 4, we present our novel grammar extraction proce- dure that makes use of posterior probabilities from word alignment models. We then provide hierarchical phrase-based system building recommendations about what decisions to make in terms of language modelling and grammar design to obtain the best possible systems for translation in Chapter 5. In Chapter 6, we introduce our phrase-based decoder for string regeneration and study fluency through the word reordering task. Finally, in Chapter 7, we apply our regeneration decoder to the output of a machine translation system. 11 http://www.nist.gov/itl/iad/mig/openmt12.cfm Chapter 2 Statistical Machine Translation Statistical Machine Translation (SMT) (Brown et al., 1993; Lopez, 2008; Koehn, 2010) has become the dominant approach to machine translation, as increasing amounts of data and computing power have become available. In the SMT paradigm, given a sentence in a source language, conceptually all possible sentences in a target language are assigned a probability, a score, or a cost and the best translation is picked according to a certain decision criterion that relates to these probabilities, scores or costs. The research challenge is to develop models that assign scores that reflect human judgements of translation quality. In this chapter, we first review the historical background of SMT in Sec- tion 2.1. We then present the original source-channel model for SMT in Section 2.2. Word alignment models, which we review in Section 2.3, were introduced within the framework of the source-channel model. The original source-channel model was extended into the log-linear model, presented in Section 2.4. The field of SMT shifted from word-based models to phrase- based models, introduced in Section 2.5, while retaining word-based models in their first formulation as a preliminary step. Phrase-based translation was extended into hierarchical phrase-based translation, which we review in Sec- tion 2.6. We then examine various features employed in state-of-the-art de- coders in Section 2.6.5. The target language model, which is one of the most important features in translation, is explored in more detail in Section 2.7. In Section 2.8, we review optimisation techniques that are employed in order to tune the decoder parameters. We finally present how finite state trans- ducers can be used in decoding in Section 2.9. Various rescoring procedures are reviewed in Section 2.10. 9 CHAPTER 2. STATISTICAL MACHINE TRANSLATION 10 2.1 Historical Background In this section, we present a brief historical background of statistical ma- chine translation. A more comprehensive account of the history of machine translation in general can be found elsewhere (Hutchins, 1997; Hutchins and Lovtskii, 2000). Warren Weaver can be considered the father of modern SMT. At a time when the first computers were being developed, he examined their poten- tial application to the problem of machine translation. In his memoran- dum (Weaver, 1955), he addressed the problem of multiple meanings of a source word by considering the context of that source word, which heralds phrase based translation techniques and the use of context in machine trans- lation. He was also the first to frame machine translation as a source-channel model by considering that a sentence in a foreign language is some form of code that needs to be broken, in analogy to the field of cryptography. Finally, he also emphasised the statistical aspect of machine translation. However, he also predicted that the most successful approaches to machine translation would take advantage of language invariants by using an intermediate lan- guage representation in the translation process. Even though state-of-the-art statistical translation systems do not use this kind of approach, we do notice a resurgence in intermediate language representation techniques (Mikolov et al., 2013). The first successful implementations of Warren Weaver’s ideas were car- ried out by IBM in the 1990s. The source-channel model together with a series of word alignment models were introduced by Brown et al. (1993) while Berger et al. (1996) addressed the problem of multiple meanings us- ing context in a maximum entropy framework. Word-based models were extended into different variants of phrase-based models in 1999 and at the beginning of the century (Och et al., 1999; Koehn et al., 2003; Och and Ney, 2004) and later on into synchronous context-free grammar models (Chiang, 2005, 2007). 2.2 Source-Channel Model Statistical machine translation was originally framed as a source-channel model (Shannon, 1948; Brown et al., 1990, 1993). Given a foreign sentence f , we want to find the original English sentence e that went through a noisy CHAPTER 2. STATISTICAL MACHINE TRANSLATION 11 channel and produced f . Note that in the source-channel model notation, what we would like to recover—the English sentence—is called the source while what is observed—the foreign sentence—is called the target. A source- channel model assigns probabilities from source (English) to target (foreign) but in translation, the model is used to infer the source that was most likely to have generated the target. We do not use this convention here and call the source what we are translating from and the target what we are translating into. This convention is frequently adopted (Och et al., 1999; Och and Ney, 2002, 2004) in SMT, and more so since SMT has been framed as a log-linear model (see Section 2.4). We use the decision rule in Equation 2.1, which minimises the risk under a zero-one loss function (see Section 2.10.2): e^ = argmax e p(e | f) e^ = argmax e p(f | e) p(e) p(f) (Bayes’ rule) e^ = argmax e p(f | e) p(e) (2.1) e^ is the hypothesis to be selected. p(f | e) is called the translation model while p(e) is called the (target) language model. The translation model and the language model are estimated separately for practical reasons: the amount of parallel data used to train the trans- lation model is in general orders of magnitude smaller than the amount of monolingual data used to train the language model. Another justification is that using two separate models makes the translation process modular: im- proving the translation model may help improve adequacy, i.e. how well the meaning of the source text is preserved in the translated text, while improv- ing the language model may help improve fluency, i.e. how well-formed the translation is. It is therefore considered preferable to train both a translation model and a language model. In these models, parallel sentence pairs and target sentences are not used directly as parameters because of an obvious sparsity problem. Parallel sentence pairs are further broken down using word- based models (see Section 2.3), phrase-based models (see Section 2.5) and hierarchical phrase-based models (see Section 2.6). For language modelling, sentences are broken down into windows of consecutive words using n-gram language models (see Section 2.7). We will see in the next section how to decompose the translation model using word alignment, which is introduced CHAPTER 2. STATISTICAL MACHINE TRANSLATION 12 as a latent variable into the source-channel model. 2.3 Word Alignment In the previous section, we have briefly described the source-channel model, which describes the translation process. This model cannot be used directly in practice as it has too many parameters, namely all imaginable sentence pairs and target sentences. In order to address this issue, the alignment between source words and target words will be introduced as a latent variable in the source channel model. Given a sentence pair (f P e) with source sentence length J = |f | and target sentence length I = |e|, a word alignment a for this sentence pair is a mapping between the source and target words. In other words, a is a subset of the cross product of the set of source words and their positions and the set of target words and their positions, as defined in Equation 2.2: a ⊂ {((fuP j)P (ziP i))P (jP i) ∈ [1P J ]× [1P I]} (2.2) When the context of which sentence pair (f P e) is being word-aligned is obvi- ous, we may simply consider source word positions and target word positions. In that case, a is simply defined as a subset (source position, target position), in Equation 2.3: a ⊂ [1P J ]× [1P I] (2.3) Each element of a is called an alignment link. Alignment links between source and target words correspond to semantic or syntactic equivalences shared by these words in the source and target language and in a particular sentence pair. Alignments can present many-to-one and one-to-many mappings as well as reordering as highlighted by crossing links. An example of word alignment is shown in Figure 2.1. Brown et al. (1993) introduce the alignment a as a latent variable in the translation model p(f | e), as in Equation 2.4: p(f | e) = ∑ a p(f Pa | e) (2.4) We abuse notation by calling a both the latent variable and the set of align- ment links, which is an instance of the latent variable. For mathematical convenience and in order to allow simplifications, given a sentence pair (f P e) CHAPTER 2. STATISTICAL MACHINE TRANSLATION 13 .Son˜e´ con una piedra lunar pa´lida I dreamt of a pale moonstone Figure 2.1: Example of word alignment a for a Spanish-English sentence pair. f is the Spanish sentence, e is the English sentence. The source (Spanish) length J is 6 as well as the target (English) length I. This alignment exhibits many-to-one mappings (I and dreamt align to Son˜e´), one-to-many mappings (moonstone aligns to piedra and lunar), as well as crossing links (the link pale—pa´lida crosses the link moonstone—lunar). with source length J and target length I, a is restricted to be a function from source word positions to target word positions, as in Equation 2.5: a : [1P J ] −→ [0P I] j 7−→ vu (2.5) The target position zero is included to model source words not aligned to any target word; these unaligned source words are virtually aligned to a so-called null word. Note that this definition is not symmetric: it only allows many- to-one mappings from source to target. Various symmetrisation strategies, presented in Section 2.3.2, have been devised to address this limitation. Also note that we did not take into account the null word in our initial definition of alignments in Equation 2.2 because in general, alignments are obtained from symmetrisation heuristics (see Section 2.3.2) where the null word is ignored. We can use the latent variable a to rewrite the translation model in Equation 2.6, with f = fJ) , e = zT) and a = vJ) : p(fJ) | zT)) = ∑ lJ1 p(fJ) P v J ) | zT)) = ∑ lJ1 J∏ u5) p(fuP vu | f u−)) P vu−)) P zT)) = ∑ lJ1 J∏ u5) p(fu | f u−)) P vu)P zT)) p(vu | f u−)) P vu−)) P zT)) (2.6) CHAPTER 2. STATISTICAL MACHINE TRANSLATION 14 Brown et al. (1993) present a series of five translation models of increas- ing complexity that parameterise the terms p(fu | f u−)) P vu)P zT)) and p(vu | f u−)) P v u−) ) P z T )). Parameter estimation is carried out with the expectation- maximisation algorithm (Dempster et al., 1977). Also based on Equation 2.6, Vogel et al. (1996) introduce an HMM model (Rabiner, 1989) for word align- ment and Deng and Byrne (2008) extend the HMMmodel to a word-to-phrase HMM model. We describe these two models in the following sections. We have described word alignment models in the context of the source- channel model. In that context, word alignment models can be used di- rectly for word-based decoding.1 However, nowadays, word alignment models are used as a preliminary step in the machine translation training pipeline, namely prior to rule extraction (see Section 2.5.3 and Section 2.6.4). In that case, the word alignment models are used to produce Viterbi alignments, defined in Equation 2.7: xvJ) = argmax lJ1 p(fJ) P v J ) | zT)) (2.7) One contribution of this thesis is to use alignment posterior probabilities instead of Viterbi alignments for rule extraction (see Chapter 4). 2.3.1 HMM and Word-to-Phrase Alignment Models We review HMM and word-to-phrase HMM models as these models are used in experiments throughout this thesis. Vogel et al. (1996) introduce an HMM alignment model that treats target word positions as hidden states and source words as observations. The model is written in Equation 2.8: p(fJ) P v J ) | zT)) = J∏ u5) p(vu | vu−)P I) p(fu | zlj) (2.8) Word-to-phrase HMM models (Deng and Byrne, 2008) were designed to cap- ture interesting properties of IBM Model 4 (Brown et al., 1993) in an HMM framework in order to keep alignment and estimation procedures exact. We now present this model in more detail, using our usual source/target conven- tion, which is the reverse than the one adopted in the original publication2. 1 e.g. http://www.isi.edu/licensed-sw/rewrite-decoder 2 s in the publication corresponds to e in this thesis; t corresponds to f ; J corresponds to J ; I corresponds to I. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 15 In the word-to-phrase HMM alignment model, the source sentence f is seg- mented into source phrases vK) . The alignment a is represented by a set of variables (K) P vK) P hK) P K) where: • K is the number of source phrases that form a segmentation of the source sentence f . • vK) is the alignment from target words to source phrases. • K) indicates the length of each source phrase. • hK) is a hallucination sequence that indicates whether a source phrase was generated by the target null word or by a usual target word. The general form of the model is presented in Equation 2.9: p(f Pa | e) = p(vK) P KP vK) P hK) P K) | e) = p(K | JP e)× p(vK) P  K ) P h K ) | KP JP e)× p(vK) | vK) P hK) P K) P KP JP e) (2.9) We now review the modelling decisions taken for each of the components from Equation 2.9. The first component is simply modelled by: p(K | JP e) = K (2.10) where  is a threshold that controls the number of segments in the source. The second component is modelled using the Markov assumption: p(vK) P  K ) P h K ) | KP JP e) = K∏ k5) p(vkP hkP k | vk−)P k−)P hk−)P KP JPe) = K∏ k5) p(vk | vk−)P hkP I) y(hk)n(k | zlk) (2.11) As in the HMM word alignment model vk depends only on vk−), the target length I and the binary value hk. y(hk) is simply controlled by the parameter p( by y(0) = p(. n(k | zlk) is a finite distribution on source phrase length that depends on each target word. This parameter is analogous to the fertility CHAPTER 2. STATISTICAL MACHINE TRANSLATION 16 parameter introduced in IBM Model 3 (Brown et al., 1993) and that controls how many source words are aligned to a given target word. The third component from Equation 2.9 is defined in Equation 2.12 and represents the word-to-phrase translation parameter: p(vK) |vK) P hK) P K) P KP JP e) = K∏ k5) p(vk | zlk P hkP k) (2.12) One key contribution from the word-to-phrase HMM model is to use bigram translation probabilities to model one single phrase translation, as shown in Equation 2.13: p(vk | zlk P hkP k) = t)(vk[1] | hk · zlk) k∏ u52 t2(vk[j] | vk[j − 1]P hk · zlk) (2.13) where hk ·zlk is zlk if hk = 1 and the null word otherwise, t) is a word-to-word translation probability and t2 is a bigram translation probability. Figure 2.2 shows a simplified version of the generative story for an HMM word-to-phrase alignment model: first, pick the number of source phrases K according to e (K | JP I); then pick a target word given the previously chosen one; finally generate the target phrase from the source word using fertility and bigram translation probabilities. For example, we generate the source phrase les vaches from the target word cows according to Equation 2.14: p(les vaches | cows) = p(les | cows) p(vaches | cows P les) (2.14) Thus bigram probabilities take into account the context of the target word to some extent. 2.3.2 Symmetrisation Heuristics We have mentioned that the IBM and HMM alignment models are not sym- metric: they only allow a many-to-one mapping from source words to target words. In order to address this issue, one can train alignment models in both source-to-target and target-to-source directions, obtain Viterbi align- ments from both models and apply symmetrisation strategies (Och et al., 1999; Och and Ney, 2003; Koehn et al., 2003). Och et al. (1999) designed a first symmetrisation heuristic that was later on dubbed as the grow heuris- tic. Koehn et al. (2003) later extended the grow heuristic into the grow-diag CHAPTER 2. STATISTICAL MACHINE TRANSLATION 17 .The wolf loves fat cows p(K = 5 | I = 5P J = 6) Le loup aime les vaches grasses ) = 1 2 = 1 + = 1 5 = 14 = 2 p(2 | 1) p(3 | 2) p(5 | 3) p(4 | 5) Figure 2.2: Simplified generative story for an HMMword-to-phrase alignment model. Adapted from (Deng and Byrne, 2008). The adjective noun sequence fat cows is reordered into the noun adjective sequence vaches grasses. The word cows has fertility 2 as it is translated into the target phrase les vaches. and grow-diag-final heuristics and examined the impact on translation per- formance for each heuristic. Alignments from source to target (i.e. in which the alignment is a func- tion from source positions to target positions) and target to source are de- noted af2p and ap2f respectively. Let us consider a sentence pair (f P e), and source-to-target and target-to-source Viterbi alignments af2p and ap2f . The intersection and union heuristics are defined as follows: • intersection: a = ap2f ∩ af2p • union: a = ap2f ∪ af2p The intersection heuristics typically produces high precision alignments while the union heuristics typically produces high recall alignments (Och and Ney, 2003). We now present the grow heuristic and its variants, which are based on the initial intersection and union heuristics. The grow heuristic algorithm is presented in Figure 2.3. The input is a sentence pair (fJ) P zT)), a source-to- target alignment af2p and a target-to-source alignment ap2f . The resulting alignment a is initialised with the intersection (line 2). Then alignment links CHAPTER 2. STATISTICAL MACHINE TRANSLATION 18 1: function Grow(fJ) P zJ) Paf2pPap2f ) 2: a← af2p ∩ ap2f 3: while true do 4: added ← false 5: for i ∈ [1P I] do 6: for j ∈ [1P J ] do 7: if (jP i) ∈ a then 8: for (kP l) ∈ Neighbours((jP i)) ∩(af2p ∪ ap2f ) do 9: if k not aligned in a or l not aligned in a then 10: a← a ∪ (kP l) 11: added ← true 12: end if 13: end for 14: end if 15: end for 16: end for 17: if not added then 18: break 19: end if 20: end while 21: return a 22: end function Figure 2.3: Algorithm for the grow symmetrisation heuristic (Koehn, 2010). The alignment is initialised from the intersection and alignment links that are neighbours to existing alignment links are iteratively added if the source or the target is not already aligned. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 19 that are in the union and that are neighbours of already existing alignment links (line 8) are considered. If the source or the target word is not already aligned (line 9), then the link is added to the resulting alignment (line 10). This is repeated until no more links are added. In the grow heuristic, neighbours are defined as horizontal or vertical neighbours. If diagonal neighbours are also considered, then the heuristic becomes grow-diag. The grow heuristic also has an optional step called final. Alignment links in the union where the source or the target is not already aligned can also be added to the resulting alignment. If only links in the union where the source and the target are not already aligned are considered for the final procedure, then the optional step is called final-and. Symmetrisation heuristics have been shown to be beneficial for alignments both in terms of alignment quality as measured by comparing automatic alignments to human alignments and in translation quality when alignments are used as an intermediate step in the translation pipeline. 2.4 Log-Linear Model of Machine Translation As we have seen in Section 2.2, SMT was historically framed as a source- channel model. As an alternative, Berger et al. (1996) introduce maximum entropy models for natural language processing. In maximum entropy mod- elling, we wish to estimate a conditional probability p(y | x). Given a training sample, various feature functions deemed to be relevant are picked. We then constrain p such that the expected value of each feature function f with respect to the empirical distribution is equal to the expected value of f with respect to the model p. Finally, p is chosen among all models that satisfy the constraints defined by the features and such that its entropy is maximum. Berger et al. (1996) show how a maximum entropy model can be parameterised as an exponential, or log-linear model. They apply this model to three machine translation related tasks. First, they use a maximum en- tropy model to predict the translation of a word using the context for that word. Then, they use a maximum entropy model to predict the target lan- guage word order. Finally, they apply maximum entropy modelling in order to predict the source sentence segmentation. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 20 Och et al. (1999) notice that using an erroneous version of the source- channel model, that is using the following equation: e^ = argmax e p(e | f) p(e) (2.15) gives comparable performance with respect to using the correct formulation of the source-channel model given in Equation 2.1. Then Och and Ney (2002) propose the following log-linear model extension: e^ = argmax e p(e | f) = argmax e exp( ∑M m5) mhm(ePf))∑ e0 exp( ∑M m5) mhm(e 0Pf)) = argmax e exp( M∑ m5) mhm(ePf)) (2.16) where hm are called feature functions and m are called feature weights. The log-linear model is an extension to the source-channel model because it can be reduced to the original source-channel model with the following settings: • b = 2 • h)(ePf) = log(p(f |e)) • h2(ePf) = log(p(e)) • ) = 2 = 1 Log-linear models were originally trained with the maximum likelihood cri- terion, which precisely makes them equivalent to maximum entropy mod- els (Berger et al., 1996). More effective training techniques such as mini- mum error rate training (Och, 2003) were introduced subsequently (see Sec- tion 2.8.2). An advantage of minimum error rate training is that the criterion for optimisation and the evaluation metric are consistent. Because minimum error rate training does not require computing a normalisation constant, in practice, SMT models effectively become linear models, with the objective function presented in Equation 2.17: e^ = argmax e M∑ m5) mhm(ePf) (2.17) CHAPTER 2. STATISTICAL MACHINE TRANSLATION 21 2.5 Phrase-Based Translation So far, we have presented two modelling approaches to machine translation: the original source-channel model and the current log-linear model. We also have presented word alignment models, which were introduced within the source-channel model framework and which are instances of word-based mod- els. In the phrase-based translation paradigm, the minimal unit of translation consists of phrases. Phrases are sequences of consecutive words, that need not be syntactically or semantically motivated, but nevertheless are used as translation units. Benefits of phrase-based models include: • effectively disambiguating the translation of a word in a certain local context; • effective local reordering within a phrase such as the adjective-noun inversion from French to English; • effective translation of multi-word expressions, such as idioms. Phrase-based models currently achieve state-of-the-art performance for cer- tain language pairs that do not involve much reordering such as French- English. They can be defined in the source-channel model framework (see Section 2.2) or the log-linear model framework (see Section 2.4). Because the source-channel model is no longer widely used anymore and because it is a special case of a log-linear model, we will focus our presentation on the log-linear model. There are different variations on the phrase-based translation paradigm. We will focus on two popular approaches, namely the alignment template model (Och et al., 1999; Och and Ney, 2004) and the phrase-based model (Koehn et al., 2003; Koehn, 2010). 2.5.1 Alignment Template Model The alignment template model uses the log-linear model presented in Equa- tion 2.16 as a starting point and repeated in Equation 2.18: e^ = argmax pJ1 exp( M∑ m5) mhm(f J ) P z T ))) (2.18) CHAPTER 2. STATISTICAL MACHINE TRANSLATION 22 In order to reduce the number of parameters, two latent variables .K) and zK) are introduced. zK) is a sequence of alignment templates while .K) is a permutation of size K. An alignment template is a triple ( ~F P ~EP ~V) where ~F is a sequence of source word classes, ~E is a sequence of target word classes and ~V is an alignment between ~F and ~E. .K) together with zK) define a source segmentation of fJ) into source phrases ~fK) , a target segmentation of zT) into target phrases ~zK) and a bijective mapping between source phrases and target phrases where ~fk is mapped to ~zk for k ∈ [1P K]. The alignment templates zK) are constrained in such a way that the alignment template classes match the word classes. The alignment template translation model is summarised in Figure 2.4. Using the max approximation and making the feature functions depend on the hidden variables, the translation model can be rewritten in Equation 2.19: e^ = argmax pJ1 ;z K 1 ; K 1 exp( M∑ m5) mhm(f J ) P z T )P . K ) P z K ) )) (2.19) 2.5.2 Phrase-Based Model The phrase-based model is similar to the alignment template model but does not make use of source and target word classes. Again, we use the latent variables .K) and zK) . This time, zk is defined as a triple ( ~fP ~zP ~V) where ~f is a sequence of source words, ~z is a sequence of target words and ~V is an alignment between the words in ~f and ~z. The reason for using the alignment information between phrase pairs is to be able to compute the lexical feature (see Section 2.6.5). Because the lexical feature can be computed by other means than this alignment information (see again Section 2.6.5), it is also possible to simply define zk as a phrase pair. We have presented two variants of the phrase-based model. We will now describe how to obtain the phrase pairs used for the latent variable z. 2.5.3 Phrase Pair Extraction A preliminary step to phrase pair extraction is to obtain word aligned parallel text. One possibility is to train source-to-target and target-to-source word alignment models, obtain Viterbi alignments in both directions, and apply symmetrisation heuristics, as described in Section 2.3.2. Then phrase pairs are extracted from each word aligned sentence pair. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 23 .f) f2 f+ f4 f5 f6 f7 f) f2 f+ f4 f5 f6 f7 f˜) f˜2 f˜+ f˜4 z˜) z˜2 z˜+ z˜4 z2z) z+ z4 z5 z6 z) z2 z+ z4 z5 z6 z) z2 z+ z4 Figure 2.4: Alignment template translation process. Adapted from (Och and Ney, 2004). The source word sequence f 7) is first transformed into a source class sequence f 7). The source classes are then segmented into source phrases f˜ 4) . The source phrases are then reordered and aligned to target phrases z˜4). For example, the source phrase f˜) is aligned to the target phrase z˜2 through z). This means that the permutation .4) has value .2 = 1. This also means that z) define a word alignment between the source words f) and f2 and the target word z+. Finally, the target phrases z˜4), which encode the target class sequence z6) generate the target word sequence z6). CHAPTER 2. STATISTICAL MACHINE TRANSLATION 24 .El mundo es grande The world is big Figure 2.5: Rule extraction for a sentence pair. For example, the phrase (El mundo, The world) is extracted. The phrase pair (es grande, big) is not extracted because it is not consistent with the alignment. Let us consider a sentence pair (f P e) and an alignment a. We extract all phrase pairs that are consistent with the alignment. This means that we ex- tract all phrase pairs (f u2u1 P z i2 i1 ) that satisfy Equation 2.20 and Equation 2.21: ∀(jP i) ∈ aP (j ∈ [j)P j2]⇔ i ∈ [i)P i2]) (2.20) [j)P j2]× [i)P i2] ∩ a 6= ∅ (2.21) Equation 2.20 requires that no alignment link be between a word inside the phrase pair and a word outside the phrase pair. Equation 2.21 requires that there be at least one alignment link between a source word in the source phrase and a target word in the target phrase. For example, in Figure 2.5, the phrase pair 〈El mundo, The world〉 is extracted while the phrase pair 〈es grande, big〉 is not because the word es (is) is aligned outside the phrase pair. Note that the consistency constraint sometimes refers to only Equation 2.20 rather than both Equation 2.20 and Equation 2.21. 2.5.4 Phrase-Based Decoding We have introduced two types of phrase-based models and described a tech- nique to extract phrase-pairs, which are an essential component of these models. We will now describe the decoding process, which is an effective means to obtain the optimal translation of a source sentence. We first describe a naive strategy for decoding, in order to motivate the need for a more efficient decoder. A naive decoder may follow the following steps, given a source sentence f of length J to be translated: CHAPTER 2. STATISTICAL MACHINE TRANSLATION 25 • Consider all possible segmentations of f into source phrases. There are 2J−) such segmentations. • For each segmentation, consider all possible permutations of the source phrases. For a segmentation of size K, there are K; such permutations. • For each permutation of the source phrases, consider all translations of each source phrase, and concatenate the target phrases according to the permutation. The source phrases translations are given by the phrase pairs extracted from the training data (see Section 2.5.3). If there are 10 translations per source phrase, we obtain 10K possible translation (for the segmentation and permutation considered). • Rank the translations by their score and pick the highest scoring trans- lation. We can see that the search space is too large for this naive approach to be feasible. We now present a practical solution to the decoding process in phrase-based translation. We will first introduce the translation process. Then, we will describe how translation hypotheses are built incrementally. We will then motivate the need for pruning and how pruning is carried out. Finally, we will describe how future cost estimation may reduce search errors. Translation Process Given a source sentence, the translation process is to iteratively pick a source phrase that has not been translated yet, translate that source phrase into a target phrase and append the target phrase to the translation, and repeat this process until the source sentence has been entirely covered by source phrases. While the process is not complete, the concatenation of target phrases is called a partial hypothesis. Hypothesis Expansion The decoding process starts from an initial empty partial hypothesis. This empty partial hypothesis is extended by picking source phrases, appending their translations to the empty hypothesis. At this stage, we have obtained several partial hypotheses. The partial hypotheses are repeatedly extended until all source words have been covered. Partial hypotheses are represented by states that contain the information necessary to compute the cost of an extension. If we use an n-gram language model as a feature, the state will encode the cost of the partial hypothesis and the last n− 1 words of the partial hypothesis. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 26 Hypothesis Recombination When two partial hypotheses share the same n− 1 words, only the partial hypothesis with the lower cost can lead to the best final hypothesis. Therefore, the partial hypothesis with higher cost can be discarded, or alternatively, it is possible to make these two partial hy- potheses share the same state for rescoring purposes. Stack Based Decoding and Pruning The decoding search space is very large as seen above. Approximations therefore need to be made for an ef- fective search. The partial hypotheses are grouped in stacks by the number of source words covered. This allows pruning. Each time a hypothesis ex- pansion produces a hypothesis that belongs to a certain stack, that stack is pruned. There are two commonly used types of pruning: histogram prun- ing and threshold pruning (Koehn, 2010). Histogram pruning enforces a maximum number of partial hypotheses in each stack. Threshold pruning examines the cost of the best partial hypothesis in a stack and discards all partial hypotheses in that stack whose cost is greater than the best cost plus a threshold. Histogram pruning provides a simple guarantee in terms of computational complexity. On the other hand, because it relies on cost, threshold pruning ensures a consistent quality for partial hypotheses that survive the pruning threshold. Future Cost Partial hypotheses that cover the same number of source words are grouped together for pruning purposes. However, their cost may not be directly comparable, for example partial hypotheses that correspond to the translation of frequent words in the source might have a smaller cost than partial hypotheses that correspond to the translation of rare words in the source. To address this issue, a future cost that represents how difficult it is to translate the rest of the sentence is added to the model cost of each partial hypothesis. 2.6 Hierarchical Phrase-Based Translation In the previous section, we have described the phrase-based translation paradigm. In this section, we present the hierarchical phrase-based trans- lation model. This model relies on the synchronous context free grammar formalism. Reordering between source and target languages is modelled by the grammar rather than by an ad hoc reordering model, although using both CHAPTER 2. STATISTICAL MACHINE TRANSLATION 27 the grammar and a reordering model may be beneficial (Huck et al., 2013). A closely related formalism, inversion transduction grammars, was previously introduced (Wu, 1995, 1997). However, because hierarchical phrase-based grammar only contain lexicalised rules, translation decoding is more practi- cal. An alternative to hierarchical phrase-based translation that also models discontinuous phrases but does not use the same grammar formalism has also been introduced recently (Galley and Manning, 2010). 2.6.1 Introduction and Motivation Phrase-based models generally impose a limit on the size of the phrase pairs extracted while, in decoding, phrases are typically reordered within a cer- tain limit. These restrictions may be problematic for language pairs such as German-English or Chinese-English that allow arbitrary reordering in some instances. Hierarchical phrase-based translation was introduced in order to overcome the reordering limitations from phrase-based models (Chiang, 2005, 2007). For example, the Chinese sentence with English gloss in Figure 2.6 requires “nested” reordering (Chiang, 2007): • The phrase with North Korea have diplomatic relations must be re- ordered into the phrase have diplomatic relations with North Korea. • The phrase few countries one of must be reordered into the phrase one of (the) few countries. • After the two previous segments are reordered, have diplomatic rela- tions with North Korea that one of the few countries must be reordered into one of the few countries that have diplomatic relations with North Korea. Phrase-based systems can model this type of movement but they require very long phrase pairs, which is impractical to rely upon because of data sparsity. On the other hand, hierarchical phrase-based grammars do model this type of movement using shorter phrase pairs but with more complex rules. 2.6.2 Hierarchical Grammar 2.6.2.1 Hierarchical Grammar Definition A hierarchical phrase-based grammar, or hierarchical grammar, or Hiero grammar, which is a particular instance of a synchronous context free gram- CHAPTER 2. STATISTICAL MACHINE TRANSLATION 28 .澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 Australia is with NorthKorea have diplomatic relations that few countries one of Australia is one of the few countries that have diplomatic relations with North Korea Figure 2.6: Example of Chinese sentence that needs nested reordering in order to be translated into English. Adapted from (Chiang, 2007). mar (Lewis II and Stearns, 1968; Aho and Ullman, 1969), is a set of rewrite rules of the following type: m → 〈 P P∼〉 where m is a nonterminal, and are sequences of terminals and nontermi- nals and ∼ is an alignment between nonterminals. Terminals that appear in are words in the source language while terminals that appear in are words in the target language. Nonterminals are chosen from a finite set disjoint from the set of terminals. The alignment between nonterminals indicates which nonterminals in the source and target languages correspond to each other. The alignment ∼ can be written with a set of matching indices. 2.6.2.2 Types of Rules A rule may or may not contain any nonterminals. A rule without nonter- minals (on the right hand side) is called a phrase-based rule. A rule with nonterminals is called a hierarchical rule. A hierarchical grammar also usu- CHAPTER 2. STATISTICAL MACHINE TRANSLATION 29 g) : h → 〈mPm〉 g2 : h → 〈hmP hm〉 g+ : m → 〈m) 的 m2P the m2 that m)〉 g4 : m → 〈m 之一P one of m〉 g5 : m → 〈与 m) 有 m2P have m2 with m)〉 g6 : m → 〈澳洲P Australia〉 g7 : m → 〈是P is〉 g0 : m → 〈北韩P North Korea〉 g1 : m → 〈邦交P diplomatic relations〉 g)( : m → 〈少数 国家P few countries〉 Figure 2.7: Example of hierarchical grammar. Adapted from (Chiang, 2007). With this grammar, it is possible to obtain a derivation for the sentence pair from Figure 2.6. The derivation is shown in Figure 2.8. ally contains the following rules called glue rules: h →〈mPm〉 h →〈hmP hm〉 with h the start symbol. The first glue rule is necessary to be able to start a derivation. The second glue rule allows concatenation of phrase-based or hierarchical rules. 2.6.2.3 Hierarchical Grammar Example Let us consider for example the grammar in Figure 2.7 where each rewrite rule is given a name gi. With this grammar, it is possible to write a derivation, i.e. a sequence of rules, that rewrites the start symbol h into the sentence pair presented in Section 2.6.1 (Chiang, 2007). For example we can apply the derivation g2P g2P g)P g6P g7P g4P g+P g)(P g5P g1P g0 as in Figure 2.8. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 30 S → 〈SX; SX〉 → 〈SX1X2; SX1X2〉 → 〈X1X2X3; X1X2X3〉 → 〈澳洲 X2X3;Australia X2X3〉 → 〈澳洲 是 X3;Australia is X3〉 → 〈澳洲 是 X 之一;Australia is one of X〉 → 〈澳洲 是 X1 的 X2 zhiyi;Australia is one of the X2 that X1〉 → 〈澳洲 是 X1 的 少数 国家 之一;Australia is one of the few countries that X1〉 → 〈澳洲 是 与 X1 有 X2 的 少数 国家 之一; Australia is one of the few countries that have X2 with X1〉 → 〈澳洲 是 与 X1 有 邦交 的 少数 国家 之一; Australia is one of the few countries that have diplomatic relations with X1〉 → 〈澳洲 是 与 北韩 有 邦交 的 少数 国家 之一; Australia is one of the few countries that have diplomatic relations with North Korea〉 Figure 2.8: Example of hierarchical grammar derivation. Adapted from (Chi- ang, 2007). A derivation is simply a sequence of rules that rewrite the start symbol h into a sentence pair. This particular derivation produces the sen- tence pair from Figure 2.6. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 31 2.6.2.4 Constraints on Hierarchical Grammars The definition given for a hierarchical grammar is very general. In prac- tice, systems impose constraints on the types of rule the grammar con- tains for efficiency reasons. We first introduce the concept of rule pat- tern (Iglesias et al., 2009a) in order to describe these constraints in terms of patterns. A rule pattern is simply a pair of regular expressions that match the source and target sides of a hierarchical rule. For example, the rule m → 〈Buenas tardes mP Good afternoon m〉 has a rule pattern 〈+mP#+m〉 where  is the source vocabulary and # is the target vocabu- lary. For ease of notation, we use the notation w for either + or #+. Thus w simply represents a sequence of terminals. In our example, the pattern for the rule m → 〈Buenas tardes mP Good afternoon m〉 is 〈wmPwm〉. Chiang (2007) defines the following set of pattern-related constraints that must be satisfied by the rules in a hierarchical grammar: • If a rule m → 〈 P 〉 has a pattern 〈wPw〉, then | | ≤ 10 and | | ≤ 10. • A rule m → 〈 P 〉 must satisfy | | ≤ 5. • Rules have at most 2 nonterminals. • The source side of a rule cannot have adjacent nonterminals. Setiawan and Resnik (2010) relax this constraint in order to model Chinese- English reordering phenomena that may be not always captured in training because of data sparsity. Note that the target side can still have adjacent nonterminals. More fine-grained constraints can be defined on rule patterns, which are ob- tained from rules by replacing terminal sequences by the placeholder w. For example, we use the configuration described in Table 2.1 for all translation experiments in this thesis, unless specified otherwise. This grammar was obtained following a greedy strategy of adding in turn the most beneficial patterns for Arabic-English translation. Another restriction on hierarchical grammars is the extent of reordering allowed. de Gispert et al. (2010a) investigate the use of shallow-c grammars that precisely control the depth of reordering in translation. We describe here shallow-1 grammars (Iglesias et al., 2009a; de Gispert et al., 2010a) as they will be used for translation experiments in this thesis. Shallow-1 grammars allow only one level of reordering, they do not allow nested reordering as in CHAPTER 2. STATISTICAL MACHINE TRANSLATION 32 〈source , target〉 include 〈source , target〉 include 〈w X , w X〉 no 〈X w , w X〉 yes 〈w X , X w〉 yes 〈X w , w X w〉 yes 〈w X , w X w〉 yes 〈X w , X w〉 no 〈w X w , w X〉 yes 〈w X w , X w〉 yes 〈w X w , w X w〉 yes 〈X1 w X2 , w X1 w X2〉 no 〈X2 w X1 , w X1 w X2〉 yes 〈X1 w X2 , w X1 w X2 w〉 no 〈X2 w X1 , w X1 w X2 w〉 yes 〈X1 w X2 , w X1 X2〉 no 〈X2 w X1 , w X1 X2〉 yes 〈X1 w X2 , w X1 X2 w〉 no 〈X2 w X1 , w X1 X2 w〉 yes 〈X1 w X2 , X1 w X2〉 no 〈X2 w X1 , X1 w X2〉 yes 〈X1 w X2 , X1 w X2 w〉 no 〈X2 w X1 , X1 w X2 w〉 yes 〈X1 w X2 , X1 X2 w〉 no 〈X2 w X1 , X1 X2 w〉 yes 〈w X1 w X2 , w X1 w X2〉 no 〈w X2 w X1 , w X1 w X2〉 yes 〈w X1 w X2 , w X1 w X2 w〉 yes 〈w X2 w X1 , w X1 w X2 w〉 yes 〈w X1 w X2 , w X1 X2〉 yes 〈w X2 w X1 , w X1 X2〉 yes 〈w X1 w X2 , w X1 X2 w〉 yes 〈w X2 w X1 , w X1 X2 w〉 yes 〈w X1 w X2 , X1 w X2〉 yes 〈w X2 w X1 , X1 w X2〉 yes 〈w X1 w X2 , X1 w X2 w〉 yes 〈w X2 w X1 , X1 w X2 w〉 yes 〈w X1 w X2 , X1 X2 w〉 yes 〈w X2 w X1 , X1 X2 w〉 yes 〈X1 w X2 w , w X1 w X2〉 yes 〈X2 w X1 w , w X1 w X2〉 yes 〈X1 w X2 w , w X1 w X2 w〉 yes 〈X2 w X1 w , w X1 w X2 w〉 yes 〈X1 w X2 w , w X1 X2〉 yes 〈X2 w X1 w , w X1 X2〉 yes 〈X1 w X2 w , w X1 X2 w〉 yes 〈X2 w X1 w , w X1 X2 w〉 yes 〈X1 w X2 w , X1 w X2〉 yes 〈X2 w X1 w , X1 w X2〉 yes 〈X1 w X2 w , X1 w X2 w〉 no 〈X2 w X1 w , X1 w X2 w〉 yes 〈X1 w X2 w , X1 X2 w〉 yes 〈X2 w X1 w , X1 X2 w〉 yes 〈w X1 w X2 w , w X1 w X2〉 no 〈w X2 w X1 w , w X1 w X2〉 yes 〈w X1 w X2 w , w X1 w X2 w〉 no 〈w X2 w X1 w , w X1 w X2 w〉 yes 〈w X1 w X2 w , w X1 X2〉 no 〈w X2 w X1 w , w X1 X2〉 yes 〈w X1 w X2 w , w X1 X2 w〉 no 〈w X2 w X1 w , w X1 X2 w〉 yes 〈w X1 w X2 w , X1 w X2〉 no 〈w X2 w X1 w , X1 w X2〉 yes 〈w X1 w X2 w , X1 w X2 w〉 no 〈w X2 w X1 w , X1 w X2 w〉 yes 〈w X1 w X2 w , X1 X2 w〉 no 〈w X2 w X1 w , X1 X2 w〉 yes Table 2.1: Rule patterns included in a baseline hierarchical grammar. A pattern is obtained from a rule by replacing terminal sequences by the place- holder w. The decision whether to include each pattern was based on prelim- inary experiments in Arabic-English. Most beneficial patterns were added incrementally. Unless specified otherwise, this configuration is used in all subsequent translation experiments. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 33 the example presented in Section 2.6.1. A shallow-1 grammar is defined as follows: h → 〈mPm〉 h → 〈hmP hm〉 m → 〈 sP s〉( sP s ∈ (i ∪ {k })+) m → 〈kP k 〉 k → 〈sP t〉(s ∈ i+P t ∈ i ∗) where h is the start symbol, i is the set of terminals and k is the set of nonterminals. There are two nonterminals apart from the start symbol: m and k . The rule type m → 〈 sP s〉 corresponds to all hierarchical rules. It is possible to apply this type of rule only once in any derivation. Indeed, the right hand side contains only one type of nonterminal, k , which can be rewritten only with a phrasal rule corresponding to the line k → 〈sP t〉. Note that for rules of the type k → 〈sP t〉, t can be the empty word, thus these rules, called deletion rules, allow deletion on the target side. Shallow-1 grammar are used for language pairs that do not present much reordering. Shallow-1 grammars were previously shown to work as well as full hierarchical grammars for the Arabic-English language pair (Iglesias et al., 2009a) and for the Spanish-English language pair (Iglesias et al., 2009c). In addition, shallow-1 grammars reduce the search space of the decoder greatly, resulting in a much faster decoding time, a reduced memory use and potentially fewer search errors under the translation grammar. 2.6.3 Log-linear Model for Hierarchical Phrase-Based Translation We now define in more detail the log-linear model for hierarchical translation, which usually makes a maximum (max) approximation, i.e. replaces a sum by the maximum term in the sum and assumes that the other terms are negligible. We follow the original description (Chiang, 2007). For a sentence pair (f P e), let us define D the set of possible derivations D of this sentence pair under a hierarchical grammar. We will use the following notation for a derivation D: • the foreign yield f . We define f(D) = f CHAPTER 2. STATISTICAL MACHINE TRANSLATION 34 • the English yield e. We define z(D) = e We can now derive the log-linear model for hierarchical translation: e^ = argmax e p(e | f) = argmax e ∑ O∈D p(DP e | f) (marginalisation) = argmax e argmax O∈D p(DP e | f) (max approximation) = z(argmax e;O∈D p(DP e|f)) = z(argmax O|f(O)5f p(D)) (2.22) Thanks to the max approximation, the distribution over derivations in- stead of the distribution over English sentences is modelled log-linearly and we obtain finally the following decoding equation: xe = z(argmax O|f(O)5f exp( M∑ m5) mhm(D))) (2.23) One of the features, the language model, plays a particular role. The language model feature can be written as: hM(D) = pLM(z(D)) (2.24) where b is the index of the language model feature, pLM is the language model and z(D) is the English yield of the derivation D. It is not possible to compute the language model using dynamic programming since the language model needs context in order to be computed, therefore the language model feature is typically computed after a parsing step. Note that Equation 2.23 is an approximation and that there have been attempts to perform marginalisation over the latent variable D while keep- ing the translation process tractable (Blunsom et al., 2008a; de Gispert et al., 2010a). This can give gains over the max approximation, although subsequent rescoring steps (see Section 2.10) can produce similar perfor- mance (de Gispert et al., 2010a). CHAPTER 2. STATISTICAL MACHINE TRANSLATION 35 2.6.4 Rule Extraction We have so far given the definition of a hierarchical grammar and explained how it is used with statistical models. It is also necessary to extract an appropriate grammar, defined by its rules. The extraction is performed on a parallel corpus. The parallel corpus is first word-aligned, then rules are extracted from the alignment. We first extract phrase pairs as described in Section 2.5.3. For each extracted phrase pair 〈f u2u1 P zi2i1〉, we define the following rule: m → 〈f u2u1 P zi2i1〉. These rules are called initial rules. We extend the set of initial rules with the following recursion: given a rule m → 〈 P 〉 and an initial rule m → 〈f u2u1 P zi2i1〉 such that = )f u2u1 2 and = )zi2i1 2, then extract the rule m → 〈 )m 2P )m 2〉. Note that ) or 2, but not both, can be the empty word, and similarly for ) and 2. 2.6.5 Features The following features are commonly used in log-linear models for machine translation: • Source-to-target and target-to-source translation models. As described above, the translation process produces a derivation D. A derivation D can be seen as a sequence of n rules m → 〈 )P )〉P OOOP m → 〈 nP n〉. Then the source-to-target translation model is simply ∏n i5) p( i| i) where the p( i| i) are typically estimated using relative frequency, based on the appearance of phrasal and hierarchical rules in the parallel corpus. The target-to-source translation model is symmetric. • Source-to-target and target-to-source lexical translation model. Typi- cally, the source-to-target lexical translation model lx(fJ) P zT)) is com- puted as following for a foreign sentence fJ) translated into the English sentence zT): lx(fJ) P z T )) = 1 (I + 1)J J∏ u5) T∑ i5( p)(zi|fu) (2.25) where p) is the word-to-word translation probability in IBM Model 1 and z( the null word. The target-to-source lexical translation model is symmetric. One reason to use lexical models in addition to translation CHAPTER 2. STATISTICAL MACHINE TRANSLATION 36 models is that translation models are relatively sparse compared to lexical models, thus lexical models can smooth the translation models. • Number of uses of the glue rule in a derivation. This feature trades off monotonic translation versus reordering. • Word insertion penalty. This feature controls the length of the output. • Rule insertion penalty. This feature controls the number of rules used in a derivation to generate a translation. Using many rules is closer to word-to-word translation while using few rules means that longer phrase pairs are used. This feature is similar to the phrase penalty in phrase-based statistical machine translation (Koehn, 2010). • Word deletion scale factor. This feature controls the number of times a deletion rule is applied. • Rule count feature. This feature indicates whether a rule occurs once, twice or more in the parallel data (Bender et al., 2007). Thus it indi- cates how reliable a rule is. The feature weights are optimised using minimum error rate training (Och, 2003) under the BLEU score (Papineni et al., 2002) (see also Section 2.8.1). 2.7 Language Modelling We mentioned in Section 2.6.3 that a language model was one of the fea- tures of log-linear models for translation. This feature is critical as it helps to obtain a fluent translation output. A language model is a probability distribution over word sequences. It can be used to assign a probability to a sequence of words or to predict the word most likely to appear in a cer- tain context. Language models have applications in fields where the goal is to produce fluent output, such as automatic speech recognition or machine translation. n-gram language models are typically used because they are robust, can be easily trained on large amounts of data and model local gram- matical relationships. Since they do not model the language structure nor long distance relationships between words, work has been conducted (Shen et al., 2008) to overcome this issue. We first review n-gram language models, then review different smoothing techniques and we finally describe in more CHAPTER 2. STATISTICAL MACHINE TRANSLATION 37 detail two smoothing techniques that are used for experiments in this thesis: Kneser-Ney smoothing and Stupid Backoff smoothing. 2.7.1 n-gram language models For simplicity, let us first consider the case of a bigram language model. Let us consider a vocabulary k and the two special symbols and corresponding respectively to the start-of-sentence symbol and the end-of- sentence symbol. We define l = k ∪ {P }. Let us now define the Markov chain (mi)i∈N with values in l and transition probability p. p has the following properties: • p(m( = ) = 1: we always start a sentence with a start-of-sentence symbol. • p(mn+) = | mn = ) = 1: once we reach the end of a sentence, we stay in the end-of-sentence state. This is because we do not consider infinite word sequences. • p(mn+) = | mn = ) = 0: we cannot stay in the start-of-sentence state and have to transition to either a word or the end-of-sentence state. A bigram model is defined by the conditional independence assumptions of the Markov chain (mi) and the translation probability p. A bigram model will therefore assign the probability p(w) to a sequence of words w = w)OOOwn in Equation 2.26: p(w) = n+)∏ i5) p(wi | wi−)) (2.26) with the convention w( = and wn+) = . An n-gram language model can be defined similarly: this time the random variables mi take values in l n−) instead of l . An n-gram model will assign the probability p(w) to a sequence of words w = w)OOOwn in Equation 2.27: p(w) = n+)∏ i5) p(wi | wi−)i−n+)) (2.27) with the same convention that w( = and wn+) = . Parame- ters can be trained using maximum likelihood estimation, so the parameter CHAPTER 2. STATISTICAL MACHINE TRANSLATION 38 p(wi|wi−)i−n+)) is computed in Equation 2.28. p(wi|wi−)i−n+)) = x(wii−n+)) x(wi−)i−n+)) (2.28) where x(O) counts the number of occurrences of a particular word sequence in the training data. Maximum likelihood estimation assigns zero probability to unseen events, therefore different smoothing strategies have been explored to address this problem. 2.7.2 Back-off and Interpolated Models Smoothing strategies for language modelling make use of lower order dis- tributions either by backoff or interpolation. The general form of a backoff model (Chen and Goodman, 1998) is presented in Equation 2.29: pbackoff(wi | wi−)i−n+)) = { (wi | wi−)i−n+)) if x(wii−n+)) S 0 (wi−)i−n+))pbackoff(wi | wi−)i−n+2) if x(wii−n+)) = 0 (2.29) where x is an occurrence count in the monolingual training corpus for an n-gram. The conditions x(wii−n+)) S 0 and x(wii−n+)) = 0 can be replaced by x(wii−n+)) ≥ cutoff and x(wii−n+)) < cutoff respectively when n-grams with an occurrence less than the cutoff threshold are ignored in the training data. The general form of an interpolated model (Chen and Goodman, 1998) is presented in Equation 2.30: pinterpolate(wi | wi−)i−n+)) = wi−1i−G+1pML(wi | w i−) i−n+))+ (1− wi−1i−G+1)pinterpolate(wi | w i−) i−n+2) (2.30) where: • pML is the maximum likelihood estimate. • wi−1i−G+1 is the interpolation parameter. The difference between backoff and interpolated model is that interpolated models make use of lower order distributions even when the n-gram counts are greater than zero. However, an interpolated model can be written in the form of a backoff model. Let us define (wi | wi−)i−n+)) in Equation 2.31: (wi | wi−)i−n+)) = pinterpolate(wi | wi−)i−n+)) (2.31) CHAPTER 2. STATISTICAL MACHINE TRANSLATION 39 and (wi−)i−n+)) in Equation 2.32: (wi−)i−n+)) = 1− wi−1i−G+1 (2.32) We can see that by plugging in the definitions of (wi | wi−)i−n+)) and (wi−)i−n+)) in Equation 2.29, it is possible to write an interpolated model as a backoff model. This observation is trivial but it is useful in practice in order to make use of the ARPA file format3 which only supports back-off models. 2.7.3 Modified Kneser-Ney Smoothing In this section, we present interpolated modified Kneser-Ney smoothing (Chen and Goodman, 1998), which is the most popular smoothing strategy used for machine translation. The motivation for Kneser-Ney smoothing (Kneser and Ney, 1995) as given by Chen and Goodman (1998) is that the use of lower or- der distributions for smoothing needs to take into account information from the higher order distribution. For example, let us consider a bigram model. We want to assign a probability to the word Francisco given its previous word, say hello. If we have not seen the bigram hello Francisco in the train- ing corpus, we need to back-off to the unigram Francisco. We assume that the unigram Francisco is very frequent in our corpus but that it is only seen in the context of the bigram San Francisco. If we simply use the maximum likelihood estimate of the unigram Francisco, we will obtain a relatively high probability for the bigram hello Francisco. This should not be the case be- cause the corpus provides evidence that Francisco is very unlikely to follow any other word than San. The modified Kneser-Ney smoothed probability is defined in Equation 2.33. pKN(wi | wi−)i−n+)) = x(wii−n+))−D(x(wii−n+))) x(wi−)i−n+)) + (wi−)i−n+))pKN(wi|wi−)i−n+2) (2.33) where D and are defined in Equation 2.34 and Equation 2.35. D(x) =  0 if x = 0 D) if x = 1 D2 if x = 2 D+ if x ≥ 3 (2.34) 3 http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html CHAPTER 2. STATISTICAL MACHINE TRANSLATION 40 (wi−)i−n+)) = D)c)(w i−) i−n+)•) +D2c2(wi−)i−n+)•) +D++c++(wi−)i−n+)•) x(wi−)i−n+)) (2.35) c), c2 and c++ are defined in Equation 2.36. c)(w i−) i−n+)•) = |{wi : x(wi−n+)wi) = 1}| c2(w i−) i−n+)•) = |{wi : x(wi−n+)wi) = 2}| c++(w i−) i−n+)•) = |{wi : x(wi−n+)wi) ≥ 3}| (2.36) 2.7.4 Stupid Backoff Smoothing The Stupid Backoff smoothing scheme (Brants et al., 2007) is similar to the general backoff smoothing scheme presented in Equation 2.29 and presented in Equation 2.37: sstupid backoff(wi | wi−)i−n+)) =  n(wii−G+1) n(wi−1i−G+1) if x(wii−n+)) S 0 sstupid backoff(wi | wi−)i−n+2) if x(wii−n+)) = 0 (2.37) We use the notation s to indicate that that the Stupid Backoff Smoothed score is not a probability. We note the following differences with respect to the traditional backoff scheme: • The non backed off score uses no discounting and simply uses relative frequency. • The backed off score scaling parameter is independent of the n-gram history. • As a result, the Stupid Backoff does not define a probability distribu- tion. Stupid Backoff smoothing was designed to fit the MapReduce framework and build very large language models (see Section 3.1). This scheme will be used in subsequent translation rescoring experiments throughout this thesis. 2.8 Optimisation 2.8.1 Evaluation Metrics Evaluating the quality of machine translation output is a challenge in that very often many possible translations are acceptable and of equal quality. Hu- CHAPTER 2. STATISTICAL MACHINE TRANSLATION 41 man experts seem to be the most qualified for this task, however their service is expensive and time consuming. Automatic metrics have been therefore designed in order to allow rapid development and comparison of machine translation systems. We describe the BLEU metric since this is the most widely used metric and because we use it extensively in our experiments. We refer to publications for alternative metrics such as METEOR (Banerjee and Lavie, 2005), NIST (Doddington, 2002) or TER (Snover et al., 2006). The BiLingual Evaluation Understudy (BLEU) score (Papineni et al., 2002) is defined as the geometric mean of n-gram modified precisions with respect to one or several translation references times a brevity penalty. More precisely, given h input sentences s)P OOOP sS, we consider a set of h candidate translations x)P OOOP xS. For each input sentence si, there is a finite number gi of reference translations ri;)P OOOP ri;Ri . The reference translation that is closest in length to the candidate translation xi is denoted rni . The BLEU metric is defined as following: BLEU(x)P OOOP xSP r);)P OOOr);R1 P OOOP rS;)P OOOP rS;RS) = BP N∏ n5) p 1 N n (2.38) BP is the brevity penalty and is defined as: BP = exp(min(0P 1− ∑S i5) |rni|∑S i5) |xi| )) (2.39) pn is the modified n-gram precision and is defined as∑S i5) ∑ g∈ni xountnwip(gP xiP ri;)P OOOP ri;Ri)∑S i5) ∑ g∈ni xount(gP xi) (2.40) where g is an n-gram, xount(gP xi) is the number of times g occurs in xi and xountnwip(gP xiP ri;)P OOP ri;Ri) = min(xount(gP xi)Pmax(xount(gP ri;))P OOP xount(gP ri;Ri))) (2.41) The maximum n-gram order is usually c = 4. The BLEU metric can be computed efficiently, is language independent and correlates well with human judgements, especially for statistical translation systems. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 42 2.8.2 Minimum Error Rate Training Minimum error rate training (Och, 2003) is a method to optimise the feature weights in a log-linear model for translation with respect to a particular evaluation metric. We present this method assuming that the evaluation metric is BLEU. The objective function is defined as follows: xM) = argmax M1 BLEU(x(M) )P r) (2.42) where x(M) ) are the candidate translations obtained using feature weights M) and r are the reference translations. Since computing x(M) ) requires computing the most likely translations given a set of features, this optimisa- tion problem is complex: it requires two argmax operations. Och therefore approximates the search space of translations with n-best lists given by a decoder. The optimisation problem is then solved by computing, along a particular direction in the feature space, intervals where the BLEU function is constant and choosing for each direction the optimal feature set. It was found that optimising feature weights in a consistent way with the evalu- ation metric, for example BLEU, could improve the translation quality as measured by this metric significantly. 2.9 Decoding with Finite State Transducers We use HiFST, a decoder based on finite state transducers (Iglesias et al., 2009b; Iglesias, 2010). The decoder solves Equation 2.23 in three steps. In the first step, the source sentence is parsed using a variant of the CYK algo- rithm (Chappelier and Rajman, 1998). Each cell in the CYK grid contains a nonterminal symbol and a span represented by a start and end index and backpointers to other cells in the grid. In the second step, a lattice (Ueffing et al., 2002), i.e. a compact representation of the possible translations as a directed acyclic graph where edges are words, is recursively built by following the backpointers in the grid. In the third step, a language model represented as an acceptor is composed with the original lattice to give translations a language model score. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 43 2.10 Lattice Rescoring An advantage of the HiFST decoder (see Section 2.9) is that it directly gener- ates translation lattices that can be rescored in order to improve translation quality. Rescoring methods are used in machine translation when their in- tegration into the decoder is challenging in terms of memory use. 5-gram language model rescoring and lattice minimum Bayes’ risk rescoring are two such methods. On the other hand, if scalability issues can be overcome, integrating complex models into first-pass decoding may provide a simpler and easier to use system. Furthermore, because larger parts of the decoding search space may be explored, single pass methods may also provide transla- tion quality gains, however this is not guaranteed unless there is no pruning during decoding. 2.10.1 5-gram Language Model Lattice Rescoring Brants et al. (2007) showed that using vast amounts of data for training language models can improve translation. However, using high order n-gram language models trained on a large dataset in decoding can create memory problems. A solution is to use such models to rescore the output lattice of a first-pass translation. These first-pass lattices are generated using a lower order language model. The language model we use for rescoring is a 5-gram zero cutoff Stupid Backoff language model (see Section 2.7.2). To build this model, n-grams of order up to 5 and containing words in the target side of the parallel corpus are extracted in parallel from the monolingual data, resulting in an n-gram count file. Then, n-grams from the first-pass translation lattice are extracted using counting transducers (Allauzen et al., 2003). Finally, n-grams from the count file are filtered according to the n-grams from the first-pass translation lattice. A sentence specific Stupid Backoff (Brants et al., 2007) 5-gram language model is built using these n-grams and constitutes a feature along with the first-pass language model, the translation model, and a length penalty in a log-linear model that is used for lattice rescoring. The weights are optimised for BLEU on a development set. CHAPTER 2. STATISTICAL MACHINE TRANSLATION 44 2.10.2 Lattice Minimum Bayes’ Risk Rescoring In Section 2.2, we have presented the general objective function for trans- lation in Equation 2.43 as a decision that minimises the number of er- rors (Bishop, 2006, p. 39-40): xe = argmax e p(e | f) (2.43) This objective function is a particular case of Minimum Bayes-Risk (MBR) decoding for SMT (Kumar and Byrne, 2004). MBR takes the general form in Equation 2.44: xe = argmin e0∈E ∑ e∈E a(eP e0)e (e | f) (2.44) where: • f is the source sentence • e and e0 are target sentences • E is the set of possible translations • a is a loss function If the loss function a is defined as a zero-one loss function as in Equation 2.45: ∀eP e0 a(eP e0) = { −1 if e = e0 0 otherwise (2.45) then we obtain the objective defined in Equation 2.43. In that sense, the minimum error objective is a special case of the MBR objective. If the translation quality is measure by BLEU, then an appropriate function for a is 1−SBLEU where SBLEU is the sentence level BLEU score, defined as BLEU with a corpus containing only one sentence pair. Kumar and Byrne (2004) use this definition of SBLEU for experiments on n-best list MBR rescoring. Even though SBLEU between two hypotheses might be zero, it is less likely that a particular hypothesis have an expected loss of zero. For other applications such as metrics evaluation, it is beneficial to introduce smoothing for sentence-level BLEU (Lin and Och, 2004). Kumar and Byrne (2004) demonstrate gains in translation quality when rescoring n-best lists CHAPTER 2. STATISTICAL MACHINE TRANSLATION 45 with the MBR objective function. Gains measured by the metric used to define the loss function are of course more important. n-best list MBR rescoring requires d(n) sentence-level BLEU computa- tions for each hypothesis, so d(n2) sentence-level BLEU computations for each source sentence. This restricts the depth of the n-best list; for example Kumar and Byrne (2004) use a 1000-best list but n could be a very large num- ber. Tromble et al. (2008) remedy this issue by introducing a lattice minimum Bayes’ risk decoding technique where E is a lattice of possible translations rather than an n-best list. They use Equation 2.46 as an approximation of Equation 2.44: xz = argmax p′∈E ( (|z′|+ ∑ u∈N u=u(z ′)p(u|E) ) (2.46) where • N is the set of all n-grams in the lattice E • u is an n-gram specific constant • =u(z′) is the number of times u occurs in z′ • p(u|E) is the posterior probability of the n-gram u in the lattice E Decoding is performed using weighted finite state transducers (Tromble et al., 2008). A more efficient implementation was designed subsequently (Black- wood et al., 2010; Blackwood, 2010). 2.10.3 Lattice Minimum Bayes-Risk for System Combi- nation Machine translation system combination is a paradigm that exploits the dif- ferences in strength and weaknesses from different machine translation sys- tems. Given an input source sentence to be translated, the source sentence is translated by several MT systems which produce one or more translations. These translations are then rescored and/or combined with the hope to pro- duce a final translation of equal or better quality than the translations of any of the individual systems. System combination techniques can be classified by whether the final translation can be a novel hypothesis that has not been generated by any of CHAPTER 2. STATISTICAL MACHINE TRANSLATION 46 the individual systems. System combination techniques that can generate a novel hypothesis usually employ consensus network decoding (Fiscus, 1997). In consensus network decoding for MT, n-best lists obtained from individual systems are aligned to a reference, e.g. the MBR hypothesis, then the final hypothesis is obtained by majority voting. In practice, it is difficult to apply consensus network decoding to deep n-best lists or to lattices because of the expensive alignment operation. Another broad class of system combination techniques that do not gener- ate novel hypotheses include n-best list rescoring as well as lattice rescoring techniques. Blackwood (2010) argues that one possible advantage over con- sensus network decoding is that fluency is more likely to be preserved and that it is possible to exploit larger hypothesis spaces coming from individual systems. It is possible to extend the lattice Minimum Bayes-Risk decoding strategy presented in Section 2.10.2 for system combination. We consider two systems that produce two translation lattices E) and E2. The statistics p(u|E) are computed by interpolation. The decoding equation becomes: xe = argmax e0∈E1∪E2 ( (|e0|+ ∑ u∈N1∪N2 u=u(e 0)()p(u|E)) + 2p(u|E2)) ) (2.47) The weights ()P 2) are such that ) + 2 = 1 and are tuned for BLEU on a development set. Chapter 3 Data Structures for Hierarchical Phrase-Based Translation Grammars We have seen in Chapter 2 that the main components of an SMT system are the translation model and the language model. The translation model is trained on parallel text while the language model is typically trained on the target side of the parallel text and possibly additional monolingual target text. It has been shown that increasing the amount of parallel text (Pino et al., 2010) and increasing the amount of monolingual text (Brants et al., 2007) is beneficial for translation quality. Monolingual and parallel text availability and distributed computing tech- niques have allowed SMT researchers to build ever larger language models and translation models, based on very large amounts of data. However, de- coders need to be able to retrieve information, such as n-gram conditional probabilities or translation grammar rules, efficiently from these models to be able to translate an input sentence or a set of input sentences in a rea- sonable amount of time. For online systems such as the commercial systems mentioned in Chapter 1, translation even needs to be carried out in real time, i.e. for example with a speed of 2000 words translated per minute. Therefore SMT models need to be stored in a data structure that supports efficient querying and is scalable. We introduce a solution that addresses these two requirements and that has the advantage that it reuses an existing framework in order to ease the implementation. We store translation models 47 CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 48 0 1000 2000 3000 4000 5000 6000 7000 2006 2007 2008 2009 2010 2011 2012 2013 N um be r o f m illi on s of to ke ns Year EN parallel data EN monolingual data Figure 3.1: Number of English tokens (in millions) in parallel and monolin- gual data available for the WMT French-English constrained track transla- tion shared task for the years 2006 to 2013. as a set of key-value pairs in an HFile1. We apply this strategy in order to retrieve relevant rules from a hierarchical phrase-based grammar at the test set level. We compare our approach to alternative strategies and show that our approach offers competitive performance in terms of speed, memory and simplicity (Pino et al., 2012). We have made software written for this work available online.2 3.1 Introduction Current machine translation research is characterised by increasing amounts of available data. For example, Figure 3.1 shows that for the WMT machine translation workshop (Bojar et al., 2013) French-English constrained track translation shared task, the English side of parallel data has increased from 13.8M tokens in 2006 to 1012.7M tokens in 2013, and that available English monolingual data has increased from 27.5M tokens to 6852.7M tokens over the same period. Along with growing amounts of data, the use of more power- 1 http://hbase.apache.org 2 https://github.com/jmp84/ruleXtract (branch PBML-2012) CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 49 ful computers and distributed computing models such as MapReduce (Dean and Ghemawat, 2008; Lin and Dyer, 2010) has enabled machine translation researchers to build larger statistical machine translation models. MapRe- duce has thus been applied to various steps of the translation pipeline: word alignment (Dyer et al., 2008), translation model building (Dyer et al., 2008) and language modelling (Brants et al., 2007). We review these applications of MapReduce in more detail in Section 3.2. The challenge is to find effective modelling techniques that can be implemented in the MapReduce framework. However, once SMT models such as language models and translation grammars are built, either with MapReduce or some other method, the mod- els must be made usable by translation decoders, which only need a fraction of the information contained in those models to be able to translate an input source sentence or a set of input source sentences. For example, in translation from French to English with a phrase-based model (see Section 2.5), given an input sentence Salut toi, we only need to know the translation probabilities the translation model assigns to translations of the words Salut and toi and to translations of the phrase Salut toi. Similarly, given a test set, a decoder only needs to retrieve the rules whose source side matches part of one of the source sentences in the test set to be able to generate hypotheses. With large models, simply retrieving relevant rules together with their translation probabilities becomes a challenge. In the system described by Iglesias et al. (2009b), given a training parallel corpus, rules are extracted and target phrases and hierarchical phrases with counts are stored on disk. Then, given an input test set, rules are extracted from the parallel corpus a second time. Only rules relevant to the test set are retained: source sides of phrase-based rules have to match an n-gram from the test set and consecutive terminals in source sides of hierarchical rules also have to match an n-gram from the test set. As a consequence, some hierarchical rules are never used in decoding. Source sides and rules together with their counts are kept in memory using hash map datastructures, target sides with counts are loaded from disk as precomputed in the first step, so that source-to-target and target-to-source probabilities can be computed. This method becomes progressively slower with larger amounts of data because it requires extracting rules from the training parallel corpus and recomputing translation probabilities each time we translate a new test set or a new sentence. We would like to improve on this design for more rapid experimentation. We also would like to use a computing infrastructure that CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 50 requires minimal maintenance. For example, HBase3, the open source non- relational database implementation of BigTable (Chang et al., 2008), has been applied to the use of distributed language models (Yu, 2008). Rule filtering, i.e. the retrieval of rules relevant to the translation of a test set or a sentence, is an essential step in our pipeline that can be a bottleneck. Our goal is to reduce its processing time from several hours to a few minutes. In this chapter, we address the problem of retrieving relevant rules with their translation probabilities. We report on investigations into storing the model in the HFile data structure. To our knowledge, this is the first de- tailed proposed implementation of translation model storage and filtering using HFile data structures. We find that this approach offers a good com- promise between speed, performance and ease of implementation. The in- frastructure necessary for filtering is lightweight and requires the use of only one machine. We will apply this approach to test set rule filtering prior to decoding. We will discuss alternative strategies as well as their strengths and weaknesses in terms of speed and memory usage. In Section 3.2, we will review approaches that have been used for model building and model filtering. The HFile data structure that is used to store models will be pre- sented in Section 3.3. In Section 3.4, we detail our custom implementation of rule extraction and translation model estimation using MapReduce. Our method and alternative strategies for rule filtering are compared empirically in Section 3.5. We conclude in Section 3.6. 3.2 Related Work In this section, we review applications of MapReduce to the translation pipeline as well as techniques for storing and retrieving from SMT models. 3.2.1 Applications of MapReduce to SMT Brants et al. (2007) introduce a new smoothing method called Stupid Backoff (see Section 2.7.4). The Stupid Backoff smoothing scheme is recalled in 3 http://hbase.apache.org CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 51 Equation 3.1: pstupid backoff(wi | wi−)i−n+)) =  n(wii−G+1) n(wi−1i−G+1) if x(wii−n+)) S 0 pstupid backoff(wi | wi−)i−n+2) if x(wii−n+)) = 0 (3.1) With respect to the traditional backoff scheme, the Stupid Backoff scheme uses no discounting and simply uses relative frequency for the non backed-off score and the backed-off score scaling parameter is independent of the n- gram history. Therefore this scheme does not define a conditional probability distribution over a word given its history. The Stupid Backoff language model building and application fit the MapRe- duce framework well. The input to language model building is a large mono- lingual corpus. The first step is to build a vocabulary. This is done with the canonical example of word counting with MapReduce. Counts are needed in order to remove words occurring less than a threshold from the vocabu- lary. The second step is to obtain n-grams and their count. This is done again with MapReduce, and the map and reduce functions are analogous to the ones defined for word counting but this time unigrams are replaced by n-grams. For n-gram counting, the partition, or sharding, function hashes on the first two words of each n-gram. In addition, unigram counts and the total size of the corpus is available in each partition, i.e. to each reducer. This allows relative frequencies to be computed. Brants et al. (2007) also demonstrate that for large amounts of monolingual data, i.e. above 10 bil- lion tokens, Stupid Backoff smoothing and Kneser-Ney smoothing perform comparably. In addition, only Stupid Backoff smoothing can be scaled to datasets with more than 31 billion tokens. The scalability of Kneser-Ney smoothing has been improved in recent work (Heafield et al., 2013).4 Dyer et al. (2008) observe that translation model estimation has become prohibitive on a single core and that existing ad hoc parallelisation algo- rithms may be more fragile than using an existing framework such as the Hadoop implementation of MapReduce.5 They provide solutions to word alignment model estimation and translation rule extraction and estimation using MapReduce and demonstrate the scalability of their method. The convenience of the MapReduce framework for parallelisation has led to the building of end-to-end toolkits for entire phrase-based (Gao and Vogel, 4 see also http://kheafield.com/code/kenlm/estimation/, Scalability section 5 https://hadoop.apache.org/ CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 52 2010) and hierarchical phrase-based models (Venugopal and Zollmann, 2009) for translation using the MapReduce framework. 3.2.2 SMT Models Storage and Retrieval Solutions We now review techniques appearing in the literature that have been used to store SMT models and to retrieve the information needed in translation from these models. SMT models are usually discrete probabilistic models and can therefore be represented as a set of key-value pairs. To obtain relevant information from a model stored in a certain data structure, a set of keys called a query set is formed; each key in this query set is then sought in that datastructure. Strategies include: • storing the model as a simple data structure in memory • storing the model in a text file • storing the model in more complicated data structures such as tries (Fred- kin, 1960) (in memory or disk) • storing fractions of the entire model • storing data as opposed to a precomputed model • storing models in a distributed fashion Each of these strategies is discussed below. In some cases, it may be possible to fit a model into RAM. In this case the model can be stored as a memory associative array, such as a hash ta- ble. In-memory storage allows allows for faster query retrieval than on-disk storage, however only smaller models will fit in memory. In-memory storage has been used to store model parameters between iterations of expectation- maximisation for word alignment (Dyer et al., 2008; Lin and Dyer, 2010). For larger models, the set of key-value pairs can be stored as a table in a single text file on local disk. Values for keys in the query set can be retrieved by scanning through the entire file. For each key in the file, its membership is tested in the query set. This is the approach adopted in the Joshua 5.0 decoder (Post et al., 2013)6, which uses regular expressions or 6 Inferred from the decoder training scripts available at http://joshua-decoder.org/ 5.0/index.html. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 53 n-grams to test membership (see Section 3.5.4). Venugopal and Zollmann (2009) use MapReduce to scan a file concurrently: the map function tests if the vocabulary of a rule matches the vocabulary of a test set. The model can also be stored using a trie associative array (Fredkin, 1960). A trie is a type of tree where each node represents a shared prefix of a set of keys represented by the child nodes. Each node only stores the prefix it represents. The keys are therefore compactly encoded in the structure of the trie itself. Querying the trie is a O(log(n)) operation, where n is the number of keys in the dataset. The trie may also be small enough to fit in physical memory to further reduce querying time. Zens and Ney (2007) use tries to store a phrase-based grammar. All the source phrases are represented in a trie stored on disk. Only relevant parts of the trie are loaded into memory when a source phrase is sought. Ganitkevitch et al. (2012) extend this approach to store hierarchical phrase-based grammars. Both source sides and target sides are stored in a packed representation of a trie. Packed tries have also been applied for storing language models (Pauls and Klein, 2011; Heafield, 2011). It is also possible to create a much smaller approximate version of the model. Talbot and Osborne (2007a) represent a set of n-grams as a Bloom filter (Bloom, 1970). They first use a standard Bloom filter to define a binary feature that indicates whether an n-gram was seen in a monolingual corpus. They also use a Bloom filter to encode n-grams together with quantised counts in order to define a multinomial feature, that is a feature with a finite set of possible values—in this case the quantised values. Both these data structures can substantially reduce the disk space and memory usage with respect to lossless representations of language models. However, they allow false positives for n-gram membership queries and overestimates of n-gram quantised counts. The authors demonstrate that the features they introduce are useful in translation, despite this lack of exactness. In a related pub- lication (Talbot and Osborne, 2007b), the authors demonstrate how these techniques can be used as a replacement to represent a smoothed language model. Talbot and Brants (2008) use a Bloomier filter (Chazelle et al., 2004) to represent n-grams and necessary statistics such as probabilities and back- off weights. Unlike Bloom filters that only support Boolean characteristic functions on a set, Bloomier filters support arbitrary functions, in this case a mapping between an n-gram and its statistics. False positives can still occur but for true positives, correct statistics are retrieved. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 54 Guthrie and Hepple (2010) propose an extension to previous work on randomised language models (Talbot and Osborne, 2007b) which prevents the random corruption of model parameters but does not stop the random assignment of parameters to unseen n-grams. Levenberg and Osborne (2009) extend randomised language models to stream-based language models. An- other way of building a smaller approximate version of a model is to retain items with high frequency counts from a stream of data (Manku and Mot- wani, 2002). This technique has been applied to language modelling (Goyal et al., 2009) and translation rule extraction (Przywara and Bojar, 2011). Instead of doing some precomputation on a dataset, it is possible to com- pute the sufficient statistics at query time using a suffix array (Manber and Myers, 1990), so that the model can be estimated only when needed. A suf- fix array is a sequence of pointers to each suffix in a training corpus. The sequence is sorted with respect to the lexicographic order of the referenced suffixes. Suffix arrays have been used for computing statistics for language models (Zhang and Vogel, 2006), phrase-based systems (Callison-Burch et al., 2005; Zhang and Vogel, 2005), and hierarchical phrase-based systems (Lopez, 2007). Callison-Burch et al. (2005) store both the source side and the target side of a parallel corpus in two suffix arrays. They also maintain an index between a position in the source or target side of the corpus and the sentence number. During decoding, all occurrences of a source phrase are located in the suffix array representing the source side of the corpus. This produces a source marginal count. For each of these occurrences, the corresponding sentence pair and word alignment are retrieved and rules are extracted. This produces a rule count, which is normalised with the source marginal count to produce a source-to-target probability. Lopez (2007) extends this approach to address the grammar extraction and estimation of hierarchical rules. Note that the use of suffix arrays for translation model estimation only supports the computation of source-to-target probabilities. Finally, some approaches store language models in a distributed fashion. We saw in Section 3.2.1 how a Stupid Backoff language model can be built. After relative frequencies have been computed, another partition function that hashes on the last two words of an n-gram is applied so that backoff operations can be done within the same partition. At decoding time, the decoder architecture is modified in order to request a batch of n-grams rather than a single n-gram. The Stupid Backoff language model building can be scaled to a corpus of 2 trillion tokens and the language model distributed application can be made real time. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 55 Zhang et al. (2006) propose a distributed large language model backed by suffix arrays. HBase has also been used to build a distributed language infrastructure (Yu, 2008). The method we propose to use is closely related to the latter but we use a more lightweight infrastructure than HBase. In addi- tion, it is also possible to apply our method to language model querying (Pino et al., 2012), which demonstrates the flexibility of the infrastructure. Note that there are many alternative solutions to HBase/HFile for storage of a large number of key-value pairs, such as Berkeley DB7 or Cassandra8. The purpose of this chapter is not to compare these alternatives but rather to compare existing solutions employed in the machine translation community to one of these solutions. 3.3 HFile Description We now describe the data structure we use to store models and we review relevant features to the design of our system. To store a model represented as key-value pairs, we use the HFile file format,9 which is an open source reimplementation of the SSTable file format (Chang et al., 2008). The HFile format is a lookup table with key and value columns. The entries are free to be an arbitrary string of bytes of any length. The table is sorted lexico- graphically by the key byte string for efficient record retrieval by key. 3.3.1 HFile Internal Structure High Level Structure As can be seen in Figure 3.2, the HFile internal structure is divided into blocks. There are various types of block, depending on the kind of information a block contains. In order to distinguish block types, the first 8 bytes of a block will indicate its type. The blocks are grouped into three parts. The first part (top) is organised into data blocks interleaved with leaf data index blocks and Bloom blocks. The second part (middle) consists of intermediate data index blocks. The third part (bottom) consists of metadata: root data index blocks, a file information block and a Bloom filter index block. 7 https://oss.oracle.com/berkeley-db.html 8 http://cassandra.apache.org 9 http://hbase.apache.org CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 56 Data Block ... Leaf Data Index Block / Bloom Block ... Data Block ... Leaf Data Index Block / Bloom Block ... Data Block Intermediate Data Index Blocks Root Data Index Blocks File Info Block Bloom Filter Index Block Figure 3.2: HFile internal structure.11 The structure is divided into three parts. In the first part, data blocks are interspersed with leaf data index blocks and Bloom blocks. The second part contains intermediate data index blocks. The third part consists of metadata: root data index blocks, file format information and a Bloom filter index block. Index Blocks The root data index blocks map keys to their relevant in- termediate data index block, indicated by an offset. In turn, an intermediate data index block maps keys to their relevant leaf data index block, again indicated by an offset. Finally, a leaf data index block maps keys to their relevant data block, still indicated by an offset. The same mechanism is em- ployed for the Bloom filter. The Bloom filter index block maps keys to their relevant Bloom block. Block Size Configuration The block size is configurable, with a default size of 64KB. Note that HFile blocks are not to be confused with Hadoop Distributed File System (HDFS) blocks whose default size is 64MB. Block Compression The HFile format allows for the blocks to be com- pressed. The choice of compression codec is selected when the file is created. We choose the GZip compression codec (Deutsch and Gailly, 1996) for all our experiments. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 57 3.3.2 Record Retrieval When the HFile is opened for reading, the root data index blocks are loaded into memory. To retrieve a value from the HFile given a key, the appropriate intermediate index block is located by a binary search through the root data index. Binary searches are conducted on the intermediate and leaf index blocks to identify the data block that contains the key. The identified data block is then loaded off the disk into memory and the key-value record is retrieved by scanning the data block sequentially. 3.3.3 Bloom Filter Optimisation for Query Retrieval It is possible to query for a key that is not contained in the HFile. This very frequently happens in translation because of language data sparsity: in our case, because keys are n-grams of terminals and nonterminals, the number of possible keys is exponential in the size of the vocabulary, and many of these will not have been observed in any rule extracted from training data. Querying the existence of a key is expensive as it involves all the op- erations described in Section 3.3.2: loading and binary searching the root data index, loading and binary searching an intermediate data index block, loading and binary searching a leaf data index block and finally loading and scanning a data block. For fast existence check queries, the HFile format allows the inclusion of an optional Bloom filter (Bloom, 1970). A Bloom filter provides a probabilis- tic, memory efficient representation of the key set with a O(1) membership test operation. The Bloom filter may provide a false positive, but never a false negative, for existence of a key in the HFile. Therefore, it is safe to use in our setting as some keys will be looked up in the HFile even though they are not present, but the situation where keys that are in the HFile are not looked up will not happen. For a large HFile, the Bloom filter may also be very large. Therefore the Bloom filter is also organised into blocks called Bloom blocks. Each block contains a smaller Bloom filter that covers a range of keys in the HFile. Similar to the root data index, a Bloom index is constructed. To check for the existence of a key, a binary search is conducted on the Bloom index, the relevant Bloom block is loaded, and the membership test performed. 11After http://hbase.apache.org/book/book.html (simplified) CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 58 Unlike work on Bloom filter language models (Talbot and Osborne, 2007a,b), this filter only tests the existence of a key and does not return any statistics from the value. If a membership test is positive, a usual key search will still be carried out. During the execution of a query, two keys may reference the same index or Bloom blocks. To prevent these blocks from being repeatedly loaded from disk, the blocks are cached after reading. 3.3.4 Local Disk Optimisation The HFile format is designed to be used with HDFS, a distributed file system based on the Google File System (Ghemawat et al., 2003). Large files are split into HDFS blocks that are stored on many nodes in a cluster. However, the HFile format can also be used completely independently of HDFS. If its size is smaller than local disk space, the entire HFile can be stored on the local disk of one machine and accessed through the machine’s local file system. We report in Section 3.5 that using local disk is faster than using HDFS. 3.3.5 Query sorting optimisation Prior to HFile lookup, we sort keys in the query set lexicographically. If two keys in the set of queries are contained in the same block, then the block is only loaded once. In addition, the computer hardware and operating system allow further automatic improvements to the query execution. Examples of these automatic improvements include reduced disk seek time, caching data from disk by the operating system,12 or CPU caching data from main memory (Patterson and Hennessy, 2009). 3.4 Hierarchical Rule Extraction with MapRe- duce In the previous section, we have described the HFile format. We will use this format to store a translation grammar. In this section, we describe how we generate this grammar and produce the HFile that represents the gram- mar. We describe a MapReduce implementation of hierarchical grammar 12 The Linux Documentation Project, The File System, http://tldp.org CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 59 extraction and estimation. The implementation is an extension of Method 3 described by Dyer et al. (2008), which was originally developed to estimate translation probabilities for phrase-based models. MapReduce (Dean and Ghemawat, 2008) is a framework designed for processing large amounts of data in a distributed fashion on a computer cluster. In this framework, the programmer needs to implement a function called map and a function called reduce. The map function is defined in Equation 3.2: map : (k)P v)) 7−→ [(k2P v2)] (3.2) where (k)P v)) is a key-value pair and [(k2P v2)] is a list of key-value pairs. We use the [ ] notation to denote a list. Keys and values are arbitrary byte sequences. As the key-value pairs (k2P v2) are computed by the map function, the MapReduce framework groups all values v2 by key k2 in order to provide an input (k2P [v2]) to the reduce function, defined in Equation 3.3: reduce : (k2P [v2]) 7−→ [(k+P v+)] (3.3) The input to the reduce function is a key and a list of values. The reduce function generates a list of key-value pairs. Using this notation, we can describe the grammar extraction and estima- tion as a series of MapReduce jobs, summarised in Figure 3.3. This architec- ture is similar to the one described in the original publication (Dyer et al., 2008). For source-to-target and target-to-source probability computation, we used Method 3 described in that publication, which was reported to be faster than other methods. Keeping each feature computation as a separate MapReduce job makes the integration of a new feature more convenient. 3.4.1 Rule Extraction The first step in grammar extraction and estimation is rule extraction. For this step, v) represents a sentence pair together with a word alignment and k) contains metadata about the parallel corpus such as what collection the sentence pair comes from. This kind of information is useful for domain adaptation as demonstrated in Section 5.5. Themap function simply extracts hierarchical rules for that sentence pair and outputs the rules. k2 is a rule and v2 contains the metadata from k). For rule extraction, there is no reduce step. Equation 3.4 illustrates the map function for rule extraction: mvp : (newswireP (f P ePa)) 7−→ [(r)P newswire)P (r2P newswire)P OOO] (3.4) CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 60 .k) : metadata k2 : word-aligned sentence pair Rule Extraction (see Section 2.6.4) map : extract rules reduce : none k) : rule k2 : metadata Source-to-target Probability map : output (src, (trg, 1)) reduce : sum trg counts, normalise Target-to-source Probability map : output (trg, (src, 1)) reduce : sum src counts, normalise k) : rule k2 : source-to-target pr k) : rule k2 : target-to-source pr Merge map : output (src, (trg, feature)) reduce : output (src, list of trg and features) k) : src k2 : list of trg and features Figure 3.3: Translation grammar extraction and estimation pipeline as a se- ries of MapReduce jobs. Ellipses represent (intermediate) input/output of MapReduce jobs. Rectangles represent MapReduce jobs. Source, target and probability are abbreviated into src, trg and pr. Rules are first extracted, then the source-to-target and target-to-source translation models are esti- mated. Finally, the features, i.e. the translation models, are merged and the final output has rules sources as keys and lists of targets with features as values. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 61 3.4.2 Corpus-Level Feature Computation Rule extraction is followed by several MapReduce jobs that are run simulta- neously in order to compute corpus-level features. Corpus-level features are features related to each rule and the computation of which requires looking at the entire set of extracted rules. In this description, we only consider source-to-target and target-to-source probabilities, but many other corpus- level features are possible, for example features that relate to the word align- ment from which rules were extracted. We use the metadata stored in the output of the extraction job in order to compute collection specific proba- bilities. Collection specific probabilities are probabilities computed over a specific subset of the training corpus. The metadata indicates whether a rule was extracted from that specific subset. Let us consider first the collection-specific source-to-target probability computation. The input to the map function is the output of the extraction job, that is a rule together with metadata. The map function checks that the rule comes from the collection we are interested in and if so, outputs the source of the rule as a key k2 and the target of the rule together with a count of one as a value v2. Targets (and counts) corresponding to the same source are grouped together by the MapReduce framework. The input to the reduce function is a source (k2) and a list of targets with counts ([v2]). The reduce function aggregates the counts for the distinct targets and then normalises those counts with the total count corresponding to the source k2. The output of the reduce function is a list of pairs (rule, probability). The target-to-source probability computation is very similar. Apart from inverting the role played by source and target, the map and reduce function also need to change the rule format for rules that invert the order of nonter- minals in the source and the target, for example m→〈m)vm2,m2wm)〉. For this type of rule, the map function has to invert the order of nonterminals in both the source and target so that probabilities are computed correctly. For example, the map function will output the target of the rule written as m)wm2 and the source of the rule written as m2vm). The reduce function will restore the original notation. We have described the computation of collection specific translation prob- abilities. The source-to-target and target-to-source probabilities over the en- tire corpus are computed in an identical manner with the collection simply being the entire corpus. Figure 3.3 illustrates only two corpus-level fea- tures, however, in practice, more features are computed: source-to-target CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 62 and target-to-source probability for each collection and for the entire parallel text, as well as lexical features (see Section 2.6.5). 3.4.3 Feature Merging Once all MapReduce features have been computed, a MapReduce job that merges the features is called. The input key k) to the map function is a rule and the input value v) is a feature. The map function outputs the source of the rule as a key k2 and the target of the rule together with the feature as value v2. Targets and features corresponding to the same source are grouped together by the MapReduce framework. The input to the reduce function is a source (k2) and a list of targets with features ([v2]). The reduce function simply merges the features for identical targets. The output key k+ is the source k2 and the output value v+ is a list of targets with all computed features. The output of this merging job is converted to an HFile format. Because the keys in the HFile format need to be sorted, we also want the output of the merging job to be sorted by key. One way to achieve this is to specify only one reducer for the job but this solution is very slow because only one core will be used for the reduce step. A better way is to use a TotalOrderPartitioner. We first use a small amount of data so that it may be feasible to carry out rule extraction with only one reducer in the merge step. We use the output of this small scale extraction as input to an InputSampler. The sampler will generate a partition file for the partitioner. The use of a TotalOrderPartitioner will guarantee that all keys are sorted after the reduce step. 3.5 Hierarchical Rule Filtering for Translation In the previous section, we have described how to obtain an HFile that rep- resents a hierarchical translation grammar. We will now describe how this datastructure can be used to retrieve rules for a given test set efficiently. We describe our system called ruleXtract, and compare it to other methods through time and memory measurements. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 63 3.5.1 Task Description Given a test set, defined as a set of sentences to be translated, and a hier- archical phrase-based translation grammar, we would like to retrieve all the relevant rules from the model. A phrase-based rule in the HFile is relevant if its source is a subsequence of a sentence in the test set. A hierarchical rule in the HFile is relevant if its source, where nonterminals are instantiated into terminals, is a subsequence of a sentence in the test set. For exam- ple, with a test set containing one sentence Salut toi, the phrase-based rules with sources Salut, toi, Salut toi are relevant and the hierarchical rules with sources Salut X and X toi are relevant, because instantiating Salut X into Salut toi produces a subsequence of the test sentence and similarly for X toi. 3.5.2 HFile for Hierarchical Phrase-Based Grammars Given a test set and an HFile storing a hierarchical phrase-based grammar, we first generate queries from the test set, then retrieve relevant rules along with their corpus-level features from the HFile. To generate queries, we have a set of allowed source patterns and instantiate these patterns against the test set. Rule patterns were defined in Section 2.6.2. A source pattern is simply the source side of a rule pattern. Here, we briefly redefine source patterns. A source pattern is a regular expression that matches the source side of a rule. For example, the pattern +m, with  the source vocabulary, represents a rule source side containing a sequence of terminals followed by the nontermi- nal X. If the input sentence is Salut a` toi, the pattern will be instantiated as Salut X, Salut a` X and a` X. We impose the following constraints on source pattern instantiation where the first four relate to constraints in extraction and the last one relates to a decoding constraint: • smax: maximum number of terminals for phrase-based rules. • smax elements: maximum number of terminals and nonterminals. • smax terminals: maximum number of consecutive terminals for hierarchi- cal rules. • smax NT: maximum nonterminal span in a hierarchical rule. • smax span: maximum span for the rule. This constraint is not to be confused with the previous smax NT constraint. While smax NT represents CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 64 the maximum span of a nonterminal on the right hand side of a rule, smax span represents the maximum span of an entire rule, or alternatively the maximum span of the nonterminal on the left hand side of the rule. Because the latter constraint is enforced by the decoder, we also enforce it in source pattern instantiation so that rules that will never be used by the decoder are not generated. The source pattern instances are then sorted for more efficient HFile lookup (see Section 3.3.5). Each query is then looked up in the HFile and if present, an HFile record is retrieved. We now compare our approach to similar ap- proaches whose aim is to obtain rules which are relevant to a test set. 3.5.3 Suffix Arrays for Hierarchical Phrase-Based Gram- mars One alternative strategy for building a set specific grammar is to use suf- fix arrays introduced in Section 3.2. We use the cdec software (Dyer et al., 2010) implementation of suffix arrays for hierarchical phrase-based rule ex- traction. The cdec implementation is based on earlier work (Lopez, 2007) which extends suffix array based rule extraction from phrase-based system to hierarchical phrase-based systems (see again Section 3.2). Given a test set, a set of source pattern instances is generated similarly to what is done for ruleXtract. These source pattern instances are then sought in a suffix array compiled from the source side of a parallel corpus. Rules are extracted using the word alignment, and source-to-target probabilities are then computed as described in Section 3.2. 3.5.4 Text File Representation of Hierarchical Phrase- Based Grammars We now describe the Joshua decoder (Weese et al., 2011) implementation for storing and retrieving from a translation model. The first implementation variant, which we call Joshua, stores the translation model in a text file. Given a test set, each word in the test set vocabulary is mapped to the list of sentences in which it appears. Then, each rule in the translation model is compiled to a regular expression, and each sentence that contains at least a vocabulary word of the rule is matched against this regular expression. If at least one match is successful, the rule is retained. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 65 A faster version is provided that matches consecutive terminals in the source side of a rule to the set of n-grams extracted from the test set. We call this version Joshua Fast. A parallel version also exists that chunks the grammar file and distributes each chunk processing as a separate process on a cluster running Sun Grid Engine (Gentzsch, 2001). We call this version Joshua Parallel. The parallel version using the faster matching algorithm is called Joshua Fast Parallel. 3.5.5 Experimental Design In this section, we describe our experimental setup in terms of data and various configurations for grammar extraction, grammar filtering and time and memory measurements. Parallel Text We use a small parallel corpus of 750,950 word-aligned sen- tence pairs and a larger corpus of 9,221,421 word-aligned sentence pairs from the NIST’12 Chinese-English evaluation, in order to investigate how systems perform with varying amounts of parallel text and whether those systems are robust when dealing with larger data sets. Grammar Extraction From the parallel corpora, we extract hierarchical grammars with the source-to-target probability feature only. This is done be- cause we do not want feature computation to introduce variability in timing results when comparing different strategies and software implementations. In addition, not all systems can generate the same set of features. For exam- ple, the suffix array implementation of rule extraction is not able to generate target-to-source probabilities. Note that in practice for translation decod- ing, given a vector of parameters, we could simply replace multiple features in the translation model by a single value representing the dot product of the features with the parameter vector. However, we need to keep separate feature values for optimisation (see Section 2.8.2). The rule extraction constraints described in Section 3.5.2 are set to a maximum source phrase length smax of 9, a maximum source element (termi- nal and nonterminal) length smax elements of 5, a maximum number of source consecutive terminals smax terminals of 5 and a maximum source nonterminal span smax NT of 10. These settings are standard in all our translation systems and give competitive performance in terms of translation quality. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 66 With these settings, the small grammar contains approximately 60M rules while the larger grammar contains approximately 726M rules. The grammars we obtain are converted to the Joshua format in order to be able to filter them with the Joshua decoder. We do not generate a hierarchical grammar with Joshua simply because we want to compare the performance of rule retrieval from the same grammar. Grammar filtering For ruleXtract, we use the following constraints for source pattern instantiation: the maximum source phrase length smax is set to 9, the maximum source element (terminal and nonterminal) length smax elements is set to 5, the maximum number of source consecutive terminals smax terminals is set to 5, the maximum source nonterminal span smax NT is set to ∞ and the maximum rule span smax span is set to ∞. The first three constraints are standard in all our translation systems. We use the value ∞ for smax NT and smax span so that the number of rules obtained after fil- tering is identical between ruleXtract and Joshua. Thus, time and memory measurements are comparable. Measurements We report time measurements for query processing and query retrieval and the total time used to obtain a set specific rule file for a test set of 1755 Chinese sentences and 51008 tokens. We also report peak memory usage. For ruleXtract, query processing involves generating source pattern instances and sorting them according to the HFile sorting order. If we use a Bloom filter, it also involves pre-filtering the queries with the Bloom filter. Query retrieval involves HFile lookup. For the Joshua configurations, query processing involves indexing the test set and generating test set n- grams and query retrieval involves regular expression matching. For the Joshua Parallel configurations, we use 110 jobs for the larger grammar on a cluster of 9 machines. For this latter configuration, we report the maximum time spent on a job (not the sum) and the maximum memory usage by a job. Hardware configuration the machine used for query processing has 94GB of memory and an Intel Xeon X5650 CPU. The distributed file system is hosted on the querying machine and other machines with the same specifi- cation, which are used to generate the HFile. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 67 Our experimental setup was designed for accurate comparisons between alternative strategies for grammar filtering. In practice, an end-to-end trans- lation system need not use this exact setup. For example the grammar extracted by Joshua is in general smaller than the grammar extracted by ruleXtract because Joshua includes target side constraints. On the other hand, ruleXtract uses threshold cutoffs, such as a minimum source-to-target translation probability, in order to reduce the size of the test set specific grammar. 3.5.6 Results and Discussion Results are summarised in Table 3.1, from which we can draw the following observations: Speed The Total Time column shows that ruleXtract is competitive with alternative strategies in terms of speed. Memory The Peak Memory column shows that both ruleXtract and Joshua are memory hungry, with a usage of up to 40GB. In the case of ruleXtract, this is because we keep all source pattern instances for all test set sentences are kept in memory. In the case of Joshua, this is due to a caching optimisa- tion: rule source sides that have already been considered relevant to the test set are stored in a hash map. Local disk optimisation Comparing HDFS and Local rows (row 1 vs. row 3, row 2 vs. row 4, row 7 vs. row 9, row 8 vs. row 10), we can see that using the local filesystem as opposed to HDFS gives a small decrease in query retrieval time, which becomes more substantial for the larger grammar. This is because when the HFile is stored on HDFS, it is composed of several HDFS blocks that can be located on different data nodes. Since the HFile size is smaller than the disk space on each data node, it is preferable to store the HFile on local disk. However, because the HFile was generated on HDFS through a MapReduce job, this requires copying or moving the HFile from the HDFS file system to the local disk file system prior to running grammar retrieval. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 68 Small Grammar Row System Query Query Total Peak # Rules Processing Retrieval Time Memory 1 ruleXtract 9m1s 7m36s 16m40s 40.8G 6435124 HDFS 2 ruleXtract 8m57s 2m16s 11m15s 39.9G 6435124 Bloom, HDFS 3 ruleXtract 8m54s 7m33s 16m30s 40.4G 6435124 Local 4 ruleXtract 8m50s 2m19s 11m11s 38.8G 6435124 Bloom, Local 5 Joshua 0.9s 29m51s 29m54s 42.2G 6435124 6 Joshua 0.9s 7m25s 7m28s 40.1G 7493178 Fast Large Grammar Row System Query Query Total Peak # Rules Processing Retrieval Time Memory 7 ruleXtract 8m56s 22m18s 31m17s 42.2G 47978228 HDFS 8 ruleXtract 9m12 15m33s 24m49s 40.7G 47978228 Bloom, HDFS 9 ruleXtract 8m55s 21m3s 30m1s 41.6G 47978228 Local 10 ruleXtract 9m0s 14m43s 23m46s 40.6G 47978228 Bloom, Local 11 Joshua 0.9s out of out of out of out of memory memory memory memory 12 Joshua 0.9s out of out of out of out of Fast memory memory memory memory 13 Joshua 0.9s 537m10s 537m11s 10.1G 47978228 No Cache 14 Joshua 0.9s 78m53s 78m54s 10.1G 83339443 Fast No Cache 15 Joshua total time for slowest job: 43m36s 4G 47978228 Parallel 16 Joshua total time for slowest job: 44m29s 4G 83339443 Fast Parallel Table 3.1: Time and memory measurements for rule filtering with differ- ent strategies for a small and a large grammar. Various configurations for ruleXtract and Joshua are compared for time and memory usage. The fast configuration for Joshua filters rules by matching consecutive terminals to test set n-grams, which explains that the number of rules obtained is higher than in all other configurations. This also means that some filtered rules are never used in decoding. Measurements were carried out twice for each condition and the second run was recorded. Thus, measurements reported take advantage of operating system caching. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 69 Bloom filter optimisation Comparing rows with and without Bloom (row 1 vs. row 2, row 3 vs. row 4, row 7 vs. row 8 and row 9 vs. row 10), we can see that the use of a Bloom filter gives an important decrease in query retrieval time. This is due to the fact that the number of source pattern in- stances queries is 31,552,746 and after Bloom filtering, the number of queries is 1,146,554 for the small grammar and 2,309,680 for the larger grammar, reducing the number of time consuming HFile lookups respectively by 96% and 93%. Note that Bloom filters increase query processing time only for the large grammar and more so when using HDFS. Parallelisation In order to run Joshua on the larger grammar and avoid memory problems, we needed to use parallelisation, which provided compet- itive speeds and a low memory footprint (maximum 4G per job) (see rows 15 and 16). For comparison, cdec’s total processing time is 57m40s for the small gram- mar, which is significantly slower than the other methods. However, we do not include cdec’s suffix array implementation performance in Table 3.1 be- cause the suffix array method involves much on-the-fly computation, such as rule extraction and source-to-target probability estimation, that has al- ready been precomputed in the case of Joshua and ruleXtract, making direct comparison difficult. Despite this apparent slowness, the use of suffix array methods for rule extraction favours rapid experimentation because no time consuming pre- computation is required. With our method, the precomputation involved consists in generating an HFile. This takes about 20 minutes for the small corpus and about 5 hours for the larger corpus. With the suffix array method, compiling the small parallel corpus into a suffix array data structure takes 2m49s and extracting a grammar for the test set from that corpus takes 57m40s as mentioned. Compiling the large parallel corpus into a suffix array takes 84m29s and extracting a grammar for our test set takes 569m54s. This gives an indication of the total time needed to extract a grammar for an arbitrary corpus and an arbitrary test set. In our case, even though HFile generation might be relatively time con- suming (5 hours for the larger corpus), once the HFile is generated, it is possible to quickly extract grammars for arbitrary test sets and arbitrary fil- tering parameters for rapid experimentation. This approach favours repeated experimentation with varying development and test sets. CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 70 Small Grammar System Extraction Time Retrieval Time Total Time cdec 2m49s 57m40s 1h29s ruleXtract 20m 11m11s 31m11s Large Grammar cdec 1h24m29s 9h29m54s 10h54m23s ruleXtract 5h 23m46s 5h23m46s Table 3.2: Timings for cdec and ruleXtract for both rule extraction and retrieval. Table 3.2 compares timings between suffix array rule extraction and re- trieval as implemented by cdec and our system. For cdec, the extraction time corresponds to corpus compilation. For both the small and large grammars, ruleXtract performs twice as fast. The ruleXtract system works in batch mode and dividing the number of words in the test set by the total time in the best configuration (ruleXtract, Bloom, Local row 10) for the large grammar yields a speed of 35.8 words per second which is a real time system speed for batch processing tasks in which latency has little effect. However, running the system in that configuration gives a speed of 2.5 words per second for the longest sentence in the test set (135 words) and 1.3 words per second for a sentence of length 20. This may be due to several factors. First, different sentences in a test set may share the same source pattern instances, which saves computation time for batch processing. Second, slow I/O operations such as opening the HFile and reading the HFile metadata is shared by all test sentences in batch processing. Finally, processing only one sentence at a time may not benefit from caching (see Section 3.3.3 and Section 3.3.5) as much as processing an entire test set. Future work will be dedicated to reduce latency and obtain an actual real time system at the translation instance level. 3.6 Conclusion We have presented techniques to build large translation grammars and to filter these grammars to a given test set efficiently. Our strategy is relatively easy to implement, it is flexible in that it can be applied to other tasks such CHAPTER 3. DATA STRUCTURES FOR HIERO GRAMMARS 71 as language modelling, and it does not require extensive computing resources as it is run on one machine. We have demonstrated that its performance in terms of speed and memory usage is competitive with other current alterna- tive approaches. There are several possible extensions to our current infrastructure. First, rule retrieval can be parallelised by source sentence in order to provide sen- tence specific grammars rather than test set specific grammar. Rule retrieval can also be parallelised by processing queries in parallel. The HFile for- mat does not support concurrent reading with multithreading, but it allows multiprocessing reading, which makes this option suitable for MapReduce. Finally, in order to provide a real time system, causes of latency may be tackled; for example optimising the source pattern instance creation phase would help reduce latency. Chapter 4 Hierarchical Phrase-Based Grammar Extraction from Alignment Posterior Probabilities In Chapter 3, we have described how to exploit the MapReduce framework and the HFile format in order to generate very large translation grammars and retrieve rules from these grammars efficiently. The main contribution was at the infrastructure level rather at the modelling level. In this chapter, we will attempt to improve models for translation grammar extraction. Standard practice in SMT systems is to decouple the word alignment phase from the rule extraction phase. Typically, word alignment models are only used to obtain a set of alignment links, then those alignment links determine constraints that are followed in the rule extraction step (see Sec- tion 2.6.4). In this chapter, we attempt to leverage the information contained in alignment models by extracting rules from alignment posterior probabili- ties (de Gispert et al., 2010b). These statistics are computed from the HMM alignment model (Vogel et al., 1996) and they are used both to generate constraints for rule extraction and for translation model estimation. This chapter presents two rule extraction methods. With the first method, rules are extracted from alignment link posterior probabilities. With the sec- ond method, rules are extracted from alignment posterior probabilities over phrase pairs. We demonstrate improvements on a medium scale Chinese- English task with these methods. We also investigate how best to exploit source-to-target and target-to-source alignment models. 72 CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 73 .Er hat den Ball gesehen He has seen the ball Figure 4.1: German-English word-aligned sentence pair. The spurious align- ment link between the German word hat (has) and the English word seen prevents the phrase pair 〈hat, has〉 to be extracted from this sentence pair. 4.1 Introduction In state-of-the-art SMT systems, rules are extracted from word-aligned paral- lel text. The alignments are typically generated by applying symmetrisation heuristics (see Section 2.3.2) to Viterbi alignments (see Equation 2.7) ob- tained from source-to-target and target-to-source word alignment models. Additional information that these models could provide, such as posterior probabilities over alignment links, is not used. For example, let us consider the word-aligned German-English sentence pair in Figure 4.1. Intuitively, we can tell that there is a spurious alignment link between the German word hat (has) and the English word seen. This link will prevent the extraction of the useful phrase pair 〈hat, has〉 from this sentence pair. However, it is possible that the posterior probability of this spurious link according to the alignment model is relatively low. We hypothesise that posterior probability information from alignment models is more reliable than the links obtained from Viterbi alignment. In this chapter, we use HMM alignment models (see Section 2.3.1) to gen- erate the statistics needed to both extract rules and estimate the translation models. We hypothesise that this tighter coupling between alignment and translation models will provide better translation quality. We will evaluate the grammar we obtain in two ways. First, we will assess the grammar’s ability to generate a reference translation from a source sen- tence. This is determined by the type of reordering allowed by the grammar and by the choice of translations for each source side of a rule. We will then evaluate translation quality provided by this grammar. Conceptually, our extraction method consists in extracting all possible CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 74 phrase pairs and hierarchical phrase pairs given a sentence pair and selecting only those that satisfy certain statistical criteria related to alignment pos- terior probabilities. For example, we can select phrase pairs that contain a link with a high posterior probability; or we can select phrase pairs that contain a link with a high posterior probability and that have a high phrase pair posterior probability. The selection process determines which rules the grammar will contain and will therefore define the ability of the grammar to generate a reference translation given a source sentence. We can also use statistics from alignment models to estimate translation models in a novel way. In this work, we will use phrase pair posterior probability instead of integer counts to estimate translation models. 4.2 Related Work The limitations of extracting translation rules from Viterbi alignments, i.e. that potentially useful information from the alignment models is ignored, has been addressed previously. Venugopal et al. (2008) extract rules from n-best lists of alignments and n-best lists of syntactic parses for a syntax-augmented hierarchical system (Zollmann and Venugopal, 2006). In the alignment step, an n-best list of alignments a1P OOOPan is produced with posterior probabil- ities p(a1 | f P e)P OOOP p(an | f P e). These posteriors are normalised to pro- duce probabilities xp(a1)P OOOP xp(an). Similarly, probabilities xp(1)P OOOP xp(n0) are obtained for an n′-best list of parses 1P OOOPn0 . For each alignment ai and parse j , syntax-augmented hierarchical rules are extracted with a count xp(ai) xp(j). Alignment n-best lists have also been used to create a structure called weighted alignment matrix (Liu et al., 2009). Probabilities xp(a1)P OOOP xp(an) for n-best alignments a1P OOOPan are computed as previously (Venugopal et al., 2008). Then, for each word pair (fuP zi), the alignment link posterior proba- bility pm(jP i) is computed in Equation 4.1. pm(jP i) = n∑ k5) xp(ak)(akP iP j) (4.1) (akP iP j) indicates whether there is a link between i and j in the alignment ak. Given a sentence pair, all phrase pairs with a maximum source length and a maximum target length that contain a link with a posterior greater CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 75 than zero are extracted. The fractional counts assigned to these phrase pairs are computed in terms of the link posteriors and then used to estimate the translation models by relative frequency. The fractional count computation approximates the posterior probability of all alignments consistent with the phrase pair. Our method also uses link posterior probabilities to constrain the extraction but the posteriors are computed exactly rather than approx- imated. In addition, posterior probabilities of consistent alignments is also computed exactly. Finally, our method is also applied to hierarchical phrase- based translation. Alignment posterior probabilities without approximation have also been used. For a given test set, Deng and Byrne (2008) first extract phrase pairs in a standard manner. Then, source phrases in the test set that do not have any corresponding target in the list of extracted phrase pairs are se- lected. For each of these source phrases, sentence pairs where the source phrase occurs are considered. For each such sentence pair, all target phrases in the target sentence are assigned phrase pair posterior probabilities (see Section 4.3.4) according to the source-to-target and target-to-source align- ment models, then ranked by the geometric average of the two probabilities. The top phrase pair is retained if its scores are above specific thresholds. Our definition of phrase pair posterior probabilities and the procedure to compute them are directly inspired by the work we just described. However, we do not use the word-to-phrase HMM model but the simpler word-to-word HMM model. In addition, our method is applied to hierarchical phrase-based grammars rather than simpler phrase-based grammars. Finally, our grammar extraction scheme does not consist in first extracting a standard grammar and then augmenting the grammar with additional rules: we modify the ex- traction procedure to directly extract a hierarchical grammar from alignment link posterior probabilities or phrase pair posterior probabilities. Kumar et al. (2007) also use exact computation of alignment link pos- teriors in a different application setting. First, instead of using the Viterbi criterion for word alignment reminded in Equation 4.2, a^ = argmax a p(f Pa | e) (4.2) the maximum a posteriori criterion (Matusov et al., 2004), shown in Equa- tion 4.3, is used: xvu = argmax i p(vu = i | f P e) (4.3) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 76 Then, given a parallel corpus for three languages F , G, E, the link posteriors for the language pair (F , E) are computed in terms of the posteriors for the language pair (F , G) and (G, E). G is called a bridge language. The moti- vation is that alignments for the F -G language pair and the G-E language pair may inform alignment for F -E. Multiple bridge languages are used and produce corresponding posterior matrices. The matrices are interpolated and alignments are extracted for each bridge language and for the interpolation. Translation gains are obtained in system combination. We also note approaches to tighter coupling between hierarchical phrase- based grammars and alignments or even direct modelling of phrase alignment. Marcu and Wong (2002) introduce a joint phrase-based model that does not make use of word alignments. In this generative model, a sentence pair is produced by concatenating phrase pairs, or so-called concepts. The authors consider a simpler model with only joint phrase pair translation probabili- ties and a more complex model with translation and distortion probabilities. The parameter are trained with an approximate version of the expectation- maximisation algorithm (Dempster et al., 1977). Experiments demonstrate translation improvements over IBM Model 4. Birch et al. (2006) constrain this model in order to be able to apply it to larger parallel corpora. When searching for a set of phrase pairs to cover a training sentence pair, phrase pairs that are consistent with the intersection of Viterbi alignments (see Section 2.3.2) are considered first; other phrase pairs are considered only when the sentence pair cannot be covered entirely. Results close to standard phrase-based models are obtained. DeNero and Klein (2008) prove that phrase alignment is an NP-hard problem. Given a sentence pair (f P e), a bijective phrase alignment a is defined as a bijective mapping between source phrases that form a partition of f and target phrases that form a partition of e. A scoring function  is also defined that assigns a real-valued score to any phrase pair 〈source phrase, target phrase〉. The score of a bijective phrase alignment is simply the product of the scores of its phrase pairs. Given (f P eP ), the phrase alignment optimisation problem is to find the best scoring alignment. DeNero and Klein (2008) show that this problem is NP-hard by showing that the corresponding decision problem is NP-complete via reduction of the SAT problem. We give here an indication of the size of the search space. The number of possible source partitions is 2|f |−). Given a source partition with K+1 phrases, there are (K + 1); possible permutation of the source phrases and 2( |e|−1 K ) possible CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 77 target partitions with K + 1 phrases. In conclusion, there is little hope to solve the phrase alignment problem exactly. Saers and Wu (2009) report an improvement on a phrase-based system where word alignment has been trained with an inversion transduction gram- mar rather than IBM or HMMmodels. Phrase alignment is directly modelled with an inversion transduction grammar. The phrase alignment search space is more restrictive than the space considered in DeNero and Klein (2008) and the expectation maximisation algorithm can be carried out in d(n6) where n is the number of tokens in a sentence. Pauls et al. (2010) also use an inversion transduction grammar to directly align phrases to nodes in a string-to-tree model. Bayesian methods have also been developed to induce a grammar directly from an unaligned parallel corpus (Blunsom et al., 2008b, 2009). Fi- nally, Cmejrek et al. (2009) extract rules directly from bilingual chart parses of the parallel corpus without using word alignments. We take a different approach in that we aim to start with very strong alignment models and use them to guide grammar extraction. Finally, some work on smoothing, which could be complementary to the approach taken in this thesis, has been conducted to address the shortcomings of relative frequency estimation for translation models. Foster et al. (2006) conduct an extensive series of experiments that either replace the relative frequency estimated phrase table by a smoothed phrase table or add the smoothed phrase table as a feature and observe improvement in translation quality. 4.3 Rule Extraction In Section 4.2, we have reviewed approaches that “widen” the translation pipeline by using alignment n-best lists. We have also reviewed applications of exact computation of alignment posterior probabilities and attempts to directly model phrase alignment. We will now describe our grammar extrac- tion methods, based on exact computation of alignment posterior probabili- ties under an alignment model. As in previous work (Hopkins et al., 2011), we first present a general approach that encompasses both standard methods based on rule extraction from Viterbi alignments as well as our methods. For clarity of presentation, we first describe our methods in the simpler case of phrase-based rule extraction, then extend them to hierarchical phrase-based rule extraction. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 78 1: function ExtractRules(fJ) P zT)Pa) 2: for 1 ≤ j) ≤ j2 ≤ J do 3: for 1 ≤ i) ≤ i2 ≤ I do 4: if SourceConstraints(f u2u1 ) ∧ AlignConstraints(f u2u1 P zi2i1 Pa) ∧ TargetConstraints(zi2i1) then 5: Extract(m→〈f u2u1 ,zi2i1〉, Count(f u2u1 P zi2i1)) 6: end if 7: end for 8: end for 9: end function Figure 4.2: General procedure for phrase-based rule extraction: both tradi- tional rule extraction from Viterbi alignment and our method are instances of this procedure. 4.3.1 General Framework for Rule Extraction We first describe a general method for the extraction of phrase-based rules. An extension of this procedure for hierarchical rules is described in Sec- tion 4.3.5. The algorithm is described in Figure 4.2. Given a sentence pair (fJ) P z T )), for each source index pair (j)P j2) defining a source phrase f u2 u1 (line 2), for each target index pair (i)P i2) defining a target phrase zi2i1 (line 3), if source constraints, target constraints and alignment constraints are satisfied (line 4), then the phrase pair (f u2u1 , z i2 i1 ) is extracted with a certain count (line 5): the phrase pair is added to the list of phrase pairs used in translation, and the count will be used subsequently to compute translation models by relative frequency. The purpose of the constraints is to obtain a manageable number of rules. If we did not impose constraints, we would extract T(T+))J(J+)) 4 (not necessarily distinct) rules for the sentence pair (fJ) P zT)). During translation, the decoder would need to apply more pruning, which would potentially lead to more search errors and a decrease in translation quality. We will now refine this general procedure to make it more practical and closer to a possible implementation. Let us call the source constraints CS, the alignment constraints CL and the target constraints CT . These are Boolean functions used to select phrase pairs. In practice, source constraints are checked on the source phrases before looking at the target phrases. If source CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 79 constraints are not met, then we need not consider target phrases for that source phrase. In addition, target phrases are only considered if they satisfy alignment constraints with the source phrase and, if they do, we rank them according to a certain ranking function R. Target constraints also depend on the ranking R, for example we can decide to keep only a certain num- ber of target phrases per source phrase. When a phrase pair is extracted, it is assigned a count which will be used to estimate the source-to-target and target-to-source translation models. The counting function is called C. With this notation, we obtain the revised extraction procedure in Figure 4.3. We will now describe different rule extraction strategies in terms of the con- straints CS, CL, CT , the ranking function R and the counting function C. 4.3.2 Extraction from Viterbi Alignment Links In this section, we describe the standard extraction procedure within the framework introduced in Section 4.3.1. Common practice takes a fixed set of word alignment links L and extracts rules from this set. Alignment links L are obtained from the alignment model a either by the Viterbi algo- rithm or by maximum a posteriori estimation (Matusov et al., 2004; Ku- mar et al., 2007) and possibly using symmetrisation heuristics to combine links obtained from source-to-target and target-to-source alignment models (see Section 2.3.2). We can restate this common approach in the framework proposed in Section 4.3.1 and in Figure 4.3 where constraints, ranking and counting functions are defined as follows: • source constraints CS(f u2u1 ): j2 − j) < smax (4.4) where smax is a integer threshold defined experimentally. smax repre- sents the maximum length of a source phrase. • alignment constraints CL(f u2u1 P zi2i1 Pa):( ∀(jP i) ∈ LP j ∈ [j)P j2]⇔ i ∈ [i)P i2] ) ∧ ( L∩[j)P j2]×[i)P i2] 6= ∅ ) (4.5) Alignment constraints have already been described in Section 2.5.3 as the conditions required for phrase pair extraction. The first bracketed constraint requires that there be no alignment link between a word CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 80 1: function ExtractRules(fJ) P zT)Pa) 2: for 1 ≤ j) ≤ j2 ≤ J do 3: if ¬CS(f u2u1 ) then . Source constraints 4: continue 5: end if 6: i ← ∅ . Sorted target phrases 7: for 1 ≤ i) ≤ i2 ≤ I do 8: if CL(f u2u1 P zi2i1 Pa) then . Alignment constraints 9: i ← i ∪ zi2i1 10: end if 11: end for 12: Sort(iPR) . Target phrases ranked according to R 13: for zi2i1 ∈ i do 14: if CT (zi2i1 P i ) then . Target constraints 15: Extract(m→〈f u2u1 ,zi2i1〉, C(f u2u1 P zi2i1)) 16: end if 17: end for 18: end for 19: end function Figure 4.3: General procedure for phrase-based rule extraction. This version is more practical and closer to a possible implementation than the algorithm in Figure 4.2. Source phrases are first considered. Only if source constraints CS are satisfied, then target phrases are considered. Targets that satisfy alignment constraints CL with their source are ranked by R. Finally, phrase pairs where the target satisfies target constraints can be extracted with a certain count C. Note that the target constraints implicitly depend on the ranking of the targets by R. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 81 inside the phrase pair and a word outside of it. The second bracketed constraint requires that there be at least one alignment link in the phrase pair. Sometimes, an additional constraint specifies that the boundary words in the phrase pair should be aligned. In this work, this constraint is not present. A phrase pair that satisfies Equation 4.5 is said to be consistent with the alignment (see Section 2.5.3). • target constraints CT (zi2i1 P i ): no constraint is imposed in this work. Target constraints based on length may be imposed depending on the implementation. • ranking and counting functions: R(f u2u1 P zi2i1) = C(f u2u1 P zi2i1) = 1 (4.6) The above constraints, ranking and counting functions define the standard approach to grammar extraction. In the next sections, we depart from this approach and apply novel functions to rank and count target-side translations according to their quality in the context of each parallel sentence, as defined by the word alignment models. We also depart from common practice in that we do not use a set of links as alignment constraints. We thus have better control over the number of extracted rules as well as the relative frequency estimates of the source-to-target and target-to-source translation models. 4.3.3 Extraction from Posteriors Probabilities over Alignment Links For presentation, we only consider source-to-target alignment models: the random variable a that models the alignment process takes values in func- tions from source word positions to target word positions. However, it it possible to apply our method with any directional alignment model. We will use the link posterior probability p(vu = i | fJ) P zT)) to guide rule extraction. This statistic expresses how likely it is that a word fu in source position j and a word zi in target position i are aligned given the sentence pair (fJ) P zT)). The link posterior probability can be computed efficiently for Model 1, Model 2 and HMM. In our experiments, we only use the HMM model to compute link posteriors but comparisons between link posteriors obtained from var- ious models may be interesting in the future. We will derive a closed form CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 82 solution for these models to compute the link posterior probability. Applying the definition of conditional probability, we obtain the general form of the link posterior probability in Equation 4.7. p(vu0 = i( | fJ) P zT)) = p(vu0 = i(P f J ) | zT)) p(fJ) | zT)) (4.7) Using Equation 4.7, we will now derive the link posterior probability pM1(vu0 = i( | fJ) P zT)) for Model 1, pM2(vu0 = i( | fJ) P zT)) for Model 2 and pHMM(vu0 = i( | fJ) P zT)) for the HMM model. 4.3.3.1 Link Posterior Probability for Model 1 Let us derive pM1(vu0 = i( | fJ) P zT)), the link posterior probability under Model 1. We use the notation from (Brown et al., 1993), where t is the word-to- word translation table and " is a constant. We compute the numerator from Equation 4.7 by marginalising over all possible alignments and inverting sum and product signs to obtain Equation 4.8: pM1(vu0 = i(P f J ) | zT)) = T∑ l15( OOO T∑ lj0−15( T∑ lj0+15( OOO T∑ lJ5( pM1(v)OOOvu0−)i(vu0+)OOOvJ P f J ) | zT)) = T∑ l15( OOO T∑ lj0−15( T∑ lj0+15( OOO T∑ lJ5( " (1 + I)J t(fu0 | zi0) J∏ u5) u 65u0 t(fu | zlj) = " (1 + I)J t(fu0 | zi0) J∏ u5) u 65u0 T∑ i5( t(fu | zi) (4.8) We compute the denominator from Equation 4.7 similarly (see Equation (15) in (Brown et al., 1993)) and obtain Equation 4.9: pM1(f J ) | zT)) = " (1 + I)J J∏ u5) T∑ i5( t(fu | zi) (4.9) After simplification, we obtain Equation 4.10 from Equation 4.8 and Equa- tion 4.9: pM1(vu0 = i( | fJ) P zT)) = t(fu0 |zi0)∑T i5( t(fu0 |zi) (4.10) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 83 4.3.3.2 Link Posterior Probability for Model 2 We apply the same method to compute pM2(vu0 = i( | fJ) P zT)), the link posterior probability for Model 2. We also use notation from (Brown et al., 1993) but we replace the notation for the alignment probability v(i | jP JP I) by pl(i | jP JP I) for clarity. We compute the numerator from Equation 4.7 in Equation 4.11: pM2(vu0 = i(P f J ) | zT)) = T∑ l15( OOO T∑ lj0−15( T∑ lj0+15( OOO T∑ lJ5( pM2(v)OOOvu0−)i(vu0+)OOOvJ P f J ) | zT)) = T∑ l15( OOO T∑ lj0−15( T∑ lj0+15( OOO T∑ lJ5( " pl(i( | j(P JP I) t(fu0 | zi0) J∏ u5) u 65u0 pl(vu | jP JP I) t(fu | zlj) = " pl(i( | j(P JP I) t(fu0 | zi0) J∏ u5) u 65u0 T∑ i5( pl(i | jP JP I) t(fu | zi) (4.11) We compute the denominator from Equation 4.7 similarly and obtain Equa- tion 4.12: pM2(f J ) | zT)) = " J∏ u5) T∑ i5( pl(i | jP JP I) t(fu | zi) (4.12) After simplification, we obtain Equation 4.13 from Equation 4.11 and Equa- tion 4.12. pM2(vu0 = i( | fJ) P zT)) = pl(i( | j(P JP I) t(fu0 | zi0)∑T i5( pl(i | j(P JP I) t(fu0 | zi) (4.13) 4.3.3.3 Link Posterior Probability for the HMM Model We now derive pHMM(vu0 = i(|fJ) P zT)), the link posterior probability for the HMM model (Vogel et al., 1996; Rabiner, 1989). These derivations are stan- dard once we realise that the observed sequence is the source sentence fJ) , the hidden sequence is vJ) and that in addition to standard presentations CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 84 of HMM, all probabilities are conditioned on the target sentence zT). We compute the numerator from Equation 4.7 in Equation 4.14: pHMM(vu0 = i(P f J ) | zT)) = pHMM(vu0 = i(P f u0) P fJu0+) | zT)) = pHMM(f J u0+) | vu0 = i(P f u0) P zT)) pHMM(vu0 = i(P f u0) | zT)) = pHMM(f J u0+) | vu0 = i(P zT)) pHMM(vu0 = i(P f u0) | zT)) = u0(i() u0(i() (4.14) where u0(i() and u0(i() are respectively the backward and forward HMM probabilities defined in Equation 4.15: u(i) = pHMM(f J u+) | vu = iP zT)) u(i) = pHMM(vu = iP f u ) | zT)) (4.15) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 85 The forward and backward probabilities can be computed recursively as shown in Equation 4.16 and Equation 4.17: u(i) = pHMM(vu = iP f u ) | zT)) = T∑ k5( pHMM(vu = iP vu−) = kP f u−) ) P fu | zT)) = T∑ k5( pHMM(fu | vu = iP vu−) = kP f u−)) P zT)) pHMM(vu = iP vu−) = kP f u−)) | zT)) = T∑ k5( pHMM(fu | zi) pHMM(vu = i | vu−) = kP f u−)) P zT)) pHMM(vu−) = kP f u−)) | zT)) = T∑ k5( pHMM(fu | zi) pHMM(vu = i | vu−) = kP I) u−)(k) (4.16) u(i) = pHMM(f J u+) | vu = iP zT)) = T∑ k5( pHMM(f J u+2P vu+) = kP fu+) | vu = iP zT)) = T∑ k5( pHMM(f J u+2 | vu+) = kP fu+)P vu = iP zT)) pHMM(vu+) = kP fu+) | vu = iP zT)) = T∑ k5( pHMM(f J u+2 | vu+) = kP zT)) pHMM(fu+) | vu+) = kP vu = iP zT)) pHMM(vu+) = k | vu = iP zT)) = T∑ k5( u+)(k) pHMM(fu+) | zk) pHMM(vu+) = k | vu = iP I) (4.17) The denominator from Equation 4.7 is computed in Equation 4.18: pHMM(f J ) | zT)) = T∑ k5( pHMM(vJ = kP f J ) | zT)) = T∑ k5( J(k) (4.18) We will use the link posterior probabilities under the HMM model in order to define constraints, ranking and counting functions. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 86 4.3.3.4 Constraints, Ranking and Counting Functions from HMM Link Posterior Probabilities We use HMM link posterior probabilities computed in Section 4.3.3.3 in order to define constraints, ranking and counting functions: • source constraints CS(f u2u1 ): j2 − j) < smax (4.19) This is the same constraint as defined for standard Viterbi extraction in Section 4.3.2. • alignment constraints CL(f u2u1 P zi2i1 Pa): ∃(jP i) ∈ [j)P j2]× [i)P i2]P p(vu = i | fJ) P zT)) S  (4.20) ∀(jP i) ∈ [1P J ]× [1P I] ∩ {(jP i) : p(vu = i | fJ) P zT)) S } (4.21) j ∈ [j)P j2]⇔ i ∈ [i)P i2] where  is a threshold defined experimentally. Intuitively,  is a high link posterior probability. The first constraint (Equation 4.20) means that we require at least one link with a high posterior probability in the phrase pair considered. The second constraint (Equation 4.21) means that there should be no link with a high posterior probability that be inconsistent with the phrase pair. Note that these constraints are identical to the Viterbi alignment constraints defined in Section 4.3.2 if we choose L to be the set of all links with high posterior defined in Equation 4.22: L = {(jP i) ∈ [1P J ]× [1P I] : p(vu = i | fJ) P zT)) S } (4.22) Also note that the second constraint does not consider links to the null word (see Section 2.3) relevant. This is because we do not need to include the null word in a translation rule. • target constraints CT (zi2i1 P i ): we pick the first k translation candidates according to the ranking function R. • ranking function: R(f u2u1 P zi2i1) = u2∏ u5u1 i2∑ i5i1 p(vu = i | fJ) P zT)) i2 − i) + 1 (4.23) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 87 This ranking function is very similar to the score used for lexical fea- tures described in Section 2.6.5. Here, we use link posteriors instead of Model 1 translation probabilities. This function favours short target phrases, therefore we do not use it as a counting function. Preliminary experiments found that this function is not appropriate for counting rules and that it gives poor results. We therefore use the same count- ing function as in standard practice described in Section 4.3.2. • counting function: C(f u2u1 P zi2i1) = 1 (4.24) We have described rule extraction from alignment link posterior prob- abilities. Next, we will describe rule extraction from alignment posterior probabilities over phrase pairs. This method will use the same source con- straints, alignment constraints and target constraints but different ranking and counting functions. 4.3.4 Extraction from Posteriors over Phrase Pairs In the previous section, we defined and gave closed form solutions to align- ment link posterior probabilities for Model 1, Model 2 and the HMM model. We can also define alignment posterior probabilities over phrase pairs. Let us consider the phrase pair 〈f u2u1 P zi2i1〉 in the sentence pair (fJ) P zT)). In Equa- tion 4.25, we define V(j)P j2; i)P i2), the set of alignments that have no links between inside the phrase pair and outside the phrase pair: V(j)P j2; i)P i2) = {vJ) : vu ∈ [i)P i2]⇔ j ∈ [j)P j2]} (4.25) Alignments in V(j)P j2; i)P i2) satisfy the consistency constraint defined in Equation 4.5 but do not require a link in [j)P j2] × [i)P i2]. The posterior probability of these alignments given the sentence pair is defined in Equa- tion 4.26: p(V(j)P j2; i)P i2) | zT)P fJ) ) = p(fJ) P V(j)P j2; i)P i2) | zT)) p(fJ) | zT)) = ∑ lJ1∈L(u1;u23i1;i2) p(f J ) P v J ) | zT))∑ lJ1 p(fJ) P v J ) | zT)) (4.26) We call this quantity the phrase pair posterior probability. We will now derive formula for the phrase pair posterior probability in the case of Model 1, Model CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 88 2 and the HMM Model. Again, experiments only use phrase pair posteriors computed from the HMM model, but comparing those with the posteriors obtained from Model 1 and Model 2 may be interesting for future research. 4.3.4.1 Phrase Pair Posterior Probability for Model 1 Let us first define Jin, Jout, Iin and Iout in Equation 4.27 for a set of indices i), i2, j), j2: Jin = [j)P j2] Jout = [1P J ] \ Jin Iin = [i)P i2] Iout = [0P I] \ Iin (4.27) For Model 1, the numerator from Equation 4.26 is obtained in Equation 4.28:∑ lJ1∈L(u1;u23i1;i2) pM1(f J ) P v J ) | zT)) = ∑ l1∈Tout OOO ∑ lj1−1∈Tout ∑ lj1∈Tin OOO ∑ lj2∈Tin ∑ lj2+1∈Tout OOO ∑ lJ∈Tout pM1(v J ) P f J ) | zT)) = ∑ l1∈Tout OOO ∑ lj1−1∈Tout ∑ lj1∈Tin OOO ∑ lj2∈Tin ∑ lj2+1∈Tout OOO ∑ lJ∈Tout " (1 + I)J J∏ u5) t(fu | zlj) = " (1 + I)J ( ∏ u∈Jout ∑ i∈Tout t(fu | zi) ) (∏ u∈Jin ∑ i∈Tin t(fu | zi) ) (4.28) The denominator from Equation 4.26 has already been computed in Equa- tion 4.9. Simplifying Equation 4.28 and Equation 4.9, we obtain Equa- tion 4.29: pM1(V(j)P j2; i)P i2) | zT)P fJ) ) = ( ∏ u∈Jout ∑ i∈Tout t(fu | zi)∑T i′5( t(fu | zi′) )(∏ u∈Jin ∑ i∈Tin t(fu | zi)∑T i′5( t(fu | zi′) ) = ( ∏ u∈Jout ∑ i∈Tout pM1(vu = i | fJ) P zT)) )(∏ u∈Jin ∑ i∈Tin pM1(vu = i | fJ) P zT)) ) (4.29) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 89 4.3.4.2 Phrase Pair Posterior Probability for Model 2 To avoid repetition, we skip the derivation which is analogous to the deriva- tion for Model 1. We obtain the phrase pair posterior in Equation 4.30: pM2(V(j)P j2; i)P i2) | zT)P fJ) ) = ( ∏ u∈Jout ∑ i∈Tout pM2(vu = i | fJ) P zT)) )(∏ u∈Jin ∑ i∈Tin pM2(vu = i | fJ) P zT)) ) (4.30) 4.3.4.3 Phrase Pair Posterior Probability for HMM Let us now compute the phrase pair posterior probability for the HMM model. The denominator from Equation 4.26 can be computed using the forward algorithm while the numerator can be computed using a modified forward algorithm (Deng, 2005). Let us define ~ u(i), the modified forward probability in Equation 4.31: ~ u(i) = pHMM(V(j)P j2; i)P i2)P f u ) P vu = i | zT)) (4.31) The numerator from Equation 4.26 can be computed in Equation 4.32: p(V(j)P j2; i)P i2)P f J ) | zT)) = T∑ i5( ~ J(i) (4.32) The denominator from Equation 4.26 can be computed using the regular forward probability in Equation 4.33: p(fJ) | zT)) = T∑ i5( J(i) (4.33) Like the regular forward probability, the modified forward probability can also be computed recursively. We can also write the modified forward prob- ability as in Equation 4.34: ~ u(i) = pHMM(V(j)P j2; i)P i2)P f u ) P vu = i | zT)) = ∑ lj1∈L(u1;u23i1;i2) pHMM(v u−) ) P f u ) P vu = i | zT)) = ∑ lj1∈L(u1;u23i1;i2) lj5i pHMM(v u−) ) P f u ) P vu = i | zT)) (4.34) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 90 The computation of ~ u(i) is by a constrained forward algorithm where the constraint is given in Equation 4.35. This is because an alignment inV(j)P j2; i)P i2) cannot have a link from inside the phrase pair to outside the phrase pair (see Equation 4.25): ∀(jP i) ∈ Jout × Iin ∪ Jin × IoutP ~ u(i) = 0 (4.35) For a link (jP i) ∈ Jout×Iout∪Jin×Iin that satisfies the constraint from Equa- tion 4.35, we can derive the modified forward probability in Equation 4.36: ~ u(i) = pHMM(V(j)P j2; i)P i2)P f u ) P vu = i | zT)) = ∑ lj−11 ∈L(u1;u23i1;i2) pHMM(fuP f u−) ) P vu = iP v u−) ) | zT)) = ∑ lj−11 ∈L(u1;u23i1;i2) pHMM(fu | f u−)) P vu = iP vu−)) P zT))× pHMM(vu = i | f u−)) P vu−)) P zT))× pHMM(f u−) ) P v u−) ) | zT)) = pHMM(fu | zi) ∑ lj−11 ∈L(u1;u23i1;i2) pHMM(vu = i | vu−)P I) pHMM(f u−)) P vu−)) | zT)) = pHMM(fu | zi) T∑ k5( ∑ lj−11 ∈L(u1;u23i1;i2) lj−15k pHMM(vu = i | vu−) = kP I) pHMM(f u−) ) P vu−) = kP v u−2 ) | zT)) = pHMM(fu | zi) T∑ k5( pHMM(vu = i | vu−) = kP I)∑ lj−11 ∈L(u1;u23i1;i2) lj−15k pHMM(f u−) ) P vu−) = kP v u−2 ) | zT)) = pHMM(fu | zi) T∑ k5( pHMM(vu = i | vu−) = kP I) ~ u−)(k) (4.36) We will use the phrase pair posterior probabilities under the HMM model in order to define ranking and counting functions. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 91 4.3.4.4 Constraints, Ranking and Counting Functions from HMM Link and Phrase Posterior Probabilities In order to keep the size of the rule set manageable, we use the same source constraints, alignment constraints and target constraints as for link posterior extraction defined in Section 4.3.3. We use the phrase pair posterior proba- bilities under the HMM model both for ranking and scoring extracted rules. This approach assigns a fractional count to each extracted rule, which al- lows finer estimation of the source-to-target and target-to-source translation models. The ranking and counting functions are defined in Equation 4.37: R(f u2u1 P zi2i1) = C(f u2u1 P zi2i1) = pHMM(V(j)P j2; i)P i2) | f u) P zJ) ) (4.37) Under the framework described in Section 4.3.1, we have described stan- dard rule extraction from Viterbi alignments and two novel approaches to rule extraction based on link posterior probabilities and phrase pair pos- terior probabilities. In order to expose concepts with more clarity, we have restricted the presentation to the extraction of phrase based rules as opposed to hierarchical rules. We will now show how to generalise the techniques pre- sented so far to the extraction of hierarchical rules. 4.3.5 Hierarchical Rule Extraction In this section, we extend the techniques presented so far to hierarchical rule extraction. In order to avoid repetition, we describe these techniques for the rule pattern 〈wmwPwmw〉 only (see Section 2.6.2.4 for the definition of patterns). We first rewrite the algorithm in Figure 4.3 into the algorithm in Figure 4.4 for this pattern. We will now describe constraints, counting and ranking functions for standard Viterbi extraction and for extraction from alignment posteriors. 4.3.5.1 Hierarchical Rule Extraction from Viterbi Alignment Links We now describe the constraints, ranking and counting functions for hierar- chical rule extraction from Viterbi alignments: CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 92 1: function ExtractRules(fJ) P zT)Pa) 2: for 1 ≤ j) ≤ j2 < j+ ≤ j4 ≤ J do 3: if ¬CS(f u2u1mf u4u3 ) then . Source constraints 4: continue 5: end if 6: i ← ∅ . Sorted hierarchical target phrases 7: for 1 ≤ i) ≤ i2 < i+ ≤ i4 ≤ I do 8: if CL(f u2u1mf u4u3 P zi2i1mzi4i3 Pa) then . Alignment constraints 9: i ← i ∪ zi2i1mzi4i3 10: end if 11: end for 12: Sort(iPR) . Hierarchical target phrases ranked according to R 13: for zi2i1mz i4 i3 ∈ i do 14: if CT (zi2i1mzi4i3 P i ) then . Target constraints 15: Extract(m→〈f u2u1mf u4u3 ,zi2i1mzi4i3〉, C(f u2u1mf u4u3 P zi2i1mzi4i3)) 16: end if 17: end for 18: end for 19: end function Figure 4.4: General procedure for hierarchical phrase-based rule extraction. The procedure is presented for the pattern 〈wmwPwmw〉 only. This algo- rithm is analogous to the algorithm used to extract phrase-based rules and presented in Figure 4.3. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 93 • source constraints CS(f u2u1mf u4u3 ): (j2 − j) + 1) + 1 + (j4 − j+ + 1) ≤ smax elements j2 − j) < smax terminals j4 − j+ < smax terminals span(m) ≤ smax NT (4.38) We remind definitions introduced in Section 3.5.2 for the thresholds in Equation 4.38. smax elements is the maximum number of terminals and nonterminals in the source. smax terminals is the maximum number of consecutive terminals in the source. smax NT is the maximum span for a nonterminal, i.e. the maximum number of terminals covered by a nonterminal. These thresholds are defined experimentally. • alignment constraints CL(f u2u1mf u4u3 P zi2i1mzi4i3 Pa): ∀(jP i) ∈ LP j ∈ [j)P j4]⇔ i ∈ [i)P i4] L ∩ [j)P j4]× [i)P i4] 6= ∅ ∀(jP i) ∈ LP j ∈ (j2P j+)⇔ i ∈ (i2P i+) L ∩ {j2 + 1} × (i2P i+) 6= ∅ L ∩ {j+ − 1} × (i2P i+) 6= ∅ L ∩ (j2P j+)× {i2 + 1} 6= ∅ L ∩ (j2P j+)× {i+ − 1} 6= ∅ (4.39) The first two constraints require that that the phrase pair (f u4u1 P z i4 i1 ) be consistent with the links L. The third constraint requires that there be no link from inside the phrase pair (f u3−)u2+) P z i3−) i2+) ), which corresponds to the nonterminal X, to outside the phrase pair. The last four con- straints mean that the boundary words fu2+)P fu3−)P zi2+)P zi3−) are not unaligned. • target constraints CT (zi2i1mzi4i3 P i ): no constraint. Again, other imple- mentations may impose for example length-based constraints on the target. • ranking and counting functions: R(f u2u1mf u4u3 P zi2i1mzi4i3) = C(f u2u1mf u4u3 P zi2i1mzi4i3) = 1 (4.40) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 94 4.3.5.2 Hierarchical Rule Extraction from Link Posterior Proba- bilities We now describe the constraints, ranking and counting functions for the link posterior extraction of hierarchical rules: • source constraints CS(f u2u1mf u4u3 ): j2 − j) < shier max j4 − j+ < shier max j+ − j2 < smax NT (4.41) shier max is the maximum length for consecutive terminals in the source. smax NT is the maximum number of terminals covered by a nonterminal. These thresholds are defined experimentally. • For alignment constraints, let us first define Jin, Jout, Iin and Iout in Equation 4.42 for a set of indices i), i2, i+, i4, j), j2, j+, j4: Jin = [j)P j2] ∪ [j+P j4] Jout = [1P J ] \ Jin Iin = [i)P i2] ∪ [i+P i4] Iout = [0P I] \ Iin (4.42) Alignment constraints CL(f u2u1mf u4u3 P zi2i1mzi4i3 Pa) are defined in Equation 4.43. ∃(jP i) ∈ (j2P j+)× (i2P i+)P p(vu = i | fJ) P zT)) S  ∃(jP i) ∈ [j)P j2]× IinP p(vu = i | fJ) P zT)) S  ∃(jP i) ∈ [j+P j4]× IinP p(vu = i | fJ) P zT)) S  ∃(jP i) ∈ Jin × [i)P i2]P p(vu = i | fJ) P zT)) S  ∃(jP i) ∈ Jin × [i+P i4]P p(vu = i | fJ) P zT)) S  ∀(jP i) ∈ [1P J ]× [1P I] ∩ {(jP i) : p(vu = i | fJ) P zT)) S } j ∈ (j2P j+)⇔ i ∈ (i2P i+) j ∈ Jin ⇔ i ∈ Iin (4.43) • target constraints CT (zi2i1mzi4i3 P i ): i2 − i) < thier max i4 − i+ < thier max (4.44) CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 95 thier max is the maximum number of consecutive terminals in the target side. Another constraint requires to pick the first k targets per source according to the ranking function R. In the experiments to follow, k = 3. • ranking function: R(f u2u1mf u4u3 P zi2i1mzi4i3) = ∏ u∈Jin ∑ i∈Tin p(vu = i | fJ) P zT)) i2 − i) + i4 − i+ + 2 (4.45) • counting function: C(f u2u1mf u4u3 P zi2i1mzi4i3) = 1 (4.46) 4.3.5.3 Hierarchical Rule Extraction from Phrase Posterior Prob- abilities For hierarchical rule extraction based on phrase pair posteriors, we use the same constraints as for hierarchical rule extraction based on link posteri- ors. The ranking and counting functions are defined in terms of alignment posterior probabilities over hierarchical phrase pairs. We define V(j)P j2; j+P j4; i)P i2; i+P i4) in Equation 4.47. V(j)P j2; j+P j4; i)P i2; i+P i4) = {vJ) : vu ∈ Iin ⇔ j ∈ Jin} (4.47) where Iin and Jin have been defined in Equation 4.42. We can now define the ranking and counting functions in Equation 4.48. R(f u2u1mf u4u3 P zi2i1mzi4i3) = pHMM(V(j)P j2; j+P j4; i)P i2; i+P i4) | f u) P zJ) ) C(f u2u1mf u4u3 P zi2i1mzi4i3) = pHMM(V(j)P j2; j+P j4; i)P i2; i+P i4) | f u) P zJ) ) (4.48) In order to compute these functions for the HMM model, we use the same constrained forward algorithm from Section 4.3.4. This time, the constraints are given by Equation 4.49. ∀(jP i) ∈ Jout × Iin ∪ Jin × IoutP ~ u(i) = 0 (4.49) where Iin, Jin, Iout and Jout have been defined in Equation 4.42. We have presented a framework to extract phrase based rules and hierar- chical rules. This framework encompasses standard extraction from Viterbi alignment links as well as two novel methods that use link posterior prob- abilities and phrase pair posterior probabilities. We will now evaluate our original methods, first by assessing the expressiveness of the grammars ex- tracted with these methods and then by measuring translation quality. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 96 G0 G1 G2 G3 G4 S→〈X,X〉 X→〈w X,X w〉 X→〈w X,X w〉 X→〈w X,X w〉 X→〈w X,X w〉 S→〈S X,S X〉 X→〈X w,w X〉 X→〈X w,w X〉 X→〈X w,w X〉 X→〈X w,w X〉 X→〈w,w〉 X→〈w X,w X〉 X→〈w X,w X〉 X→〈X1wX2,wX2X1〉 X→〈w X w,w X w〉 X→〈X1wX2,X2X1w〉 Table 4.1: Hierarchical phrase-based grammars containing different types of rules. The grammar expressiveness is greater as more types of rules are included. In addition to the rules shown in the respective columns, G), G2, G+ and G4 also contain the rules of G(. 4.4 Experiments In this section, we carry out experiments in order to demonstrate the ef- fectiveness of our original grammar extraction method. We will assess the quality of the grammars extracted in terms of how expressive these grammars are, and in terms of translation quality. In Section 4.4.1, we define the pat- terns to be included in the grammars used for experiments. In Section 4.4.2, we describe our experimental setup. In Section 4.4.3 and Section 4.4.4, we report results for grammar expressiveness and translation quality. In Sec- tion 4.4.5, we compare our two original methods: link posterior extraction vs. phrase pair posterior extraction. Finally, in Section 4.4.6, we explore various methods to exploit information from source-to-target and target-to- source alignment models. 4.4.1 Experimental Setup: Grammar Pattern Defini- tion In this section we define the hierarchical phrase-based grammars we use for translation experiments. Each grammar is defined by the patterns it contains (see Section 2.6.2.4). We first start with a basic grammar G( defined in the left-most column of Table 4.1. G( is a monotonic phrase-based translation grammar. It includes all phrase-based rules, represented by the rule pattern m → 〈wPw〉, and the two glue rules that allow concatenation. Our approach is to extend this grammar by successively incorporating sets of hierarchical rules, or patterns. The goal is to obtain a grammar with few rule patterns but capable of generating a large collection of translation candidates for a given input sentence. Patterns were added to the base grammar G( according to their frequency CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 97 of use in a translation experiment. We analysed one iteration of minimum error rate training. For each sentence, the decoder generates a 1000-best list of hypotheses. For each hypothesis, the decoder also generates its single best derivation. We extracted rules from the single best derivations of the single best hypotheses and analysed their pattern frequency. Table 4.2 shows the most frequently used rule patterns. We assume that if a rule pattern is used frequently, this means that it is needed in a grammar to produce an acceptable translation. 〈source , target〉 〈w , w〉 〈w X w , w X w〉 〈w X , X w〉 〈X w , w X〉 〈X2 w X1 , X1 X2 w〉 〈X2 w X1 , w X1 X2〉 〈X2 w X1 , X1 w X2〉 〈X w , w X w〉 〈X2 w X1 , X1 w X2 w〉 〈X2 w X1 , w X1 w X2〉 〈w X1 w X2 , w X1 X2〉 〈w X , w X w〉 〈w X w , w X〉 〈X1 w X2 w , X1 w X2〉 Table 4.2: Most frequently used rule patterns. Frequency was measured by running the decoder on a development set, and counting rule patterns on single best derivations of single best hypotheses. We hypothesise that a frequent rule pattern is needed to produce a translation with good quality. The grammars used in experiments are summarised in Table 4.1. Each of these grammars will first be evaluated for expressiveness: we will mea- sure how often a target sentence can be recovered from a source sentence in decoding. We will also evaluate these grammars for translation quality. 4.4.2 Experimental Setup We report on experiments in Chinese to English translation. Parallel training data consists of the collections listed in Table 4.3. This is approximately 50M words per language. We report translation results on a development set CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 98 ADSO_v5 LDC2006E26 LDC2007E87 LDC2002L27 LDC2006E34 LDC2008E40 LDC2003E07 LDC2006E85 LDC2008E56 LDC2003E14 LDC2006E92 LDC2008G05 LDC2005E83 LDC2006G05 LDC2009E16 LDC2005T06 LDC2007E06 LDC2009G01 LDC2005T10 LDC2007E101 proj_syndicate LDC2005T34 LDC2007E103 UMD_NewsExplorer LDC2006E24 LDC2007E46 Wikipedia Table 4.3: List of collections used as parallel training data for translation experiments. tune-nw and a test set test-nw1. These contain translations produced by the GALE program and portions of the newswire sections of NIST MT02 through MT06.1 They contain 1755 sentences and 1671 sentences respectively. Results are also reported on a smaller held-out test set test-nw2, containing 60% of the NIST newswire portion of MT06, that is, 369 sentences. In Section 4.4.5, the entire newswire portion of MT06 is used and the test set is named test- nw3. The parallel texts for both language pairs are aligned using MTTK (Deng and Byrne, 2005, 2008). For decoding, we use HiFST (see Section 2.9). The language model is a 4-gram language model estimated over the English side of the parallel text and the AFP and Xinhua portions of the English Gigaword Fourth Edition (LDC2009T13) for the first pass translation, in- terpolated with a zero-cutoff Stupid Backoff 5-gram (see Section 2.7.4 and Section 3.2) estimated using 10.5B words of English newswire text for 5-gram language model rescoring. In tuning the systems, standard minimum error rate training iterative parameter estimation under BLEU is performed on the development set. Rules with a source-to-target probability less than 0.01 are discarded. In addition, rules that did not occur more than nzms times are discarded. For Viterbi extraction, nzms = 1. For link posterior extraction, nzms = 2 and for phrase pair posterior extraction, nzms = 0O2. These thresholds offer were defined experimentally and provide competitive performance for the various 1 http://www.itl.nist.gov/iad/mig/tests/mt CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 99 conditions without giving unreasonably slow decoding times. The thresholds introduced in previous sections are defined as follows: • smax is set to 9 for Viterbi extraction and to 5 for posterior extraction. A preliminary experiment showed that for posterior extraction, setting smax to 9 did not provide any improvements and slowed down decoding times. • smax elements is set to 5. • smax terminals is set to 5. • smax NT is set to 10. •  is set to 0.5. • shier max and thier max are set to 3. The thresholds relating to Viterbi extraction (smax, smax elements, smax terminals, smax NT) are standard in our translation systems. , shier max and thier max were chosen experimentally and provide competitive performance without slowing down decoding unreasonably. 4.4.3 Grammar Coverage We first evaluate the grammars defined in Section 4.4.1 for expressiveness. We measure the ability of different grammars to produce a reference transla- tion given an input sentence. Rules are extracted from the sentence pair we want to align and we run the decoder in alignment mode (de Gispert et al., 2010a), which is equivalent to replacing the language model by an accep- tor for the reference translation. We compare the percentage of references that can be successfully produced by grammars G(, G), G2 and G+2 for the following extraction methods: • Viterbi (V): this is the standard extraction method based on a set of alignment links. We distinguish four cases, depending on the model used to obtain the set of links: source-to-target (V-st), target-to-source (V-ts), and two common symmetrisation strategies: union (V-union) and grow-diag-final (V-gdf) (see Section 2.3.2). 2We did not include ,4 in coverage analysis CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 100 30 40 50 60 70 80 V-st V-ts V-union V-gdf V-merge WP-st WP-ts WP-merge G0 G1 G2 G3 Figure 4.5: Percentage of parallel sentences successfully aligned for various extraction methods and grammars. • Word Link Posteriors (WP): the extraction method is based on link posteriors described in Section 4.3.3. These rules can be obtained either from the posteriors of the source-to-target (WP-st) or the target-to- source (WP-ts) alignment models. We apply the constraints described in Section 4.3.3. We do not report alignment percentages when using phrase pair posteriors as they are roughly identical to the WP case. • Finally, in both cases, we also report results when merging the extracted rules in both directions into a single rule set (V-merge and WP- merge). Figure 4.5 shows the results obtained for a random selection of 10,000 parallel corpus sentences. As expected, we can see that for any extraction method, the percentage of aligned sentences increases when switching from G( to G), G2 and G+. Posterior-based extraction is shown to outperform standard methods based on a Viterbi set of alignment links for nearly all grammars. The highest alignment percentages are obtained when merging rules obtained under models trained in each direction (WP-merge), approximately reach- ing 80% for grammar G+. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 101 The maximum rule span (smax span defined in Section 3.5.2) in alignment was allowed to be 15 words, so as to be similar to translation, where the maximum rule span is 10 words. Relaxing this in alignment to 30 words yields approximately 90% coverage for WP-merge under G+. We note that if source constraints, alignment constraints and target con- straints were not applied, then alignment percentages would be 100% even for G(, but the extracted grammar would include many noisy rules with poor generalisation power, for example entire sentence pairs. 4.4.4 Translation Results In this section we investigate the translation performance of each hierarchical grammar, as defined by rules obtained from three rule extraction methods: • Viterbi union (V-union): standard rule extraction from the union of the source-to-target and target-to-source alignment link sets. • Word Posteriors (WP-st): extraction based on word posteriors as described in Section 4.3.3. The posteriors are provided by the source- to-target alignment model. • Phrase Posteriors (PP-st): extraction based on alignment poste- riors over phrase pairs, as described in Section 4.3.4, with fractional counts equal to the phrase pair posterior probability under the source- to-target alignment model. Table 4.4 reports the translation results. It also shows the following decoding statistics as measured on the tune-nw set: decoding time in seconds per input word, and number of instances of search pruning per input word. Preliminary experimental work was conducted on grammars G) and G2. Because of slow decoding times, G4 was not included in this set of experiments; instead, gram- mar G4 was used to contrast the WP and PP methods in Section 4.4.5. As a contrast, we extract rules according to the heuristics introduced by Chiang (2007) and apply the filters described by Iglesias et al. (2009a) to generate a standard hierarchical phrase-based grammar GS described in Section 2.6.2.4. This uses rules with up to two nonadjacent nonterminals, but excludes iden- tical rule types such as m→〈w m,w m〉 or m→〈w m) w m2,w m) w m2〉, which were reported to cause computational difficulties without a clear im- provement in translation (Iglesias et al., 2009a). CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 102 Grammar Extraction # Rules tune-nw test-nw1 test-nw2 time prune BLEU BLEU BLEU ,H V-union 979k 3.7 0.3 35.1 35.6 37.6 V-union 614k 0.4 0.0 33.6 34.6 36.4 ,1 WP-st 920k 0.9 0.0 34.3 34.8 37.5 PP-st 894k 1.4 0.0 34.4 35.1 37.7 V-union 735k 1.0 0.0 34.5 35.4 37.2 ,2 WP-st 1132k 5.8 0.5 35.1 36.0 37.7 PP-st 1238k 7.8 0.7 35.5 36.4 38.2 V-union 967k 1.2 0.0 34.9 35.3 37.0 ,3 WP-st 2681k 8.3 1.1 35.1 36.2 37.9 PP-st 5002k 10.7 2.6 35.5 36.4 38.5 Table 4.4: Chinese to English translation results with alternative grammars and extraction methods (lower-cased BLEU shown). The number of rules and time (secs/word) and prune (times/word) measurements are done on tune-nw set. Condition tune-nw test-nw1 GS/V-union 34.60/34.56/0.02/0.41 34.89/34.73/0.03/0.52 G2/PP-st 34.74/34.56/0.02/0.44 35.04/34.91/0.06/0.62 Table 4.5: Repeated MERT runs with different random parameter initialisa- tions. Original first-pass BLEU score, BLEU score mean, variance and spread (in original/mean/variance/spread format) are reported for two conditions. The results show that the optimisation is relatively stable. MERT stability For the results reported in Table 4.4, the MERT param- eter optimisation was carried out once. Following Clark et al. (2011), we run the MERT parameter optimisation step 10 times from different initial random parameters for the following conditions: GS/V-union and G2/PP- st. We report the original first-pass decoding BLEU score and the first-pass decoding BLEU score mean, variance and spread (i.e. difference between minimum and maximum) for the additional 10 runs on the tune-nw and test- nw1 sets in Table 4.5. The results show that the optimisation is relatively stable. In addition, we can see that the gain in BLEU for our method on the test set is consistent for the original run and for the average of the additional 10 runs. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 103 Grammar complexity As expected, for the standard extraction method (see rows entitledV-union), grammar G) is shown to underperform all other grammars due to its structural limitations: in parsing the source, G) does not allow inversions after concatenating two phrases. Grammar G2 obtains much better scores, nearly generating the same translation quality as the baseline grammar GS . Finally, G+ does not prove able to outperform G2 consistently, which suggests that the phrase-disjoint rules with one nonter- minal are redundant for the translation grammar. Rule extraction method For all grammars, we find that the proposed ex- traction methods based on alignment posteriors outperform standard Viterbi- based extraction, with improvements ranging from 0.5 to 1.1 BLEU points for test-nw1 (depending on the grammar) and from 1.0 to 1.5 for test-nw2. In all cases, the use of phrase posteriors PP is the best option. Interestingly, we find that G2 extracted with WP and PP methods outperforms the more complex GS grammar as obtained from Viterbi alignments. Rule set statistics For grammarG2 and for the tune-nw set, Viterbi-based extraction produces 0.7M rules, while the WP and PP extraction methods yield 1.1M and 1.2M rules, respectively. We further analyse the sets of rules m → 〈 P 〉 in terms of the number of distinct source and target sequences and which are extracted. Viterbi extraction yields 82k distinct source sequences whereas the WP and PP methods yield 116k and 146k sequences, respectively. In terms of the average number of target sequences for each source sequence, Viterbi extraction yields an average of 8.7 while WP and PP yield 9.7 and 8.4 rules on average. This shows that method PP yields wider coverage but with sharper source-to-target rule translation probability distributions than method WP, as the average number of translations per rule is determined by a threshold of 0.01 for the minimum source-to-target probability. Decoding time and pruning in search In connection to the previous comments, we find an increased need for search pruning, and subsequently slower decoding speed, as the search space grows larger with methods WP and PP. A larger search space is created by the larger rule sets, which allows the system to generate new hypotheses of better quality. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 104 4.4.5 Comparison between WP and PP Section 4.4.4 has shown that the PP extraction method gives the best trans- lation results in our experiments. Reasons for this may be that more rules were extracted and the translation models were better estimated. These re- sults were obtained with a fixed value of the parameter nzms that was found to give good results for each method. Now, we would like to observe the effect of varying nzms. Given an extraction method, we want to observe the effect of decreasing nzms, that is augmenting the size of the rule set. Additionally, we want to obtain two comparable rule sets in terms of statistics such as number of rules, number of different rule source sides for both phrasal and hierarchi- cal rules, etc., for two different extraction methods in order to observe the effect of estimating the translation model with one method versus the other. Table 4.6 summarises the results. The table shows 4 configurations: the WP extraction method with two different values of nzms and the PP extrac- tion method with two different values of nzms. Configurations (WP nzms=2) and (PP nzms=1) give comparable rule sets. Configurations (WP nzms=1) and (PP nzms=0.2) also give comparable rule sets in terms of size. We first study the effect of decreasing nzms for each extraction method. For the WP method, decreasing nzms from 2 to 1 leads to a average decrease of 0.1 BLEU computed on the test sets for different grammars. We believe that increasing the size of the rule set can lead to more pruning and therefore to a degra- dation in translation performance. For the PP method, decreasing the nzms from 1 to 0.2 leads to an average gain of 0.2 BLEU. We conclude that the PP method is more robust than the WP method with larger sets of rules. We then study the effect of using phrase pair posteriors (in PP) versus using integer counts (in WP) to estimate translation models for comparable rule sets. The configurations (PP nzms=1) and (PP nzms=0.2) with respect to the configurations (WP nzms=2) and (WP nzms=1) present an average gain of 0.2 BLEU. This shows that the translation model estimation using phrase pair posteriors is beneficial in translation. 4.4.6 Symmetrising Alignments of Parallel Text In this section, we investigate extraction from alignments (and posterior dis- tributions) over parallel text which are generated using alignment models trained in the source-to-target (st) and target-to-source (ts) directions. Our motivation is that symmetrisation strategies have been reported to be bene- CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 105 , ) , 2 , 4 C on fig ur at io n tu ne -n w te st -n w 1 te st -n w 3 tu ne -n w te st -n w 1 te st -n w 3 tu ne -n w te st -n w 1 te st -n w 3 W P S z ms = 2 34 .3 34 .8 22 .7 35 .1 36 .0 23 .0 34 .9 35 .5 23 .0 W P S z ms = 1 34 .4 35 .1 22 .4 35 .2 36 .0 23 .2 34 .7 35 .4 22 .4 P P S z ms = 1 34 .5 34 .7 22 .4 35 .3 36 .2 23 .2 34 .8 35 .6 23 .1 P P S z ms = 0. 2 34 .4 35 .1 22 .8 35 .5 36 .4 23 .3 34 .9 35 .7 22 .9 Ta bl e 4. 6: P er fo rm an ce co m pa ri so n m ea su re d by lo w er ca se B LE U ac ro ss di ffe re nt gr am m ar s fo r di ffe re nt va lu es of n z ms CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 106 ficial for Viterbi extraction methods (Koehn et al., 2003). Results are shown in Table 4.7 for grammar G2. We find that rules extracted under the source-to-target alignment models (V-st, WP-st and PP-st) consistently perform better than the V-ts,WP-ts and PP-ts cases. Also, for Viterbi extraction we find that the source-to-target V-st case per- forms better than any of the symmetrisation strategies, which is not consis- tent with previous findings for non-hierarchical phrase-based systems (Koehn et al., 2003). We use the PP rule extraction method to extract two sets of rules, under the st and ts alignment models respectively. We now investigate two ways of merging these sets into a single grammar for translation. The first strategy is PP-merge and merges both rule sets by assigning to each rule the maximum count assigned by either alignment model. We then extend the previous strategy by adding three binary feature functions to the system, indicating whether the rule was extracted under the stmodel, the tsmodel or both. The motivation is that minimum error rate training can weight rules differently according to the alignment model they were extracted from. However, we do not find any improvement with either strategy. Finally, we use linearised lattice minimum Bayes-risk decoding to com- bine translation lattices (see Section 2.10.3) as produced by rules extracted under each alignment direction (see rows named LMBR(V-st,V-ts) and LMBR(PP-st,PP-ts)). Gains are consistent when comparing this to apply- ing lattice minimum Bayes’ risk to each of the best individual systems (rows named LMBR(V-st) and LMBR(PP-st)). Overall, the best-performing strategy is to extract two sets of translation rules under the phrase pair pos- teriors in each translation direction, and then to perform translation twice and combine the output translations. 4.4.7 Additional Language Pair: Russian-English We also investigated our proposed method in the context of the Russian- English language pair, which presents less reordering challenges than Chinese- English. We repeated experiments from Table 4.4 with the Viterbi method as a baseline and the phrase-pair posterior method, which showed the best results for Chinese-English. We use the same word alignment models and development sets as for the system described in Section 5.2. Results are pre- sented in Table 4.8. For this language pair, the translation quality obtained by the Viterbi method and the phrase-pair extraction method is comparable, CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 107 Rule Extraction tune-nw test-nw1 test-nw2 V-st 34.7 35.6 37.5 V-ts 34.0 34.8 36.6 V-union 34.5 35.4 37.2 V-gdf 34.4 35.3 37.1 WP-st 35.1 36.0 37.7 WP-ts 34.5 35.0 37.0 PP-st 35.5 36.4 38.2 PP-ts 34.8 35.3 37.2 PP-merge 35.5 36.4 38.4 PP-merge-MERT 35.5 36.4 38.3 LMBR(V-st) 35.0 35.8 38.4 LMBR(V-st,V-ts) 35.5 36.3 38.9 LMBR(PP-st) 36.1 36.8 38.8 LMBR(PP-st,PP-ts) 36.4 36.9 39.3 Table 4.7: Translation results under grammar G2 with individual rule sets, merged rule sets, and rescoring and system combination with lattice-based minimum Bayes’ risk (lower-cased BLEU shown) Grammar Extraction # Rules tune test1 test2 GS V-union 886k 33.7 32.7 25.6 V-union 386k 33.5 32.3 25.6 G) PP-st 458k 33.4 32.1 25.5 V-union 500k 33.3 32.2 25.5 G2 PP-st 712k 33.7 32.2 25.6 V-union 901k 33.7 32.4 25.5 G+ PP-st 2787k 33.7 32.3 25.8 Table 4.8: Russian to English translation results with alternative grammars and extraction methods (lower-cased BLEU shown). Experiments from Ta- ble 4.4 were repeated for the baseline and the phrase-pair posterior extraction method. CHAPTER 4. HIERO EXTRACTION FROM POSTERIORS 108 with differences in the order of 0.1 BLEU. We hypothesise that because there is less reordering between Russian and English, the quality of the Viterbi alignments, therefore the quality of the rules extracted, is high enough to already produce relatively high quality translations. 4.5 Conclusion We have presented new rule extraction methods that lead to larger rule sets as well as better estimated translation models. In Chinese-English trans- lation experiments, a simple grammar estimated with these methods was shown to outperform a baseline hierarchical grammar extracted from Viterbi alignments. A fine-grained analysis was provided to explain the advantages of the phrase-pair posterior extraction method. Finally, it was shown that the best way to exploit both alignment directions is to train separate mod- els for each direction, translate separately and combine the lattice outputs. In addition, the same techniques were explored in the context of Russian- English translation and provided the same translation quality as the baseline technique. Chapter 5 Domain Adaptation for Hierarchical Phrase-Based Translation Systems In Chapter 3, we have demonstrated improvements in scalability for genera- tion of and retrieval from hierarchical phrase-based grammars. In Chapter 4, we have introduced an original model for hierarchical phrase-based gram- mar extraction and estimation. In this chapter, we will investigate further refinements for hierarchical phrase-based grammar modelling in the context of domain adaptation. We will also investigate possible improvements for language modelling both in the context of domain adaptation and in terms of training data size. 5.1 Introduction As described in Section 1.2.1, SMT state-of-the-art systems consist of a pipeline of various modules. These modules interact in complex ways, which makes experimentation challenging, but also provides an opportunity to make improvements in separate modules with the hope that the end-to-end sys- tem will be improved overall. In this chapter, we describe investigations into language model and grammar refinement techniques for SMT. We com- pare various strategies for grammar and language modelling in the context of domain adaptation. We also explore the interaction between a first pass language model used in decoding and a second pass language model used in 109 CHAPTER 5. HIERO DOMAIN ADAPTATION 110 rescoring. Finally, we compare two smoothing strategies for large language model rescoring. The techniques presented in this chapter may be used in or- der to build high quality systems for a translation evaluation or to customise a translation system for a client in the industry setting. Figure 5.1 indicates which modules we attempt to improve. Experiments are based on a system submitted to the WMT13 Russian- English translation shared task (Pino et al., 2013). Because we developed a very competitive system for this evaluation, refinements on that system give a good indication of which techniques are worthwhile applying to SMT systems. We briefly summarise the system building in Section 5.2. In Section 5.3, we review domain adaptation techniques for machine translation. In Section 5.4, we show how to adapt a language model to obtain better performance on a specific domain such as newswire text. In Section 5.5, we show how additional grammar rule features related to specific domains may help in translation. In Section 5.6, we investigate the various strategies for combining first pass translation and language model rescoring in terms of language model training data size and n-gram language model order. Finally, in Section 5.7, we compare two smoothing strategies for large language model rescoring. 5.2 Description of the System to be Developed The experiments reported in this chapter are based on the system submitted to theWMT13 Russian-English translation shared task (Pino et al., 2013). 19 systems were submitted to this particular task. The CUED system obtained the best case-sensitive BLEU score and the second best human judgement score amongst constrained-track systems (Bojar et al., 2013).1 In this section, we summarise the system building. We use all the Russian-English parallel data available in the constrained track. We filter out non Russian-English sentence pairs with the language- detection library.2 A sentence pair is filtered out if the language detector detects a different language with probability more than 0.999995 in either the source or the target. This discards 78,543 sentence pairs from an initial 2,543,462. For example, the sentence in Figure 5.2 was detected as French with a probability of 0.999997. In addition, sentence pairs where the source sentence has no Russian character, defined by the Perl regular expression 1 http://matrix.statmt.org/matrix/systems_list/1738 2 http://code.google.com/p/language-detection/ CHAPTER 5. HIERO DOMAIN ADAPTATION 111 .Parallel Text Preprocessing Word Alignment Grammar Extraction Monolingual Text Preprocessing Language Modelling Decoding Tuning Language Modelling for Rescoring First pass LM / rescoring LM interaction Figure 5.1: Machine Translation System Development Pipeline. Modules that we attempt to refine are in bold blue. Grammar extraction and first- pass language modelling are refined in the context of domain adaptation. We also investigate smoothing for language modelling in rescoring. Finally, we study the interaction between the first-pass language model and the rescoring language model. Sur base des pre´ce´dents avis, il semble que l’hoˆtel a fait un effort. Figure 5.2: Example of target side sentence of a Russian-English sentence pair. The sentence was detected as actually being French with 0.999998 probability. CHAPTER 5. HIERO DOMAIN ADAPTATION 112 [\x0400-\x04ff], are discarded. This regular expression corresponds to the Unicode block for Cyrillic characters (in hexadecimal notation, code 0400 to code 04ff). This further discards 19000 sentence pairs. We take the view that discarding a small portion of training data in order to obtain cleaner data is beneficial in translation, when such cues are easily and reliably available. The Russian side of the parallel corpus is tokenised with the Stanford CoreNLP toolkit.3 The English side of the parallel corpus is tokenised with a standard English tokeniser, which splits sentences into tokens using white space and punctuation, and retains abbreviations as single tokens. Both sides of the parallel corpus are then lowercased. Corpus statistics after filtering and tokenisation are summarised in Table 5.1. Language # Tokens # Types RU 47.4M 1.2M EN 50.4M 0.7M Table 5.1: Russian-English parallel corpus statistics after filtering and to- kenisation. Parallel text contains approximately 50M tokens on each side. This translation task can be characterised as a medium size translation task. Parallel data is aligned using the MTTK toolkit (Deng and Byrne, 2008): we train a word-to-phrase HMM model with a maximum phrase length of 4 in both source-to-target and target-to-source directions (see Section 2.3.1). The final alignments are obtained by taking the union of alignments obtained in both directions. A hierarchical phrase-based grammar is extracted from the alignments as described in Section 3.4. The constraints for extraction described in Section 3.5.2 are set as follows: • smax = 9 (maximum number of source terminals for phrase-based rules) • smax elements = 5 (maximum number of source terminals and nontermi- nals) • smax terminals = 5 (maximum number of source consecutive terminals) • smax NT = 10 (maximum source nonterminal span) 3 http://nlp.stanford.edu/software/corenlp.shtml CHAPTER 5. HIERO DOMAIN ADAPTATION 113 We are using a shallow-1 hierarchical grammar (see Section 2.6.2.4) in our experiments. This model is constrained enough that the decoder can build exact search spaces, i.e. there is no pruning in search that may lead to search errors under the model. We use the KenLM toolkit (Heafield et al., 2013) to estimate a single modified Kneser-Ney smoothed 4-gram language model (see Section 2.7.3) on all available monolingual data available in the constrained track. Statistics for the monolingual data are presented in Table 5.2. Corpus # Tokens EU + NC + UN + CzEng + Yx 652.5M Giga + CC + Wiki 654.1M News Crawl 1594.3M afp 874.1M apw 1429.3M cna + wpb 66.4M ltw 326.5M nyt 1744.3M xin 425.3M Total 7766.9M Table 5.2: Statistics for English monolingual corpora used to train lan- guage models. Abbreviations are used for the following corpora: Europarl (EU), News Commentary (NC), United Nations (UN), Czech-English corpus (CzEng), Yandex (Yx), 101 French-English corpus (Giga), Common Crawl (CC), Wikipedia Headlines (Wiki). Corpora “afp” and below are the various news agency from the GigaWord corpus. For first-pass decoding, we use HiFST as described in Section 2.9. First- pass decoding is followed by a large 5-gram language model rescoring step as described in Section 2.10. The 5-gram language model also uses all available monolingual data described in Table 5.2. Two development sets are available: newstest2012 with 3003 sentences and newstest2013 with 3000 sentences. We extract odd numbered sentences from newstest2012 in order to obtain a tuning set newstest2012.tune. Even numbered sentences from newstest2012 form a test set newstest2012.test. The newstest2013 set is used as an additional test set. newstest2012.tune, CHAPTER 5. HIERO DOMAIN ADAPTATION 114 Configuration tune test1 test2 baseline 1st pass 33.22 32.01 25.35 baseline +5g 33.33 32.26 25.53 Table 5.3: WMT13 Russian-English baseline system. Performance is mea- sured by case-insensitive BLEU. The 1st pass configuration indicates results obtained after decoding. The +5g configuration indicates results obtained after 5-gram language model rescoring. newstest2012.test and newstest2013 are respectively shortened to tune, test1 and test2. We report our baseline experiment in Table 5.3. We indicate case-insensitive BLEU score after first-pass decoding and 5-gram language model rescoring. We will now investigate various strategies in order to improve performance over the baseline. We will investigate domain adaptation techniques in order to estimate better language models and grammars for translating newswire data. We will also study the interaction between the first-pass language model and the rescoring language model. Finally, we will study an alterna- tive smoothing technique to Stupid Backoff for the rescoring language model. 5.3 Domain Adaptation for Machine Transla- tion In this section, we first introduce in Section 5.3.1 the general problem of domain adaptation and how certain machine translation settings may be instances of this problem. We then review previous work on addressing the problem of domain adaptation in SMT in Section 5.3.2. 5.3.1 Domain Adaptation Let us first formalise the problem of domain adaptation. In a standard multi- class classification problem, we are given a training set {(xiP yi) ∈ X ×Y P i ∈ [1P c ]} where X is a set of instances to be labelled and Y is a finite set of labels. The multi-class classification learning problem is to find a function f : X → Y that minimises the number of prediction errors. Machine transla- tion can be seen as a multi-class classification problem where X is the set of CHAPTER 5. HIERO DOMAIN ADAPTATION 115 In addition, a new seven-year EU budget needs to be passed, which is very complicated due to the current crisis. Figure 5.3: Example of English sentence in the newswire domain. all possible source sentences and Y is the set of all possible target sentences. The multi-class classification framework is not very intuitive for machine translation. First, at least in theory, there are an infinite number of possible target sentences to choose from in translation. In addition, many possible translations are correct or acceptable in general. However, the multi-class classification setting is assumed by the original objective function (see Sec- tion 2.2), which we recall in Equation 5.1. This objective function minimises the number of misclassifications. e^ = argmax e p(e | f) (5.1) A domain is a particular distribution D over X . For example, D can be such that source sentences in X drawn from D are in the newswire domain. For example, if source sentences are in English, then source sentences drawn sampled from X with distribution D may look like the sentence in Figure 5.3. For domain adaptation, we assume an out-of-domain distribution DO and a in-domain distribution DT . A model is trained on a sample drawn from DO but the model performance is evaluated on a sample drawn from DT . Domain adaptation consists in altering the training procedure by using information from domain DT in order to achieve better performance on DT . The out- of-domain and in-domain distributions are also called source domain and target domain respectively. A simple example of domain adaptation setting for SMT may be that the parallel text is a parallel corpus of tweets4 and sentences to be translated come from newspaper articles. 5.3.2 Previous Work on Domain Adaptation for SMT In machine translation, we can distinguish two types of domain adapta- tion problem: cross-domain adaptation and dynamic adaptation (Foster and 4 https://twitter.com/ CHAPTER 5. HIERO DOMAIN ADAPTATION 116 Kuhn, 2007). In cross-domain adaptation, a small amount of in-domain training data is available while in dynamic adaptation, no in-domain training data is available. Dynamic adaptation may be important for online transla- tion systems for example. Our concern is cross-domain adaptation, where at least an in-domain development set is available for parameter tuning. Koehn and Schroeder (2007) explore various techniques for machine trans- lation domain adaptation. They use Europarl (Koehn, 2005) as out-of- domain training data and a news commentary parallel corpus5 for in-domain training data. Training the language model on in-domain training data gives an improvement of 0.8 BLEU with respect to the baseline. Training two sep- arated language models on the in-domain and out-of-domain training data and interpolating them with weights set to minimise perplexity on the tuning set gives an improvement of 0.4 BLEU. Using these two language models as separate features to be optimised by MERT (Och, 2003) gives an improve- ment of 0.6 BLEU. Finally, using two separate translation models trained on the in-domain and out-of-domain data and tuning the weights with MERT gives an improvement of 0.9 BLEU. These results are reported on the tuning set only. In this chapter, we explore very similar techniques. 5.4 Language Model Adaptation for Machine Translation In this section, we contrast two simple language model adaptation techniques for machine translation. Experiments are carried out on the Russian-English system described in Section 5.2. We use the KenLM toolkit to estimate separate modified Kneser-Ney smoothed 4-gram language models for each of the corpora listed in Table 5.2. The component models are then interpolated with the SRILM toolkit (Stol- cke, 2002) to form a single LM. The interpolation weights are optimised for perplexity on the concatenation of the English side of the news-test2008, newstest2009 and newssyscomb2009 development sets which were provided for previous translation evaluations, using the compute-best-mix tool, which is part of the SRILM toolkit. The weights reflect both the size of the compo- nent models and the genre of the corpus the component models are trained on, e.g. weights are larger for larger corpora in the news genre (Foster et al., 5 http://statmt.org/wmt07/shared-task.html CHAPTER 5. HIERO DOMAIN ADAPTATION 117 Row Configuration tune test1 test2 1 baseline 1st pass 33.22 32.01 25.35 2 baseline +5g 33.33 32.26 25.53 3 linear 1st pass 33.71 32.26 25.47 4 linear +5g 33.74 32.51 25.58 5 loglinear 1st pass 33.23 31.75 25.37 6 loglinear +5g 33.28 31.85 25.48 Table 5.4: Performance comparison, measured by case-insensitive BLEU, be- tween an uninterpolated language model (baseline), a linearly interpolated language model (linear) and a log-linearly interpolated language model (log- linear). Off-line linear interpolation (rows 3 and 4) is the best strategy. The loglinear configuration performs worst, however this may be due to the MERT optimisation algorithm not being able to find a high quality set of parameters with 20 features. 2013). Thus we obtain a linear mixture of language models. We call this configuration “linear”. As a contrast, we also keep the individual language models as separate features in optimisation. We call this configuration “log- linear”. We compare the baseline, the linear and the loglinear configurations in Table 5.4. We observe that the best strategy for building a language model is to do offline linear interpolation to minimise perplexity on a development set that has the same genre as the translation test sets (row 3). The second best strategy is to use an uninterpolated language model (row 1). The worst strategy is to do a log-linear interpolation of the various language model components by tuning these language language models as separate features with lattice MERT (row 5). Note that these observations are confirmed even after rescoring with a large Stupid Backoff 5-gram language model (rows 2, 4 and 6). It is possible that the loglinear configuration, which has 20 fea- tures as opposed to 12 features for the baseline and the linear configuration, performs worst because the MERT algorithm becomes unstable with more features (Foster and Kuhn, 2009). These results are in contrast to those obtained previously (Koehn and Schroeder, 2007), where the loglinear inter- polation was found to provide more benefits over the linear interpolation. In addition, more language models are considered (9 vs. 2). Finally, both tune and test results are reported here. CHAPTER 5. HIERO DOMAIN ADAPTATION 118 5.5 Domain Adaptation with Provenance Fea- tures In Section 5.4, we have presented domain adaptation strategies for the lan- guage model. In this section, we will focus on domain adaptation for the translation model. Chiang et al. (2011) introduce provenance lexical fea- tures. The concept of provenance was introduced previously (Matsoukas et al., 2009) and provides metadata information such as genre or collection for each sentence pair in parallel text. Word translation tables and lexical features are computed for each provenance. Finally, for each provenance, the ratio of the provenance specific lexical feature with the global lexical feature is added as a feature to the translation system. We use a very similar approach with the following differences and exten- sions: • The lexical features are computed with Model 1 rather than Viterbi alignments, as described in Section 2.6.5. • The features added to the system are not the ratios between a prove- nance specific lexical feature and the global lexical feature but simply the provenance specific lexical features. • We extend this technique to compute provenance specific source-to- target and target-to-source translation models as described in Sec- tion 3.4. For our experiments, we use 4 provenances that correspond to the 4 sub- corpora provided for parallel text: the Common Crawl corpus, the News Commentary corpus, the Yandex corpus and the Wiki Headlines corpus. This gives an additional 16 features to our system: source-to-target and target-to-source translation and lexical features for each provenance. This configuration is called “provenance”. When retrieving relevant rules for a particular test set, various thresholds are applied, such as number of targets per source or translation probability cutoffs. Thresholds involving source-to-target translation scores are applied separately for each provenance and the union of all surviving rules for each provenance is kept. This configuration is called “provenance union”. This specific configuration was used for submission to the translation task com- petition at WMT13. To our knowledge, this original technique has not been CHAPTER 5. HIERO DOMAIN ADAPTATION 119 Row Configuration tune test1 test2 1 no provenance 1st pass 33.71 32.26 25.47 2 no provenance +5g 33.74 32.51 25.58 3 provenance 1st pass 33.22 32.14 25.40 4 provenance +5g 33.22 32.14 25.41 5 provenance union 1st pass 33.65 32.36 25.55 6 provenance union +5g 33.67 32.58 25.63 7 no provenance union 1st pass 33.22 31.78 25.59 8 no provenance union +5g 33.22 31.78 25.59 Table 5.5: Effect of using provenance features. Case-insensitive BLEU scores are reported. The most effective strategy is provenance union. Because the “no provenance union” does not improve over the baseline, we conclude that the gains obtained by the provenance union strategy are not only due to additional rules; instead, they are due to the conjunction of additional rules with provenance features for better discrimination. employed previously. Results show that it provides additional gains over the simple use of provenance features. The first-pass language model is the linear mixture described in Sec- tion 5.4. We also compare the various configurations after 5-gram rescoring. Results are presented in Table 5.5. We can see that adding provenance fea- tures slightly degrades performance (row 3 vs. row 1, row 4 vs. row 2). This again may be due to the inability of the MERT optimisation algorithm to tune more than the 12 basic features. However, if we use the provenance union strategy, we can see that having additional rules together with prove- nance features is the best strategy (row 5 vs. row 1, row 6 vs. row 2). In order to verify whether the gains are only due to additional rules, we rerun the experiment with the provenance union configuration but remove the prove- nance features (row 7 and row 8). The performance of the “no provenance union” configuration is worse or equal to the “no provenance” configuration, which demonstrates that the provenance union strategy is optimal in this set of experiments and that the gains provided are not only due to having addi- tional translation rules. The effectiveness of the provenance union strategy is due to additional rules in conjunction with additional features for better discrimination between rules. CHAPTER 5. HIERO DOMAIN ADAPTATION 120 5.6 Interaction between First Pass Language Model and Rescoring Language Model In Section 5.4 and Section 5.5, we have investigated various strategies to adapt the language model and translation model to the genre of the test set to be translated. In each case, we verified that the conclusions held even after rescoring with a large 5-gram language model. In this section, we study the interaction between the first pass language model and the rescoring language model in terms of n-gram order and training data size. We give recommendations on how much data should be used to train a first pass language model and what order should be chosen for the first pass n-gram language model. Our translation systems are typically run in two steps. The first step usually consists in decoding with a 4-gram language model. The second step consists in 5-gram lattice rescoring. Brants et al. (2007) argue that single pass decoding is conceptually simpler and may lose less information. In this section, we also attempt to verify this claim by incorporating 5-gram language models directly in first-pass decoding and evaluate performance with respect to a two-pass strategy. We describe the various configurations used for experiments and reported in Table 5.6. For monolingual data size, we use a “large” configuration which uses all available data described in Table 5.2, and a “medium” configuration which uses the following corpora: EU, NC, UN, CzEng, Yx, Giga, CC, Wiki, afp and xin. (The keys for abbreviations are described in Table 5.2). We also train 4-gram and 5-gram language models with large and medium data for the first pass language model. The “large 4g” configuration corresponds to the linear configuration from Table 5.4. The initial parameter for the MERT optimisation step was the same for all configurations. This parameter was obtained from a separate Arabic-to- English experiment carried out earlier. Reusing already optimised parame- ters between experiments allows us to do faster experimentation since less iterations are needed until convergence. Results are presented in Table 5.6. Comparing row 1 and row 3, we can see that for a 4-gram language model, more training data is beneficial in first-pass decoding. This conclusion is confirmed after 5-gram rescoring (row 2 vs. row 4). However, when comparing row 5 vs. row 7 and row 6 vs. row 8, we conclude that in the case of a first-pass 5-gram language model, more CHAPTER 5. HIERO DOMAIN ADAPTATION 121 Row Configuration tune test1 test2 1 medium 4g 1st pass 32.96 31.53 24.60 2 medium 4g +5g 33.43 32.12 25.30 3 large 4g 1st pass 33.71 32.26 25.47 4 large 4g +5g 33.74 32.51 25.58 5 medium 5g 1st pass 32.62 31.62 24.77 6 medium 5g +5g 33.20 31.96 25.27 7 large 5g 1st pass 32.99 31.79 25.18 8 large 5g +5g 32.99 31.79 25.18 Table 5.6: Comparing various data size conditions and n-gram orders for lan- guage model training and the effect on large 5-gram language model rescor- ing. Case-insensitive BLEU scores are reported. In conclusion, more lan- guage model training data is helpful but the use of higher order n-gram language models is only beneficial in rescoring. training data is only helpful in first-pass decoding. It is possible that even more monolingual data is required for a 5-gram language model to generate a lattice of sufficient quality in first-pass decoding so as to obtain improvements in rescoring. Comparing row 2 vs. row 6 and row 4 vs. row 8, we see that for equal amounts of training data, the best strategy is to use a first-pass 4-gram language model followed by large 5-gram language model rescoring. One possible reason is that a first pass 5-gram language model may not have enough training data to be reliably estimated and therefore may not produce a first-pass lattice of high quality translations to be rescored, whereas a first pass 4-gram language model is able to discard poor translations in first pass decoding. One caveat with this conclusion of course is that the maximum amount of data experimented with is 7.8B words as opposed to 2 trillion words reported by Brants et al. (2007). Future work may be dedicated to fill this gap in experimentation. For example, we plan to exploit a large 5-gram language model (Buck et al., 2014) trained on the Common Crawl corpus6 in rescoring experiments. We are not aware of any studies in the machine translation literature on interactions between first pass and second pass lan- guage models. The main conclusion, i.e. that a two-pass decoding approach 6 https://commoncrawl.org/ CHAPTER 5. HIERO DOMAIN ADAPTATION 122 is beneficial over a single-pass approach given the amount of training data, has not been demonstrated experimentally before. 5.7 Language Model Smoothing in Language Model Lattice Rescoring In Section 5.6, we have given recommendations about what strategy to adopt in first-pass decoding in order to obtain optimal results in rescoring. The conclusion was to use a 4-gram language model trained on large amounts of data. In this section, we attempt to improve on the 5-gram rescoring procedure by exploring two alternative smoothing strategies for the 5-gram language model used for rescoring. We train a Stupid Backoff 5-gram language model (see Section 2.7.4 and Section 3.2.1) and a modified Kneser-Ney 5-gram language model on all avail- able English data for WMT13, described in Table 5.2. We compare the two smoothing methods in rescoring over various configurations used throughout this chapter. We did not include the “provenance union no provenance” and the “large 5g” configurations. On average, over the seven experiments from Table 5.7, we obtain a gain of 0.11 BLEU on the tuning set tune, a gain of 0.14 BLEU on the first test set test1 and a gain of 0.13 BLEU on the second test set test2. We conclude that the use of modified Kneser-Ney smoothing is slightly beneficial in 5-gram lattice rescoring when the amount of train- ing data is in the order of several billion tokens. This confirms conclusions from Figure 5 in a previous publication (Brants et al., 2007) that Kneser-Ney smoothing is superior to Stupid Backoff smoothing with amounts of train- ing data in a similar order to the work here. These conclusions were made for first pass decoding while here. This is the first time these two smooth- ing strategies are compared in the context of 5-gram language model lattice rescoring. We also note that we obtain an improvement over the system sub- mitted to the WMT13 translation task using the “provenance union + KN 5g” configuration (row 15 in bold vs. row 14). 5.8 Conclusion In this chapter, we have investigated various refinements on translation sys- tem building. We have explored domain adaptation techniques in order to CHAPTER 5. HIERO DOMAIN ADAPTATION 123 Row Configuration tune test1 test2 1 baseline 1st pass 33.22 32.01 25.35 2 baseline + SB 5g 33.33 32.26 25.53 3 baseline + KN 5g 33.31 32.28 25.65 4 linear 1st pass 33.71 32.26 25.47 5 linear + SB 5g 33.74 32.51 25.58 6 linear + KN 5g 33.73 32.26 25.48 7 loglinear 1st pass 33.23 31.75 25.37 8 loglinear + SB 5g 33.28 31.85 25.48 9 loglinear + KN 5g 33.30 31.92 25.35 10 provenance 1st pass 33.22 32.14 25.40 11 provenance + SB 5g 33.22 32.14 25.41 12 provenance + KN 5g 33.54 32.37 25.75 13 provenance union 1st pass 33.65 32.36 25.55 14 provenance union + SB 5g 33.67 32.58 25.63 15 provenance union + KN 5g 33.89 32.82 25.84 16 medium 5g 1st pass 32.62 31.62 24.77 17 medium 5g + SB 5g 33.20 31.96 25.27 18 medium 5g + KN 5g 33.24 32.29 25.47 19 medium 4g 1st pass 32.96 31.53 24.60 20 medium 4g + SB 5g 33.43 32.12 25.30 21 medium 4g + KN 5g 33.65 32.46 25.55 Table 5.7: Comparison between the gains obtained by Stupid Backoff 5- gram rescoring and Kneser-Ney 5-gram rescoring. Case-insensitive BLEU scores are reported. In conclusion, at the level of several billion tokens for language model training, Kneser-Ney smoothing is beneficial over Stupid Backoff smoothing in the context of 5-gram lattice rescoring. Row 15 (in bold) achieves a better performance than row 14, which was our system submitted to the WMT13 translation shared task. CHAPTER 5. HIERO DOMAIN ADAPTATION 124 refine the language model and the translation model. For language model building, we find that the best strategy for cross-domain adaptation is to build an interpolated language model and tune the interpolation weights in order to minimise the perplexity on a development set in the target domain, as opposed to tuning the language model weights with MERT. We also have examined the effect of provenance features in the translation model. We found that the use provenance features is beneficial only when these features are used to discriminate rules from a larger rule set. After having attempted to refine models used in first-pass decoding, we have studied the interaction between the first pass language model and the 5-gram language model used in lattice rescoring. We have come to the con- clusion that in our setting, a two-pass strategy is preferable and that using large amounts of data to train a 4-gram language model for first-pass decod- ing yields better results in rescoring. Finally, we have demonstrated that Kneser-Ney smoothing produces bet- ter results than Stupid Backoff smoothing when training a 5-gram language model on 7.8 billion words for rescoring purposes. This confirms previous research in the context of rescoring as opposed to first-pass decoding. Chapter 6 String Regeneration as Phrase-Based Translation In the previous chapters, we have proposed various improvements to hier- archical phrase-based translation systems, in terms of infrastructure (Chap- ter 3), grammar modelling (Chapter 4 and Chapter 5) and language mod- elling for first pass decoding as well as rescoring (Chapter 5). In the chapter to follow (Chapter 7), we will also introduce techniques for potential improve- ments in fluency in the output of hierarchical phrase-based systems. In this chapter, we lay the ground work to be applied in translation in Chapter 7. Machine translation output was originally evaluated by human judges in terms of adequacy, or how well the meaning of the source text is preserved in the translated text, and fluency, or how well-formed the translation is (White et al., 1993), i.e. to what extent it is syntactically, morphologically, and ortho- graphically correct (Reiter and Dale, 1997). Adequacy and fluency motivate the separation of SMT models into a translation model and a language model (see Section 2.2). There is increasing concern that translation output often lacks fluency (Knight, 2007). In this chapter, we study the related prob- lem of string regeneration: given a sentence where the word order has been scrambled, how difficult is it to recover the original sentence ? Our approach is inspired by phrase-based translation techniques and achieves state-of-the- art performance on the string regeneration task. Software written for this chapter is available online.1 1 https://github.com/jmp84/NgramGen 125 CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 126 6.1 Introduction Before the widespread use of automatic metrics such as BLEU (see Sec- tion 2.8.1) or METEOR (Banerjee and Lavie, 2005), machine translation systems were originally evaluated in terms of adequacy and fluency. Current popular automatic metrics for translation can be “gamed” in the sense that it is possible to obtain a high score according to these metrics with an output that lacks fluency. For example, a translation may obtain a high BLEU score simply because it contains many n-grams that are also in the reference trans- lation but this translation may not be well-formed syntactically. Figure 6.1 shows how an improvement of 20 BLEU points at the sentence level between an MT hypothesis and an oracle MT hypothesis, i.e. the best performing hypothesis among a set, may not translate into an improvement in fluency. Of course, BLEU is designed to measure quality at the set level, so hopefully other oracle MT hypotheses will probably improve fluency. The nature of the translation metrics may have led research on translation to neglect fluency. Correct word ordering is an essential part of fluency. Producing a trans- lation with words correctly ordered is a very difficult task. Different word ordering between distant language pairs such as Chinese-English is one of the main reasons for incorrect word ordering in translation. In general, obtain- ing a correct word order is a fundamental problem in many natural language processing applications. In machine translation, because source language and target language have a different word ordering, a machine translation system needs to model word reordering. In some language pairs such as French- English, the source language gives good cues to the target word order; how- ever, in other language pairs such as Chinese-English or Japanese-English, the word order can be very different in the source and the target language. In natural language generation tasks, such as paraphrasing or summarisa- tion, word ordering is also critical at the surface realisation step (Reiter and Dale, 1997). Word ordering is also one possible type of grammatical error, especially for foreign language learners (Yu and Chen, 2012). In this chapter, we study the word ordering problem in isolation through the task of string regeneration (Wan et al., 2009). Given an input sentence where the word order has been scrambled, the string regeneration task con- sists in recovering the original sentence. This task can be seen as one of the steps in surface realisation for language generation where the input is a list of content words and function words and the output a grammatical sentence. It also allows us to study the realisation task independently of the transla- CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 127 SMT System (19.76 Sentence-Level BLEU): enda accord- ing to jerzy made a televised speech, the president of burundi civil war of resistance army soldiers in the past ten years, and had to settle on january 5 before the completion of the new military leadership will be in place before january 7, four of the former rebel army leader. SMT System Oracle (39.91 Sentence-Level BLEU): en, a us $according to al made a televised speech, said that the presi- dent of the burundi’s rebel army soldiers in the past 10-year civil war must be before january 5 and the completion of the new mil- itary leadership will be in place by january 7, and four of the former leaders of the rebel forces. Reference: According to President Ndayizeye’s televised speech, former rebel fighters in Burundi’s 10-year civil war must be settled in by January 5. The new army leadership is to be made up of 40 percent former rebels and should be in place by January 7. Figure 6.1: Example showing how a large improvement in BLEU score does not imply an improvement in fluency. The first quote is one of the output sentences produced by an SMT system presented in Chapter 7. The second quote is the oracle translation for that system, i.e. the best possible transla- tion to be found in the lattice of translations generated by the system. The third quote is one of the human reference translations. Note that the oracle hypothesis was picked to maximise sentence-level BLEU, which shows that, in principle, it is possible to obtain a translation with a high BLEU score but that lacks fluency. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 128 tion task. Finally, it can have alternative applications such as automatically producing alternative references for machine translation. We use a simple but extensible approach, inspired from stack-based de- coding for phrase-based translation (see Section 2.5.4). Given a bag of words as input, we use n-gram rules extracted from a large monolingual corpus and concatenate these n-grams in order to form hypotheses. The hypotheses are scored with a linear model in the same manner as translation hypotheses in phrase-based translation. One of the critical features in this system is the language model. In experiments to follow, we only study this main language model feature, however, our decoder has other features implemented and ar- bitrary features can be added easily to the system, demonstrating the system flexibility. Our strategy achieves state-of-the-art performance as measured by BLEU (Section 2.8.1) in the string regeneration task. We also study various capa- bilities of our decoder: in Section 2.5.4, we show how to split the input into chunks for more rapid experimentation; in Section 6.6.4, we analyse the ef- fect of stack-based pruning; in Section 6.6.5, we study the effect of including n-gram rules with a larger n-gram order; in Section 6.6.6, we relax decoding constraints by allowing n-gram rules to overlap; in Section 6.6.7, we intro- duce future cost estimation using a unigram language model; finally, in Sec- tion 6.6.8, we demonstrate how SMT rescoring techniques are also applicable to the output of our decoder. 6.2 Background Brown et al. (1990) anecdotally mention the string regeneration task, or bag of words translation, in order to demonstrate the effectiveness of n-gram language models. With a 3-gram language model, they recover 63% of a sen- tences with less than 11 words with brute force search over all permutations. This is of course impractical for longer sentences. String regeneration has recently gained interest as a standalone task.Wan et al. (2009) introduced the string regeneration task as a proxy to measure the grammaticality of a natural language generation system output. String regeneration can also be considered as a simplified version of the linguistic realisation task in a natural language generation system (Reiter and Dale, 1997), which consists in generating a surface form that is syntactically, mor- phologically, and orthographically correct. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 129 Wan et al. (2009) use a modified version of the minimum spanning tree algorithm in order to find an optimal dependency tree that covers an input bag of words. Once a dependency tree is built, an n-gram language model is used so as to order siblings in the tree. Performance is measured by the BLEU metric. Wan et al. (2009) demonstrate BLEU improvements when regenerating Section 23 of the Penn Treebank using their original algorithm vs. baseline algorithms based on n-gram language models only. Base nouns indicated by the Penn Treebank are kept together for experimentation. He et al. (2009) also use the dependency tree formalism for string real- isation. They describe a sentence realiser that takes as input an unordered dependency tree, i.e. a dependency tree that only encode head-modifier re- lations but not the surface string ordering. Because an exhaustive search is intractable (due to the number of possible children permutations at each node), He et al. (2009) adopt a divide-and-conquer approach that recursively linearises subtrees in a bottom-up fashion. A log-linear model is used to select the best possible linearisation for each subtree. Features include de- pendency relations, n-gram language models for the surface string and for the sequence of heads. CCG grammars (Zhang and Clark, 2011) have also been used for the string regeneration task as an alternative to the dependency grammar for- malism. Given an input bag of words, the best CCG parse that covers the input is searched for. Similarly to previous work, input base nouns are kept as single units. In follow-up work (Zhang et al., 2012), CCG grammars are combined with large language models and provide performance improvements as measured by the BLEU metric. Some of the works described so far are instances of tree linearisation: given a set of unordered dependencies, reconstruct a dependency tree which gives the word ordering (Belz et al., 2012). Zhang (2013) extends this task to partial tree linearisation: the input is a set of possibly missing POS tags and possibly missing unordered dependencies. This last work achieves state- of-the art performance on string regeneration on the Penn Treebank, when using dependency information on the input. Zhang (2013) also reports results on string regeneration without using dependency information on the input but keeping base noun phrases as single units. In this chapter, we restrict the input to be a bag of words without any additional syntactic or semantic information. Our work is directly inspired by recent work that exploits phrase-based grammars together with a hier- archical phrase-based decoder based on FSTs (see Section 2.9) (de Gispert CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 130 et al., 2014). We also use phrase-based grammars, but our decoding pro- cedure is analogous to stack-based decoding for phrase-based translation. To our knowledge, our method achieves state-of-the-art performance on this task. We will now present our model, which is analogous to the phrase-based translation model, but in a monolingual setting. 6.3 Phrase-Based Translation Model for String Regeneration In the previous section, we have reviewed various strategies for string regen- eration. We observe that, in general, the use of syntax was beneficial over the exclusive use of an n-gram language model. We do not take advantage of syntactic information, instead, we complement an n-gram language with n- gram rules extracted from monolingual data. We now describe our method, which is inspired from phrase-based translation techniques (see Section 2.5). We employ a feature based linear model (see Section 2.4) as our objective function: e^ = argmax e  · s(e) (6.1) where: • e^ is the best possible hypothesis according to the model. • e is an English sentence. •  is a feature weight vector. • s(e) is a feature vector for the hypothesis e. (e) contains a language model g(e) as well as local features '(e) that are additive over phrases that segment e. Similarly,  has a weight g for the language model and weights ' for local features. As in phrase-based translation, a hypothesis is decomposed into phrases eT). We can then rewrite Equation 6.1 into Equation 6.2: xe = argmax e5eI1 gg(e) + ' ·'(e) = argmax e5eI1 gg(e) + ' · T∑ i5) '(ei) (6.2) CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 131 Note that this model makes a max approximation: various segmentations of the hypotheses are considered but the score only depends on one single segmentation. Experiments to follow only make use of the language model feature but our implementation has other features integrated, such as word count or rule count, and arbitrary features can be added easily. Our decoder algorithm described subsequently in Section 6.5 generates hypotheses left to right by concatenating phrases e1, ..., eI . For a phrase index i ∈ [1P I], we define the partial model score sc(ei1) of partial hypothesis ei1 as in Equa- tion 6.3: sc(ei1) = gg(e i 1) + ' · i∑ k5) '(ek) (6.3) 6.4 Phrase-Based Translation Rules for String Regeneration In the previous section, we have introduced our model, inspired from phrase- based translation, for string regeneration. This model assumes that a hy- pothesis is segmented into phrases. In this section, we describe how to obtain these phrases from monolingual data. In Section 6.4.1, we review the n-gram rule extraction process, which is based on the Hadoop implementation of MapReduce. In Section 6.4.2, we describe how to obtain relevant rules for a given test set. 6.4.1 Rule Extraction n-gram rules are extracted from large collections of monolingual text. The text is tokenised and lowercased. We then employ the MapReduce framework as implemented by Hadoop in order to extract n-grams together with their occurrence counts from the corpus. The input key to the map function is irrelevant. The input value to the map function is a line of text. The map function simply extracts all n-grams from the input line of text and outputs these n-grams with a count of one. The input to the reduce function is an n-gram and a series of constant values equal to one. The reduce function simply sums the counts and produces the n-gram with its total count. This MapReduce job is analogous to the canonical “word count” example except that here n-grams of higher order CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 132 than unigrams are considered. For our experiments, we extract n-grams from order 1 to 5. 6.4.2 Rule Filtering Once we have extracted all n-grams from a monolingual corpus, we would like to retrieve the n-grams relevant to a test set that we wish to experiment on. An n-gram is relevant if the vocabulary of the n-gram is included in any of the vocabularies of each sentence in the test set. Contrary to the process described in Section 3.5, we do not generate queries and then seek those queries in an HFile. This is because the number of possible queries is too large to fit in memory. For a sentence of length c with words distinct from each other, the number of possible relevant 5-grams is c × (c − 1)× (c − 2)× (c − 3)× (c − 4). For sentence of length 100, this would correspond to approximately 9B queries. Even with an alternative implementation where keys in the HFile are lexicographically sorted n-grams and values are the permutations and counts found in the monolingual data, the number of 5-gram queries would be ( N 5 ) , which is approximately 75M queries for a 100 word sentence. We therefore take the alternative approach of scanning the set of n-grams and retaining the relevant ones. For this, we again use a MapReduce job. The input key to the map function is an n-gram and the input value is a count. For each test sentence, the map function checks if the n-gram vocabulary is included in the vocabulary of that sentence and if so, it generates all possible coverages of this n-gram for that sentence, where a coverage indicates the positions of the words covered by the n-gram in the input sentence. The map output key is the n-gram together with a sentence id. The map output value is a coverage and a count. The reduce function simply removes duplicates in the list of coverages. 6.5 String Regeneration Search In Section 6.3, we have described our string regeneration model and in Sec- tion 6.4, we have described how to extract n-gram rules. We now present our search algorithm and introduce our string regeneration decoder, Ngram- Gen, which reorders an input bag of word using the model introduced and stack-based decoding techniques reviewed in Section 2.5.4. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 133 6.5.1 Input The input to the decoder consists of: • A bag of words, or multiset, w, represented as a vector. For example, we have w = [do, you, do] for the input sentence “do you do” to be regenerated. • A set of n-grams, where the vocabulary of each n-gram is a subset of the vocabulary of the input bag of words. Each n-gram is also associ- ated with one or more coverages as computed in Section 6.4.2. For a given n-gram, its coverage is a bit vector that indicates what positions the n-gram covers in the input bag of words. In case of repeated words, an n-gram can have several possible coverages. For example, with the input bag of words [do, you, do], the n-gram [do, you] has two possible coverages: 110 and 011. The input n-grams and coverages are repre- sented as an associative array datastructure. We denote this input by N . Note that it would be possible for the decoder to compute possible coverages on the fly, however we choose to precompute those at the filtering step described in Section 6.4.2. • A n-gram language model G. The n-gram language model is used as a the main feature in the model to encourage fluent output. We will now describe how hypotheses, i.e. permutations of the input, are generated by the decoder from a bag of words. 6.5.2 Algorithm Our algorithm is inspired by stack based decoding for phrase-based transla- tion, reviewed in Section 2.5.4. We first give an overview before outlining the algorithm details. 6.5.2.1 Overview In order to represent partial hypotheses, we use a vector of columns (more commonly called stacks). Each column contains a set of states. Each state represents a set of partial hypotheses. All states in a column represent hy- potheses that cover the same number of words in the input. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 134 We start to build hypotheses from an initial state that represents the empty hypothesis. This initial state is located in the first column, which represents the hypotheses that have not covered any word in the input. The empty hypothesis is then extended with the input n-gram rules and their coverage. New states are created to take into account these extensions. Then, partial hypotheses represented by these states are repeatedly extended until hypotheses that cover the entire input are produced. 6.5.2.2 Algorithm Details As mentioned in Section 6.5.2.1, we represent the set of partial hypotheses with a vector of columns that we denote b . For an input w of size |w|, b has a size of |w| + 1. Elements of b are denoted b(, ..., b|w|. The i-th columnbi represents hypotheses that have covered i words in the input. The columnsbi are the analog of stacks in stack based decoding for phrase-based translation. Each column contains a set of states that represents a set of partial hy- pothesis. A state contains the information necessary to extend a partial hypothesis and represents a set of partial hypotheses that can be recombined (see Section 2.5.4). This means that if two partial hypotheses h) and h2 are represented by a state s and h) has a higher model score than h2, then if h) and h2 are extended with the same rules, then the extension of h) will have a higher score than the extension h2. In our case, two hypotheses can be recombined if they share the same last n − 1 words where n is the order of the n-gram language model. The states are sorted by their score. For a given state, its score is the best model score of any of the partial hypotheses that the state represents. The decoding algorithm is presented in Figure 6.2. We first initialise the matrix b to be a vector of empty columns of size |w|+1 (line 2). The first column is filled with an initial state representing an empty hypothesis. Then, we loop over the column indices (line 3), the states in each column (line 4) and the n-gram rules (line 5). In each case, we attempt to extend a state with an n-gram rule. We first test if the n-gram rule r is applicable to state s (line 6). If this is the case, we extend the state s with the rule r (line 7). We will now describe what kind of information is contained in a state. This will allow use to also describe the last two operations, CanApply and Extend. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 135 1: function Decode(wPN PG) 2: Initialize(b) 3: for 0 ≤ i < |w| do 4: for state s ∈bi do 5: for ngram r ∈ N do 6: if CanApply(sP r) then 7: Extend(sP r) 8: end if 9: end for 10: end for 11: end for 12: end function Figure 6.2: NgramGen decoding algorithm. The input is a bag of words w, a set N of n-gram rules with their coverage and a language model G. The column vector b is first initialised with a state representing the empty hypothesis. Then, the initial state is iteratively extended. State Definition The states must contain enough information in order to be able to extend a hypothesis, and states also represent all hypotheses that can be recombined. Thus, a state contains the following information: • Coverage: the coverage is a bit vector that indicates which words in the input have been covered. This implies a sorting of the input bag of words. The sorting is arbitrary and we simply choose to represent the bag of words by the sequence of words to be recovered. • History: in order to compute the n-gram language model score correctly when extending a partial hypothesis, we store the last n − 1 words of the partial hypothesis we want to extend. The definition of history is therefore the same as the definition of history for an n-gram language model. In our implementation, the history is simply encoded as an lm::ngram::State as defined in the KenLM toolkit (Heafield, 2011). CanApply Operation Given a state s with coverage x(s) and an n-gram rule r with coverage x(r), rule r can be applied to state s if x(s) and x(r) are disjoint. We will see in Section 6.6.6 that this constrain can be relaxed if we allow for overlapping n-grams. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 136 Extend Operation Given a state s with coverage x(s) and and n-gram rule r with coverage x(r) that can be applied to s, we extend s with r into a new state s′ as follows: • The coverage x(s′) of s′ is the bitwise OR of x(s) and x(r): x(s′) = x(s) | x(r) (6.4) • The new partial score sc(s′) for s′ is defined in terms of the partial score sc(s) for s, the history h(s) and the rule r: sc(s′) = sc(s) + gg(r|h(s)) + ' ·'(r) (6.5) • If s′ already exists in the column corresponding to its coverage, then the score of the existing state is updated if sc(s′) is better than the existing score. Otherwise, s′ is simply added as a new state to the column corresponding to its coverage. Pruning The search space for an input bag of size c is the number of permutations of the input, which is c ; if all input words are distinct. We therefore need to apply pruning during search to make the decoding process tractable. We support histogram pruning and threshold pruning (see Sec- tion 2.5.4). After each extension, if we add a new state, we enforce that the column where a state was added satisfies the pruning constraints. This means that for histogram pruning, with m the maximum number of states per column, if we add a new state to a column that has m states, then the state with the lowest score is removed from that column. For threshold pruning with a threshold t, the situation is slightly more complex. When we want to add a state s with score sc(s) to a column where the best score is w, we consider three cases: • sc(s) ≤ w and sc(s) ≥ w− t: the state s is added to the column. • sc(s) ≤ w and sc(s) < w− t: the state s is not added to the column. • sc(s) S w: the state s is added to the column and all states in that column with a score less than sc(s)− t are removed from the column. We have presented the decoding algorithm and various data structures to support the algorithm. We will now describe how the set of hypotheses is encoded with FSTs. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 137 6.5.3 FST Building We mentioned in Section 6.5.2.2 that a state represents partial hypotheses that can be recombined as well as the necessary information to be able to extend a partial hypothesis. However, since we do not keep backpointers between states, we cannot recover hypotheses directly from b , the vector of columns. Instead, we build an FST that contains all hypotheses. For each extension, an arc is added to the output FST. We make use of the OpenFst toolkit (Allauzen et al., 2007). This allows us to rescore the output using tools already integrated with OpenFst (Blackwood, 2010). Another advantage is that hypotheses that are recombined are not deleted and all completed hypotheses are stored in the output FST. This allows hypotheses that would be discarded in recombination to remain in the output and get an opportunity to become the best hypothesis after rescoring. 6.5.4 Example We now demonstrate how our algorithm works with an example. Our input consists of: • a bag of words: [do, you, do]. Note that the word do is repeated. • a bigram language model. Thus the history for each state will consist of the last word of the partial hypothesis. • monolingual n-gram rules with their coverage of the input bag of words. We use the rules listed in Table 6.1. After running our algorithm without pruning, we obtain the data structure represented in Figure 6.3. We can see that the same state {111, do} is reused for the two hypotheses do you do, you do do, because these hypotheses have the same coverage and the same history. We also note that the hypothe- sis do you do is repeated. This is not an issue as we use an operation of determinisation on the resulting FST. We will now present various experiments that demonstrate the effective- ness of our decoder for the string regeneration task. We will also analyse various capabilities of the decoder in order to understand which configura- tions lead to better results. CHAPTER 6. STRING REGENERATION AS PHRASE-BASED MT 138 N-gram Coverage do 100 do 001 you do 011 you do 110 do you 110 do you 011 Table 6.1: Example of input monolingual n-gram rules with their coverage, corresponding to the example input [do, you, do]. The FST representing the set of hypotheses obtained with these rules is represented in Figure 6.3. .000,