Recurrent Neural Network Language Generation for Dialogue Systems Tsung-Hsien Wen Department of Engineering University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Darwin College March 2018 Declaration I hereby declare that, except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. This dissertation contains roughly 40277 words including appendices, bibliography, footnotes, tables and equations and has 53 figures and tables. Some of the work presented here was published in the Special Interest Group on Discourse and Dialogue (Wen et al., 2015a), the Empirical Methods on Natural Language Processing (Wen et al., 2016a, 2015c), the North American Chapter of the Association for Computational Linguistics (Wen et al., 2016b), the European Chapter of the Association for Computational Linguistics (Wen et al., 2017c), and the International Conference of Machine Learning (Wen et al., 2017a). Tsung-Hsien Wen March 2018 Acknowledgements First, I would like to thank my supervisor Steve Young for his dedicated supervision through- out the course of my PhD. It has always been a pleasure talking to him, which is where he taught me how to think clearly and perform effective research. His feedback is always well-considered and to the point, but he never stops his research group from exploring interesting research ideas. He has been, in general, an inspiration and it’s really my pleasure to be his PhD student. I’m also grateful for my advisor Milica Gasic, for taking me to the group and her help and guidance during the study of the course. It has been a privilege to work with the Cambridge Dialogue Systems Group, which has provided a highly motivating and friendly environment for research. I would like to thank some of the group’s previous members, David Vandyke, Dongho Kim and Pirros Tsiakoulis for their guidance on the group infrastructure and Filip Jurcicek on his excellent work setting up the Amazon Mechanical Turk. I would also like to thank the current group members Stefan Ultes, Lina Maria Rojas Barahona, Inigo Casanueva, and Pawel Budzianowski for their invaluable discussion and support, Anna Langley and Patrick Gosling for their computing assistance, as well as Rachel Fogg and Diane Hazell for their excellent administrative work. I would like to extend my gratitude to my peers Nikola Mrksic and Pei-Hao Su for their mutual support. It has been a great pleasure to be able to pursue my PhD with them. I would like to particularly acknowledge Pei-Hao Su for his support these past ten years, which we have spent studying and working together. It has been a great journey. Outside of the department, I would like to acknowledge Yishu Miao for his invaluable discussion on latent variable modelling. His cooperation has helped to shape the last part of this thesis. I acknowledge Toshiba Cambridge Research Laboratory for funding and Yannis Stylianou, Zhuoran Wang, Alex Papangelis, Margarita Kotti, and Norbert Braunschweiler for their valuable comments and suggestions. Lastly, I would like to offer a sincere thank you to my parents, Wen-Bing and Chiu-Yuan, for their caring encouragement. I couldn’t have finished this thesis without their support. I’m also very grateful to my girlfriend, Fang-Yu Chiu, for her companionship throughout this journey. Abstract Language is the principal medium for ideas, while dialogue is the most natural and effective way for humans to interact with and access information from machines. Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact on usability and perceived quality. Many commonly used NLG systems employ rules and heuristics, which tend to generate inflexible and stylised responses without the natural variation of human language. However, the frequent repetition of identical output forms can quickly make dialogue become tedious for most real-world users. Additionally, these rules and heuristics are not scalable and hence not trivially extensible to other domains or languages. A statistical approach to language generation can learn language decisions directly from data without relying on hand-coded rules or heuristics, which brings scalability and flexibility to NLG. Statistical models also provide an opportunity to learn in-domain human colloquialisms and cross-domain model adaptations. A robust and quasi-supervised NLG model is proposed in this thesis. The model leverages a Recurrent Neural Network (RNN)-based surface realiser and a gating mechanism applied to input semantics. The model is motivated by the Long-Short Term Memory (LSTM) network. The RNN-based surface realiser and gating mechanism uses a neural network to learn end-to-end language generation decisions from input dialogue act and sentence pairs; it also integrates sentence planning and surface realisation into a single optimisation problem. The single optimisation not only bypasses the costly intermediate linguistic annotations but also generates more natural and human-like responses. Furthermore, a domain adaptation study shows that the proposed model can be readily adapted and extended to new dialogue domains via a proposed recipe. Continuing the success of end-to-end learning, the second part of the thesis speculates on building an end-to-end dialogue system by framing it as a conditional generation problem. The proposed model encapsulates a belief tracker with a minimal state representation and a generator that takes the dialogue context to produce responses. These features suggest comprehension and fast learning. The proposed model is capable of understanding requests and accomplishing tasks after training on only a few hundred human-human dialogues. A complementary Wizard-of-Oz data collection method is also introduced to facilitate the viii collection of human-human conversations from online workers. The results demonstrate that the proposed model can talk to human judges naturally, without any difficulty, for a sample application domain. In addition, the results also suggest that the introduction of a stochastic latent variable can help the system model intrinsic variation in communicative intention much better. Table of contents List of figures xiii List of tables xvii 1 Introduction 1 1.1 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . 3 2 Overview of Spoken Dialogue Systems 7 2.1 Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Spoken Language Understanding . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Dialogue State Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Natural Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.1 Template-based Language Generator . . . . . . . . . . . . . . . . 14 2.5.2 Linguistically Motivated Approaches . . . . . . . . . . . . . . . . 15 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Corpus-based Natural Language Generation for Dialogue Systems 19 3.1 Class-based Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Phrase-based Factored Language Model . . . . . . . . . . . . . . . . . . . 21 3.3 Example-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Attention-based Sequence-to-Sequence Approaches . . . . . . . . . . . . . 23 3.5 Evaluation Metrics and Difficulties . . . . . . . . . . . . . . . . . . . . . . 25 3.5.1 BLEU score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.2 Slot Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.3 Language Variability . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 x Table of contents 4 Neural Networks and Deep Learning 29 4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 32 4.1.2 Long Short-term Memory . . . . . . . . . . . . . . . . . . . . . . 33 4.1.3 Bidirectional Networks . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 37 4.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Stochastic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Conditional Variational Autoencoder Framework . . . . . . . . . . 41 4.3.2 Neural Variational Inference . . . . . . . . . . . . . . . . . . . . . 42 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Recurrent Neural Network Language Generators 47 5.1 The Recurrent Language Generation Framework . . . . . . . . . . . . . . 47 5.2 Gating Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 The Heuristically Gated LSTM Generator . . . . . . . . . . . . . . 49 5.2.2 The Semantically Controlled LSTM Generator . . . . . . . . . . . 50 5.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.1 The Attentive Encoder Decoder LSTM Generator . . . . . . . . . . 52 5.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.3 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.5 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.6 Corpus-based Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.6.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 57 5.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.7 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.7.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 61 5.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Domain Adaptation 65 6.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Feature-based Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . 66 Table of contents xi 6.3 Model-based Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . 68 6.4 Baseline Method - Model Fine-Tuning . . . . . . . . . . . . . . . . . . . . 69 6.5 The Adaptation Recipe for Language Generators . . . . . . . . . . . . . . 69 6.5.1 Data Counterfeiting . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.5.2 Discriminative Training for Adaptation . . . . . . . . . . . . . . . 71 6.6 Corpus-based Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.6.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 72 6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.7 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.7.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 76 6.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7 Generation by Conditioning on a Broader Context 79 7.1 End-to-End Dialogue Modelling . . . . . . . . . . . . . . . . . . . . . . . 80 7.2 Goal-oriented Dialogue as Conditional Generation . . . . . . . . . . . . . 82 7.3 Neural Dialogue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.3.1 Intent Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.3.2 The RNN-CNN Dialogue State Tracker . . . . . . . . . . . . . . . 86 7.3.3 Database Operator and Deterministic Policy Network . . . . . . . . 88 7.3.4 Conditional LSTM Generator and Its Variants . . . . . . . . . . . . 89 7.4 Wizard-of-Oz Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 92 7.5 Evaluation on Neural Dialogue Models . . . . . . . . . . . . . . . . . . . . 93 7.5.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 93 7.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.5.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8 Generation based on a Latent Policy 101 8.1 Stochastic Policy as a Discrete Latent Variable . . . . . . . . . . . . . . . . 102 8.2 Latent Intention Dialogue Model . . . . . . . . . . . . . . . . . . . . . . . 103 8.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . 107 8.2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 108 8.3 Evaluation on Latent Intention Dialogue Models . . . . . . . . . . . . . . . 109 8.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xii Table of contents 8.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.3.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 9 Conclusions 117 9.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 118 Appendix A Slot-based Dialogue Domains 121 Appendix B Dialogue Act Format 123 Appendix C Example of a Template-based Generator 127 Appendix D The Wizard-of-Oz website 131 References 137 List of figures 2.1 Dialogue system architecture . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 An example of the sentence for the "inform" DA and its corresponding stack representation and tree equivalent used in Mairesse and Young (2014). Figure borrowed from Mairesse and Young (2014). . . . . . . . . . . . . . . . . . 21 3.2 An overview of the phrase-based DBN model for NLG. The lower level is a semantic DBN consisting of both mandatory (blue) and functional stacks (white), while the upper level is a phrase-based DBN that oversees mapping semantic stacks to phrases. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Examples adopted from Dušek and Jurcicek (2016). Trees encoded as sequences used for the seq2seq generator (top) and the reranker (bottom). . 24 3.4 Figure adopted from Dušek and Jurcicek (2016). The proposed attention- based seq2seq generator architecture. . . . . . . . . . . . . . . . . . . . . . 24 3.5 Figure adopted from Dušek and Jurcicek (2016). The proposed reranker architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 A feed-forward Artificial Neural Network . . . . . . . . . . . . . . . . . . 30 4.2 Architecture of the Elman recurrent neural network. The recurrence is fully connected between two hidden layers in adjacent time steps. . . . . . . . . 33 4.3 An unfolded view of RNN. Each rectangle represents a layer of hidden units in a single time step. The weighted connections from input to hidden layer are Wx, those from hidden to output are Wh, and the hidden to hidden weights are Wr. Note that the same weights are reused at every time step. . 34 4.4 In the RNN vanishing gradient problem, the colour of nodes is determined by the sensitivity of a single node to input at time 0. Due to the repeated multiplication of the recurrent weights, the influence of the input will vanish or grow over time. The network will eventually forget the first input. . . . . 35 xiv List of figures 4.5 A Long short-term memory cell is an RNN architecture with the memory block Ct. it, ft,ot are input, forget, and output gates. xt and ht−1 are control- ling signals of the three gates. Typically xt is the current time step’s input and ht−1 is the hidden layer’s representation at previous time step. . . . . . 36 4.6 The unfolded bidirectional RNN. The backward layer and forward layer can encode information from both directions. Six sets of distinct weights are included. Note there are no connections between two hidden layers. . . . . 37 4.7 Architecture of a CNN model called LeNet-5 (Lecun et al., 1998a). Each plane is a feature map to capture certain patterns of the previous layer. A fully connected FNN is attached on top as the classifier for recognising input digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.8 Gradient estimation in stochastic computation graphs. Figure adapted from "Categorical Re-parameterization with Gumbel-Softmax" by Jang et al. (2016). (a) the gradient ∂ f(h)/∂h can be directly computed via back- propagation if h is deterministic and differentiable. (b) The introduction of a stochastic node z precludes back-propagation because the sampling function z(n) ∼ pθ (z|x) doesn’t have a well-defined gradient. (c) The re- parameterisation trick (Kingma and Welling, 2014; Rezende et al., 2014), which creates an alternative gradient path to circumvent the sampling func- tion inside the network, allows the gradient to flow from f(z) to θ . Here we used the Gaussian re-parameterisation trick as an example, but other types of distributions can also be applied. (d) The scoring function-based gradient- estimator obtains an unbiased estimate of ▽f(z) by back-propagating along a surrogate loss fˆ▽ logpθ (z|x), where fˆ = f(z)−b(z) and b(z) is the baseline for variance reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1 The RNNLG framework for language generation. . . . . . . . . . . . . . . 48 5.2 The Heuristically Gated LSTM generator (H-LSTM). At each time step an LSTM cell is used to process the input token and previous time step hidden state. The DA gate is controlled by matching the current generated token with a predefined semantic dictionary. . . . . . . . . . . . . . . . . . . . . 50 5.3 The Semantically Conditioned LSTM generator (SC-LSTM). The upper part is a traditional LSTM cell in charge of surface realisation, while the lower part is a sentence planning cell based on a sigmoid control gate and a DA. . 51 List of figures xv 5.4 The Attention-based Encoder Decoder generator (ENC-DEC). The DA vector is created by concatenating a deterministic DA embedding with an attentive slot-value embedding over potential slot-value pairs. The dashed lines show how the attentive weights are created by the two control signals. . . . . . . 53 5.5 Examples showing how the SC-LSTM controls the dialogue features, flowing into the network through the reading gates. Despite errors due to sparse training data for some slots, each gate generally learned to detect the words and phrases relating to each associated slot-value pair. . . . . . . . . . . . . 60 6.1 An example of the data counterfeiting approach. Note the slots highlighted red are randomly sampled from the target domain where the prerequisite is that the counterfeited slot needs to be in the same functional class as the original slot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Hotel to Restaurant domain adaptation . . . . . . . . . . . . . . . . . . . . 73 6.3 Restaurant to Hotel domain adaptation . . . . . . . . . . . . . . . . . . . . 73 6.4 TV to Laptop domain adaptation . . . . . . . . . . . . . . . . . . . . . . . 74 6.5 Laptop to TV domain adaptation . . . . . . . . . . . . . . . . . . . . . . . 74 6.6 Restaurant + Hotel to Laptop + TV domain adaptation . . . . . . . . . . . 75 6.7 Laptop + TV to Restaurant + Hotel domain adaptation . . . . . . . . . . . 75 7.1 The proposed Neural Dialogue Model framework . . . . . . . . . . . . . . 84 7.2 Tied Jordan-type RNN belief tracker with delexicalised CNN feature extrac- tor. The output of the CNN feature extractor is a concatenation of top-level sentence (green) embedding and several levels of intermediate n-gram-like embedding’s (red and blue). However, if a value cannot be delexicalised in the input, its ngram-like embedding’s will be padded with zeros. We zero-pad vectors (in gray) before each convolution operation to make sure the representation at each layer has the same length. The output of each tracker bts is a distribution over values of a particular slot s. . . . . . . . . . 85 7.3 Three different conditional generation architectures. . . . . . . . . . . . . . 89 7.4 The action vector embedding zt generated by the vanilla NDM model. Each cluster is labelled with the first three words that the embedding generated. . 96 8.1 LIDM for Goal-oriented Dialogue Modelling . . . . . . . . . . . . . . . . 104 D.1 The user webpage. The worker who plays a user is given a task to follow. For each mturk HIT, he/she needs to type an appropriate sentence to carry on the dialogue by looking at both the task description and the dialogue history. 131 xvi List of figures D.2 The wizard page. The wizard’s job is slightly more complex: the worker needs to go through the dialogue history, fill in the form (top green) by interpreting the user input at this turn, and type in an appropriate response based on the history and the DB result (bottom green). The DB search result is updated when the form is submitted. The form can be divided into informable slots (top) and requestable slots (bottom), which contain all the labels we need to train the trackers. . . . . . . . . . . . . . . . . . . . . . . 131 List of tables 5.1 Ontologies of the Restaurant, Hotel, Laptop and TV domains . . . . . . . . 54 5.2 Properties of the four datasets collected . . . . . . . . . . . . . . . . . . . 56 5.3 Corpus-based evaluation on four domains. The results were produced by training each model on 5 random seeds (1-5) and selecting the models with the best BLEU score on the validation set. Results of both the test and the validation set are reported. The approach with the best performance on a metric in a domain is highlighted in bold, while the worst is underlined. . . 59 5.4 Samples of top 5 realisations from the SC-LSTM output. . . . . . . . . . . 61 5.5 Human evaluation of the Restaurant domain. The significance of the Utter- ance Quality Evaluation section indicates whether a model is significantly worse than the reference based on a Wilcoxon signed-rank test. The Pairwise Preference Comparison section indicates whether a model is significantly preferred to its opponent based on a two-tailed binomial test. . . . . . . . . 62 5.6 Human evaluation of the Laptop domain. The significance flag in the Utter- ance Quality Evaluation section indicates whether a model is significantly worse than the reference based on a Wilcoxon signed-rank test, while in the Pairwise Preference Comparison section it indicates whether a model is significantly preferred to its opponent based on a two-tailed binomial test. . 63 6.1 Human evaluation of utterance quality in two adaption scenarios. Results are shown for two metrics (rated out of 3 possibilities). Statistical significance was computed using a Wilcoxon signed-rank test, between the model trained with full data (scrALL) and all others. . . . . . . . . . . . . . . . . . . . . 76 6.2 Pairwise preference test among four approaches in two adaption scenarios. Statistical significance was computed using a two-tailed binomial test. . . . 77 xviii List of tables 7.1 Corpus-based experiment by comparing different NDM architectures. The results were obtained by training models with several hyper-parameter set- tings. The testing performance is based on the best model of the validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Human assessment of NDM. The rating for comprehension and naturalness are both out of 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3 A comparison of the NDM with a rule-based modular system (HDC). A two-tailed binomial test was used. . . . . . . . . . . . . . . . . . . . . . . 95 7.4 Samples of real text conversations between online judges and NDM. . . . . 99 8.1 An example of the automatically labelled response seed set for semi-supervised learning during variational inference. . . . . . . . . . . . . . . . . . . . . . 110 8.2 Corpus-based experiment by comparing different NDM and LIDM archi- tectures. The results were obtained by training models with several hyper- parameter settings and report the testing performance based on the best model on the validation set. The best performance of each metric in each big block is highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.3 Human evaluation based on text-based conversations. . . . . . . . . . . . . 112 8.4 A sample dialogue from the LIDM, I=100 model, one exchange per block. Each induced latent intention is shown using an (index, probability) tuple, followed by a decoded response. The sample dialogue was produced by following the responses highlighted in bold. . . . . . . . . . . . . . . . . . 113 8.5 Two sample dialogues from the LIDM+RL, I=100 model, one exchange per block. Comparing to Table 8.4, the RL agent demonstrates a much greedier behavior toward task success. This can be seen in block 2 & block 4 in which the agent provides the address and phone number even before the user asks. 114 A.1 Slots in the restaurant, hotel, laptop, and TV domains respectively. All the informable slots are also requestable. The group of Sreq\Sinf shows the requestable slots that are not informable. Note the type slot contains only one value, which is the domain string. . . . . . . . . . . . . . . . . . . . . 122 B.1 The set of dialogue acts used in RNNLG. Note that the dialogue act definition here is different from Young (2007) in which the inform act is broken down into specific categories to facilitate data collection and learning. . . . . . . 125 List of tables xix C.1 A sample snapshot of the root rules of the template-based generator used in this thesis. Note that both slot and value can be variables. The matching is done by matching as many slot-value pairs as possible. . . . . . . . . . . . 128 C.2 A sample snapshot of the lower level rules. . . . . . . . . . . . . . . . . . 129 Chapter 1 Introduction The development of conversational skills is an important part of human socialisation. Dia- logues (or conversations) between people facilitate the exchange of ideas, experiences and knowledge. If we consider the entire world wide web (WWW) as a database that stores known facts, then the most natural interface for accessing this knowledge would be through direct conversations with an Artificial Intelligence agent that can access the web. This thesis focuses on developing dialogue systems, which are the prototypes for the AI agents mentioned above. The AI field is based on the Dartmouth proposal (McCarthy et al., 1955), which claims that human intelligence "can in principle be so precisely described that a machine can be made to simulate it". Turing’s imitation game (Turing, 1950) was designed to test this hypothesis by asking human participants to judge whether they are talking to a human or a computer system. The Loebner Prize (Epstein, 1992), a derivative of the Turing test for text-based dialogue systems, has not received a prominent success yet since its inception almost 26 years ago. An early effort that appears to pass the Turing test is ELIZA (Weizenbaum, 1966), a chatting program developed by Joseph Weizenbaum. ELIZA works by examining keywords in the user’s request. If a keyword is found, the resulting response is formed by transforming the user input based on a handcrafted rule. On the other hand, if a keyword is not found, ELIZA replies either with a clever riposte or one of the earlier responses. Although ELIZA could fool some people into believing that they were talking to a human, it was not considered AI because it knows nothing about the real world and only acts upon a set of pre-programmed scripts. Although open-domain, AI-based dialogue systems that can handle arbitrary conversa- tions with encyclopaedic knowledge are appealing, developing such systems has proven challenging (Lyons, 2007) and the current state-of-the-art is still very far away. Therefore, 2 Introduction this thesis focuses on building goal-oriented dialogue systems for applications whose do- mains are much more constrained, so that the system can assist users with well-defined tasks, such as tourist guidance (Misu and Kawahara, 2007), movie searches (Liu et al., 2012), flight reservations (Seneff and Polifroni, 2000), and information retrieval (Wen et al., 2013b, 2012b). Due to the complexity of developing end-to-end dialogue systems, state-of-the-art ap- proaches divide the problem into three components: language understanding, dialogue management, and language generation. This modular system design relies on the dialogue act protocol (Traum, 1999) to represent the semantics and communication between system modules. In the past few decades, research on goal-oriented dialogue systems has been fo- cused on either building better language comprehension engines (Allen, 1995; Thomson and Young, 2010), so that the system can understand human queries better, or developing better dialogue managers (Roy et al., 2000a; Young et al., 2010), so that the system can help users accomplish tasks efficiently. Language generation was largely overlooked and handcrafted templates became common practise in the dialogue system development process (Cheyer and Guzzoni, 2007; Mirkovic and Cavedon, 2011). The Natural Language Generation (NLG) component provides much of the personality of a dialogue agent, which has a significant impact on a user’s impression of the system. Dia- logue systems with strong comprehension and decision-making capabilities may be effective taskmasters, but they often fail at engaging users in conversations if a rigid, templated-based generator is used to render text. Moreover, critics argue that, through careful engineering ef- forts and resources, a handcrafted system can still achieve an impressive result. Nevertheless, the handcrafted generator still poses a problem of scaling up its domain coverage. To allow machine responses to be more natural and provide a more engaging conversational flow, a statistical framework trained on data is used in this thesis. The key in handling complex-structured prediction problems, such as NLG with only lightweight labelling efforts, is the use of Recurrent Neural Network (RNN) language models for sequence generation (Mikolov et al., 2010; Sutskever et al., 2011). An RNN directly learns the mapping from an arbitrary vector representation to surface forms by generating words one-by-one, each from its hidden state. The word embedding (Mikolov et al., 2013), which the model learns during training, grants the model the ability to generalise to rare or unseen examples. Meanwhile, the recurrent connection also empowers the model the potential to explore long-term dependencies from data. The flexible structural design of neural networks also allows the model to quickly adapt to specific problems so that additional features can be easily studied and exploited. 1.1 Thesis Outline and Contributions 3 1.1 Thesis Outline and Contributions The thesis is split into three main sections. After an overview of Spoken Dialogue Systems in Chapter 2, chapters 3, 4, and 5 focus on Natural Language Generation from dialogue act taxonomies, while Chapter 6 studies the domain scalability of the proposed NLG methods. Chapters 7 and 8 then look at the wider context of NLG and frame it as an end-to-end dialogue modelling problem. Below is a description of each of these chapters and their contributions. Chapter 2, Overview of Spoken Dialogue Systems This chapter presents an overview of a modular Spoken Dialogue System architecture design. A series of system modules are reviewed and the corresponding state-of-the-art approaches are presented. Two classes of Natural Language Generation (NLG) approaches are also introduced: (1) the template-based approaches, and (2) the linguistic-motivated approaches. The aim of this chapter is to lay the foundations of statistical dialogue modelling and NLG and set-up the context of the thesis. Chapter 3, Corpus-based Natural Language Generation for Dialogue Systems This chapter reviews the corpus-based NLG methods and their applications for dialogue systems, which includes the class-based language model, phrase-based factored language model, and example-based methods for NLG. Evaluation of NLG and its difficulties are also presented. This is where the BLEU score and slot error rate metrics are both introduced. These two met- rics are then used as the primary objective metrics for evaluating the NLG models proposed in this thesis. Chapter 4, Neural Networks and Deep Learning This chapter introduces the basic concept of Neural Network (NN), from the model formulation, optimisation, objective function, to its variants like Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). Moreover, the second part of the chapter extends the deterministic NN models to stochastic ones and introduces the Conditional Variational Autoencoder (cVAE) framework and its optimisation techniques. The goal of this chapter is to present building blocks and optimisation algorithms used in this thesis. Chapter 5, Recurrent Neural Network Language Generator This chapter extends the idea of Recurrent Neural Network Language Model (Mikolov et al., 2010) and proposes Recurrent Neural Network Language Generation (RNNLG), which integrates an RNN LM as the surface realiser to produce sentences in a sequential way. This generation process is additionally reliant on a dialogue act representation and managed by either a gating mechanism or an attention mechanism to ensure it conveys the intended meaning. Three RNNLG models are proposed, the Heuristically Gated LSTM (H-LSTM) and the Semantically Conditioned LSTM (SC-LSTM), which employ controlling gates and the Attention-based Encoder Decoder (ENC-DEC) with attention. In the experimental section, 4 Introduction four corpora are presented. Each of them is collected via Amazon Mechanical Turk and labelled in the form of a dialogue act and sentence pair. The three RNNLG models and baselines are then evaluated. The evaluation is based on a corpus-based evaluation and a human evaluation on the four corpora. Result shows that the SC-LSTM by integrating a learnable gate for latent sentence planning. Chapter 6, Domain Adaptation This chapter investigates the domain scalability of the proposed SC-LSTM generator. A set of domain adaptation approaches are reviewed and a recipe for adapting the NLG model is proposed based on the survey. The recipe encompasses two methods: data counterfeiting and discriminative training, which are both validated by a corpus-based evaluation and human evaluation. By bootstrapping from data of existing domains, the result demonstrates that the proposed adaptation recipe can produce a very competitive NLG model with only a few hundred in-domain training examples. Chapter 7, Generation by Conditioning on a Broader Context Speculating on the possibility of training NLG based on a broader dialogue context, this chapter presents the Neural Dialogue Model (NDM) which directly learns end-to-end dialogue decisions from data without the provision of dialogue act annotations. The proposed NDM contains several modularly-connected neural network components which are like that of a Partially Observ- able Markov Decision Process (POMDP) system. Unlike POMDP-based systems, NDM directly learns dialogue strategies together with language comprehension and generation via supervised learning. The experimental results show that NDM is able to generate appropriate sentences directly from the encoded dialogue context and help the user to accomplish tasks. A series of studies also follow to analyse and understand the representations learnt by NDM. Chapter 8, Generation based on Latent Policy In this chapter we continue to pursue the end-to-end dialogue modelling problem and extend NDM to the Latent Intention Dia- logue Model (LIDM) by introducing a discrete latent variable with the aim of modelling the variations in the dialogue response. The conditional Variational Autoencoder (VAE) framework is employed and proven to be useful to the model, i.e., meaning variation exists in dialogue. In a corpus-based evaluation, LIDM outperforms its deterministic counterpart NDM. Human judges also confirm that LIDM can carry out more natural conversations than NDM. We believe this is a promising step forward, particularly for building autonomous dialogue agents since the learnt discrete latent variable interface not only enables the agent to perform learning using several paradigms but it also demonstrates a competitive performance and a strong scaling potential. Finally, chapter 9 concludes the thesis, providing a critical analysis of the benefits and limitations of the proposed approaches for NLG. We discuss how these approaches may help 1.1 Thesis Outline and Contributions 5 to create a real-world dialogue system that is much more natural and scalable, yet with a much shorter development life cycle. Chapter 2 Overview of Spoken Dialogue Systems This chapter provides an overview of Spoken Dialogue Systems (SDS) theory. As shown in Figure 2.1, a typical SDS contains five components arranged in a pipeline. There is no common agreement on a single dialogue system architecture in the literature. A representative reference can be found in (Pieraccini and Huerta, 2008). One cycle through the pipeline is a dialogue turn, which involves both a user utterance and a system utterance. This thesis focuses on task-oriented dialogue systems, where the task is information-seeking and the domain can be defined by a set of slots and their corresponding values. Slots are variables that the user can either specify or ask about in the given domain. For example, in a restaurant Fig. 2.1 The pipeline architecture of a typical dialogue system. booking dialogue system, the user may look for restaurants of a particular food type and price range slots and ask for the address and phone number slots. Appendix A provides a more detailed discussion of the slot-based dialogue domains used in this thesis. The subsequent sections look at each of these system-components in detail. 8 Overview of Spoken Dialogue Systems 2.1 Speech Interface Automatic Speech Recognition (ASR) and Speech Synthesis are components at the two ends of the SDS that provide a speech interface with the user. In modern SDS development, these two components are usually implemented separately from the others; in this way, the speech interface can be generally applicable across a variety of dialogue domains. Many traditional ASR systems use generative graphical models such Hidden Markov Models (HMMs) to model the acoustics of speech signals (a review is given in Gales and Young (2007)) and n-gram-based language models to capture the transition between word symbols (Goodman, 2001). In recent years, however, this approach has been largely replaced by discriminative artificial neural networks in both acoustic modelling (Deng et al., 2013; Hinton et al., 2012) and language modelling (Bengio et al., 2003; Mikolov et al., 2010). There is even work on doing end-to-end ASR using Recurrent Neural Networks (RNNs) (Amodei et al., 2016; Graves and Jaitly, 2014). Deep neural networks can learn several layers of input representations along with top-level objectives so that we do not need to rely on handcrafted feature extractors like MFCC (Logan, 2000) to represent the raw inputs, but instead model them directly. This additional flexibility renders deep neural networks a very powerful function approximator and the current state-of-the-art classifier used in many machine perception applications. The output form of an ASR component is particularly important for SDS research because recognition errors impose uncertainty in the dialogue system and decrease its robustness. An ASR component assigns a posterior probability to the syllables of an utterance given its acoustics; thus, a typical form of output is an N-best list of hypotheses and their corresponding probabilities. The resulting top hypotheses usually have the same meaning; variations only appear in articles and other short-function words. Other output forms, such as word lattices (Murveit et al., 1993) and word confusion networks (Mangu et al., 2000) are also possible choices. These multiple hypotheses provide a more robust estimation of the user speech making the system more resistant to noise. The speech synthesis component, on the other hand, takes the system’s response and converts it back to speech. The most common approach employed by most systems is based on the unit selection method. In the unit selection method, waveforms are constructed by concatenating segments of recordings held in a database such as the Festival synthesiser (Tay- lor et al., 1998). Another widely used approach is the HMM-based synthesiser, which is a statistical method using a generative model of speech (Zen et al., 2007). This statistical method can learn dialogue with context-sensitive behaviour if provided with more natural speech (Tsiakoulis et al., 2014). Recently, deep neural networks have also been applied to 2.2 Spoken Language Understanding 9 speech synthesis (Zen et al., 2013). The most famous approach is WaveNet (van den Oord et al., 2016), a fully convolutional neural network (CNN) with various dilation parameters that allow receptive fields to grow exponentially with depth over thousands of time steps. Despite the challenge in runtime computation efficiency, this approach can generate more natural speech and reduce the gap with human performance by over 50%. 2.2 Spoken Language Understanding Spoken Language Understanding (SLU), the first component of a dialogue system, after the speech interface, identifies and extracts the semantics of an utterance given a user’s input text or the output of the ASR. Although many forms of semantic representations are available, most existing spoken dialogue systems use a shallow level of semantics called DA(dialogue act) (Traum, 1999); DA is derived from the concept of the speech act (Searle, 1969). Most DA taxonomies are designed to capture just enough meaning in an utterance to facilitate rational system behaviours within the domain; they are limited by the semantics it can model and therefore constrain the scalability and learnability of the system. A SLU component (or semantic decoder) then takes the utterance as an input and maps it to an output DA representing user semantics. Many spoken dialogue systems use semantic template grammar, such as the Phoenix parser (Ward, 1990), to extract the DA from the utterance (Young and Proctor, 1989; Zue et al., 2000). These approaches first produce a parse tree of the input sentence, which is based on a set of handcrafted rules that map words or phrases to their corresponding semantic concepts. The semantic concepts can then be grouped together to form the DA that represents the sentence. Usually an effective approach, the rules are domain-specific and require multiple iterations of modification before achieving adequate coverage (Young, 2002). More advanced grammars such as Combinatorial Categorical Grammars (CCGs) (Steedman, 2000) were also applied to SLU. Learning to deal with ungrammatical spontaneous speech and erroneous recognition output requires relaxed grammars (Nguyen et al., 2006; Zettlemoyer and Collins, 2007). The major advantage of using these parsing-based methods is that they can readily model long-range dependencies, which is important when applied to complex language where concepts may be split up in the utterance, e.g., by relative clauses. Modern SLU methods can be roughly grouped into two classes, one with internal se- quential labels at the word-level or one with only a sentence level label. The first class of methods usually requires an alignment between the words in the input utterance and the target semantics, while the second one does not. Most of the sequential labelling SLU adopt the BIO tag scheme, which provides a method of aligning spans of words with labels. For 10 Overview of Spoken Dialogue Systems example, an input utterance "Seven Days is a nice Chinese restaurant in the north ." can be labelled as "B-restaurant I-restaurant O O O B-foodtype O B-area I-area I-area .", where "B" indicates the word is at the beginning of the concept, "I" inside the concept, and "O" outside the concept, respectively. On the other hand, if the label is at the sentence level, it can just simply be the DA representation of the sentence such as "inform(name=Seven Days, food=chinese, area=north)". Based on the input utterance, and its corresponding labels, we can then apply machine learning algorithms to learn the SLU components directly from data. To frame SLU as a machine learning problem, we can either use a generative model or a discriminative model. Given the input x and label y, generative models learn a joint probability distribution P(x,y) and then assign labels based on the derived conditional distribution P(y|x). Dynamic Bayesian Network (DBN) were widely applied to SLU by modelling the semantics of the input utterance as a joint probability distribution of the hidden variable with the observed words (He and Young, 2006; Levin, 1995; Miller et al., 1994). Moreover, to conquer the challenge of modelling long-term dependencies between words imposed by the Markovian assumption, some approaches proposed to mitigate dependencies by introducing a hierarchical hidden structure (He and Young, 2006; Miller et al., 1996) in the generation process. Discriminative models, on the other hand, directly learn the conditional probability distribution P(y|x) of labels given a feature representation of the input utterance. Unlike generative models that rely on independence assumptions over the feature set, discriminative models are not constrained by these assumptions and therefore any useful feature can be directly included in the models. Empirical studies have shown discriminative models tend to outperform generative models on the SLU task (Wang and Acero, 2006). Among many popular discriminative approaches, such as Markov Logic Network (Meza-Ruiz et al., 2008), Support Vector Machines (Kate and Mooney, 2006; Mairesse et al., 2009), Conditional Random Fields (CRF) (Lafferty et al., 2001) are a particularly popular choice for SLU (Wang and Acero, 2006; Zhou and He, 2011). CRFs have also been used to create associations with the word confusion network (Tur and Deoras, 2013), to jointly classify utterance topics (Jeong and Geunbae Lee, 2008) or combine with Convolutional Neural Network (CNN)s for joint intent detection and slot filling (Celikyilmaz and Hakkani-Tur, 2015). Recently, progress has been made by applying discriminative neural network approaches to language comprehension. Since the input sentence has variable length, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) are the two most common choices for sentence encoding. Methods that adopt RNN for sentence encoding typically read words one by one from the beginning of the sentence; depending on the label type, it can either predict a label for each time step, like sequential labelling (Mesnil et al., 2015; Yao 2.3 Dialogue State Tracking 11 et al., 2014), or predict a sentence-level label based on the encoded sentence vector. Although RNNs can potentially capture global sentence meanings, CNNs (Kalchbrenner et al., 2014; Kim, 2014) are the most reasonable sentence encoding method because they can capture both global and local sentence semantics by applying several levels of convolution and pooling operations. Convolution filters allow the model to make predictions based on the locally- extracted features while ignoring the global context whenever it is necessary. Mrkšic´ et al. (2016b) has employed CNNs as the feature extractor of a set of binary semantic classifiers and explored pre-trained word embedding’s to improve the semantic decoding of dialogue systems. Recently, Rojas Barahona et al. (2016) also showed that the feature representations of a CNN can be weighted by the ASR confidence score of the input hypothesis to produce a more robust semantic decoder. Furthermore, CNN-based semantic decoders have also been studied for zero-shot intent expansion (Chen et al., 2016). 2.3 Dialogue State Tracking In a single turn conversation scenario, such as when answering questions, the output of the semantic decoder should provide enough information to fully encode the user’s request. However, if multi-turn conversations are expected, the dialogue state should be tracked, from turn to turn, to accumulate the necessary data needed to make system decisions. Therefore, the term dialogue state loosely denotes a full-representation of the information the system needs to take its next action. For example, in a slot filling dialogue system, the dialogue state is typically defined by a list of user-provided constraints. When dealing with erroneous ASR output, a distribution of possible dialogue states is usually maintained instead of a point estimate. This distribution is sometimes referred to as the belief state of the dialogue (Rapaport, 1986). The five Dialogue State Tracking Challenges (Henderson et al., 2014a,b; Kim et al., 2016a,b; Williams et al., 2013) (DSTC 1∼5) have sparked interest in dialogue state tracking. Baseline methods like the focus tracker use a set of handcrafted rules to update the dialogue state given observations provided by the SLU (Smith, 2014; Wang and Lemon, 2013). Generative and discriminative models can be applied to track dialogue states (like SLU does). Generative DST approaches generally follow the Hidden Information State model (Young et al., 2010) and use a Dynamic Bayesian Network to model the joint distri- bution of the observed input semantics (the output of SLU) and the hidden dialogue state. Although many variants exist (Lee et al., 2014; Williams, 2010), one of the most representa- tive approaches is the Bayesian Update of Dialogue State (BUDS) system (Thomson and Young, 2010). 12 Overview of Spoken Dialogue Systems Instead of learning the joint distribution, discriminative approaches for DST directly model the conditional probability distribution of dialogue states, given the input semantic labels, from the SLU output. The parameters are then updated by maximising the conditional log-likelihood. In the first few DSTCs, the most popular discriminative methods were the Maximum Entropy and neural network models (Henderson et al., 2013; Lee and Eskenazi, 2013; Ren et al., 2014; Sun et al., 2014). However, neural network approaches (Henderson et al., 2014c; Mrkšic´ et al., 2016b; Perez, 2016) gradually dominated the field because of their flexibility and strong discriminative power. Furthermore, recent research in DST also shows a tendency to integrate DST and SLU, and model the pair with a single neural network (Henderson et al., 2014c; Mrkšic´ et al., 2015). Although these end-to-end DST methods tend to produce models with better performance and mitigate the error propagation problem, they sacrifice the interpretability of the system and impede debugging. 2.4 Dialogue Management Based on the dialogue state inferred from the DST component, the system’s dialogue man- ager must decide what to say next. Conventional dialogue systems typically maintain one hypothesis of the dialogue state and make system decisions based on this point estimate (e.g., flowchart-based systems (Lucas, 2000; Sutton et al., 1996), form filling systems (Goddeau et al., 1996), and systems that makes decisions based on logical inference and planning (Lars- son and Traum, 2000)). However, none of these approaches suggest a systematic way of learning or delineate the actions needed for each dialogue state. Thus, the dialogue manager is one of the many handcrafted components of the system. By casting the decision-making step as a Markov Decision Process (MDP), we can learn the action selection model directly from interactions (Levin and Pieraccini, 1997; Singh et al., 1999). However, learning action selection based on a point estimate of the dialogue state is not ideal because ASR errors create unreliable results. Therefore, the Partially Observable Markov Decision Process (POMDP) formulation of dialogue systems (Young, 2002) offer a more robust and well-founded framework for statistical dialogue modelling. In a POMDP-based statistical dialogue system framework, the ASR, SLU, and DST components provide multiple output hypotheses to encode recognition uncertainty. A prob- ability distribution, rather than a point estimate, of the dialogue state is then used as the input of the dialogue management component. Therefore, the dialogue manager can be cast as a POMDP, which considers the distribution over many dialogue states when selecting the next system action (Roy et al., 2000b; Williams and Young, 2005). If we define the dialogue state as s and the action as a, then the belief state is a probability distribution over 2.5 Natural Language Generation 13 the possible states b = P(s). The mapping π of a state to an action is denoted as the policy of the system. Learning the policy can be facilitated by defining a reward function r(s,a) for each state-action pair and by optimising the parameters against the expected accumulated reward (or total return) R =∑ t r(st ,at) (2.1) This learning paradigm is called Reinforcement Learning (RL) (Sutton and Barto, 1998) because good actions are reinforced with positive rewards during training. There are two major approaches for applying RL to dialogue management, the value- and policy-based methods. In value-based RL, a value function Q(s,a) is the expected future reward for taking action a in state s. Once the Q function is learned, the optimal policy can be extracted by applying the maximum operator π(s) = argmax a Q(s,a) (2.2) Recently, Gaussian Process (GP) has been used to model Q functions (Gasic and Young, 2014) for policy learning in dialogue systems. This technique has shown impressive efficiency in policy optimisation, which can be used to learn from real human subjects (Gašic´ et al., 2013) via direct interaction. On the other hand, policy-based methods (or policy gradients) parameterise the policy as a conditional probability distribution of action given state π = P(a|s) and optimise π directly against the reward function. Though optimisation is typically harder, and less efficient (Jurcˇícˇek et al., 2011), policy-based methods are attractive because they are natural for combining different learning paradigms (Silver et al., 2016; Su et al., 2016a). 2.5 Natural Language Generation Natural Language Generation (NLG) is the final module in the dialogue system pipeline. Its main responsibility is to transform the semantic representation offered by the dialogue manager into natural language. Since language is the primary medium for communicat- ing between the user and dialogue system, NLG is critical because it has a significant impact on usability and perceived quality. However, using data-driven NLG for SDS ap- plications remains relatively unexplored due to the lack of an effective algorithm that can learn from structured input-output pairs and the difficulty of collecting semantically aligned corpora (Dušek and Jurcicek, 2015). Most existing NLG approaches commonly employ rules and heuristics (Cheyer and Guzzoni, 2007; Mirkovic and Cavedon, 2011); they also tend to 14 Overview of Spoken Dialogue Systems generate rigid and stylised responses without the natural variation of human language. More- over, these limitations significantly add to development cost and slow down cross-domain, cross-lingual dialogue system deliveries, by making them complex and expensive. In a pipelined dialogue system setting, the NLG is simplified to a module whose task is to transform the information encoded in the semantic representation output by the dialogue manager (Gasic and Young, 2014; Young et al., 2013) into a human-readable text. Therefore, the system does not need to handle content selection but only sentence planning and surface realisation (Konstas and Lapata, 2012; Walker et al., 2002). Typically, the NLG takes a DA representation as its input and outputs one or multiple sentences in natural language. Since there are many ways to define a set of DAs, this thesis mainly works with the DAs described in Appendix B. For example a DA such as "inform(name=seven_days, food=chinese)" could be transformed to either "Seven days serves Chinese food" or "There is a Chinese restaurant called Seven days". Since there is a one-to-many mapping between the input and outputs, building a good NLG system is nontrivial because this typically means finding a sweet spot that can strike a balance between language fluency and variation. Due to this tradeoff, supervised learning for NLG is not always an apparent choice, while reinforcement learning has been shown to have a potential to tackle this problem (Rieser and Lemon, 2011). The following of this section reviews three existing NLG methods across different application domains. There is one rule-based approach and two statistical methods that incorporate machine learning to different degrees. The HALOGEN system (Langkilde and Knight, 1998) can be viewed as the first statistical NLG approach where a language model was employed to re-rank the output of a rule-based generator. The SParKY framework (Stent et al., 2004; Walker et al., 2002) suggests a systematic approach by dividing the problem into pipelines and applying machine learning techniques in different submodules. This background establishes the foundation for the introduction of a family of more data-driven NLG approaches we called the corpus-based methods in Chapter 3. 2.5.1 Template-based Language Generator By definition, template-based NLG systems are natural language generating systems that map non-linguistic semantic input to the linguistic surface structure without intermediate representations (Reiter and Dale, 2000; Van Deemter et al., 2005). Adopting an example from Reiter and Dale (2000), a simple template-based system might associate its semantic input form Departure(train = 306, location = Aberdeen, time = 10 : 00) (2.3) 2.5 Natural Language Generation 15 directly with a template such as [train] is leaving [town] now, (2.4) and then fill in the corresponding values of each individual slot via a database search The train is leaving Aberdeen now. (2.5) Note that this template will only be used when the time referred to is close to the intended time of speaking. Other templates must be used to generate departure expressions that relate to the past or future. In practice, a hierarchical top-down approach is adopted in template-based NLG systems; the final surface form is generated recursively and composed of several individual sub-rules. Example rules of the template-based generator used in our spoken dialogue systems are shown in Appendix C. Although building a template-based NLG system is relatively straightforward, maintain- ing and updating its rules becomes more and more challenging (Reiter and Dale, 1997) as the target application becomes more complex. Furthermore, most template-based NLG systems do not incorporate linguistic insights (Busemann and Horacek, 1998) or statistical variations; they tend to produce rather rigid and stylised responses without the natural variation found in human language. 2.5.2 Linguistically Motivated Approaches Unlike template-based approaches, which directly associate a template with an input semantic representation, most of the mainstream, linguistically-motivated NLG methods rely on an indirect mapping; the input semantic is mapped to an intermediate representation and then back to the surface form. For example, this type of NLG system will not directly map Equation 2.3 to Equation 2.4; instead, Equation 2.3 will map to Leavepresent(train = traindemonstrative, location = Aberdeen, time = now), (2.6) where lexical items and style of reference have been determined while linguistic morphology is still absent. Although the details may vary, linguistically-motivated NLG systems can start from the same semantic representation. Then, the system undergoes several consecutive transformations where various NLG submodules can operate until a final surface form is produced. 16 Overview of Spoken Dialogue Systems The SPaRKy Generation Framework The SPaRKy (Sentence Planning with Rhetorical Knowledge) framework (Stent et al., 2004) is an extension of SPoT (Sentence Planning Trainable) (Walker et al., 2002), which is extended via the addition of Rhetorical Structure Theory (RST) (Mann and Thompson, 1988). Based on this framework, the NLG problem can be divided into three modules connected in a pipeline: sentence plan generation, sentence plan re-ranking, and surface realisation. SPaRKy, the sentence planner gets the content plan selected by the dialogue manager (also referred to as the content planner) and applies a sentence plan generator to produce one or multiple sentence plans. Each sentence plan, which is represented by a set of text plan trees (tp-trees) (Stent et al., 2004), consists of a set of DAs need to communicate the rhetorical relations that hold between them. Then, a sentence plan re-ranker takes the sentence plans and selects the most promising one as the input of the surface realiser. The surface realiser converts the chosen sentence plan into a natural language surface form by replacing each of the leaf nodes with a word from the lexicon. Sentence Plan Generation The basis of the sentence plan generator is a set of clause- combining operations that operate on tp-trees and incrementally transform elementary rep- resentations called DSyntS (Melcˇuk, 1988) that are associated with the DAs on the leaves of the tree. The output of the sentence plan generation component contains two structures: (1) a tp-tree, which is a binary tree with leaves labelled by the assertions from the input tp-tree and interior nodes labelled with clause-combing operations; and (2) one or more DSyntS trees, which reflect the parallel operations on the predicate-argument representations. During the generation phase, the sentence plan generator samples several valid sentence plans based on the input tp-tree. Most of the sentence plan generators in the literature are purely handcrafted and therefore their scalability is an issue. Although Stent and Molina (2009) proposed learning sentence planning rules directly from a corpus of utterances labelled with Rhetorical Structure Theory (RST) discourse relations (Mann and Thompson, 1988), the required corpus labelling is expensive; additional handcrafting is still needed to map the sentence plan to a valid syntactic form. Sentence Plan Re-ranking In the original SPaRKy paper (Stent et al., 2004), the sentence plan re-ranker is the core component that employs machine learning. Based on the set of tp-trees produced by the generator, the sentence plan re-ranker learns its ranking rules via a labelled set of sentence plan training examples using the RankBoost algorithm (Schapire, 1999). Each input tp-tree is pre-processed and subsequently represented by a feature vector. This feature vector is extracted by handcrafting a variety of templates and rules based on the tree structure. On the other hand, each input tp-tree is fed into the RealPro surface realiser (Lavoie and Rambow, 1997) to generate a realisation. This realisation is then 2.5 Natural Language Generation 17 presented to an operator, who assigns a label with a number from 1 to 5 as an indicator of the realisation’s quality. The objective is then to rank the loss, F(x) =∑ i αifi(x) (2.7) L(x,y) = ∑ x,y∈D exp−(F(x)−F(y)), (2.8) where x, and y are tp-trees, x is preferred over y, fi(·) are the functions that generate the feature vector, αi are the model parameters, and D is the entire dataset. Surface Realisation The surface realiser then takes the selected sentence plan and converts it into natural language. Typical surface realisers involve at least three kinds of processing methods: the first is syntactic realisation, where grammar rules are used to choose inflections, add function words, or decide the order of the components; the second is morphological realisation, which computes inflected forms; and finally, the third is orthographic realisation, which deals with casing, punctuation, and formatting. These are very basic steps, but most realisers are capable of considerably more complex processing (Espinosa et al., 2008; Lavoie and Rambow, 1997; White et al., 2007). Recently, research on learning statistical surface realisation rules from data have been proposed (Cuayahuitl et al., 2014; Dethlefs et al., 2013). The HALOGEN system The HALOGEN system, developed by Langkilde and Knight (1998), is another example of a linguistically-motivated NLG approach. Like the SPaRKy framework, HALOGEN also employs over-generation and re-ranking strategies. In the over-generation phase, a symbolic generator produces several candidate realisations in a word-lattice form-based on a set of grammar rules. These candidate sentences are then re-ranked by an n-gram LM, trained via a collection of news articles. This method can be considered as the first step in combining a symbolic generator with a statistical re-ranker. It has been successfully applied to a specific dialogue system domain (Chambers et al., 2004). However, like all the other linguistically motivated generators, the major drawback of this method is that the performance still heavily relies on a generator. Generators are rule-intensive; machine learning is only applied on the part of the system (such as re-ranking) that is relatively indirect to the actual performance. 18 Overview of Spoken Dialogue Systems 2.6 Conclusion In summary, although dialogue system research covers a wide range of topics and has been studied extensively across several communities, the learnability of NLG is relatively under- explored. Previous popular approaches are either template-based or grammar-based methods that employ a bunch of rules to transform the original meaning representations given by the dialogue manager to the corresponding linguistic surface forms. Despite the fact that grammar-based approaches are a good fit to model complex linguistic phenomenon such as aggregations and discourse relations, little of these sophisticated responses are required in a real-world dialogue system deployment. In contrast, what we need is a rather lean approach that can model just enough characteristics of human language but readily scalable to other domains and languages if provided with sufficient training data. This thesis focuses on studying NLG approaches that integrate statistical and machine learning methods to improve output quality and system scalability. Chapter 3 Corpus-based Natural Language Generation for Dialogue Systems Although template and grammar-based approaches have been dominating the NLG research community in the past decade, corpus-based methods (Mairesse et al., 2010; Mairesse and Young, 2014; Oh and Rudnicky, 2000) have received more and more attention as access to data becomes increasingly available. Here we refer to this line of NLG approaches as corpus-based methods as it usually involves learning to generate or select a response from a corpus of sentences. By defining a flexible learning structure, corpus-based methods aim to learn generation decisions directly from data. Compared to grammar-based systems such as SPaRKy and HALOGEN, which assume hierarchical tree-like generation processes, most of the corpus-based approaches embrace a much flatter model design such as a language-model- based sequential generation model. Learning directly from data enables the system to mimic human responses more naturally, removes the dependency on predefined rules, and makes the system easier to build and extend to other domains. However, these approaches suffer from the inherent computational cost gained in the over-generation and re-ranking phases (Oh and Rudnicky, 2000). This computational cost increases further when semantic errors in the rendered output cannot be tolerated. The approaches also have difficulty collecting sufficient training data with fine-grained semantic alignments (Dušek and Jurcicek, 2015). 3.1 Class-based Language Model Proposed by Oh and Rudnicky (2000), the class-based language model is the first pure statistical generation model used in dialogue systems. Unlike previous methods, class-based models have an n-gram language model for a generator. It produces sentences word by word 20 Corpus-based Natural Language Generation for Dialogue Systems and is based on the language model probability p(wt |w j−1, ...w0,c)≈ p(wt |w j−1, ...w j−n,c), (3.1) where w∗ is a word, c is the class, and n is the number of steps predicted by the Markov assumption. Therefore, the probability of the sentence W can be written down as a joint probability distribution of all the words that compose the sentence p(W|c) =∏ j p(w j|w j−1, ...w j−n,c). (3.2) Consequently, if we sample words one-by-one based on Equation 3.1 until an end-of-sentence token is generated or a predefined length is reached, we can produce a sentence from the model. To control the meaning of the generated sentences, researchers partitioned the training corpus into several small clusters based on the dialogue act (DA) of each individual sentence. The partition criteria is the DA type and a few slots1. The sentences in the same class are then used to train an n-gram LM for that class. During the generation phase, based on the input DA, the corresponding class LM is selected to over-generate a set of candidate outputs, which a rule-based re-ranker can rescore from. Various penalties are assigned to the candidate sentences in the rescoring phase if the sentence • is either to short or too long (determined by class dependent thresholds), • contains repetitions of any of the slots, • contains slots whose values are null in the DA, • does not have required slots. Although class-based LM generation provides a way for learning language generation directly from data, there are several drawbacks to this approach: firstly, the sentence class partitions are crude, which impedes model generalisation (e.g., unseen combination of DA types, slot-value pairs); secondly, the Markovian assumption of the n-gram LM prevents the model from learning long-range dependencies; finally, last but not least, the generation is extremely inefficient (Mairesse and Young, 2014; Wen et al., 2015a) (e.g., thousands of candidate sentences need to be produced in the over-generation phase in order to achieve a reasonable performance). Although the generation process is straightforward and easy to implement, it is difficult to apply in practice due to its defects. 1In the original paper, Oh and Rudnicky (2000) partitioned the corpus based on only one slot. However, in follow-up work by (Mairesse and Young, 2014; Wen et al., 2015a), the authors both show that partitioning based on more slots yields better results. 3.2 Phrase-based Factored Language Model 21 Fig. 3.1 An example of the sentence for the "inform" DA and its corresponding stack representation and tree equivalent used in Mairesse and Young (2014). Figure borrowed from Mairesse and Young (2014). 3.2 Phrase-based Factored Language Model Motivated by both linguistic theories and statistical sequential generation models, BAGEL, which is proposed by Mairesse and Young (2014) is a phrase-based generator built on the Dynamic Bayesian Network framework. As shown in Figure 3.1, BAGEL uses a set of stacks to approximate and represent the tree-structure-like nature of human language. Each stack is either mandatory or functional. A mandatory stack creates phrase-based content using semantic information provided by DA input. A functional stack creates functional phrases that transform semantics into sentences. By modelling semantics and phrases as a sequence of stacks, NLG is cast as a two-sequence decoding problem, one on the semantic level and the other on the phrase level. Based on the DA input, represented by a set of unordered mandatory stacks Sm, |Sm| ≤ J, the goal is to find a sequence of phrases R = (r1, r2, ...rJ) such that the probability of the realisation given the DA P(R|Sm) is maximised, R∗ = argmax R P(R|Sm). (3.3) We can then rewrite the objective function by introducing two additional variables: (1) Sm, an ordered mandatory stack sequence, and (2) S, a full stack sequence created from the 22 Corpus-based Natural Language Generation for Dialogue Systems Fig. 3.2 An overview of the phrase-based DBN model for NLG. The lower level is a semantic DBN consisting of both mandatory (blue) and functional stacks (white), while the upper level is a phrase-based DBN that oversees mapping semantic stacks to phrases. functional stack set. P(R|Sm) =∑ Sm ∑ S P(R,S,Sm|Sm) (3.4) =∑ Sm ∑ S P(R|S,Sm,Sm)P(S|Sm,Sm)P(Sm|Sm) (3.5) =∑ Sm P(Sm|Sm)∑ S P(R|S)P(S|Sm), (3.6) where we assume conditional independence given Sm,Sm⊥ R | S and Sm⊥ S|Sm. However, inference of such a model would require a summation over all possible underlying stack sequences and all possible realisations; therefore, this model is not practical in real-time applications such as dialogue systems. Therefore, the authors adopted a greedy approach by breaking down the inference into the following three steps: 1. A content ordering model that chooses the optimal stack order S∗m = argmaxSm P(Sm|Sm), 2. A content planning model that inserts functional stacks in between mandatory stacks and outputs the full stack sequence S∗ = argmaxS P(S|Sm). 3. A surface realising model that converts each stack into a phrase R∗ = argmaxR P(R|S) The first two models can be viewed as a DBN for modelling semantics, while the last one is another DBN for modelling phrases. The process is illustrated in Figure 3.2. Although phrase-based DBN is promising, and the authors showed that it can achieve better performance than the class-based n-gram LM approach, the main limitation of the method is the requirement of additional semantic alignment labels between the semantic 3.3 Example-based Approaches 23 components and phrases. Mairesse et al. (2010) have proposed to mitigate this additional labelling load by introducing active learning into the framework. The labelling is nevertheless difficult to collect since the annotations are not intuitive; thus, the labellers still need to be trained beforehand. 3.3 Example-based Approaches Another popular corpus-based method for NLG is the example-based approaches, in which the generation process is cast as a template extraction and ranking problem. For example, Angeli et al. (2010) trained a set of log-linear models to develop a series of generation decisions that chose the most suitable template from a corpus of sentence examples. Kondadadi et al. (2013) later showed that the output could be further improved by an SVM re-ranker, making the output comparable to human-authored text. The k-Nearest Neighbour (knn) approach is an example-based method for NLGs. The core idea is very similar to the template extraction method proposed by Angeli et al. (2010). During training, the DA is pre-processed and represented by a one-hot vector, as described in Section 5.1. The sentence is delexicalised and mapped into a template-like structure. In the testing phase, the testing DA di is processed similarly and used to compute cosine similarity against each of the examples dj in the training set sim(di,d j) = di ·d j |di||d j| . (3.7) This similarity score is then used to select the best matching template T∗ for realising the sentence T ∗ = F(argmax d j sim(di,d j)), (3.8) where F(·) is a one-to-one mapping from the DA to the template. In cases where there are multiple templates matching to the same DA, we randomly sample one as the target. Then, a post-processing lexicalisation is performed by inserting appropriate slot values in input DA into the template placeholders. 3.4 Attention-based Sequence-to-Sequence Approaches 24 Corpus-based Natural Language Generation for Dialogue Systems Fig. 3.3 Examples adopted from Dušek and Jurcicek (2016). Trees encoded as sequences used for the seq2seq generator (top) and the reranker (bottom). Fig. 3.4 Figure adopted from Dušek and Jurcicek (2016). The proposed attention-based seq2seq generator architecture. Recently, deep learning-based approaches have became popular in the NLG community. Two of most the similar works to part of this thesis are Mei et al. (2015) and Dušek and Jurcicek (2016), which are both based on a sequence-to-sequence (or encoder-decoder) architecture combined with an attention mechanism. Since Dušek and Jurcicek (2016)’s work is more relevant to this thesis (also a dialogue NLG problem), this section focuses on introducing their work. The idea of the work is to check whether a deep syntax tree (Dušek and Jurcicek, 2016) encoding of the output sentences can help to improve the generation performance. The proposed model architecture employs the idea of over-generation and reranking idea and includes a generator as shown in Figure 3.4 and a reranker in Figure 3.5. The job of the generator is to take the input DA encoded in a sequence of semantic components and generate a sequence of tokens that represent a sentence in its linearised deep syntax tree representation, as shown in Figure 3.3. After a bunch of sentences are over-generated from the generator, a reranker (Figure 3.5) is used to rescore the generated sentences by mapping them back to DA and adding penalty if the resulting DA is not equal to the original one. Although encoding the output sentence using a linguistically-motivated structure like deep syntax tree is a good idea, their result showed that this additional linguistic structure only helps when a reranker is not used. This suggests that although the deep syntax tree can inject additional heuristics into the model, the model can just be better off by including an additional machine learned reranker. Moreover, the BAGEL dataset (Mairesse and Young, 2014) used in the experiment was pretty small (around 2k sentences) and the improvement of the added heuristic rules maybe simply come from data sparsity. 3.5 Evaluation Metrics and Difficulties 25 Fig. 3.5 Figure adopted from Dušek and Jurcicek (2016). The proposed reranker architecture. In Section 5.3, an attention-based encoder-decoder is introduced and compared to other methods proposed in this thesis. The implementation of that model basically adopted the lesson learned from both Mei et al. (2015) and Dušek and Jurcicek (2016). 3.5 Evaluation Metrics and Difficulties Evaluating NLG systems is almost as hard as building them. The major difficulty comes from the fact that there is no single metric that can properly tell whether a generated sentence is a good one. As noted in Stent et al. (2005), a good generator usually depends on the following four factors: 1. Adequacy: to be adequate, a sentence needs to express the exact meaning of the input unambiguously. 2. Fluency: a sentence is only fluent if it is grammatically correct and idiomatic. 3. Readability: a sentence is readable if it is both adequate and fluent in a context. 4. Variability: a generator needs to be able to produce multiple sentences that fulfil the adequacy, fluency and readability constraints. The fourth factor is the most difficult of the non-statistical approaches because it literally means handcrafting more templates or rules for each input DA. 3.5.1 BLEU score Corpus BLEU One of the popular metrics used in NLG evaluations is the BLEU score (Pa- pineni et al., 2002). The BLEU score first became popular in the Machine Translation (MT) 26 Corpus-based Natural Language Generation for Dialogue Systems community, where it is used to compute the similarity between a candidate and a set of reference sentences, BLEU = BP · exp( N ∑ n=1 wn log pn), (3.9) where n is the n-gram length up to N, wn are positive weights summing to 1, and pn are the geometric average of the modified n-gram precisions pn = ∑C∈{Candidates}∑ngram∈C Countclip(ngram) ∑C′∈{Candidates}∑ngram′∈C Count(ngram′) , (3.10) where C is the candidate sentence, Count(·) is the normal counting function and Countclip(·)= max(Count(·),Max_Ref_Count) is the truncated counting function. The set of equations suggest that we can truncate each n-gram’s count, if necessary, to not exceed the largest count observed in any single reference for that n-gram. This term basically serves as a rough measure of the n-gram similarity between the candidate and the reference set. Note that we sum over the candidate set {Candidates}, therefore BLEU is a corpus-level metric rather than at the sentence-level. Because n-gram models prefer short sentences, we need an additional term called brevity penalty (BP) as in Equation 3.9 to penalise these overly-short sentences BP = { 1 c > r exp(1−r/c) c≤ r (3.11) where r is the length of the best match reference (in terms of length) and c is the length of the candidate. Therefore, this penalty only exists if the generated sentence is shorter than any of the other references. In Papineni et al. (2002), the authors showed that the BLEU score correlates with human perceived quality on MT tasks very well. However, studies applying it to NLG (Belz, 2006; Novikova et al., 2017; Stent et al., 2005) were not very positive; in fact, studies showed that the BLEU score, and other automatic metrics such as ROUGE (Lin and Hovy, 2003) or NIST (Doddington, 2002), either negatively correlate with human perceptions or only correlate slightly. Thankfully, this can be mitigated by having a large collection of references for each candidate (Belz, 2006). These automatic metrics can only serve as a metric when developing systems and real performance can only be properly accessed by human studies. Sentence BLEU BLEU is usually computed at the corpus-level to assess the quality of a generation system. When computing it at the sentence-level, higher order n-grams are usually harder to match and result in very poor score. To mitigate this and come up a good approximate of the BLEU score at the sentence-level, He and Deng (2012) proposed a smoothed version of BLEU score – the sentence BLEU – where it modifies the original 3.5 Evaluation Metrics and Difficulties 27 n-gram precisions pn as, pn = ∑C∈{Candidates}∑ngram∈C Countclip(ngram)+η · p0n ∑C′∈{Candidates}∑ngram′∈C Count(ngram′)+η , (3.12) where η is a smoothing factor usually set to 5 and p0n is set by p0n = pn−1 · pn−1/pn−2 for n ≥ 3. p1 and p2 are estimated empirically. The smoothing term in Equation 3.12 allows us to fallback the n-gram matching to lower order n-grams when higher order ones are not matched. Moreover, the brevity penalty is not clipped, BP = exp(1−r/c) (3.13) Sentence BLEU in both previous literature (Auli and Gao, 2014; He and Deng, 2012) and this thesis are used only as an alternative objective to train models rather than a metric for the evaluation. During evaluation, the corpus BLEU is used. 3.5.2 Slot Error Rate Adequacy can sometimes be captured very well if the application domain is simple. A spoken dialogue system where most of the responses are short and compact is an example. In this case, the slot error rate (Wen et al., 2015a,c), an evaluation metric based on exact string matching between the candidate surface form and a handcrafted semantic dictionary, can be used to calculate a coarse estimate of the generator’s adequacy score. This metric was also used in Oh and Rudnicky (2000) for reranking sentences. The equation slot error rate ERR, over the entire corpus, is characterised by ERR = ∑i |si,missing|+∑i |si,redundant| ∑i |si| , (3.14) where |si| is the number of slots of the i-th example and |si,missing| and |si,redundant| are the counts of the missing and redundant slots, respectively. Note that the counting of |si,missing| and |si,redundant| is dependent on the exact string matching between a semantic dictionary and the output sentence. Note, slot error rate can only be treated as an approximate metric for assessing a generation system’s adequacy when the domain is simple. However, when the domain becomes bigger and the generated sentences become more complex the use of slot error rate for assessing adequacy may become obscure. This thesis focuses on assessing NLG in domain-specific dialogue system scenarios and therefore slot error rate can still give a good approximation of the generation quality. 28 Corpus-based Natural Language Generation for Dialogue Systems 3.5.3 Language Variability Evaluating language variability of a generation system is the most difficult part in NLG evaluation. Although there are quite a lot of works have been proposed in the past (Cao and Clark, 2017; Li et al., 2016a) to address this problem, none of the existing metrics correlates well enough to human perceptions. Instead of evaluating with existing metrics, this thesis proposes to assess the quality of an NLG system by evaluating its top-5 generated sentences. If all the top-5 sentences are scored high on the other metrics (BLEU and Slot Error Rate), the system can be considered as a good system that can generate unique and high quality responses. 3.6 Conclusions This chapter has presented the corpus-based methods which aim at learning generation decisions directly from data. There approaches of this method have been introduced: the class-based n-gram language generator, the phrase-based DBN model, and the example-based kNN approach. Different approaches may require different level of annotations. Complex annotation schemes may limit the scalability of the model. Since we are focusing on building dialogue systems that can easily scale to bigger domains once the data is provided, it is more attractive to focus on corpus-based NLG approaches because they employ few rules and minimal handcrafted components. Although this may mean that we could lose a part of the generalisation capabilities brought by the linguistic insights, training on large corpora can potentially grant the model a generalisation capability that is less arbitrary. In the long run, this could be a more general way to go. In the following chapters, we are going to focus on corpus-based NLG methods and apply Recurrent Neural Network (RNN) LM (Mikolov et al., 2010) as a generator to replace the class-based n-gram model (Oh and Rudnicky, 2000). Unlike Mairesse and Young (2014), we will focus on models that can learn to generate from a paired DA-sentence corpus without the need of additional semantic alignment annotations. Chapter 4 Neural Networks and Deep Learning 4.1 Neural Networks Artificial Neural Networks (ANNs) were inspired and developed as mathematical models that mimic brain function (McCulloch and Pitts, 1988; Rumelhart et al., 1988). Although it is now clear that ANN processes bear little resemblance to neurological behaviour, they remain popular pattern classifiers. ANN is composed of many simple processing units called neurons joined together by weighted connections. The computational neuron is activated via a signal input, the activation spreads throughout the entire network along the weighted connections. The electrical activity of biological neurons typically follows a series of sharp spikes, while the activation of an ANN node was originally intended to model the average firing rate of these spikes. One very basic type of ANN is the Feed-forward Neural Network (FNN), as shown in Figure 4.1. Each neuron is represented by a circle and each synapse by an arrow connecting two neurons. Notice that the direction of the connection is one-way, from input to output. The input pattern x is introduced to the input layer, then propagated through the hidden layer h and finally to the output layer y h = σ(Wxx), (4.1) y = φ(Whh), (4.2) where Wx and Wo are weight matrices, σ(.) and φ(.) are hidden and output layer activation functions. To achieve a greater expressive power and mimic the spike shape activation in the biological brain, σ(.) is usually selected as a sigmoid function g(x) = 1 1+ e−x , (4.3) 30 Neural Networks and Deep Learning Fig. 4.1 A feed-forward Artificial Neural Network or a hyperbolic tangent tanh(x) = e2x−1 e2x+1 (4.4) The type of output layer activation depends on the task at hand. The most common approach to solve classification problems such as language modelling or sentiment classification is the soft-max function φ(xi) = exi ∑ j ex j , (4.5) where i is the index of the output neuron. The soft-max activation effectively transforms the network output into a probability distribution over the labels. Up to now, the process of computing output, given an input, is known as the forward pass of the network. Training an NN requires us to re-estimate the parameters (weight matrices) of the network to minimise the network loss. Although many possibilities exist for defining the loss function of a classification network, the most widely adopted today is the cross entropy error between the true target value t and the prediction y L(θ) =−t⊺ log(y). (4.6) Because of the complex dependency relations, and the nonlinearity inside the network, finding an exact optimal solution is intractable. As a result, NNs are usually optimised using the gradient desent with error backpropagation method (Rumelhart et al., 1986; Werbos, 1988). The basic idea of gradient desent is to find the derivative of the loss function with 4.1 Neural Networks 31 respect to each of the network weights θ , then adjust the weights in the direction of the negative slope θ (t+1) = θ (t)−α(t)∂L(θ) ∂θ (t) , (4.7) where α(t) is the learning rate that typically decreases with training iteration t. If the loss function is computed with respect to the entire training set, the weight update procedure is referred to as batch learning. This contrasts with online learning, where weight updates are performed with respect to individual training examples only. In many studies (Lecun et al., 1998b), online learning tends to be more efficient than batch; especially when learning a large dataset where significant redundancy is used. In addition, the stochastic nature of online learning can help to escape from local minima (Lecun et al., 1998b). As a result, Stochastic Gradient Decent (SGD) is used in all the experiments in this thesis. Backpropagation, on the other hand, is simply a repeated application of the chain rule for partial derivatives. The first step is to calculate the derivatives of the loss function with respect to the output units. Given the cross entropy loss L(θ), its derivative with respect to an output unit before soft-max ao is ∂L ∂ao =∑ k tk ∂ logyk ao =−∑ k tk 1 yk ∂yk ∂ao . (4.8) Since y is the output of a soft-max application on a, we know ∂yk ∂ao = yo(1− yo), o = k.−yoyk, o ̸= k. (4.9) After substituting Equation 4.8 by Equation 4.9 and reorganising, we arrive at ∂L ∂ao = yo− to. (4.10) Now we continue to apply the chain rule, working backwards to compute the error derivative of the hidden layers. At this point, it will be helpful to introduce the following notation δ j := ∂L ∂a j , (4.11) where j is any unit in the network. For the units in the hidden layer, we have δh = ∂L ∂vh ∂vh ∂ah = ∂vh ∂ah O ∑ o=1 ∂L ∂ao ∂ao ∂vh = σ ′(ah) O ∑ o=1 δoθho, (4.12) 32 Neural Networks and Deep Learning where O is the size of output layer, ah and vh are the value of hidden units h before and after passing through the activation function σ . Once the δ terms for all hidden units are obtained, we can calculate the error derivative with respect to each network’s weight ∂L ∂θi j = ∂L ∂a j ∂a j ∂θi j = δ jvi. (4.13) These gradients are then used to update weights as described in Equation 4.7. The procedure of propagating gradients backwards from the output nodes to the input nodes is known as the backward pass of the network. 4.1.1 Recurrent Neural Networks In the previous section, we considered feed-forward neural networks that have no cycles in the connection. If we allow feedback cycles inside the network, we obtain Recurrent Neural Networks (RNNs). Many variants of RNN have been proposed, such as the Elman network (Elman, 1990), Jordan network (Jordan, 1990), and the echo state network (Jaeger, 2001). The RNN we refer to in this report is the Elman RNN containing a single, self- connected hidden layer, as shown in Figure 4.2. The major difference between RNN and FNN is that FNN can only map an input vector to an output vector, whereas an RNN can, in principle, map from the entire history of the inputs into each output. This property makes RNN the best model structure for sequences. Theoretically, an RNN, with enough hidden units, is an universal approximator that can learn to map any two arbitrary-length sequences to a certain accuracy (Hammer, 1998). The key point is that the recurrent connection allows the hidden units to "memorise" previous inputs and thereby influence the output of the network. The forward pass formula for RNNs is like FNN; the only difference is the hidden layer activation. To consider previous hidden units in each time step, Equation 4.1 is changed to ht = σ(Wxxt +Wrht−1), (4.14) where t denotes the time step. Wr is the newly introduced weight matrix representing the recurrent connection. To perform the backward pass, two well-known algorithms have been proposed to efficiently calculate the derivative of RNN parameters: real time recurrent learning (Robinson and Fallside, 1987) and Back-propagation through time (BPTT) (Werbos, 1990). The BPTT algorithm is adopted in this report because it is conceptually simpler and computationally more efficient. Given an unrolled view of RNN, shown in Figure 4.3, the BPTT algorithm, 4.1 Neural Networks 33 Fig. 4.2 Architecture of the Elman recurrent neural network. The recurrence is fully connected between two hidden layers in adjacent time steps. like standard backpropagation, also consists of repeated application of the chain rule. The difference is that the loss function depends on the activation of the hidden layer in RNN; thus, the influence occurs on both the hidden and output layers for each single time step. Therefore, δ th = σ ′(ath) ( O ∑ o=1 δ toθho+ H ∑ h′=1 δ t+1h′ θhh′ ) , (4.15) where δ tj := ∂L ∂atj . (4.16) The sequence of δ terms can then be calculated recursively using Equation 4.15; beginning with δT+1j = 0 for all j from the end of the sequence. Finally, since the weights are shared for each time step, we need to sum over the entire sequence to get derivatives with respect to the network parameters, ∂L ∂θi j = T ∑ t=1 ∂L ∂atj ∂atj ∂θi j = T ∑ t=1 δ tjv t i (4.17) 4.1.2 Long Short-term Memory As noted in the previous session, an important benefit of RNN is the ability to use arbitrarily long context information when mapping between input and output sequences. In practice, the range of context that a vanilla RNN can process is quite limited. The reason is that the influence of a given input on the hidden layer, and therefore on the network output, either 34 Neural Networks and Deep Learning Fig. 4.3 An unfolded view of RNN. Each rectangle represents a layer of hidden units in a single time step. The weighted connections from input to hidden layer are Wx, those from hidden to output are Wh, and the hidden to hidden weights are Wr. Note that the same weights are reused at every time step. decays or grows exponentially as it cycles around the network’s recurrent connections. This phenomenon is usually referred to as the vanishing gradient problem in the literature (Ben- gio et al., 1994; Hochreiter et al., 2001). The vanishing gradient problem is illustrated schematically in Figure 4.4. Many methods were proposed to overcome the vanishing gradient problem in RNN. Some methods include simulated annealing and discrete error propagation (Bengio et al., 1994), hierarchical sequence compression (Schmidhuber, 1992), and Gated Recurrent Units (Chung et al., 2014). We apply the Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) architecture in this thesis. We choose this popular method for sequence modelling because it has been tested on several similar problems, e.g., speech recognition (Graves et al., 2013a), handwriting recognition (Graves et al., 2009), spoken language understanding (Yao et al., 2014), and MT (Sutskever et al., 2014). Although multiple memory blocks in a cell are allowed, Figure 4.5 provides an illustration of an LSTM cell with only one memory block. The memory block is recurrently connected and can be used to store historical information. Each memory block is associated with three multiplication units, the input, forget, and output gates. The multiplication units provide the cell with the continuous analogies of write, read, and reset. These multiplication gates allow LSTM memory cells to store and access information over long periods, thereby mitigating the vanishing gradient problem. For example, if the input gate remains closed (activation→0), the value of the memory block won’t be overwritten by the new input in the network, and can therefore be made available to the net much later in the sequence, by opening the output gate. 4.1 Neural Networks 35 Fig. 4.4 In the RNN vanishing gradient problem, the colour of nodes is determined by the sensitivity of a single node to input at time 0. Due to the repeated multiplication of the recurrent weights, the influence of the input will vanish or grow over time. The network will eventually forget the first input. To compute the forward pass of an LSTM, the gate activation level need to be calculated that given the notation defined in Figure 4.5 it = σ(Wxixt+Whiht−1) (4.18) ft = σ(Wxfxt+Whfht−1) (4.19) ot = σ(Wxoxt+Whoht−1), (4.20) where σ is a sigmoid function. The proposed cell value at the current time step is therefore cˆt = tanh(Wxcxt +Whcht−1). (4.21) Combining this equation with the input and forget gate obtained earlier, we can calculate the new cell value via element-wise multiplication of the gate and cell values as in ct = ft ⊙ ct−1+ it ⊙ cˆt . (4.22) Finally, the hidden layer representation is computed by multiplying the output gate and cell value element-wise as in ht = ot ⊙ tanh(ct). (4.23) 36 Neural Networks and Deep Learning Fig. 4.5 A Long short-term memory cell is an RNN architecture with the memory block Ct. it, ft,ot are input, forget, and output gates. xt and ht−1 are controlling signals of the three gates. Typically xt is the current time step’s input and ht−1 is the hidden layer’s representation at previous time step. The LSTM cell is meant to replace the original additive hidden layer of vanilla RNNs. Consequently, once we obtain the hidden layer representation, we can substitute it into the original RNN output layer; all the other parts of the forward pass work the same. Since an LSTM is a differentiable function approximator like an RNN, gradient-based methods applied to RNNs can also be used to train LSTMs. Therefore, BPTT is also adopted for calculating the gradients. 4.1.3 Bidirectional Networks In this section, we consider both the forward and backward context of a sequence classifica- tion problem because of the benefits it may provide. For example, given two sentences: 1. Seven days is a expensive restaurant. 2. Seven days is an expensive restaurant. To decide whether to choose a or an, we only need to consider the word after the indefinite article rather than the previous words. An obvious solution would be to add a time-window of future context to the network input. However, the range of context the input can consider is limited and is often decided by heuristics. This in turn constrains the expressive power of the network and may also impose an asymmetric bias of the two contexts. Bidirectional RNNs (Baldi et al., 1999; Schuster and Paliwal, 1997a) offer a more elegant solution. Instead of using only the forward context, the network encodes the input sequence 4.1 Neural Networks 37 Fig. 4.6 The unfolded bidirectional RNN. The backward layer and forward layer can encode information from both directions. Six sets of distinct weights are included. Note there are no connections between two hidden layers. in both directions by two separate RNNs and connects both the forward and backward hidden layers to a single output layer, as can be seen in Figure 4.6. This structure provides the output layer with context based on the future and the past at every point in the input sequence, without displacing the inputs from the relevant targets. Bidirectional networks (Schuster and Paliwal, 1997b) have been shown to be effective for sequential problems such as protein secondary structure prediction (Chen and Chaudhari, 2004), speech recognition (Graves et al., 2013b), and MT (Sundermeyer et al., 2014). The forward pass of the network is the same as an RNN, the only difference is that the input sequence is also presented in the opposite direction so that the output layer can only be updated when both the forward and backward hidden layers are obtained. Similarly, the backward pass is initiated like the RNN trained with BPTT; the only difference is that the gradients of the output layer are computed first, then fed back to the two RNNs. 4.1.4 Convolutional Neural Networks The Convolutional Neural Network (CNN) is a biologically-inspired variant of NNs. Ac- cording to very early research on the visual cortex (Hubel and Wiesel, 1968), we know that the visual cortex is composed of a complex arrangements of cells that are sensitive to small sub-regions of the visual fields (receptive fields). The sub-regions are tiled to cover the entire 38 Neural Networks and Deep Learning Fig. 4.7 Architecture of a CNN model called LeNet-5 (Lecun et al., 1998a). Each plane is a feature map to capture certain patterns of the previous layer. A fully connected FNN is attached on top as the classifier for recognising input digits. visual field. These cells act as local filters over the input space and are well-suited to exploit the strongly spatial local correlations present in natural images. CNNs combine three architectural concepts to ensure some degree of shift, scale, and distortion invariance: local receptive fields, shared weights, and spatial or temporal sub- sampling, as shown in Figure 4.7. The inputs of the LeNet CNN (Lecun et al., 1998a) are pixels of the target image that are classified and arranged as a matrix according to their location in the image. A set of local filters (2D weighted connections) is then applied to aggregate pixels in the filter window and forms a set of feature maps. The convolution operation is applied by sliding each of the local filters through the entire input image. Then, a weighted value of input pixels and weights for each location are determined. With these local receptive fields, neurons can extract elementary visual features such as oriented edges, end-points, and corners. The subsequent layers then combine these features to detect higher order features. Another benefit of the convolutional approach is that, by applying the same weights across the entire image, the feature detector can extract similar features in different parts of the image. Thus the feature detector is capable of handling slight distortions or translational features in the image. Typically, a CNN contains multiple feature detectors for each layer to allow multiple features to be captured at the same location. This convolutional process is the basis of the robustness of CNNs, which allows shifts and distortions of the input. Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is relevant. The precise location may in fact be harmful because the positions are likely to vary for different instances of the same label. This is done by introducing a spatial or temporal subsampling layer between two convolutional layers, as shown in Figure 4.7. Two well-known subsampling techniques are average pooling and maximum pooling. Although maximum pooling is believed to be more discriminative 4.2 Objective Function 39 and computationally efficient in most cases, average pooling seems to work better for certain problems (Wen et al., 2015a). After several convolution-pooling operations, a set of features maps is extracted. A fully-connected FNN is then attached on top of the final feature layer and takes the extracted feature maps as an input to determine the label of the image. Another crucial advantage of the pooling operation is that it can transform a variable-size image into a fixed-size vector representation so that it can be fed into the FNN for classification. Since all the weights in the CNN can be learned by standard backpropagation, the CNN is thus synthesising its own feature extractor. This is important since we can remove the process of designing feature templates in the traditional pattern recognition pipeline, which is obviously time-consuming and not scalable. We let the network figure out the important features on its own. CNNs were first studied in the computer vision literature on topics such as object recognition (Lecun et al., 1998a) and handwriting recognition (Ciresan et al., 2011). Recently, these methods have been also adopted in problems such as speech recognition (Sainath et al., 2013). The convolutional sentence model (Kalchbrenner et al., 2014; Kim, 2014) uses the same methodology but collapses the two-dimensional convolution and pooling process into a single dimension. The resulting model is claimed to represent the state-of-the-art for many NLP related tasks (Kalchbrenner et al., 2014; Kim, 2014). 4.2 Objective Function Neural networks are mathematical models that, after trained, can be used to make predictions on unseen input. During training time, given a labelled dataset D = {xi,yi}, the goal is to tune the set of available model parameters θ so that a paramterised function of input fθ (D) is maximised or minimised. This function fθ (·) is called the objective function of the optimisation procedure. Two objective functions: (1) Maximum Likelihood Estimation, which is used everywhere in the thesis, and (2) Discriminative Training, which is used in the domain adaptation in Chapter 6, are introduced below. 4.2.1 Maximum Likelihood Maximum Likelihood Estimation (MLE) is probably the most common objective function in modern machine learning. As the name implies, the objective function of MLE is usually defined as the likelihood function of the dataset p(D). This likelihood function in supervised 40 Neural Networks and Deep Learning setting is usually interpreted as a conditional probability of label yi given input xi, θMLE = argmax θ ∏ i p(yi|xi) (4.24) As taking a product of some numbers less than one would approach zero as the number of those numbers goes to infinity, we can instead work in the log space. Therefore, instead of optimising Equation 4.24, an alternative objective function is, θMLE = argmax θ log∏ i p(yi|xi) = argmax θ ∑ i log p(yi|xi) (4.25) 4.2.2 Discriminative Training In contrast to the MLE criteria, whose goal is to maximise the log-likelihood of correct examples, Discriminative Training (DT) objective aims at separating correct examples from competing incorrect examples. Given a training instance (xi,yi), the training process starts by generating a set of model predictions GEN(xi) of label yi using the current model parameter θ and input xi. The discriminative objective function can therefore be written as, θDT = argmax θ ∑ yˆi∈GEN(xi) pθ (yˆi|xi)R(yˆi,yi) (4.26) where R(yˆi,yi) is the scoring function evaluating candidate yˆi by taking ground truth yi as reference. pθ (yˆi|xi) is the scaled distribution of the candidate and is typically computed as pθ (yˆi|xi) = exp ( γ logp(yˆi|xi,θ) ) /∑yˆj∈GEN(xi) exp ( γ logp(yˆj|xi,θ) ) . (4.27) γ ∈ [0,∞] is a tuned scaling factor that flattens the distribution for γ < 1 and sharpens it for γ > 1. Note, the un-normalised candidate likelihood logp(yˆt|xi,θ) is the model prediction. Since the DT objective presented here is differentiable everywhere w.r.t. model parameters θ , optimisation algorithms like back-propagation can be applied to calculate the gradients and update the parameters directly. The DT criterion introduced here can make use of the data at hand better than the MLE criterion. This is mainly because the objective function allows the model to explore its own limitations by refining its predictions via both positive and negative examples. This enables the model to correct its own mistakes, while adjusting towards the correct examples. 4.3 Stochastic Neural Networks 41 4.3 Stochastic Neural Networks Deterministic Neural Networks, such as MLPs and RNNs, are popular models that apply well to regression and classification tasks. For classification tasks, deterministic NNs model the conditional distribution p(y|x) of the predictor variable y given the input variable x. This conditional distribution is deterministic because its shape solely depends on the input variable x and there is a single best model prediction y∗ = argmaxy p(y|x) for each input. This is not the case in many real-world problems such as image or dialogue response generation. For example, given a user’s input query, there are many responses that are valid and appropriate. Therefore, a strong deterministic model would fail to capture these multiple modes in dialogue responses and only learns to produce the most frequent and generic responses (Cao and Clark, 2017). One way to tackle this multi-modality in large learning systems is to introduce a latent variable and make it stochastic. The resulting framework is thus a Stochastic Neural Network. This is because the output is no longer only deterministic; thus it no longer only depends on the input, but also the randomness injected by the latent variables as well. Due to recent advances in Neural Variational Inference (NVI) (Mnih and Gregor, 2014), the Variational Autoencoder (VAE) framework (Kingma and Welling, 2014; Rezende et al., 2014) has become more and more popular. The Variational Autoencoder (VAE) framework is a practical training technique for NN-based generative models with latent variables. We are more interested in its extension – Conditional VAE (cVAE) (Doersch, 2016) – because of it allows the output to be additionally conditioned on the input rather than drawn independently from a prior distribution. 4.3.1 Conditional Variational Autoencoder Framework Based on an input x and output y, our goal is to model the conditional distribution of the output y, given the input x, with the introduction of a latent variable z for p(y|x) = ∫ z pθ (y|z)pθ (z|x)dz, (4.28) where pθ (y|z) and pθ (z|x) are both modelled by neural networks. Similar to VAE, cVAE modifies Equation 4.28 by replacing the conditional distribution pθ (z|x) with an approximate posterior distribution qφ (z|x,y). By parameterising the latent distribution, based on both x and y, qφ (z|x,y) is a powerful surrogate of pθ (z|x) during inference. If cVAE were trained with a standard cross entropy objective, the model would learn to encode its inputs deterministically by making the variances of qφ (z|x,y) decay (Raiko et al., 2014). Instead, during VAE inference, we introduce the variational lower bound of 42 Neural Networks and Deep Learning the true-likelihood of the data as an alternative objective function to encourage the model to keep its posterior distribution close to the original conditional distribution pθ (z|x), L(θ ,φ) = Eqφ (z|x,y)[logpθ (y|z)]−DKL(qφ (z|x,y)||pθ (z|x)) ≤ log ∫ z pθ (y|z)pθ (z|x)dz = logpθ (y|x) (4.29) Consequentially, from the qφ (z|x,y) point of view, pθ (z|x) serves as a regularisation term to encourage a multimodal distribution of the learnt posterior term. While in the pθ (z|x) viewpoint, qφ (z|x,y) is in the teacher network, from which it must learn. 4.3.2 Neural Variational Inference To have a tighter variational lower bound, as shown in Equation 4.29, a common practise in the deep learning community is to parameterise and approximate the intractable, true posterior using a neural network qφ (z|x,y), with a parameter set φ . Therefore, Neural Variational Inference (NVI) (Miao et al., 2016) begins by constructing an expressive inference network qφ (z|x,y) based on the following steps 1. Construct vector representations of the observed variables x = fx(x), y = fy(y). De- pending on the nature of the variables, the embedding functions fx(·) and fy(·) can be any kind of neural network that is suitable for data such as CNNs or RNNs. 2. Assemble a joint representation for o = g(x,y). g(·) is typically modelled via an MLP, by taking the concatenation of the two vector representations as input. 3. Parameterise the variational distribution over the latent variable qφ (z|x,y) = l(o). l(·) is another neural network that can project the vector representation into a distribution. The projection function l(·) can take on different forms ,depending on the type of latent variables we want to model. Most of the VAE literature employs a continuous latent variable and parameterise the latent distribution as a diagonal Gaussian, represented by its parame- terised mean and standard deviation, as in µ = l1(o), logσ = l2(o). l1(·) and l2(·) are linear transformations, which output the parameters of the Gaussian distribution. However, if a discrete latent variable is considered for the latent distribution, then l(·) could be an MLP that maps the vector representation into a multinomial distribution or a set of Bernoulli distributions. As a result, by sampling from the variational distribution z∼ qφ (z|x,y), we can carry out stochastic back-propagation and optimise the variational lower bound. 4.3 Stochastic Neural Networks 43 Fig. 4.8 Gradient estimation in stochastic computation graphs. Figure adapted from "Cate- gorical Re-parameterization with Gumbel-Softmax" by Jang et al. (2016). (a) the gradient ∂ f(h)/∂h can be directly computed via back-propagation if h is deterministic and differ- entiable. (b) The introduction of a stochastic node z precludes back-propagation because the sampling function z(n) ∼ pθ (z|x) doesn’t have a well-defined gradient. (c) The re- parameterisation trick (Kingma and Welling, 2014; Rezende et al., 2014), which creates an alternative gradient path to circumvent the sampling function inside the network, allows the gradient to flow from f(z) to θ . Here we used the Gaussian re-parameterisation trick as an example, but other types of distributions can also be applied. (d) The scoring function- based gradient-estimator obtains an unbiased estimate of ▽f(z) by back-propagating along a surrogate loss fˆ▽ logpθ (z|x), where fˆ = f(z)−b(z) and b(z) is the baseline for variance reduction. As shown in Figure 4.8(b), the introduction of the stochastic latent variable z precludes the back-propagation period because the sampling function z∼ pθ (z|x) doesn’t have a well- defined gradient1. Therefore, to carry out variational inference in stochastic neural networks, one can either use the re-parameterisation trick as shown in Figure 4.8(c) or the scoring function-based method in Figure 4.8(d). Instead of sampling directly inside the network and blocking the gradient flow as in Figure4.8(b), the re-parameterisation trick (Figure 4.8(c)) (Kingma and Welling, 2014; Rezende et al., 2014) moves the internal stochastic node out of the computation graph and 1In practise, we sample from the variational distribution z∼ qφ (z|x,y) which doesn’t have a well-defined gradient either. 44 Neural Networks and Deep Learning treats it as a random input node (z in Figure 4.8(b) → ε in Figure 4.8(c)). Therefore, the encoder network does not produce a latent distribution pθ (z|x) directly. Instead, the network only outputs the parameters for constructing the target distribution, for example, the mean µ and variance σ2 of a Gaussian distribution. We can then reconstruct the sample z(n) via the expression z(n) =µ+σ · ε ε(n) ∼N(0,1). (4.30) In this way, the gradient can be directly back-propagated through the entire network. This is because the stochastic node has been moved out of the graph. The re-parameterisation trick is commonly combined with a Gaussian random variable to model a continuous latent distribution inside a neural network. Recently, a Gumbel-Softmax re-parameterisation trick (Jang et al., 2016) was also proposed to approximate the categorical distribution of discrete latent-variable models. The scoring function-based estimator (also referred to as REINFORCE (Williams, 1992) or the likelihood ratio estimator (Glynn, 1990)) is illustrated in Figure 4.8(d). It computes the gradient based on the following derivation ▽θ Epθ (z|x)[f(z)] = ∫ ▽θpθ (z|x)f(z)dz (4.31) = ∫ pθ (z|x) pθ (z|x)▽θpθ (z|x)f(z)dz (4.32) = ∫ pθ (z|x)▽θ logpθ (z|x)f(z)dz (4.33) = Epθ (z|x)[f(z)▽θ logpθ (z|x)] (4.34) where from Equation 4.32 to Equation 4.33 we apply the log derivative trick, ▽θ log pθ (z|x) = ▽θ pθ (z|x)pθ (z|x) . (4.35) Therefore, based on Equation 4.34, we can approximate this expectation using a Monte Carlo method by first drawing samples from the conditional distribution z(n) ∼ pθ (z|x) and then computing the weighted gradient term Epθ (z|x)[ f (z)▽θ log pθ (z|x)]≈ 1 N N ∑ n=1 f (z(n))▽θ log pθ (z(n)|x). (4.36) 4.4 Conclusion 45 This unbiased estimator of the gradient does not require f(z) to be differentiable. Instead, we just need to be able to evaluate or observe its value for a given z. However, the real challenge is that this Monte Carlo gradient estimator typically has a high variance. To make it useful, we must ensure that its variance is as low as possible. Additionally, to gain more control over the variance, we can subtract a control variate b(z) from the learning signal f(z). We add back its analytical expectation µb = Epθ (z|x)[b(z)▽θ logpθ (z|x)] to keep it unbiased ▽θ Epθ (z|x)[ f (z)] = Epθ (z|x)[( f (z)−b(z))▽θ log pθ (z|x)]+µb (4.37) 4.4 Conclusion This chapter has presented fundamental building blocks of neural networks and deep learning. The chapter begins with introducing neurons and basic feed-forward network architecture, and the forward propagation to compute predictions. It follows by a discussion about network training, from back-propagation, gradient computation, to stochastic gradient decent. A series of neural network types are then introduced, such RNN, LSTM, and CNN. After a brief description of the two major objective functions used in training deterministic neural networks, stochastic neural networks and the Conditional VAE framework are introduced, followed by a section of optimisation using the NVI framework. These neural network modules and its optimisation techniques are the foundation of the methods proposed in this thesis. Chapter 5 Recurrent Neural Network Language Generators Recently, results have shown that a machine learning approach with a sufficient amount of training data can outperform pipelined systems with a set of sophisticated rules and domain theories (Deng et al., 2013; Sutskever et al., 2014). The use of RNNs for language generation is motivated by the observation that a trained Recurrent Neural Network Language Model (RNNLM) (Mikolov et al., 2010) is effectively a compact encoding of all of the utterances used to train it. If an RNNLM is made to randomly generate word sequences by sampling the output distribution with the previous word as input, many of the generated sequences will be syntactically correct even if semantically incoherent (Sutskever et al., 2011). Hence, if the RNNLM is conditioned during training by some abstract representation of the required semantics, then it should generate appropriate surface realisations. This is the basis of the Recurrent Neural Network Language Generation (RNNLG) framework shown in Fig. 5.1. 5.1 The Recurrent Language Generation Framework The input to the generator consists of a dialogue act (DA) comprising a DA-type such as inform, request, confirm and a set of zero or more slot-value pairs. Thus, the output is an appropriate surface realisation. In Figure 5.1, the input DA is “inform(name=Seven_Days, food=Chinese)" and the output is “Seven Days serves chinese food". To make more efficient use of training data, slot values are delexicalised by replacing actual slot values by slot tokens; then the substitution reversed after rendering the output. For any slot, there are five possible slot tokens: VALUE, DONTCARE, YES, NO, and NONE. Thus, in Figure 5.1, the input to the generator is actually "inform(name=VALUE, food=VALUE)" and the output 48 Recurrent Neural Network Language Generators Fig. 5.1 The RNNLG framework for language generation. is “name_VALUE serves food_VALUE food”. The saved name and food values are then substituted back in as a final post-processing phase. The framework itself operates as follows. At each time step t, an 1-hot encoding wt of a token1 wt is input to the model, which is conditioned on the recurrent hidden layer ht−1 from the previous step. This entire process produces a new state representation ht. To take the system’s intended meaning into consideration, the recurrent function f is further conditioned on a control vector dt encoding the system DA, ht = f (wt ,ht−1,dt). (5.1) This updated state is then transformed to an output probability distribution, from which the next token in the sequence is sampled p(wt+1|wt,wt−1, ...w0,dt) = softmax(Whoht) (5.2) wt+1 ∼ p(wt+1|wt,wt−1, ...w0,dt). (5.3) Therefore, by decoding input tokens one by one from the output distribution of the RNN until a stop sign is generated (Karpathy and Fei-Fei, 2014), or some other constraint is satisfied 1We use token to emphasise that utterances consist of both words and slot tokens. 5.2 Gating Mechanism 49 (Zhang and Lapata, 2014), the network can produce a sequence of tokens which can be lexicalised2 to form the required utterance. As noted in Section 4.1, although there are many choices for the recurrent function f, we opt for an LSTM since it is effective and competitive. These benefits are seen in a variety of configurations based on our preliminary experiments. Therefore, given the input wt and previous hidden state ht−1, an LSTM updates its internal state according to the following equations it = sigmoid(Wwiwt+Whiht−1) (5.4) ft = sigmoid(Wwfwt+Whfht−1) (5.5) ot = sigmoid(Wwowt+Whoht−1) (5.6) cˆt = tanh(Wwcwt+Whcht−1) (5.7) ct = ft⊙ ct−1+ it⊙ cˆt (5.8) ht = ot⊙ tanh(ct), (5.9) where it, ft,ot ∈ [0,1]n are input, forget, and output gates, respectively, n is the hidden layer size, cˆt and ct are proposed cell and true cell values, respective, and W∗,∗ are the model parameters. The remainder of this section describes three variants of the LSTM. The variants incor- porate a control signal to ensure that the LSTM generates an output surface form, which is consistent with the required meaning encoded by the input DA. 5.2 Gating Mechanism Motivated by the LSTM, which uses gates to control the information flow inside the cell, the same idea can be readily applied to control the input DA; this is called the gating mechanism. Based on the gating mechanism concept, two generators are proposed: (1) the Heuristically Gated LSTM generator (H-LSTM) and (2) the Semantically Conditioned LSTM generator (SC-LSTM). 5.2.1 The Heuristically Gated LSTM Generator To ensure that the generated utterance represents the intended meaning, a DA control vector dt is constructed from the concatenation of 1-hot encodings of the required DA-type and its associated slot-value pairs. The auxiliary information provided by this control vector tends to 2The process of replacing slot tokens by their saved values. 50 Recurrent Neural Network Language Generators Fig. 5.2 The Heuristically Gated LSTM generator (H-LSTM). At each time step an LSTM cell is used to process the input token and previous time step hidden state. The DA gate is controlled by matching the current generated token with a predefined semantic dictionary. decay over time because of the vanishing gradient problem (Bengio et al., 1994; Mikolov and Zweig, 2012). Hence, dt is reapplied to the LSTM at every time step, as shown in Figure 5.2. Equation 5.8 is then modified to incorporate dt ct = ft ⊙ ct−1+ it ⊙ cˆt + tanh(Wdhdt), (5.10) where Wdh is another weight matrix to train. To prevent the generation of undesirable repetition in the output, the control vector dt is filtered by a reading gate rt at each step before being input into the network dt = rt ⊙dt−1. (5.11) Each component of this gating signal rt corresponds to a slot-value pair, which is manually reset to zero when the surface form of the slot-value pair appears in the output. To facilitate this, the possible realisations of each slot-value pair must be stored in a semantic dictionary. This model is dubbed the Heuristically Gated LSTM (H-LSTM) (Wen et al., 2015a) generator. 5.2.2 The Semantically Controlled LSTM Generator The heuristic gating mechanism in the H-LSTM relies on phrase-spotting the generated surface realisation to see when a slot-value has been rendered. Thus, the heuristic gating mechanism is dependent on the provision of a semantic dictionary and exact matching. Cases 5.2 Gating Mechanism 51 Fig. 5.3 The Semantically Conditioned LSTM generator (SC-LSTM). The upper part is a traditional LSTM cell in charge of surface realisation, while the lower part is a sentence planning cell based on a sigmoid control gate and a DA. such as binary slots and slots that take “don’t care" values cannot be explicitly delexicalised in this way; these cases frequently result in generation errors. One way to make slot-value pair- matching and the corresponding surface forms more flexible is to make the reading gate rt trainable by parameterising it using a small network rt = sigmoid(Wwrwt +Whrht−1+Wdrdt−1). (5.12) Here Wwr and Whr act like keyword and key phrase detectors. The detectors learn to associate certain patterns of generated tokens with certain slot-value pairs. This trainable rt function can then be used in Equation 5.11 as a replacement for its heuristic counterpart. The entire network is trained end-to-end using a cross-entropy cost function, between the predicted word distribution pt and the actual word label yt. The regularisations on the DA control vector and its transition dynamics are characterised by L(θ) = ∑t y ⊺ t log(pt)+∥dT∥+∑T−1t=0 ηξ ∥dt+1−dt∥, (5.13) 52 Recurrent Neural Network Language Generators where θ is the set of all the model parameters, dT is the control vector at the last index T, and η and ξ are constants set to 10−4 and 100, respectively. The introduction of the second term is to make sure at the end of the sentence all the slot-value pairs have been rendered into text ∥dT∥→ 0, while the third term is to encourage the model to render one slot-value pair at a time. As shown in Figure 5.3, the Semantically Conditioned LSTM (SC-LSTM) (Wen et al., 2015c) cell is divided into two parts: the lower part is a limited version3 of sentence planning, where it manipulates the control vector features during the generation process to produce a surface realisation that accurately encodes the input information; the upper part is a surface realiser that employs the LSTM to generate coherent utterances. 5.3 Attention Mechanism Attention mechanisms are very popular in the deep learning community. The core idea of attention is to selectively focus on a certain part of the input patterns when making model decisions. This method has been applied to many NLP tasks such MT (Bahdanau et al., 2015) and image caption generation (Xu et al., 2015). Based on this framework, we propose the Attention-based Encoder Decoder generator (ENC-DEC) to selectively focus on semantic components during generation. 5.3.1 The Attentive Encoder Decoder LSTM Generator The RNN Encoder-Decoder architecture was first proposed in the MT literature (Bahdanau et al., 2015). The encoder first encodes the input into a distributed vector representation. Then, the decoder subsequently decodes each entry to produce the target output. Adapting an idea from Mei et al. (2015), the encoder uses a separate parameterisation of slots and values. Each slot-value pair is converted into a distributed vector representation zi via the equation zi = si+vi, (5.14) where si and vi are the i-th slot and value embedding, respectively, and i runs over the given slot-value pairs. The DA embedding at each time step t is then formed via dt = a⊕∑iωt,izi, (5.15) 3It can only handle cases where each slot-value pair is realised exactly once in the sentence. 5.3 Attention Mechanism 53 Fig. 5.4 The Attention-based Encoder Decoder generator (ENC-DEC). The DA vector is created by concatenating a deterministic DA embedding with an attentive slot-value embedding over potential slot-value pairs. The dashed lines show how the attentive weights are created by the two control signals. where a is the embedding of the DA type, ⊕ is vector concatenation, and ωt,i is the weight of i-th slot-value pair calculated by an attention mechanism, βt,i = q⊺ tanh(Whmht−1+Wmmzi) (5.16) ωt,i = eβt,i/∑ i eβt,i, (5.17) where q, Whm, and Wmm are parameters to train. The DA embedding dt is then fed into an LSTM as additional information that will update the hidden state it ft ot cˆt =  sigmoid sigmoid sigmoid tanh W4n,3n  wtht−1 dt  ct = ft⊙ ct−1+ it⊙ cˆt ht = ot⊙ tanh(ct), 54 Recurrent Neural Network Language Generators Venue Restaurant Hotel informable slots *pricerange, area, near, food, goodformeal, kidsallowed *pricerange, area, near, hasinternet, acceptscards, dogsallowed requestable slots *name, *type, *price, phone, address, postcode *name, *type, *price, phone, address, postcode act type *inform, *inform_only_match, *inform_on_match, *inform_count, *recommend, *select, *confirm, *request, *request_more, *goodbye Product Laptop Television informable slots *pricerange, family, batteryrating, driverange, weightrange, isforbusinesscomputing *pricerange, family, screensizerange, ecorating, hdmiport, hasusbport requestable slots *name, *type, *price, warranty, battery, design, dimension, utility, weight, platform, memory, drive, processor *name, *type, *price, resolution, powerconsumption, accessories, color, screensize, audio act type *inform, *inform_only_match, *inform_on_match, *inform_count, *recommend, *select, *confirm, *request, *request_more, *goodbye, inform_all, inform_no_info, compare, suggest bold=binary slots, *=overlap between Product and Venue domains, all informable slots can take "dontcare" value Table 5.1 Ontologies of the Restaurant, Hotel, Laptop and TV domains where n is the hidden layer size, wt is the input word embedding, and W4n,3n are the LSTM parameters. This kind of attention mechanism can also be viewed as a latent sentence planning component where the slot-value pairs zi are selected to realise the required output, based on the attention weight ωt,i at each time step. This is shown in Equation 5.16 and 8.3. 5.4 Data Collection To collect the data that are required to train our models, the Amazon Mechanical Turk (AMT) service was used. The following section describes the methodology and settings. 5.4.1 Ontologies To thoroughly evaluate the effectiveness of our methods, experiments were conducted in four different domains: finding a restaurant, finding a hotel, buying a laptop, and buying a television. The ontologies of the four domains are shown in Table 5.1. There are two product domains (Laptop and TV) and two venue domains (Restaurant and Hotel). As we can see, 5.4 Data Collection 55 the two product domains are more complicated than the two venue domains because there are more distinct DA-types4 and slot kinds. More details, e.g., overlapping slots, DAs (marked by * sign) and binary slots that take only binary values (highlighted in bold), are presented in Table 5.1. Each DA type represents the intent of the system; this is usually combined with a few domain specific slot-value pairs to form a DA valid for that domain. For example, "inform(name=’alexander b&b’, pricerange=’cheap’, acceptscards=’yes’)" is a valid DA for the hotel domain. 5.4.2 Methodology The corpora for the venue and product domains were collected in two separate data collec- tions (Wen et al., 2015c, 2016b). Both used the Amazon Mechanical Turk (AMT) service. AMT workers were recruited and asked to propose an appropriate natural language realisation corresponding to each system DA presented to them. The set of DAs in the venue domains (Restaurant and Hotel) were generated via an actual dialogue system (Wen et al., 2015c) and each DA could be presented to workers several times. This results in a multiple reference dataset for both the Restaurant and Hotel domains. On the other hand, all possible combina- tions of DA types and slots were enumerated and only one surface form was collected for each DA (Wen et al., 2016b) in the product domains (Laptop and TV). Therefore, the product domain datasets are harder than the venue domain ones because the product domain datasets have only one reference for each DA while the venue ones contain multiple references. Detail statistics about the collected datasets are presented in Table 5.2. 5.4.3 Dataset Overview The statistics of the four datasets5 collected are shown in Table 5.2. Since there are multiple references for each DA in the Restaurant and Hotel domains, the corpora of these two domains are good for evaluating NLG systems using similarity-based metrics like BLEU score (Papineni et al., 2002). On the other hand, since the corpora collected in Laptop and TV domains have a much larger input space, they typically only have one training example for each DA. These corpora are better resources for evaluating the generalisation capability of different methods when faced with unseen semantic inputs. Furthermore, due to the varying characteristics of the venue and product domains, the ability to adapt well between the two is 4One thing worth to note is the compare DA, which prompts the system to generate Sparky-like value-based comparisons. 5Datasets are available together with the code at https://github.com/shawnwun/RNNLG 56 Recurrent Neural Network Language Generators Property Restaurant Hotel Laptop Television # of unique DAs 248 164 ∼13K ∼7K # of data points ∼5K ∼5K ∼13K ∼7K # of ref. per unique DA multiple multiple single single Table 5.2 Properties of the four datasets collected a good indicator of the effectiveness of the adaption method under test. We will investigate the adaptation of RNNLG in Chapter 6. 5.5 Baselines To assess the performance of the proposed RNNLG models, we compared them against three non-NN based methods: (1) a handcrafted, template-based generator as described in Section 2.5.1, (2) an example-based generator based on k-Nearest Neighbour (kNN) as mentioned in Section 3.3, and (3) a class-based LM generator introduced in Section 3.1. Details are provided below: Handcrafted Generator A template-based generator (hdc) developed in the Cambridge Dialogue Systems Group is the basis for the handcrafted generator used in this thesis. It employs a lookup-table for each configuration of the input DA to extract a surface template containing a set of slot-value related placeholders. Based on realistic slots and values in the DA, the generator recursively realises parts of the surface form via additional table lookups. The handcrafted generator was tuned over a long period and has been used frequently to interact with real users (Gasic and Young, 2014). Therefore, it sets a reasonable bar even though it is not the optimal baseline. K-Nearest Neighbour-based Generator The k-Nearest Neighbour (knn) approach is an example-based method for NLG as described in Section 3.3. During training, all the sentences are delexicalised with their corresponding DA represented in one-hot feature vectors. The DA feature vectors were used to computed the cosine similarities between testing DAs and all DAs in the training set to find the highest scored template for realisation. The template used for final realisation is sampled if multiple templates are ranked equally. Class-based Language Model The third baseline approach we compared to is the class-based LM (ngram) generator (Oh and Rudnicky, 2000), as introduced in Section 3.1. Due to the inefficiency of the class-based LM generator, previous studies (Mairesse and Young, 2014; Wen et al., 2015a) have shown that clustering sentences into finer classes while using a longer n-gram can help the model perform better. Consequently, we experimentally 5.6 Corpus-based Evaluation 57 compared our models to a class-based LM that has a n-gram length of 5 and divides its classes by the DA type and up to 3 slots (compared to only 1 slot in the original paper). During decoding, we also allowed the n-gram models to over-generate 10 times more sentences than our RNNLG models (200 v.s. 20). Other baselines This thesis mainly compares the proposed approaches with corpus- based baselines so that different models can be trained under the same settings with the same data. Approaches like Phrase-based DBN (Section 3.2), SPaRKy or other linguistically- motivated methods (Section 2.5.2) are not considered because they require additional in- formation apart from the input-output examples such as tree structure rules and semantic alignments. Note both Mei et al. (2015) and Dušek and Jurcicek (2016) were also excluded from the experiments here because they are essentially attention-based encoder-decoder neural network architecture where the proposed ENC-DEC network can well represent their performance. 5.6 Corpus-based Evaluation 5.6.1 Experimental Setup and Evaluation Metrics The LSTM-based natural language generators were implemented using the Theano li- brary (Bastien et al., 2012; Bergstra et al., 2010)6; they were trained by partitioning each of the collected corpora into a training, validation and testing were set in the ratio 3:1:1. All the generators were trained by treating each sentence as a mini-batch. The parameters were randomly initialised between -0.3 and 0.3 and an l2 regularisation term was added to the objective function every 10 training examples. The hidden layer size was set to 80 for all cases. Stochastic gradient descent and back propagation through time (Werbos, 1990) were used to optimise the parameters. To prevent overfitting, early stopping was implemented using the validation set. To decode utterances from a trained model, an over-generation and re-ranking approach was adopted Oh and Rudnicky (2000). Firstly, 20 utterances were generated using beam search with the beam-width set to 10, where the decoding criteria is the average log-likelihood of the utterance. The top 5 realisations were then selected for each DA according to the following re-ranking criteria, R =−(L(θ)+λ ·ERR) (5.18) 6Code available at https://github.com/shawnwun/RNNLG 58 Recurrent Neural Network Language Generators where λ is a trade-off constant, L(θ) is the cost generated by a model with parameters θ , and the slot error rate ERR is computed by exact matching of the slot tokens in the candidate utterances as described in Equation 3.14, Section 3.5.2. λ is set to a large value (10) to severely penalise nonsensical outputs. The various models were tested using both corpus-based evaluation and human evaluation. In the corpus-based evaluations, model performance was assessed using two objective evaluation metrics as noted in Section 3.5, the BLEU score (Papineni et al., 2002) and slot error rate ERR (Wen et al., 2015c) as described in Equation 3.14, Section 3.5.2. Both metrics were computed based on the top 5 realisations. Slot error rates were calculated by averaging slot errors over each of the top 5 realisations in the entire corpus. Multiple references were used to compute the BLEU scores where available (i.e. for the restaurant and hotel domains), but for corpora that contain only single references are only computed based on the single reference7. Since the generators are stochastic and the trained networks can differ depending on the initialisation, the corpus-based evaluation results shown in Table 5.3 were produced by training each neural network model on 5 different random seeds (1-5) and selecting the models with the best BLEU score on the validation set. 5.6.2 Results The proposed RNNLG methods described in Section 5.1 (i.e., the Heuristically Gated LSTM (hlstm), the Semantically Conditioned LSTM (sclstm), and the Attentive Encoder-Decoder (encdec)) were first compared with the baseline models in terms of generation quality on the held-out test sets. The results are shown in Table 5.3. The rule-based baseline (hdc) performed the worst, in terms of BLEU score, across all four domains. Nevertheless, it achieved zero slot error rates. Setting aside the difficulty of scaling to large domains, the handcrafted generator’s use of predefined rules yields a fixed set of sentence plans, which can differ markedly from the colloquial human responses collected using AMT. The class-based language model approach (ngram), on the other hand, generally suffers from inaccurate rendering of information, which results in very poor slot error rates across the four domains. However, due to language modelling, the ngram model can achieve a better BLEU score than the hdc baseline. The kNN model (knn), although not optimal, performed robustly across four domains on both metrics. Since knn is an example-based method that picks its realisation template directly from the training examples, it cannot go too wrong except for the DAs that don’t exist in the training set (poor generalisation). 7This may not be a reliable metric for more complex scenarios but we found that for the particular task discussed in this thesis it is good enough. 5.6 Corpus-based Evaluation 59 Domain Model test set validation set BLEU ERR(%) BLEU ERR(%) Restaurant hdc 0.4260 0.00 - - knn 0.594 0.60 - - ngram 0.642 8.73 - - encdec 0.740 2.78 0.775 2.75 hlstm 0.747 0.74 0.779 0.73 sclstm 0.753 0.38 0.786 0.47 Hotel hdc 0.5406 0.00 - - knn 0.675 1.75 - - ngram 0.770 5.87 - - encdec 0.855 4.69 0.852 4.56 hlstm 0.850 2.67 0.849 2.14 sclstm 0.848 3.07 0.851 2.38 Laptop hdc 0.3761 0.00 - - knn 0.414 0.88 - - ngram 0.294 35.64 - - encdec 0.511 4.04 0.517 3.83 hlstm 0.513 1.10 0.512 1.42 sclstm 0.512 0.79 0.516 0.95 TV hdc 0.3919 0.00 - - knn 0.432 3.69 - - ngram 0.298 38.78 - - encdec 0.518 3.18 0.520 3.49 hlstm 0.525 2.50 0.527 2.17 sclstm 0.527 2.31 0.527 2.02 Table 5.3 Corpus-based evaluation on four domains. The results were produced by training each model on 5 random seeds (1-5) and selecting the models with the best BLEU score on the validation set. Results of both the test and the validation set are reported. The approach with the best performance on a metric in a domain is highlighted in bold, while the worst is underlined. 60 Recurrent Neural Network Language Generators (a) An example realisation from the Restaurant domain (b) An example realisation from the Hotel domain Fig. 5.5 Examples showing how the SC-LSTM controls the dialogue features, flowing into the network through the reading gates. Despite errors due to sparse training data for some slots, each gate generally learned to detect the words and phrases relating to each associated slot-value pair. Comparing the three baseline methods (hdc, knn, ngram) with the three RNNLG models (encdec, hlstm, sclstm), the RNNLG models consistently outperformed the baselines in terms of BLEU score. This is presumably because of the core LSTM efficiency learning the natural language features embedded in the training set and thereby reproducing more natural utterances. Although the BLEU scores do not differ too much between the RNNLG models, the different architectures used to handle the content selection and planning do have a strong effect on the slot error rate. Overall, the gating mechanisms implemented by H-LSTM and SC-LSTM are more effective than the attention mechanism used in ENC-DEC for reducing the slot error rate. This is probably because the former gating mechanism can effectively prevent the undesirable repetition of semantic elements by turning the slot off permanently, once the information for that slot has been rendered whereas the softer attention mechanism can still allow repetition. 5.7 Human Evaluation 61 # Example Dialogue Acts and Realizations from Restaurant Domain 1 inform(name="red door cafe", goodformeal="breakfast", area="cathedral hill", kidsallowed="no") red door cafe is a good restaurant for breakfast in the area of cathedral hill and does not allow children . red door cafe is a good restaurant for breakfast in the cathedral hill area and does not allow children . red door cafe is a good restaurant for breakfast in the cathedral hill area and does not allow kids . red door cafe is good for breakfast and is in the area of cathedral hill and does not allow children . red door cafe does not allow kids and is in the cathedral hill area and is good for breakfast . 2 informonly(name="dosa on fillmore and kiss seafood", pricerange="expensive", near="lower pacific heights") there is no place other than dosa on fillmore and kiss seafood that are expensive near to lower pacific heights . dosa on fillmore and kiss seafood is the only expensive restaurant near lower pacific heights . the only listed restaurant near lower pacific heights in the expensive price range is dosa on fillmore and kiss seafood . i apologize , dosa on fillmore and kiss seafood is the only expensive restaurant near lower pacific heights . i apologize , dosa on fillmore and kiss seafood are the only expensive restaurants near lower pacific heights . # Example Dialogue Acts and Realizations from Hotel Domain 3 inform(type="hotel",count="182",dogsallowed="dontcare") there are 182 hotels if you do not care whether dogs are allowed . there are 182 hotels if you do not care whether they allow dogs . 182 hotels are available if dogs allowed or not is not an issue . there are 182 hotels if allowing dogs or not is not an issue . there are 182 hotels if whether dogs are allowed does not matter . 4 informonly(name="red victorian bed breakfast",acceptscreditcards="yes",near="haight",hasinternet="yes") red victorian bed breakfast is the only hotel near haight and accepts credit cards and has internet . red victorian bed breakfast is the only hotel near haight and has internet and accepts credit cards . red victorian bed breakfast is the only hotel near haight that accept credit cards and offers internet . the red victorian bed breakfast has internet and near haight , it does accept credit cards . the red victorian bed breakfast is the only hotel near haight that accepts credit cards , and offers internet . Table 5.4 Samples of top 5 realisations from the SC-LSTM output. Among the four domains tested, the SC-LSTM performs best in most of cases, especially in the Restaurant and TV domains. Its ability to learn words or phrases that trigger the semantic gate is key. Examples showing how the SC-LSTM controls the reading gates can be seen in Figure 5.5. Table 5.4 shows the examples of the output generated from SC-LSTM in two venue do- mains. Despite the fact that most of sentences are correct both semantically and syntactically, there are still some grammatical errors from time to time. For example, the "accept" in the third sentence of the fourth block should be "accepts" because it is present simple tense. 5.7 Human Evaluation 5.7.1 Experimental Setup and Evaluation Metrics To confirm the findings of the corpus-based evaluation in Table 5.3, a human evaluation was conducted in the Restaurant and Laptop domains and the results are shown in Tables 5.5 and 5.6, respectively. Two tests were run in each domain: (1) an utterance quality test in which AMT workers were asked to rate each utterance from 1 to 5 in terms of how informative and natural each is, and (2) a pairwise preference test, where the workers were asked to compare two models w.r.t. an associated test example and state their preference. 62 Recurrent Neural Network Language Generators Utterance Quality Evaluation (Rating ranges from 1 to 5) Metrics hdc ngram knn hlstm sclstm encdec ref Informativeness 3.88 3.60** 3.91 4.00 3.99 3.95 3.93 Naturalness 3.46** 3.34** 3.69 3.73 3.80 3.70 3.81 Pairwise Preference Comparison (%) hdc ngram knn hlstm sclstm encdec ref hdc - 35.5 25.5 26.8 42.6 33.9 38.6 ngram 64.5* - 37.3 26.6 22.4 35.4 30.9 knn 74.5** 62.7 - 38.6 35.2 42.6 36.4 hlstm 73.2** 73.4** 61.4 - 43.8 50.8 55.3 sclstm 57.4 77.6** 64.8* 56.2 - 59.6 43.4 encdec 66.1* 64.6 57.4 49.2 40.4 - 61.2 ref 61.4 69.1* 63.6 44.7 56.6 38.8 - * p <0.05, ** p <0.005 Table 5.5 Human evaluation of the Restaurant domain. The significance of the Utterance Quality Evaluation section indicates whether a model is significantly worse than the reference based on a Wilcoxon signed-rank test. The Pairwise Preference Comparison section indicates whether a model is significantly preferred to its opponent based on a two-tailed binomial test. Here informativeness8 is defined as an utterance that contains all the information specified in the DA. The term naturalness9 is defined as an utterance that has been produced by a human. The significance flag in the Utterance Quality Evaluation section indicates whether a model is significantly worse than the reference. In the Pairwise Preference Comparison section, the evaluation indicates whether a model is significantly preferred to its opponent. 5.7.2 Results The results for the Restaurant domain are shown in Table 5.5. The ratings of the reference (ref) are only significantly better than the handcrafted (hdc) and class-based language model (ngram) baselines but not the rest. This indicates that the RNNLG models are considered indistinguishable from the human authored sentences in this domain. This mainly comes from the ability of the LSTM to model long range dependencies and its powerful generalisation 8The exact wording:"A higher informativeness score means the computer conveys exactly the same idea as in the reference sentence." 9The exact wording:"A higher naturalness score means the computer behaves more like a human." 5.7 Human Evaluation 63 Utterance Quality Evaluation (Rating ranges from 1 to 5) Metrics hdc ngram knn hlstm sclstm encdec ref Informativeness 4.14 2.84** 4.04** 3.99** 4.00** 3.97** 4.28 Naturalness 3.51** 2.87** 3.64* 3.73 3.71 3.60** 3.84 Pairwise Preference Comparison (%) hdc ngram knn hlstm sclstm encdec ref hdc - 86.8** 36.1 16.7 23.5 30.0 36.1 ngram 13.2 - 24.0 14.3 13.8 12.5 11.1 knn 63.9 76.0* - 40.5 40.5 31.0 60.5 hlstm 83.3** 85.7** 59.5 - 51.9 50.8 48.0 sclstm 76.5** 86.2** 59.5 48.1 - 57.7 59.3 encdec 70.0* 87.5** 69.0 49.2 42.3 - 50.0 ref 63.9 88.9** 39.5 52.0 40.7 50.0 - * p <0.05, ** p <0.005 Table 5.6 Human evaluation of the Laptop domain. The significance flag in the Utterance Quality Evaluation section indicates whether a model is significantly worse than the reference based on a Wilcoxon signed-rank test, while in the Pairwise Preference Comparison section it indicates whether a model is significantly preferred to its opponent based on a two-tailed binomial test. capability over unseen inputs. Among the approaches evaluated, the SC-LSTM (sclstm) seems to be the best choice when both metrics are considered, although the differences are not statistically significant. In the preference test, the judges preferred the SC-LSTM to all the other methods. These observations are consistent with the corpus-based evaluation that SC-LSTM seems to be the best choice for this task. Furthermore, these results suggest that the gating mechanism is more effective than the attention mechanism since the H-LSTM and SC-LSTM were both rated higher for the two quality scores and preferred by the judges to the ENC-DEC model. The results in the Laptop domain in Table 5.6 are less clear cut. As in the Restaurant domain, the gating mechanism is better than the attention mechanism. However, in this case, the H-LSTM seems to be marginally preferred than the SC-LSTM. This is probably due to the complexity and difficulty of the Laptop domain. This results in the semantic control gates of the SC-LSTM being undertrained. In this case, the predefined heuristics of the H-LSTM provides a more robust control mechanism. 64 Recurrent Neural Network Language Generators 5.8 Conclusions This chapter has presented a thorough study that compares the proposed RNNLG models with three baseline methods and against each other. The template-based approach is the most robust, in terms of slot error rate, simply because the rendering of information is strictly managed by template rules. Such a deterministic process sacrifices the improvisation in human language; it also generates surface forms that are much more rigid and stylised than the other approaches. At the other end of the spectrum, the class-based LM sample words produce a lot of variation in the surface forms. Its inefficient use of the DA information often leads to unnecessary computation overhead and erroneous outputs. Despite the poor generalisation over unseen inputs, the kNN-based approach, which uses a similarity measure between input DAs to select templates for realisation, is the most reliable baseline method among the three. When comparing the three proposed RNNLG models (e.g., H-LSTM, SC-LSTM, ENC- DEC) to the baselines, the following conclusions can be drawn from the results 1. The RNNLG models all outperform baselines in both corpus-based and human evalu- ations. This can be attributed to the distributed representation and the modelling of long-term dependencies using LSTM. 2. The proposed gating mechanism is preferred over the popular attention mechanism, especially in terms of the slot error rate metric. This is because the gating mechanism is more efficient at preventing semantic repetitions via its hard decision-gates. 3. The SC-LSTM is the top choice for most of the domains due to its learnable controlling gates. However, H-LSTM can be comparable in complex domains, where the input space is rich (more slots, values and DA types) and the amount of data is relatively insufficient (single reference per DA). 4. The on and off of SC-LSTM’s internal gates can be interpreted as the acquired seman- tic alignments between the semantic components (slots and values) in the meaning representation and the partial realisation of the surface form. Therefore, the proposed SC-LSTM can be considered a more suitable model for the tasks. A study of its domain scalability will be presented in the next chapter. Chapter 6 Domain Adaptation A crucial advantage of a statistical-based NLG system is the ability to scale it to cover more domains by simply collecting more data, without the need for specific domain knowledge or design know-how. However, data collection itself can become costly and laborious if there are many new domains. Domain adaptation provides a solution to this problem. Domain adaptation problems arise when we have sufficiently labelled data in one domain (the source domain S), but have little or no labelled data in a related domain (the target domain T). The goal is to design a learning algorithm that can use large source domain data sets to facilitate learning and prediction in the target domain. Whether the target domain data is labelled or not, we can roughly divide the problem into two categories: (1) the fully supervised case where a large amount of source domain data and a small labelled dataset in the target domain are available; (2) the semi-supervised case where the source domain dataset is annotated but the target domain dataset is not. In this thesis, we exclusively focus on the fully supervised case because it suits the NLG scenario better. From a machine learning perspective, where the type of adaptation applied to the pipeline matters, this yields three major methods for domain adaptation: data augmentation, feature- based domain adaptation, and model-based domain adaptation. 6.1 Data Augmentation Data augmentation (Frühwirth-Schnatter, 1994; Tanner and Wong, 1987) is a method that is widely applied in machine learning; the method can create a rich dataset from a small set of authentic data points. Although it may not necessarily solve the data sparsity problem, it does improve the model’s robustness. For example, if an image contains a cat, then a proper scale, rotation, shift or change of colour of that image shouldn’t affect this fact. Therefore, 66 Domain Adaptation to perform well on these variants, the model should be trained on a set of pictures properly transformed from the original image rather than just itself. Data augmentation is popular in the Computer Vision (CV) community and is often com- bined with Deep Neural Networks to prevent overfitting (Chatfield et al., 2014; Krizhevsky et al., 2012). This technique has also been applied to document classification (Zhang and LeCun, 2015) in NLP to augment text data via a thesaurus. Relation extraction in biomedical NLP uses data augmentation to create new instances by modifying the headword candi- date arguments by covering equivalent arguments. Moreover, data augmentation was also used to generate negative examples (Bordes et al., 2015) during training when the dataset contains mostly positive examples; obtaining the negative ones is not trivial. Although data augmentation-based domain adaptation has not been studied extensively in the past, its application to NLG is intuitive and shows impressive results. The most relevant work was done by Hogan et al. (2008). They showed that an LFG f-structure-based generator could yield better performance if trained on in-domain sentences paired with pseudo-parse tree inputs generated from a state-of-the-art, but out-of-domain parser. Furthermore, Cuayahuitl et al. (2014) trained statistical surface realisers from unlabelled data via an automatic slot labelling technique. Our approach employs a data augmentation-type approach. More details can be found in Section 6.5.1. 6.2 Feature-based Domain Adaptation Unlike data augmentation, which directly modifies the input data, feature-based domain adaptation relies on manipulating the feature space to adapt models across domains. There are roughly three different families (classes) of this approach. Feature augmentation is the method proposed by Daume III (2007). This method dupli- cates the original feature vector three times. Each replicated feature vector represents general, source-specific, and target-specific features. Therefore, the augmented source data contains only the general and source-specific features, while the augmented target data contains only the general and target-specific versions. The model can then learn to put weights on the general features that are shared across domains, but also distinguish features that are domain-specific by separating the source and target specific feature space. An updated kernel function Kˆ(x,x′) is then treated as a measure of similarity between a pair of data points and applied to the augmented data points via Kˆ(x,x′) = 2K(x,x′), if same domain,K(x,x′), if different domain, (6.1) 6.2 Feature-based Domain Adaptation 67 where x and x′ are data points and K(x,x′) is the original kernel function. Note Equation 6.1 is self-explanatory and indicates that the data points from the target domain have twice as much influence as the source data points when making decisions about testing data from the target domain. Based on this method, many algorithms such as SVM and maximum entropy models can be used to optimise parameters. Feature ensemble (Xia et al., 2013) is another feature-based adaptation technique that learns the ensemble weights based on a set of pre-trained models. The approach begins by dividing the feature vector x into sub-groups k. For each sub-group, a classifier gk(xk) is trained on those sub-group features only gk(xk) = ω ⊺ k xk, (6.2) here a linear classifier is used as the base classifier for each sub-group. After base classifica- tion, a meta-learning approach is stacked on top to learn the ensemble weight of each base classifier θk based on the small amount of annotated data from the target domain f (x) =∑ k θkgk(xk). (6.3) Note this approach differs from the model ensemble described in Section 6.3. In this case, each base classifier is only trained on a subset of the original feature space. The model ensemble technique combines models trained on a subset of data points in the full feature space. In the original work, the feature ensemble was also combined with a sample selection that selects a subset of data points in the source domain whose instance distribution is close to the target domain that trains the model. The feature embedding approach becomes more and more popular because of the rise in popularity of deep learning, even though it has existed for a while (Blitzer et al., 2006; Yang and Eisenstein, 2015). This family of approaches aims at finding a mapping that maps both the source domain data points and the target domain data points into a shared feature space on which the classifier is trained on. Therefore, if a good mapping is developed, then the classifier trained in the source domain will also be effective in the target domain. Previously, this feature mapping was learnt separately, based on Singular Value Decomposition (SVD) as in Blitzer et al. (2006). This mapping has recently become an NN-based embedding function that has become common practise (Collobert and Weston, 2008; Yang and Eisenstein, 2015). 68 Domain Adaptation 6.3 Model-based Domain Adaptation Model-based domain adaptation is probably the most popular approach among the three methods because it is much more intuitive. It covers a wide range of methods such as modifying the learning signals (Lemon, 2008; Walker et al., 2007), changing the model architecture (Mrkšic´ et al., 2015; Wen et al., 2013a), model ensemble (Heidel and Lee, 2007; Wen et al., 2012a), and learning a transformation for model parameters (Gales and Woodland, 1996; Leggetter and Woodland, 1995). Most of the current speech and language applications rely on adaptation of this method because the method doesn’t always require the model to be re-trained , which allows online domain adaptation. The most direct approach to model-based domain adaptation is to continue to train the model based on data points from the target domain. This often requires a change in the model architecture or specific steps during training so that the adapted model can perform reasonably well in the target domain. For example, Mrkšic´ et al. (2015) trained a general belief tracker by abstracting away the identity of slots. This procedure was later adapted to track each individual slot. Wen et al. (2013a) personalised an RNN LM by adapting a model trained on a general corpus to specific social network users. The RNN they used was augmented with an additional feature layer to facilitate information sharing between different users (or domains). In this thesis, the proposed adaptation recipe is compared with a model-based adaptation baseline called model fine-tuning. A general model is fine-tuned by being trained on the source domain data and then further adapted by tuning the parameters based on the adaptation data. More details can be found in Section 6.4. Model ensemble is a domain adaptation technique that is especially popular for language modelling. The idea of model ensemble is similar to a feature ensemble as described in Section 6.2. However, instead of training models based on a subset of features, model ensemble learns models based on a subset of training examples and combines them later by learning a set of model-dependent weights. In language modelling (Heidel and Lee, 2007; Wen et al., 2012a), for example, a set of class-based LMs is first trained on topics like politics, tourism, sports, or economics. During testing, the inference algorithm will dynamically decide the weight of each topic LM and combine them into a single LM, which can be used to estimate the complexity of the test data. Most of the adaptation work done in NLG were model-based methods. However, most of them did not directly address the problem of domain adaptation. The SPoT-based generator, proposed by Walker et al. (2002), has the potential to address domain adaptation problems. However, their published work has focused on tailoring user preferences (Walker et al., 2007) and mimicking personality traits (Mairesse and Walker, 2011). In contrast, Mairesse et al. 6.4 Baseline Method - Model Fine-Tuning 69 (2010) proposed the use of active learning to mitigate the data sparsity problem when training data-driven NLG systems. 6.4 Baseline Method - Model Fine-Tuning Given training instances represented by DA di and sentence yi = {wi0,wi1, ...wiN} tuples {di,yi} from the source domain S (data rich) and the target domain T (data sparse), our goal is to find a set of network parameters θT that can perform acceptably well in the target domain. A straightforward way to adapt NN-based models to a target domain is to continue training, or fine-tuning, a well-trained generator on whatever new target domain data is available. This belongs to one of the model-based approaches described in the previous section. The training procedure is as follows: 1. Train a source domain generator θS on source domain data {di,yi} ∈ S with all values delexicalised1. 2. Set θT = θS as the initial parameters for the target domain. 3. Divide the adaptation data into training and validation sets. Refine parameters by training on adaptation data {di,yi} ∈ T with early stopping and a smaller starting learning rate. This yields the target domain generator θT. This method is very like the work of Wen et al. (2013a), where the adaptation was done by continuing to train the source domain model on target domain data. Although this method can benefit from parameter sharing part of the LM network, the parameters of similar input slot-value pairs are not shared1. Thus, the realisation of any unseen slot-value pairs in the target domain can only be learned from scratch and adaptation offers no benefit in this case. 6.5 The Adaptation Recipe for Language Generators The feature-based method requires manual manipulation of the input feature space; therefore, it is not very suitable for combining with NN-based approaches since the power of NN mainly comes from the ability of learning from raw features. Moreover, model-based approaches are tricky to implement in practise for slot-value based applications. They are either like the 1We have tried training with both delexicalised slots and values and then use the weights to initialise unseen slot-value pairs in the target domain. However, this process yielded even worse results since the learned semantic alignment became stuck at a local minimum. Pre-training only the LM parameters did not produce better results. 70 Domain Adaptation method mentioned in Section 6.4, which does not make the best use of the source domain data or require a complex weight-typing strategy and pre-training like Mrkšic´ et al. (2015), which is more difficult to implement for NLG tasks. On the other hand, data augmentation suits the problem quite well since the RNNLG models already require delexicalisation as the pre-processing step. Consequently, the adaptation recipe proposed in this thesis is centred on the idea of data augmentation and the more efficient use of the data at hand. This leads to the combination of two methods: Data Counterfeiting and Discriminative Training, as described in the following two sections. 6.5.1 Data Counterfeiting To maximise the effect of domain adaptation, the model should be able to (1) generate acceptable realisations of unseen slot-value pairs (based on similar slot-value pairs seen in the training data), and (2) continue to distinguish slot-value pairs that are similar but nevertheless distinct. Instead of exploring weight-tying strategies in different training stages (which is difficult to implement and typically relies on ad-hoc rules), we propose instead a data counterfeiting approach to synthesise target domain data from source domain data. The procedure is shown in Figure 6.1 and described in the list below: 1. Categorise slots, both source and target domain, into classes according to some simi- larity measure. In our case, we categorise the slots based on their functional type to yield three classes: informable (I), requestable (R), and binary (B)2. For more details about the slots in the three classes, please refer back to Table 5.1. 2. Delexicalise all slots and values to produce template-like sentences. 3. For each slot s in the source example (di,yi), randomly select a new slot s′ to replace s. Note that s′ must belong to the target domain and in the same functional class as s. This yields a pseudo example (d′i,y ′ i) ∈ T in the target domain. 4. Train a generator θ ′T on the counterfeit dataset {d′i,y′i} ∈ T. 5. Refine the parameters on in-domain data. This yields the target domain model θT. The name counterfeiting comes from the fact that the training examples used to train the target domain generator are not authentic data points but counterfeited by sampling slots from the target to replace the original ones. This approach allows the generator to share realisations 2The informable class includes all non-binary informable slots while the binary class includes all binary informable slots. 6.5 The Adaptation Recipe for Language Generators 71 Fig. 6.1 An example of the data counterfeiting approach. Note the slots highlighted red are randomly sampled from the target domain where the prerequisite is that the counterfeited slot needs to be in the same functional class as the original slot. among the slot-value pairs that have similar functionalities. Therefore, it facilitates the transfer learning of rare slot-value pairs in the target domain. Furthermore, the approach also preserves the co-occurrence statistics of slot-value pairs and their realisations. This allows the model to learn a gating or attention mechanism even before adaptation data is introduced. 6.5.2 Discriminative Training for Adaptation Apart from data counterfeiting, the second approach in the proposed adaptation recipe is Discriminative Training (DT). DT was an alternative objective function to MLE when training ASR or language models (Collins, 2002; Kuo et al., 2002) to improve model predictions. Instead of maximising the log-likelihood of the correct examples, its core idea is to separate the correct examples from competing incorrect ones. Based on the notations introduced in Section 4.2.2, we have the current model parame- terised using θ and training examples (di,yi) where di is the input DA and yi is the reference sentence. Following Equation 4.26, the DT objective function is therefore written as, θDT = argmax θ ∑ yˆi∈GEN(di) pθ (yˆi|di)R(yˆi,yi) (6.4) where yˆi ∈ GEN(di) is a set of candidate examples generated by the current model. To consider DT with multiple targets, the scoring function R(yˆi,yi) is further generalised to take several metrics into account R(yˆi,yi) =∑ j R j(yˆi,yi)β j (6.5) 72 Domain Adaptation where βj is the weight for the j-th scoring function. Since the cost function presented here (Equation 4.26) is differentiable everywhere w.r.t. model parameters θ , back propagation can be applied to calculate the gradients and update the parameters directly. When applying the DT objective to adapt sparse target domain examples in domain adaptation scenario, a significant performance improvement was observed in almost of the cases. This is because that DT discriminates positively against the in-domain examples by not only increasing the probability of the correct examples but also penalising those incorrect ones learned in the data counterfeiting phase. This allows the model to learn biased in-domain behaviours quickly with a limited amount of in-domain data. Although this strong discriminative power learned using DT is beneficial when the amount of target domain data is limited, it is nevertheless harmful to the system trained on a vast amount of in-domain data where the discrimination is so strong that the learned system loses its ability to generate diverse responses. This is the reason why it is only used for domain adaptation rather than training NLG models in general. 6.6 Corpus-based Evaluation 6.6.1 Experimental Setup and Evaluation Metrics To assess the effectiveness of the proposed adaptation recipe, a set of corpus-based experi- ments were conducted. The metrics used to evaluate the model performance are the BLEU score and slot error rate (as mentioned in Section 5.6.1). The datasets are the same as the ones described in Section 5.4, which covers the four application domains: Restaurant, Hotel, Laptop, and TV. All of these experiments use the SC-LSTM generator since it is proven to be the most robust choice across different domains (Section 5.6 and Section 5.7). The hidden layer size was set to 80 for all cases. Stochastic gradient descent, back propagation through time, and early stopping were used to optimise the parameters. For more experimental detail, such as beamwidth and decoding criteria, please refer to Section 5.6.1. Two-way adaptation was conducted between (1) the Restaurant and Hotel domain, (2) the Laptop and TV domain, (3) the Venue domain formed by combining the Restaurant and Hotel domains and the Product domain formed by combining the Laptop and TV domains. The (1) and (2) scenarios can be considered as easier tasks than (3) because the characteristics between venue domains are very different from product domains as described in Section 5.4.3. This adaptation resulted in 6 scenarios; the results are shown in Figures 6.2 to 6.7. To facilitate comparison of the results across different adaption methods, the BLEU score and slot error rate curves are plotted for varying amounts of adaptation data. Note that 6.6 Corpus-based Evaluation 73 the x-axis is presented on a log-scale in all these figures. There are two baselines: (1) the (a) BLEU score (b) Slot error rate Fig. 6.2 Hotel to Restaurant domain adaptation (a) BLEU score (b) Slot error rate Fig. 6.3 Restaurant to Hotel domain adaptation models that were trained completely from scratch using only target domain data (scratch), and (2) the models trained with the model fine-tuning technique (fine-tune) as described in Section 6.4. Therefore, the starting point of the scratch curve is the performance of a random initialised network, while the one for the fine-tuned curve indicates the performance of a model pre-trained on source domain data. 6.6.2 Results Comparing first the data counterfeiting (counterfeit) approach with the two baselines (scratch and fine-tune), Figures 6.2 to 6.5 show the result of adapting models between similar domains, i.e., between Restaurant and Hotel; Laptop and TV. Because of the parameter sharing on the LM part of the network, model fine-tuning (fine-tune) achieves a better BLEU score than training from scratch (scratch), when the target domain data is limited. However, the data 74 Domain Adaptation (a) BLEU score (b) Slot error rate Fig. 6.4 TV to Laptop domain adaptation (a) BLEU score (b) Slot error rate Fig. 6.5 Laptop to TV domain adaptation counterfeiting (counterfeit) method achieves a significantly greater BLEU score than either of the baselines. This is typically because there is more robust handling of unseen slot-value pairs. Furthermore, data counterfeiting (counterfeit) also brings a substantial reduction in slot error rate. Counterfeiting preserves the co-occurrence statistics between slot-value pairs and realisations, which allows the model to learn good semantic alignments even before the target domain data is introduced. Similar results can be seen in Figure 6.6 and 6.7. In those figures, adaptation was performed on more disjoint domains: between Restaurant + Hotel joint domain and Laptop + TV joint domain where data counterfeiting (counterfeit) method is again superior to the two baselines. As described in Section 4.2.2, the generator parameters obtained from data counterfeiting and maximum likelihood adaptation can be further tuned by applying discriminative training. In each case, the models were optimised using two objective functions: BLEU-4 score and slot error rate. However, the sentence BLEU as introduced in Section 3.5.1 was used during training to mitigate the sparse n-gram match problem of BLEU at the sentence-level. In these experiments, we set γ to 5.0 and βj to 1.0 and -1.0 (Equation 4.27 and 6.5) for BLEU and ERR, respectively. For each DA, the generator was sampled 50 times to generate candidate 6.7 Human Evaluation 75 (a) BLEU score (b) Slot error rate Fig. 6.6 Restaurant + Hotel to Laptop + TV domain adaptation (a) BLEU score (b) Slot error rate Fig. 6.7 Laptop + TV to Restaurant + Hotel domain adaptation sentences and any repeated candidates that appeared were removed. The resulting candidate set was treated as a single batch and the model parameters were updated using the procedure described in section 4.2.2. As we can see by the curves marked counterfeit+DT in Figures 6.2 to 6.7, DT consistently improves generator performance for both metrics. It is interesting to note that the effect on slot error rate is more pronounced than the effect on the BLEU. This may be because the sentence BLEU optimisation criteria are only approximation to the corpus BLEU score used for evaluation. Overall, however, these results show that the proposed adaption procedure incorporates data counterfeiting and discriminative training was consistently effective across the six adaption scenarios tested. 76 Domain Adaptation Utterance Quality Evaluation (Rating ranges from 1 to 3) Method TV to Laptop Laptop to TV Informativeness Naturalness Informativeness Naturalness scr-10% 2.24** 2.03** 2.00** 1.92** ML-10% 2.51** 2.22** 2.45** 2.22** DT-10% 2.53* 2.25* 2.51 2.19** scr-All 2.64 2.37 2.54 2.36 * p <0.05, ** p <0.005 Table 6.1 Human evaluation of utterance quality in two adaption scenarios. Results are shown for two metrics (rated out of 3 possibilities). Statistical significance was computed using a Wilcoxon signed-rank test, between the model trained with full data (scrALL) and all others. 6.7 Human Evaluation 6.7.1 Experimental Setup and Evaluation Metrics To confirm the conclusions drawn from the corpus-based evaluation, the adaptation procedure was evaluated by human judges recruited using AMT for two adaptation scenarios: laptop to TV and TV to laptop. For each task, two systems among the four were compared: training from scratch using the full dataset (scrALL), adapting with DT training using only 10% of the target domain data (DT-10%), adapting with maximum likelihood training using only 10% of the target domain data (ML-10%), and training from scratch using only 10% of target domain data (scr-10%). To reduce the effects of language variation, each system generated 5 different surface realisations for each input DA. The human judges were asked to score each of them in terms of informativeness and naturalness (rating out of 3)3, and asked to state a preference between the two. To decrease the amount of information presented to the judges, utterances that were rendered identically in both systems were filtered out. In total, 2000 DAs were tested for each scenario and distributed uniformly between the contrast, except that 50% more comparisons were provided for the contrast between ML-10% and DT-10% because their results were very close. 6.7.2 Results Table 6.1 shows the subjective quality assessments. We clearly see they exhibit the same general trend as in the corpus-based objective results. If a large amount of target domain 3A 3-valued likert scale was used here to reduce the cognitive load of the MTurk judges. 6.7 Human Evaluation 77 Pairwise Preference Comparison (%) TV to Laptop scr-10% ML-10% DT-10% scr-All scr-10% - 34.5 33.9 22.4 ML-10% 65.5** - 44.9 36.8 DT-10% 66.1** 55.1 - 35.9 scr-All 77.6** 63.2** 64.1** - Laptop to TV scr-10% ML-10% DT-10% scr-All scr-10% - 17.4 14.2 14.8 ML-10% 82.6** - 44.9 37.1 DT-10% 85.8** 51.9 - 41.6 scr-All 85.2** 62.9** 58.4* - * p <0.05, ** p <0.005 Table 6.2 Pairwise preference test among four approaches in two adaption scenarios. Statisti- cal significance was computed using a two-tailed binomial test. data is available, training everything from scratch (scrALL) achieves very good performance and adaptation is not necessary. However, if only a limited amount of in-domain data is available, efficient adaptation is critical (DT-10% & ML-10% > scr-10%). Moreover, judges mostly preferred the DT adapted generator (DT-10%) compared to the ML adapted generator (ML-10%), especially for informativeness. In the laptop to TV scenario, the informativeness score for DT adaptation (DT-10%) was statistically indistinguishable from training on the full training set (scrALL). The preference test results are shown in Table 6.2. Again, the adaptation is essential to bridge the gap between domains when the target domain data is scarce (DT-10% & ML-10% > scr-10%). The results also suggest that the DT training approach (DT-10%) was preferred compared to ML training (ML-10%) despite the preference being statistically insignificant. Moreover, one thing worthy of noting, the proposed adaptation recipe is quite effective. This is even the case when only 10% of the target domain data is available for adaptation. This suggests that the combination of the SC-LSTM model with the proposed adaptation recipe allows us to develop a multi-domain language generator that is more efficient by leveraging existing data at hand. 78 Domain Adaptation 6.8 Conclusions This chapter has presented a procedure for training multi-domain, RNN-based language generators via data counterfeiting and discriminative training. The chapter begins with a big picture description of domain adaptation. To test the efficacy of the methods, we begin by defining the three major directions: data augmentation, feature-based adaptation, and model-based adaptation. Among the methods, it is argued that data augmentation is the most practical and efficient approach to deal with unseen slot-value pairs in the NLG problem. This leads to the data counterfeiting method, which occurs where surface forms are delexicalised. Each source domain slot-value pair is substituted with a randomly sampled target domain slot-value pair to produce pseudo examples. These pseudo examples (or counterfeited surface forms) are then used to initialise the model parameters in the target domain. Although the arbitrary replacement of slot-value pairs inevitably causes realisation errors, it does provide the necessary prior knowledge (the LM probabilities and the co-occurrence statistics of slot- value pairs and their surface forms) needed to operate the model in a new domain, even before any domain specific data is introduced. The second phase, however, is to correct the errors introduced by the counterfeiting. By training via a more efficient objective function and leveraging the small amount of target domain data, the second phase is complete. Since the adaptation data is sparse and the maximum likelihood criterion doesn’t make the most effective use of the data points to correct the model errors, the discriminative objective is instead introduced and employed. The proposed adaptation recipe for RNNLG were assessed by both corpus-based evalua- tion and human evaluation. The objective measures on corpus data have demonstrated that by applying this procedure to adapting models between different dialogue domains, good performance is achieved with much less training data. Subjective assessment by human judges confirms the effectiveness of the approach. In addition, although the experiments conducted in this thesis were based on the SC-LSTM model, the procedure is general and applies to any data-driven language generator since it requires only data augmentation and a change of training objective rather than feature-level or model-level manipulation. Chapter 7 Generation by Conditioning on a Broader Context The framework for natural language generation described in Section 5.1 is the basis for the proposed RNNLG. The RNN model generates sentences word-by-word by conditioning on the handcrafted semantic representations such as the dialogue act formalism (Traum, 1999). The input-output space of the NLG module is designed to fit into the pipelined dialogue system framework described in Chapter 2. Thus, the system can be broken down into several components and the development can be simplified and distributed. The intermediate dialogue act representation plays a crucial role here as well; it is the major communication medium between system modules. It is also the representation that creates abstract boundaries between the various surface expressions of the same meaning and establishes the intrinsic intentions behind the conversation. Although these dialogue act taxonomies are meant to summarise the context from the front-end modules to the next system components, they were originally designed to capture just enough meaning in an utterance to provide rational system behaviours within the domain; thus, they are limited to capturing frequent yet simple conversations. Also, since there is still no agreement on what the common semantic representation should be, different annotation schemes have been applied (e.g., logical forms (Parsons, 1990) and Abstract Meaning Representation (AMR) (Chiang et al., 2013)). Although there was an effort trying to standardise the dialogue act formalism (Bunt et al., 2010), it was only used in the dialogue community due to its limited expressivity for general semantics. As a consequence, language generation by conditioning on the dialogue act representation such as the RNNLG framework in Section 5.1 is constrained by the expressive power of the handcrafted semantic representation itself and can potentially overlook many useful features in the dialogue context. 80 Generation by Conditioning on a Broader Context If we consider the end-to-end text-based dialogue problem as a whole, the end goal of the system is to generate a response that is appropriate given the current dialogue context. Therefore, the response generator can take the entire dialogue context as input and map it back to a natural language without the provision of the intermediate dialogue act representation. There are several hypothetical benefits for this approach: • Firstly, it liberates the generation model from the hard-coded dialogue act representa- tion and allows it to produce more contextually-natural replies; • Secondly, it can model more complex dialogue scenarios and is easier to scale; • Finally, the NLG and dialogue policy can be jointly optimised (Rieser and Lemon, 2010; Rieser et al., 2014) to avoid a suboptimal solution. Furthermore, it also saves the additional labour of annotating the dialogue act representation which requires expert knowledge and is hard to scale even by crowdsourcing. If the repre- sentation of the dialogue context can be primitive but as complete as possible, then a neural network-based response generator should learn the appropriate dialogue intent via represen- tations of its layers of nonlinear transformations. In this way, the semantic representation of the system can be learnt together as a latent component alongside the rest of the system module and any potentially impotent dialogue act annotations can be avoided. 7.1 End-to-End Dialogue Modelling As mentioned in the Chapter 2, building an end-to-end goal-oriented dialogue system, such as a hotel booking or a technical support service, is difficult because it is application-specific, which results in limited training data. Therefore, machine learning approaches to goal- oriented dialogue systems typically divide the problem into modules arranged in a pipeline and cast it as a partially observable Markov Decision Process (POMDP) (Young et al., 2013). The aim of this approach is to use Reinforcement Learning (RL) to train dialogue policies online via interactions with real users (Gašic´ et al., 2013). Although promising results have been achieved, there are still many issues that have not been properly addressed yet: • Firstly, although policy optimisation can be learnt quite efficiently through RL in small domains (Gašic´ et al., 2013, 2015), we do not know whether RL can be readily scaled to learn policies in more complex state and action settings. • Secondly, the handcrafted nature of the state and action space (Young et al., 2013, 2010) may restrict the expressive power and learnability of the model. 7.1 End-to-End Dialogue Modelling 81 • Thirdly, the reward functions needed to train such models are difficult to design and hard to measure at run-time; therefore, reward functions generally rely on explicit user feedback (Su et al., 2016b, 2015). • Finally, even though the policy can be directly optimised online via RL, language com- prehension (Henderson et al., 2014c; Yao et al., 2014) and language generation (Wen et al., 2015c, 2016b) modules still rely on supervised learning and therefore need corpora with additional semantic annotations to train on. As mentioned previously, these semantic annotations lead to additional labour with expert knowledge and cannot be easily addressed even by crowdsourcing. At the other end of the spectrum, modelling chat-based dialogues as a sequence-to- sequence learning problem (Sutskever et al., 2014) has become a common theme in the deep learning community. This is due to the recent advances with training RNNs on a large number of corpora and the computational tractability using massive hardware. This family of approaches treats dialogues as a source that targets the sequence transduction problem. An encoder network (Cho et al., 2014) is applied to encode a user query into a distributed vector, representing its semantics, which then conditions the decoder network to generate each system response. For example, Vinyals and Le (2015) has demonstrated a seq2seq-based model trained on a huge amount of conversation corpora which learns interesting replies conditioned on different user queries, based on a qualitative assessment. It can also learn to reply to certain questions providing a reasonable answer. Similar results were also found in Shang et al. (2015) and Serban et al. (2015b). Although it is interesting to see what the model can learn by looking at just the dialogue history, the current state-of-the-art is still far from any real-world applications. First of all, these models generally suffer from the generic response problem (Li et al., 2016a; Serban et al., 2016). This is when the model struggles to produce causal or even diverse responses. Despite the fact that efforts have been made to address these issues (e.g., modelling the persona (Li et al., 2016b), reinforcement learning (Li et al., 2016c), and introducing continuous latent variables (Cao and Clark, 2017; Serban et al., 2016)), generic response remains a deep-rooted problem in many of these systems. Secondly, these models typically require a large amount of data to train but still lack any capability for supporting specific tasks. For example, specific tasks like interacting with databases (Sukhbaatar et al., 2015; Yin et al., 2015) or aggregating useful information into responses remain intractable. Lastly, the evaluation of a chat-based dialogue system is itself problematic. None of the existing automatic metrics correlate with human’s perception very well (Liu et al., 2016)1. This is partly because of the intrinsic difficulty in evaluating NLG 1The BLEU metric they used is at sentence-level. However, experiments have shown that BLEU should be computed at the corpus-level (Papineni et al., 2002) to really reflect its effectiveness. 82 Generation by Conditioning on a Broader Context systems, as mentioned in Section 3.5, and partly because the chat-based scenario does not ground the conversation in any kind of knowledge or database. One thing worth to note is that when chatting to people, we humans constantly use the knowledge and experiences we have accumulated. Every conversation is more or less grounded to a knowledge, an experience, or at least the personality of the speaker. Therefore, hoping to evaluate a non-grounded chat-based NLG system via a testing corpus with a limited coverage itself doesn’t make too much sense. Another promising research direction is the Memory Network-based (MemNN) ap- proach (Sukhbaatar et al., 2015; Weston et al., 2014) for dialogue modelling. Despite the fact that it was first proposed for Question Answering (QA), MemNN answers user queries by first embedding it as a vector. This query vector is then used to correlate with a set of supporting vectors, which are again embedded from a set of existing supporting facts, and retrieve them back via an attention mechanism. This process was extended to cater to different tasks such as factoid question answering (Miller et al., 2016), children’s book reading comprehension (Hill et al., 2016), and goal-oriented dialogue (Bordes and Weston, 2017). In Bordes and Weston (2017), the dialogue problem was divided into several sub-tasks and evaluation was made on both turn-level and dialogue-level success. The major benefit of using MemNN for goal-oriented dialogue modelling, is that the entire model can be optimised end-to-end with a single objective. This approach is different from the work done in this thesis, which leverages additional supervision signals to train the explicit belief-tracking component. As a result, their MemNN model is much more data-hungry and requires more data to achieve a useful level for real-world applications. Bordes and Weston (2017) trained their model on DSTC 2 (Henderson et al., 2014a). They achieved a 40% dialogue success rate by learning from approximately 2000 distinct dialogues. The model proposed in this thesis was experimented using the same domain but only requires about 600 dialogues to achieve a success rate of 80%. Moreover, their model only learns to select a response from a set of replies. In our work, the model learns to generate words to compose sentences. 7.2 Goal-oriented Dialogue as Conditional Generation Given the many different approaches to solve dialogue problems, basing conversations to some sort of knowledge base is a necessity for achieving rational system behaviour and producing contextual responses. For goal-oriented dialogue modelling, grounding the language on symbolic representations is important for two reasons: firstly, the symbolic representation is easier to manipulate and operate for modern computers; secondly, the 7.3 Neural Dialogue Model 83 symbolic representation provides a way to resolve the credit assignment problem2 and allows the model to learn more efficiently with less data. We need to keep in mind that we also want to annotate the input via some simple scheme so that it can be readily understood by non-experts allowing crowdsourcing to be an option for data collection. In this thesis, we propose to model a goal-oriented dialogue system3 as a response generation problem that is conditioned on the dialogue context and a knowledge base. The dialogue context provides clues for reasonable system replies, while the knowledge base encapsulates the domain of interest that the system can talk about. Formally speaking, given the user input utterance ut at turn t and the system’s knowledge base (KB), the machine parses the input into a set of actionable commands Q and accesses the KB to search for useful information to reply to the query. Based on the results, the model needs to summarise its retrieved knowledge and reply with an appropriate response mt in natural language. In this case, the coarse intermediate representation can just be the entities and attributes of the knowledge base. The annotations of discourse acts or the relations held between entities and attributes can simply be learnt from data. Following this framework, if we can combine a competent learning model with an efficient and simple dialogue collection procedure, we can develop a model that not only directly learns from humans or picks up interesting conversation strategies automatically, but also scales better and evolves faster. 7.3 Neural Dialogue Model As shown in Figure 7.1, the proposed Neural Dialogue Model (NDM) treats dialogue as a sequence-to-sequence mapping problem (Sutskever et al., 2014) augmented by the dialogue history (modelled from a set of belief trackers (Henderson et al., 2014c)) and the current database search outcome (modelled from a database operator). At each turn, the system takes a sequence of tokens from the user as input and converts the tokens into two internal representations: a distributed representation generated by an intent network; and a probability distribution over slot-value pairs called the belief state (Young et al., 2013) generated by a set of belief trackers. Note, all sentences are pre-processed via delexicalisation (Henderson et al., 2014c) where slot-valued specific words are replaced with their corresponding generic tokens based on an ontology. The database operator then selects the most probable values in the belief state to form a query for the DB. The search result, along with the representation’s intent and belief state, 2The credit assignment problem concerns determining how the success of a system’s overall performance is due to the various contributions of the system’s components (Minsky, 1961). 3Like most of the goal-oriented dialogue research, we focus on information seek type dialogues. 84 Generation by Conditioning on a Broader Context Fig. 7.1 The proposed Neural Dialogue Model framework forms the dialogue context representation, which contains all the necessary information to decide the machine response. This context representation is then taken by an internal policy network as the input to produce a vector representing the next system’s intent. This system intention vector is then used to condition a response generation network, which generates the required system output token by token in skeletal (delexicalised) form. The final system response is then formed by substituting the actual values of the database entries into the skeletal sentence structure. A more detailed description of each component is given below. 7.3.1 Intent Network The intent network can be viewed as the encoder in the sequence-to-sequence learning framework (Sutskever et al., 2014) whose job is to encode a sequence of input tokens ut = {wt0,wt1, ...wtN} into a distributed vector representation ut at every turn t. A Long Short- term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) is typically used and 7.3 Neural Dialogue Model 85 Fig. 7.2 Tied Jordan-type RNN belief tracker with delexicalised CNN feature extractor. The output of the CNN feature extractor is a concatenation of top-level sentence (green) embedding and several levels of intermediate n-gram-like embedding’s (red and blue). However, if a value cannot be delexicalised in the input, its ngram-like embedding’s will be padded with zeros. We zero-pad vectors (in gray) before each convolution operation to make sure the representation at each layer has the same length. The output of each tracker bts is a distribution over values of a particular slot s. the previous time step’s hidden layer −→ut [N] is taken as the representation ut =−→ut [N] =−−−−→LSTM(ut)[N]. (7.1) However, bidirectional LSTM, which uses two LSTMs to read the input sentence from both directions, is often used because it offers a more integral view of the input and mitigates the vanishing gradient problem automatically ut = biLSTM(ut) = −−−−→ LSTM(ut)[N]⊕←−−−−LSTM(ut)[N]. (7.2) Therefore, throughout this thesis, all the intent networks are encoded using bidirectional LSTMs. Since slot-value specific information is delexicalised, the encoded vector can be viewed as a distributed intent representation, which replaces the hand-coded dialogues act representation (Traum, 1999) from traditional goal-oriented dialogue systems. 86 Generation by Conditioning on a Broader Context 7.3.2 The RNN-CNN Dialogue State Tracker Belief tracking (also called Dialogue State tracking), as discussed in Section 2.3, provides the foundation of a goal-oriented dialogue system (Henderson, 2015b). Current state-of-the-art belief trackers use discriminative models, such as Recurrent Neural Networks, to directly map ASR hypotheses to belief states (Henderson et al., 2014c; Mrkšic´ et al., 2016b). Although we focus on text-based dialogue systems in this work, we retain belief tracking at the core of our system because firstly, belief tracking enables a sequence of free-form natural language sentences that are mapped into a fixed set of slot-value pairs, which can then be used to query a DB. This mapping can be viewed as a simple version of a semantic parser (Berant et al., 2013); secondly, by keeping track of the dialogue state, belief tracking avoids learning unnecessarily complicated long-term dependencies from raw inputs; thirdly, belief tracking uses a smart weight-tying strategy that can greatly reduce the data needed to train the model, and lastly, belief tracking provides an inherent robustness, which simplifies future extensions to spoken systems. Taking each user input as new evidence, the task of a belief tracker is to maintain a multinomial distribution b over values v ∈ Vs for each informable slot s and a binary distribution for each requestable slot. Note, Informable slots are slots that users can use to constrain the search, such as food type or price range, while Requestable slots are slots that users can ask a value for, such as address. Each slot in the ontologyG4 has its own specialised tracker. Each tracker is a Jordan-type (recurrence from output to hidden layer) (Jordan, 1989) RNN5 tracker with a CNN-based feature extractor, as shown in Figure 7.2. As in Mrkšic´ et al. (2015), we tie the RNN weights together for each value of v but vary the ftv features when updating each pre-softmax activation gtv. The update equations for a given slot s are, ftv = f t v,cnn⊕pt−1v ⊕pt−1/0 (7.3) gtv = ws · sigmoid(Wsftv+ ls)+ l′s (7.4) btv = exp(gtv) exp(g /0,s)+∑v′∈Vs exp(g t v′) , (7.5) where the vector is characterised by ws, the matrix is Ws, bias terms is ls and l′s, and the scalar g /0,s are parameters. bt/0 is the probability that the user has not mentioned that slot, up to turn t, and can be calculated by substituting g /0,s for gtv in the numerator of Equation 7.5. Note that the top-level weight vector ws is tied to each individual value v; therefore, it can learn to identify rare or even unseen values by leveraging training data points of the other values. 4A small knowledge graph defining the slot-value pairs that the system can talk about, for a particular task. 5We don’t use the recurrent connection for requestable slots since they do not need to be tracked. 7.3 Neural Dialogue Model 87 To model the discourse context at each turn, the feature vector ftv,cnn is the concatenation of two CNN derived features, one from processing the user input ut at turn t and the other from processing the machine response mt−1 at turn t−1, ftv,cnn = CNN (u) s,v (ut)⊕CNN(m)s,v (mt−1), (7.6) where every token in ut and mt−1 is represented by an embedding of size N, derived from a 1-hot input vector. This can be visualised in the tracker block of Figure 7.1, where two CNNs are used to process the sentences. To make the tracker aware of delexicalisation after it is applied to a slot or value, the slot-value specialised CNN operator CNN(·)s,v(·) extracts not only the top level sentence representation, but also an intermediate n-gram-like embedding determined by the position of the delexicalised token in each utterance. If multiple matches are observed, each corresponding embedding is summed. On the other hand, if there is no match for the slot or value, each empty n-gram embedding is padded with zeros. To keep track of the position of delexicalised tokens, both sides of the sentence are padded with zeros before each convolution operation to make each intermediate representation the same size as the original input. The number of vectors is determined by the filter size at each layer. The overall process of extracting several layers of position-specific features is visualised in Figure 7.2. The belief tracker described above is based on Henderson et al. (2014c), with some modifications: firstly, only probabilities over informable and requestable slots and values are output. This is because we want to keep the annotations as simple as possible; secondly, the recurrent memory block is removed, since it appears to offer no benefit in this task; and thirdly, the n-gram feature extractor is replaced by the CNN extractor described above to allow for more flexible feature extractions. By introducing slot-based belief trackers, we essentially add a set of intermediate labels into the system, compared to training a pure end-to-end system. These tracker components have been shown to be critical for achieving task success (Wen et al., 2017b). We will also show that the additional annotation effort they introduce can be successfully mitigated using a novel pipelined Wizard-of-Oz (WoZ) data collection framework. 88 Generation by Conditioning on a Broader Context 7.3.3 Database Operator and Deterministic Policy Network Database Operator Based on the output bts of the belief trackers, the DB query qt is formed by qt = ⋃ s′∈SI {argmax v bts′}, (7.7) where SI is the set of informable slots. This query is then applied to the DB to create a binary truth-value vector xt over DB entities where a 1 indicates that the corresponding entity is consistent with the query (and hence it is consistent with the most likely belief state). In addition, if x is not entirely null, an associated entity pointer is maintained. This identifies one of the matching entities selected at random. The entity pointer is updated if the current entity no longer matches the search criteria; otherwise it remains the same. The entity referenced by the entity pointer is used to form the final system response as described in Section 7.3.4. Policy Network The policy network can be viewed as the decision-making centre that takes the dialogue context and transforms it into an action representation of the system. Its output is a single vector zt representing the system action. Its inputs are comprised of ut from the intent network, the belief state bts, and the DB truth value vector xt. Since the generation network only generates appropriate sentence forms, the individual probabilities of the categorical values in the informable belief state are immaterial. They are summed together to form a summary belief vector for each slot bˆts, which is represented by three components: the summed value probabilities, the probability that the user said they "don’t care" about this slot and the probability that the slot has not been mentioned. Similarly for the truth value vector xt, the number of matching entities matters but not their identity. This vector is therefore compressed to a 6-bin 1-hot encoding xˆt, which represents varying degrees of matching in the DB (no match, 1 match, ... or more than 5 matches). Finally, the policy network output is generated by a three-way matrix transformation zt = tanh(Wuzut +Wbzbˆt +Wxzxˆt), (7.8) where matrices {Wuz, Wbz, Wxz} are parameters and bˆt = ⊕ s∈G bˆts is a concatenation of all summary belief vectors. This can be viewed as deterministic policy modelling because the policy is completely based on the input. 7.3 Neural Dialogue Model 89 (a) Language model type LSTM (b) Memory type LSTM (c) Hybrid type LSTM Fig. 7.3 Three different conditional generation architectures. Attentive Policy Network An attention-based mechanism provides an effective approach for aggregating multiple information sources for prediction tasks. Instead of just using a simple deterministic matrix transformation for decision-making, we explore the use of an attention mechanism to combine the tracker belief states in which the policy network in Equation 7.8 is modified as zjt = tanh(Wuzut+Wxzxt+∑s∈Gα j sWsbzb s t ), where j is the generation step of the decoder and the attention weights α js are calculated via α js = softmax ( r⊺ tanh ( Wr · (vt⊕bst ⊕wtj⊕htj−1) )) , where vt = ut+xt, wtj and h t j−1 are current-step word-embeddings and previous step decoder context, respectively, and {Wr, r} are learning parameters. By adopting an attention mech- anism in the policy network, the action embedding is no longer constant during decoding. This allows the generator to selectively focus on part of the input tracker space rather than conditioning over the entire belief state while generating the next word. The action embed- ding zjt remains solely determined by the input, and therefore the decision variations in the data or any intrinsic agent decisions cannot be modelled yet. 7.3.4 Conditional LSTM Generator and Its Variants Conditioned on the system action representation zt , which is provided by the policy network, the decoder module uses a conditional LSTM LM to generate the required system output token by token in skeletal form6. The final system response can then be formed by substituting 6The output is delexicalised by replacing slot-value specific realisations with generic tokens. 90 Generation by Conditioning on a Broader Context the actual values of the database entries into the skeletal sentence structure. This is different from the situation we faced in Section 5.1 where the dialogue act taxonomies serve as an intermediate representation. These taxonomies can be easily broken down into several meaningful pieces (slot and values) and tackled individually using separate sigmoid gates. This is done while the dialogue act representation is in a high-dimensional, continuous space, which is hard to interpret. Therefore, we now give up the SC-LSTM-like explicit gating controls but instead simply condition the LSTM decoder on the dialogue act vector. In this thesis, we study and analyse three different variants of LSTM-based conditional generation architectures: the LM type, the memory type, and the hybrid type decoder. More details are provided in the sections that follow. Language Model Type Decoder The most straightforward way to condition the LSTM network on additional source informa- tion is to concatenate the conditioning vector zt together with the input word embedding wj and the previous hidden layer hj−1, ij fj oj cˆj =  sigmoid sigmoid sigmoid tanh W4n,3n  ztwj hj−1  cj = fj⊙ cj−1+ ij⊙ cˆj hj = oj⊙ tanh(cj), where the index j is the generation step, n is the hidden layer size, ij, fj,oj ∈ [0,1]n are input, forget, and output gates, respectively, cˆj and cj are proposed cell values and true cell values at step j. While W4n,3n are model parameters. The model is shown in Figure 7.3a. Since the resulting model does not differ significantly from the original LSTM, we call it the language model type (lm) conditional generation network. Memory Type Decoder The memory type (mem) conditional generation network was introduced by Wen et al. (2015c). The network is shown in Figure 7.3b. We see that the conditioning vector zt is governed by a standalone reading gate rj. This reading gate decides how much information should be read 7.3 Neural Dialogue Model 91 from the conditioning vector and directly writes it into the memory cell cj, ij fj oj rj =  sigmoid sigmoid sigmoid sigmoid W4n,3n  ztwj hj−1  cˆj = tanh ( Wc(wj⊕hj−1) ) cj = fj⊙ cj−1+ ij⊙ cˆj+ rj⊙ zt hj = oj⊙ tanh(cj), where Wc is another weight matrix the network needs to learn. The idea behind this is that the model isolates the conditioning vector from the LM so that the model has more flexibility to learn to trade-off between the two. Hybrid Type Decoder Continuing with the same idea as the memory type network, a complete separation of conditioning vector and LM (except for the gate controlling the signals) is provided by the hybrid-type network shown in Figure 7.3c, ij fj oj rj =  sigmoid sigmoid sigmoid sigmoid W4n,3n  ztwj hj−1  cˆj = tanh ( Wc(wj⊕hj−1) ) cj = fj⊙ cj−1+ ij⊙ cˆj hj = oj⊙ tanh(cj)+ rj⊙ zt This model was motivated by the fact that long-term dependency is not needed by the conditioning vector because we apply this information at every step j anyway. The decoupling of the conditioning vector and LM is attractive because it leads to better interpretability of the results and provides the potential to learn from isolated conditioning vectors and LMs. 92 Generation by Conditioning on a Broader Context 7.4 Wizard-of-Oz Data Collection Arguably the greatest bottleneck for statistical approaches to dialogue system develop- ment is the collection of appropriate training data. This is especially true for task-oriented dialogue systems. Serban et al. (2015a) have catalogued existing corpora to develop con- versational agents. Such corpora may also be useful for bootstrapping. For task-oriented dialogue systems, in-domain data is essential7. To mitigate this problem, we propose a novel crowdsourcing version of the Wizard-of-Oz (WoZ) paradigm (Kelley, 1984) for collecting domain-specific corpora. Based on the given ontology, we designed two webpages on the Amazon Mechanical Turk, one for wizards and the other for users (see Appendix D for the designs). The users are given a task that specifyies the characteristics of a particular entity that they must find (e.g. a Chinese restaurant in the north). They are then asked to type in natural language sentences to fulfil the task. The wizards are given a form to record the information conveyed in the last user’s turn (e.g. pricerange=Chinese, area=north). They are also provided with a search table showing all the available matching entities in the database. We note that these forms contain all the labels needed to train the slot-based belief trackers. The table is automatically updated every time the wizard submits new information. Based on the updated table, the wizard types an appropriate system response and the dialogue continues. To enable large-scale, parallel data collection and avoid the distracting latencies inherent in conventional WoZ scenarios (Bohus and Rudnicky, 1999), users and wizards are asked to contribute only a single turn to each dialogue. To ensure coherence and consistency, users and wizards must review all previous turns in the dialogue before they contribute their turns. Thus, dialogues progress in a pipeline. Many dialogues can be active in parallel. No worker needs to wait for a response from the other party in the dialogue. Even though multiple workers contribute to each dialogue, we observe that dialogues are generally coherent, but diverse. To ensure the quality of the collected data, two things are especially important: • Setting a high bar for worker qualifications8. • Collecting data iteratively and rejecting HITs if the collected example is unacceptable. Furthermore, the turn-level data collection strategy seems to encourage workers to learn and correct each other based on previous turns. In this thesis, the system was designed to assist users to find a restaurant in the Cambridge, UK area. There are three informable slots (food, pricerange, area), which users can use to 7e.g., technical support for Apple computers may differ completely from that for Windows, due to the many differences in software and hardware. 8In our case, setting the approval rate above 97% works the best. 7.5 Evaluation on Neural Dialogue Models 93 constrain the search. There are six requestable slots (address, phone, postcode plus the three informable slots) that the user can ask about once a restaurant has been offered. There are 99 restaurants in the DB. Based on this domain, we ran 3000 HITs (Human Intelligence Tasks) in total for roughly 3 days and collected roughly 2750 dialogue turns (counting both user and system turns). After cleaning the data, we have approximately 680 dialogues in total (some of them are unfinished). The total cost for collecting this dataset was ∼ 400 USD. 7.5 Evaluation on Neural Dialogue Models To evaluate the effectiveness of the proposed NDM as well as the WoZ data collection, a corpus-based evaluation and a human evaluation were performed. Detailed are provided in the next section. 7.5.1 Experimental Setup and Evaluation Metrics Training Training of the NDM is divided into two phases. Firstly the belief tracker parameters θb are trained using the cross entropy errors between tracker labels yts and predictions bts, L1(θb) =−∑t∑s(yts)⊺ logbts. For the full model, we have three informable trackers (food, price range, and area) and six requestable trackers (address, phone, postcode, plus the three informable slots). An additional tracker was also trained to detect user goal changes (handling queries like "is there anything else?"). Having fixed the tracker parameters, the remaining parts of the model θ\b are trained using the cross entropy errors from the generation network language model, L2(θ\b) = −∑t∑j(ytj)⊺ logptj, where ytj and ptj are output as token targets and predictions, respectively, at turn t of output step j. We treated each dialogue as a batch and used the Adam opti- miser (Kingma and Ba, 2014) with a small l2-regularisation term to train the model. The collected corpus was partitioned into training, validation, and testing sets in the ratio 3:1:1. Early stopping was implemented and based on the validation set for which the maximum degradation multiplier was set to 2 and gradient clipping was set to 1. All the hidden layer sizes were set to 50 and all the weights were randomly initialised between -0.3 and 0.3. This included the initialisation of each word embedding. While initialising both the LSTM encoder and decoder, we initialise the bias term of the LSTM forget gate and set its value to 2, as suggested in Jozefowicz et al. (2015). The vocabulary size is approximately 500 for both input and output. Rare words and words that can be delexicalised are removed. We used three convolutional layers for all belief tracking CNNs and all the filter sizes were set to 3. Pooling operations were only applied after the final convolution layer. 94 Generation by Conditioning on a Broader Context Decoding To decode without length bias, we decoded each system’s response mt based on the average log probability of tokens m∗t = argmax mt {log p(mt |θ ,ut)/∥mt∥}, (7.9) where θ is the model parameter. As studied in Wen et al. (2017b), the repetition of slot or value tokens can hurt the model because it may generate placeholders that cannot be properly realised. Therefore, we introduce an additional penalty function for repeated tokens that carry slot-value specific information and modify the decoding criterion m∗t = argmax mt {log p(mt |θ ,ut)/∥mt∥+αRt}, (7.10) where α is the trade-off parameter and was empirically set to 1.0. We used a simple heuristic for the scoring function Rt as Rt = { − 8 if a value token repeats, −12(r−1)2 if a slot token repeats, (7.11) where r is the number of repetitions of that token. Note Equation 7.11 allows slot token repetitions, to some degree, when the loss grows in a polynomial way. However, the heuristic cannot tolerate value token repetitions, which are basically cut down from the branch during decoding simply because the model won’t be able to properly realise these additional value- specific tokens. Another topic worth mentioning is that, unlike Wen et al. (2017b), we do not apply a fully-weighted decoding strategy here because we want to keep the decoding as simple as possible so that the performance gain is mainly from the intrinsic capabilities of the model. Furthermore, the MMI criterion (Li et al., 2016a) was proven to be ineffective for this task. During decoding, the beamwidth was set to 10 and the search stops when an end of sentence token is generated. Metrics We compared models trained with different recipes by performing a corpus- based evaluation. The model is used to predict each system response in the held-out test set. Three evaluation metrics were used: BLEU score on the top-1 generated candidates (Papineni et al., 2002) and the objective task success rate (Su et al., 2015). The dialogue is marked as successful if the following two conditions are true: (1) the offered entity matches the task that was specified by the user, and (2) the system answered all the associated information requests (e.g., what is the address?) from the user. We computed the BLEU scores of the skeletal sentence forms before substituting the actual entity values. A weighted metric of BLEU, and success rate, was also computed by multiplying the BLEU score with 0.5 and 7.5 Evaluation on Neural Dialogue Models 95 Table 7.1 Corpus-based experiment by comparing different NDM architectures. The re- sults were obtained by training models with several hyper-parameter settings. The testing performance is based on the best model of the validation set. Model Decoder Success(%) BLEU Suc.+0.5 BLEU Vanilla NDM lm 72.8% 0.237 0.847 mem 74.3% 0.243 0.865 hybrid 77.9% 0.231 0.894 Attentive NDM lm 72.1% 0.246 0.844 mem 80.1% 0.240 0.921 hybrid 77.9% 0.234 0.896 Table 7.2 Human assessment of NDM. The rating for comprehension and naturalness are both out of 5. Metric NDM Success 98% Comprehension 4.11 Naturalness 4.05 # of dialogues: 245 Table 7.3 A comparison of the NDM with a rule-based modular system (HDC). A two- tailed binomial test was used. Metric NDM HDC Tie Subj. Success 96.95% 95.12% - Avg. # of Turn 3.95 4.54 - Comparisons(%) Naturalness 46.95* 25.61 27.44 Comprehension 45.12* 21.95 32.93 Preference 50.00* 24.39 25.61 Performance 43.90* 25.61 30.49 * p <0.005, # of comparisons: 164 summing the two metrics together. This serves as the main indicator when comparing models because we care about the success rate slightly more than the BLEU rate, although both are important. 7.5.2 Results The corpus-based evaluation results are shown in Table 7.1. Since neural networks are usually sensitive to random seeds and hyper-parameters, we ran a small grid search over the following hyper-parameters: the initial learning rate of Adam, with range [0.008,0.010,0.012]; the l2 regularisation terms, with range [0.0,1e−6,1e−5,1e−4]; and the random seed with values ranging from 1 to 5. We trained a model based on each of the hyper-parameter combinations mentioned above. We picked the model which performed the best on the validation set, and reported its performance on the testing set. 96 Generation by Conditioning on a Broader Context Fig. 7.4 The action vector embedding zt generated by the vanilla NDM model. Each cluster is labelled with the first three words that the embedding generated. As suggested in Table 7.1, the attentive NDMs generally perform better than the vanilla NDMs, in terms of both metrics. This is mainly due to the inclusion of an attention mech- anism, which gives the model the freedom to change focus while generating the sentence. This is consistent with the findings of Wen et al. (2016a). However, there is no clear "best choice" for the language decoder, even though Wen et al. (2016a) suggested that the hybrid type LSTM decoder, which uses a complete separation of conditional vector and language model, yielded the best result. This discrepancy can be attributed to two major reasons: firstly, in Wen et al. (2016a) the encoder used was a single-direction LSTM. In this thesis, however, a bidirectional LSTM encoder is used; secondly, all the LSTM-based forget gate biases were set to 2 as suggested in Jozefowicz et al. (2015), which were not employed by Wen et al. (2016a). Because both of the methods mentioned above were trying to address the same vanishing gradient problem (Bengio et al., 1994) when training RNN, the effectiveness of the decoder separation is therefore offset. 7.5.3 Human Evaluation To assess operational performance, we tested our model using paid subjects recruited via Amazon Mechanical Turk. Each judge was asked to perform a task and to rate the model’s performance. We assessed the subjective success rate, the perceived comprehension ability, and naturalness of response on a scale of 1 to 5. The attentive NDM was used and the system 7.6 Conclusions 97 was tested on a total of 245 dialogues. As we see in Table 7.2, the average subjective success rate was 98%. This result suggests that the system was able to complete the majority of tasks. Moreover, the comprehension ability and naturalness scores both averaged more than 4 out of 5. (See Table 7.4 for sample dialogues in this trial.) We also ran comparisons between the NDM and a handcrafted, modular baseline system (HDC); it consists of a handcrafted semantic parser, rule-based policy and belief tracker, and a template-based generator. The results can be seen in Table 7.3. The HDC system achieved a∼ 95% task success rate, which suggests that it is a strong baseline, even though most of the components were hand-engineered. Over the 164-dialogues tested, the NDM was considered better than the handcrafted system on all the metrics compared. Although both systems achieved similar success rates, the NDM was more efficient and provided a more engaging conversation (lower turn number and higher preference). Moreover, the comprehension ability and naturalness of the NDM were also rated much preferred, which suggests that the learned system was perceived as being more natural than the hand-designed system. Action Embedding A major argument for learning end-to-end dialogue systems is that we can directly learn from human-human conversations without the need for explicit handcrafting of the semantic representation, such as dialogue acts. Consequently, analysing what the NDM learns as its internal semantic representation is an important task still left to do. To better understand the meaning of the conditional vector zt represents, we used t-SNE (der Maaten and Hinton, 2008) to produce a reduced dimensional view of the zt embedding, plotted and labelled by the first three generated output words9. The result is shown as Figure 7.4. We can see soft intent clusters scattering in different places, even though we did not explicitly model them using DAs. This shows the main advantage of NDM: it learns to group states and actions that are supposed to close to each other together via direct supervision signal from the output responses, which helps the dialogue policy generalises through the learnt high dimensional state and action space. 7.6 Conclusions In this chapter, we estimate the possibility of training a dialogue response generator via conditioning on a broader context without the provision of a handcrafted semantic represen- tation like dialogue act. This is an important step forward, not only because it allows the generator to produce more contextual and natural replies, but also because it requires less manual annotation labour and the complementary data collection procedure is much easier to scale using crowdsourcing platforms. 9The vanilla NDM was analysed here. 98 Generation by Conditioning on a Broader Context In the second section, we reviewed a series of studies on end-to-end dialogue modelling, including both chat-based and goal-oriented dialogue modelling. We concluded that a dialogue agent must be based on real-world knowledge or databases that enable causal and rational system behaviours. Furthermore, to develop systems that only have a limited access to training data, we introduce a coarse-grained intermediate representation so that the model can learn effectively on less training data by leveraging the intermediate annotations without compromising the system’s scalability. The main theme of this chapter is that we propose a Neural Dialogue Model for goal- oriented dialogue modelling by balancing the strengths and weaknesses of two dialogue research communities: firstly, the model is end-to-end trainable. However, it remains modularly connected. This allows us to combine the advantages of both end-to-end chat bots and modular task-oriented dialogue systems; secondly, it has an explicit representation of database (DB) attributes (slot-value pairs) which it uses to achieve high-task success rates, but has a distributed representation of user intent to allow ambiguous inputs; thirdly, it uses delexicalisation and a weight tying strategy (Henderson et al., 2014c) to reduce the data required to train the model, but still maintains a high degree of freedom should larger amounts of data become available. We show that the NDM can learn from a human-human conversational corpus via supervised learning and perform a task competitively across several metrics, when trained on only a few hundred dialogues. Furthermore, to train the model for the target application, we introduced a novel pipe- lined data collection mechanism inspired by the Wizard-of-Oz paradigm (Kelley, 1984) to collect human-human dialogue corpora via crowd-sourcing. This data collection method collects dialogue corpora on a turn-by-turn basis and allows crowdsourcing to be easily done by non-experts. We found that this process is simple and enables fast data collection online with very low development costs. 7.6 Conclusions 99 Table 7.4 Samples of real text conversations between online judges and NDM. Human : Restaurant in any area that serves Vietnamese food Machine : thanh binh serves vietnamese food type , is there anything else i can help you with ? Human : What is the phone number and post code Machine : the number is 01223 362456 and the area code is c.b 3, 0 a.f . is there anything else i can do for you ? Human : Good bye Machine : thank you for using the cambridge restaurant system . Human : i am looking for moderately priced halal food Machine : i am sorry , there are no halal restaurants in the moderate price range . would you like something else ? Human : how about european food Machine : galleria serves european food type and is in the moderate price range . Human : address and postcode Machine : the address is 33 bridge street and the area code is c.b 2, 1 u.w . is there anything else i can help you with today ? Human : thanks Machine : galleria is a european restaurant in the moderate price range . would you like their phone ? Human : thank you Machine : thank you for using the cambridge restaurant system . good bye Human : Hi, I’m hungry for some Indonesian food Machine : i am sorry , there are no indonesian restaurants . would you like a different food ? Human : Oh no. How about Indian? Machine : there are 9 restaurants serving indian types of food . do you have a preference on side of town ? Human : East side please Machine : sitar tandoori is a indian restaurant in the east side of town . would you like their information ? Human : Yes, the address and phone number please Machine : sitar tandoori is located at 43 high street cherry hinton cherry hinton and their phone number is 01223 249955 . Human : Okay, thank you Machine : thank you , goodbye . Human : hello Machine : what cuisine would you like ? Human : afghan Machine : i am sorry , there are no afghan restaurants . would you like something else ? Human : korean Machine : little seoul serves korean food , it is located at 108 regent street city centre . Human : is it expensive? Machine : little seoul serves korean food and is in the expensive price range . Human : what area is that Machine : little seoul is located in the centre area of town . would you like their number ? Human : yes Machine : the telephone number is 01223 308681 . is there anything else i can help you with today ? Human : no goodbye Machine : thank you for using the cambridge restaurant system , goodbye . Human : I want to find a restaurant in the east part of town that serves creative food Machine : i am sorry , there are no creative restaurants in the east side of town . would you like a different restaurant ? Human : how about indian food? Machine : curry prince is a indian restaurant in the east of town . would you like their phone number ? Human : yes, may i have the address and phone number? Machine : curry prince is located at 451 newmarket road fen ditton and their phone is 01223 566388 . is there anything else i can help you with today ? Human : no, thank you goodbye Machine : thank you for using the cambridge restaurant system . good bye Machine : what food are you interested ? Human : english Machine : i am sorry , there are no english restaurants . would you like to try a different food type ? Human : chinese Machine : there are 8 restaurants serving chinese type of cuisine . do you have a preference of area ? Human : south part Machine : the good luck chinese food takeaway is a chinese restaurant in the south of town . would you like the phone number ? Human : no. just address and price range Machine : the good luck chinese food takeaway is located at 82 cherry hinton road cherry hinton . is there anything else i can help you with ? Human : what is their price range? Machine : the good luck chinese food takeaway is expensively priced chinese restaurant in the south area . would you like more information ? Human : no goodbye Machine : thank you for using the cambridge restaurant system . good bye . Human : Is there a gastropub anywhere around here? Machine : there are 3 gastropub restaurants . would you like more information ? Human : Is there one on the west side? Machine : i am sorry , there are no gastropub restaurants in the west side of town . would you like something else ? Human : How about east? Machine : royal standard is a gastropub restaurant in the east of town . Human : What’s the address? Machine : the address is 290 mill road city centre . Human : Thanks, bye Machine : thank you for using the cambridge restaurant system . good bye Chapter 8 Generation based on a Latent Policy In the previous chapter, we proposed the idea of dialogue response generation by conditioning over the rich dialogue context instead of dialogue acts and learning end-to-end dialogue modelling directly, via decoder supervisions. Although the proposed NDM has demonstrated that it can learn conversational strategies from the training corpus, and performs quite well in achieving task success, NDM has an apparent defect. NDM cannot model intrinsic variations in both dialogue policy and language expressions. Like most neural networks such as LSTMs and sequence-to-sequence models, NDM also belongs to the family of deterministic models, where outcomes are precisely determined via known relationships among states and events, without room for random variation. Typically, discriminative models are trained to learn only conditional output distributions over strings, and despite the sophisticated architecture and conditioning mechanisms used to ensure salience, they cannot model the underlying actions needed to generate natural dialogues. Thus, these models are limited by their inability to exhibit the intrinsic variability and stochasticity of natural dialogue. In addition, there is often insufficient training data for goal-oriented dialogues, which results in over-fitting or preventing deterministic models from learning effective and scalable interactions. Another difficulty in end-to-end based dialogue models is the lack of an interpretable interface for controlling system responses, which poses a great challenge when debugging of real-world applications. In this chapter, we propose to address NDM’s drawbacks with a latent variable model and introduce the Latent Intention Dialogue Model (LIDM), that can learn complex distributions of communicative intentions in goal-oriented dialogues. In LIDM, each system action is represented as a dimension of a discrete latent variable, instead of a deterministic vector as in NDM. The system action (or intention), which is sampled from the latent distribution that was inferred from the user’s input utterance, is combined with the dialogue context representation that will guide the generation of the system response. Since the learnt latent 102 Generation based on a Latent Policy space is discrete, it can be interpreted by the responses it generates; therefore, the space serves as an interface for human intervention and manipulation. Furthermore, this discrete latent variable also provides the model with an interface to perform different learning paradigms under the same framework. This is important because it provides a stepping stone towards building an autonomous dialogue agent that can continuously improve itself via interaction. 8.1 Stochastic Policy as a Discrete Latent Variable As mentioned in the beginning of this chapter, if we view end-to-end dialogue modelling as a single problem, then latent variable modelling of the communicative intention must capture the underlying diversity and variation in the conversation. To model intention as a latent variable, we must first decide whether we want to model it as a continuous or discrete latent variable. Continuous latent variable modelling is more popular and has been introduced in chat-based scenarios before (Cao and Clark, 2017; Serban et al., 2016). The chat-based scenario of a continuous Gaussian random variable is sampled and the Gaussian re-parameterisation trick is used for inference. Although this continuous Gaussian random variable injects randomness and encourages diversity in the response, it does not bring the additional advantage, like the discrete latent variables do. In this thesis, we propose to model the dialogue intention as a discrete latent variable instead. There are several benefits to modelling dialogue intention as a discrete latent variable (a categorical distribution) rather than a continuous one (a Gaussian distribution): firstly, each dimension of the discrete latent distribution can be interpreted as a DA and the distribution can be considered a latent dialogue policy. Unlike the POMDP-based framework (Young et al., 2013), which handcrafted the set of system intentions based on the DA formalism (Traum, 1999), here we directly infer the underlying dialogue intentions from data. Thus, we can handle intent distributions with long tails by measuring similarities, against the existing ones, during variational inference. This frees the model from pre-defined, limited semantic representations and allows it to better capture complex human-human conversations; secondly, this discrete set of intentions can serve as a principal framework for integrating different learning paradigms. This allows us to develop a dialogue agent with an AlphaGo-like recipe (Silver et al., 2016) by first training on a supervised dataset and then applying reinforcement learning to continuously improve through interaction. This discrete interface is also good for semi-supervised learning, which is critical for NLP tasks where additional supervision and external knowledge can be utilised for bootstrapping (Faruqui et al., 2015; Kocˇiský et al., 2016; Miao and Blunsom, 2016); finally, this discrete latent variable can also serve as an interface for human control and intervention. This is crucial from a practical 8.2 Latent Intention Dialogue Model 103 viewpoint because it not only allows us to better understand the complex learning system, but also provides us with a gateway for debugging it. 8.2 Latent Intention Dialogue Model The Latent Intention Dialogue Model (LIDM) proposed here is based on the framework of NDM as described in the previous chapter. However, instead of modelling each system action as a deterministic vector, we introduce a discrete latent variable with a categorical distribution to better capture its underlying stochasticity. In LIDM, the latent intention is inferred from user input utterances. The agent draws a sample based on its dialogue context. The intention then guides the generation of natural language response. Firstly, in the framework of NVI (Mnih and Gregor, 2014), we construct an inference network to approximate the posterior distribution over the latent intention. Then, by sampling the intentions for each response, we can directly learn a basic intention distribution on a human-human dialogue corpus by optimising the variational lower bound. To further reduce the variance, we utilise a labelled subset of the corpus. The labels are automatically generated by clustering. Then, the latent intention distribution can be learned in a semi-supervised fashion, where the learning signals are either from the direct supervision (labelled set) or the variational lower bound (unlabelled set). From the perspective of reinforcement learning, the latent intention distribution can be interpreted as the intrinsic policy that reflects human decision-making under a conversational scenario. Based on the initial policy (latent intention distribution) learnt from the semi- supervised variational inference framework, the model can refine its strategy easily against alternative objectives, especialy by using policy gradient-based reinforcement learning. This is somewhat analogous to the training process used in AlphaGo (Silver et al., 2016) for the game of Go. Based on LIDM, we show that different learning paradigms can be brought together under the same framework to bootstrap the development of a dialogue agent (Li et al., 2016c; Weston, 2016). 8.2.1 Model The LIDM is based on the NDM framework proposed in Chapter 7. For the sake of explaining NVI in the dialogue system framework, we give an alternative modular interpretation of the system diagram that is shown in Figure 8.1, which comprises three principal components: (1) Representation Construction; (2) Policy Network; and (3) Generator. 104 Generation based on a Latent Policy Fig. 8.1 LIDM for Goal-oriented Dialogue Modelling Representation Construction Based on the user input query ut and the internal knowl- edge base KB, we first construct a representation of the dialogue context via st = ut ⊕ bˆt ⊕xt . (8.1) As we recall from Section 7.3, ut is a bidirectional LSTM encoding of the user input utterance ut = biLSTMθ (ut), and bˆt is a concatenation of a set of summarised probability distributions over domain specific slot-value pair. This set is extracted by a set of pre-trained RNN-CNN belief trackers (Mrkšic´ et al., 2016b; Wen et al., 2017b), in which ut and mt−1 are processed by two different CNNs as shown in Figure 8.1 bt = RNN-CNN(ut ,mt−1,bt−1), (8.2) where mt−1 is the preceding machine response and bt−1 is the preceding belief vector. They are included to model the current turn of the discourse and the long-term dialogue context, respectively. Based on the belief vector, a query qt is formed by taking the union of the maximum values of each slot. qt is then used to search the internal KB and return a vector xt, which represents the degree of matching in the KB. This is produced by counting all the matching venues and re-structuring it into a six-bin one-hot vector. Up to now, we reuse NDM’s system modules to construct the deterministic dialogue state vector st. 8.2 Latent Intention Dialogue Model 105 Latent Policy Network Conditioning on the state st, the policy network parameterises the latent intention zt via a single layer MLP πθ (zt |st) = softmax(W⊺2 · tanh(W⊺1st +b1)+b2), (8.3) where W1, b1, W2, b2 are model parameters. Since πθ (zt|st) is a discrete conditional probability distribution, based on dialogue state, we can also interpret the policy network as a latent dialogue management component as in the traditional POMDP-based framework (Gašic´ et al., 2013; Young et al., 2013). A latent intention z(n)t (or an action in the reinforcement learning literature) can then be sampled from the conditional distribution z(n)t ∼ πθ (zt |st). (8.4) Generator The sampled intention (or action) z(n)t and the state vector st can be combined into a control vector dt, which is then used to govern the generation of the system response mt, based on a conditional LSTM language model dt = W⊺4zt ⊕ [ sigmoid(W⊺3zt +b3) ·W⊺5st ] (8.5) pθ (mt |st ,zt) =∏ j p(wtj+1|wtj,htj−1,dt), (8.6) where b3 and W3∼5 are parameters, zt is the 1-hot representation of z (n) t , w t j is the last output token (i.e., a word, a delexicalised slot name or a delexicalised slot value), and htj−1 is the decoder’s last hidden state. Note in Equation 8.5, the degree of information flow from the state vector is controlled by a sigmoid gate whose input signal is the sampled intention z(n)t . This prevents the decoder from over-fitting to the deterministic state’s information and forces it to consider the sampled stochastic intention. The LIDM can then be formally written down by its parameterised form with a parameter set θ , pθ (mt |st) =∑ zt pθ (mt |zt ,st)πθ (zt |st). (8.7) 8.2.2 Inference To carry out inference for LIDM, we introduce an inference network qφ (zt|st,mt) to approxi- mate the posterior distribution p(zt|st,mt). We then optimise the variational lower bound of the joint probability in the neural variational inference framework (Miao et al., 2016) and 106 Generation based on a Latent Policy thus, derive the variational lower bound L(θ ,φ) = Eqφ (zt)[logpθ (mt|zt,st)]−λDKL(qφ (zt)||πθ (zt|st)) ≤ log∑ zt pθ (mt|zt,st)πθ (zt|st) = logpθ (mt|st), (8.8) where qφ (zt) is shorthand for qφ (zt|st,mt). Note that we use the modified version of the lower-bound here by introducing the trade-off constant λ (Higgins et al., 2017). By following the steps, as described in Section 4.3.2, the inference network qφ (zt|st,mt) is then constructed by qφ (zt |st ,mt) = Multi(ot) = softmax(W6ot) (8.9) ot = MLPφ (bˆt ,xt ,ut ,mt) (8.10) ut = biLSTMφ (ut),mt = biLSTMφ (mt), (8.11) where ot is the joint representation, and both ut and mt are modelled by a bidirectional LSTM network. Although both qφ (zt|st,mt) and πθ (zt|st) are modelled as parameterised multinomial distributions, qφ (zt|st,mt) is an approximation that only functions via inference by producing samples to compute stochastic gradients, while πθ (zt|st) is the generative distribution that generates the samples for composing the machine response. Based on the samples z(n)t ∼ qφ (zt|st,mt), we adopt a scoring function-based gradient estimator, as described in Section 4.3.2. We use different strategies to alternately optimise the parameters θ and φ against the variational lower bound (Equation 8.8). To optimise parameters θ , we further divide the parameter set into two sets of parameters θ = {θ1,θ2}. For θ1 on the decoder side, we directly update them by back-propagating the gradients, ∂L ∂θ1 = Eqφ (zt|st,mt)[ ∂ logpθ1(mt|zt,st) ∂θ1 ] ≈ 1 N∑n ∂ logpθ1(mt|z(n)t ,st) ∂θ1 (8.12) 8.2 Latent Intention Dialogue Model 107 For the parameters θ2 in the generative network, we update them by minimising the KL divergence via ∂L ∂θ2 =−∂λDKL(qφ (zt|st,mt)||πθ2(zt|st)) ∂θ2 =−λ∑ zt qφ (zt|st,mt)∂ logπθ2(zt|st)∂θ2 (8.13) The entropy derivative ∂H[qφ (zt|st,mt)]/∂θ2 = 0; thus, it can be ignored. Finally, for each parameter φ in the inference network, we define the learning signal r(mt,z (n) t ,st) as r(mt,z (n) t ,st) = logpθ1(mt|z(n)t ,st)−λ (logqφ (z(n)t |st,mt)− logπθ2(z(n)t |st)). (8.14) Then the φ parameter can be updated by using ∂L ∂φ = Eqφ (at|st,mt)[r(mt,at,st) ∂ logqφ (at|st,mt) ∂φ ] ≈ 1 N∑n r(mt,z (n) t ,st) ∂ logqφ (z (n) t |st,mt) ∂φ (8.15) This gradient estimator has a large variance because the learning signal r(mt,z (n) t ,st) relies on samples from the proposal distribution qφ (zt|st,mt). To reduce the variance during inference, we follow the REINFORCE algorithm (Mnih and Gregor, 2014; Mnih et al., 2014) to introduce the two baselines b and b(st). The centred baseline signal and input dependent baseline help reduce the variance. b is a learnable constant and b(st) = MLP(st) is modelled by an MLP. During training, the two baselines are updated by minimising the distance, Lb = [ r(mt ,z (n) t ,st)−b−b(st) ]2 (8.16) and the gradient w.r.t. φ can be rewritten as ∂L ∂φ ≈ 1 N∑n [r(mt ,z (n) t ,st)−b−b(st)] ∂ logqφ (z (n) t |st ,mt) ∂φ (8.17) 8.2.3 Semi-supervised Learning Despite the efforts to reduce the variance, there are two major difficulties in learning latent intentions in a completely unsupervised manner: (1) the high variance of the inference network in the early stages of training prevents it from generating sensible intention samples 108 Generation based on a Latent Policy in the beginning, and (2) The overly-strong discriminative power of the LSTM language model. This results in a disconnection phenomenon between the LSTM decoder and the rest of the system components. The decoder learns to ignore the samples but only to focus on optimising the language model. To ensure a more stable training experience and prevent disconnection, a semi-supervised learning technique is introduced. Inferring the latent intentions behind utterances is like an unsupervised clustering task. To mitigate the learning load of the model, we can use off-the-shelf clustering algorithms to pre-process the corpus and generate automatic labels zˆt for part of the training examples (mt,st, zˆt)∈L. As a result, when the model is trained on the unlabelled examples (mt,st)∈U, we optimise the model against the modified variational lower bound as in Equation 8.8. L1 = ∑ (mt,st)∈U Eqφ (zt|st,mt) [logpθ (mt|zt,st)]−λDKL(qφ (zt|st,mt)||πθ (zt|st)) (8.18) If the model is updated based on examples from the labelled set (mt,st, zˆt) ∈ L, we treat the labelled intention zˆt as an observed variable and train the model by maximising the joint log-likelihood, L2 = ∑ (mt ,zˆt ,st)∈L log [ pθ (mt |zˆt ,st)πθ (zˆt |st)qφ (zˆt |st ,mt) ] (8.19) Thus, the final joint objective function can be written as L′ = αL1 +L2, where α is a trade-off constant on the supervised and unsupervised examples. 8.2.4 Reinforcement Learning One of the main purposes of learning interpretable, discrete latent intention within a dialogue system is to be able to control and refine the model’s future behaviour. The generative network πθ (zt|st) encodes the policy discovered from the underlying data distribution, but it is not necessarily optimal for any task. Since the πθ (zt|st) is a parameterised policy network itself, any off-the-shelf, policy gradient-based reinforcement learning algorithm (Konda and Tsitsiklis, 2003; Williams, 1992) can be used to fine-tune the initial policy against other objective functions that we are more interested in. Based on the initial policy πθ (zt|st), we revisit the training dialogues and update parame- ters based on the following strategy: when encountering unlabelled examples U, in turn t, the system samples an action from the learnt policy z(n)t ∼ πθ (zt|st) and receives a reward r(n)t . By conditioning on these, we can directly fine-tune a subset of the model parameters θ ′ via 8.3 Evaluation on Latent Intention Dialogue Models 109 the policy gradient method ∂J ∂θ ′ ≈ 1 N∑n r(n)t ∂ logπθ (z (n) t |st) ∂θ ′ , (8.20) where θ ′= {W1,b1,W2,b2} is the MLP that parameterises the policy network (Equation 8.3). When a labelled example ∈ L is encountered, we force the model to take the labelled action z(n)t = zˆt and update the parameters via Equation 8.20. Unlike Li et al. (2016c), which refines the whole model end-to-end using RL, updating only θ ′ effectively allows us to revise only the decision-making of the system and make it resist against over-fitting. 8.3 Evaluation on Latent Intention Dialogue Models 8.3.1 Experimental Setup We tested the LIDM using the CamRest676 corpus, as described in Section 7.4. To make a direct comparison with NDMs, we follow the same experimental setup as in Section 7.5. For more detail about the common experimental settings and the hyper-parameters, please refer back to Section 7.5. For LIDM specific settings, we train three types of LIDMs with their latent intention size I set to 50, 70, and 100, respectively. The trade-off constants λ and α are both empirically set to 0.1. To produce self-labelled response clusters for semi-supervising the intentions, we firstly removed function words from all the responses and cluster them according to their word content. We then assign responses to the i-th frequent cluster to the i-th latent dimension as its supervised set. This results in about 35% (I = 50) to 43% (I = 100)-labelled responses across the whole dataset. An example of the resulting seed set is shown in Table 8.1. This seed set is somehow related to the DA taxonomies used in this domain, but instead of labeling their semantics manually, in LIDM we only roughly cluster than based their surface form content words. During inference, we carried out stochastic estimation by taking a sample that estimates the stochastic gradients. The model is trained by Adam (Kingma and Ba, 2014) and tuned (early stopping, hyper-parameters) on the held-out validation set. We alternately optimise the generative model and the inference network by fixing the parameters of one, while updating the parameters of the other. Due to the difficulty in training end-to-end models directly online with real users, a rather simple corpus-based RL setting was adopted to prove the validity of applying RL on top of the trained latent variable. During corpus RL fine-tuning, we generate a sentence mt from the model to replace the ground truth mˆt at each turn and define an immediate reward. If mt 110 Generation based on a Latent Policy ID # content words 0 138 thank, goodbye 1 91 welcome, goodbye 3 42 phone, address, [v.phone],[v.address] 14 17 address, [v.address] 31 9 located, area, [v.area] 34 9 area, would, like 46 7 food, serving, restaurant, [v.food] 85 4 help, anything, else Table 8.1 An example of the automatically labelled response seed set for semi-supervised learning during variational inference. can improve the dialogue success Su et al. (2015) over mˆt and the sentence BLEU (Auli and Gao, 2014) via rt = η · sBLEU(mt,mˆt)+  1 mt improves, −1 mt degrades, 0 otherwise, (8.21) where the constant η was set to 0.5. We fine-tuned the model parameters using RL for 3 epochs. During testing, we selected the most probable intention using a greedy method and applied a beam search with the beamwidth set to 10 when decoding the response. The decoding criterion was the average log-probability of tokens in the response. We then evaluated our model based on task success rate (Su et al., 2015) and BLEU score (Papineni et al., 2002). The model is used to predict each system response in the held-out test set. 8.3.2 Results Like the results presented in Section 7.5.2, we ran a small grid search over the same set of hyper-parameters for LIDM. Thus, the initial learning rate of Adam whose range was [0.002,0.004,0.006]1; the l2-regularisation term had range of [0.0,1e− 6,1e− 5,1e− 4]; and the random seed values range from 1 to 5. We trained a model based on each of the hyper-parameter combinations mentioned above. We picked the model, which performed the best on the validation set, and reported its performance on the testing set. Note the learning rate range searched here is smaller than that of the NDM’s because NVI requires a smaller 1For NDM we searched over [0.008,0.010,0.012] 8.3 Evaluation on Latent Intention Dialogue Models 111 Table 8.2 Corpus-based experiment by comparing different NDM and LIDM architectures. The results were obtained by training models with several hyper-parameter settings and report the testing performance based on the best model on the validation set. The best performance of each metric in each big block is highlighted in bold. Model Decoder Addition Success(%) BLEU Suc.+0.5 BLEU Vanilla NDM lm - 72.8% 0.237 0.847 mem - 74.3% 0.243 0.865 hybrid - 77.9% 0.231 0.894 Attentive NDM lm - 72.1% 0.246 0.844 mem - 80.1% 0.240 0.921 hybrid - 77.9% 0.234 0.896 Model Intent Dim. Addition Success(%) BLEU Suc.+0.5 BLEU LIDM I=50 - 78.7% 0.226 0.900 I=70 - 80.9% 0.245 0.932 I=100 - 69.1% 0.221 0.801 I=50 +RL 77.2% 0.249 0.896 I=70 +RL 83.8% 0.258 0.967 I=100 +RL 81.6% 0.245 0.939 Ground Truth - - 91.6% 1.000 1.416 initial learning rate to explore the latent space in the first few epochs. The corpus-based evaluation results are presented in Table 8.2. Note that the result of NDMs (first two blocks of Table 8.2) are borrowed from Table 7.1 for the convenience of comparison. As seen in Table 8.2, the Ground Truth block shows the two metrics when we computed them on the human-authored responses. This sets a gold standard for the task. When comparing among LIDMs (the LIDM block), a dimension of 70 satisfies the number of latent intentions on this corpus because both the success rate and BLEU rate are the best among others. Furthermore, the initial policy learnt by fitting the latent intention to the underlying data distribution does not necessarily produce good results when compared to both the vanilla and attentive NDMs. This may be because we were optimising the variational lower bound of the dataset instead of the task success and BLEU score during the variational inference. However, once we applied RL to optimise the success rate and BLEU rate as part of the reward function (Equation 8.21) during the fine-tuning phase, the resulting RL-based LIDM models can outperform all NDM baselines when both the success rate and BLEU score are considered. To observe the general trend of the three types of models, we highlighted the best performance of each metric (each column) on each model (each big block) in bold. As 112 Generation based on a Latent Policy Metrics NDM LIDM LIDM+RL Success 91.5% 92.0% 93.0% Comprehension 4.21 4.40 4.40 Naturalness 4.08 4.29 4.28 # of Turns 4.45 4.54 4.29 Table 8.3 Human evaluation based on text-based conversations. we can see, the best metrics of each model suggests that the proposed LIDM performs better in terms of all the metrics considered. 8.3.3 Human Evaluation To assess the human perceived performance, we evaluated the three models (1) Attentive NDM, (2) LIDM (without RL), and (3) LIDM+RL by recruiting paid subjects on Amazon Mechanical Turk. Each judge was asked to follow a task and carry out a conversation with the machine. At the end of the conversation the judges were asked to rate and compare the model’s performance. We assessed the subjective success rate, the perceived comprehension ability and the naturalness of responses on a scale of 1 to 5. For each model, we collected 200 dialogues and averaged the resulting scores. During the human evaluation, we sampled from the top-5 intentions of the LIDM models and decoded a response based on the sample. The result is shown in Table 8.3. One interesting fact to note is that although the LIDM did not perform well on the corpus-based task success metric, the human judges rated its subjective success almost indistinguishable from the others. This discrepancy between the two experiments suggests that the flaw in the corpus-based success metric that favours greedy policies because the user side behaviours were fixed rather than interactional2. Despite LIDMs being considered only marginally better than NDM on the subject of success, the LIDMs outperform NDM on both comprehension and naturalness scores. This is because the proposed LIDM models can better capture the multi-modality of the communicative intention in dialogue and carry out a more natural conversation by sampling from the latent variable. In a qualitative experiment, we show three example conversations between a human judge and a machine, one from LIDM in Table 8.4 and two from LIDM+RL in Table 8.5, respectively. The results are shown as one exchange per block. Each induced latent intention 2The system wants to provide as much information as possible in early turns, in case that the fixed user side behaviours a few turns later don’t fit into the scenario the system originally planned. A simulated user is impossible here because LIDM directly operates at the language level and developing an end-to-end simulated user itself is an AI-complete problem 8.3 Evaluation on Latent Intention Dialogue Models 113 Human : Hi, I’m hungry for some Indonesian Machine: ( 0 0.71) i am sorry , there are no indonesian restaurants in the area . would you like something else ? (46 0.19) there are no restaurants serving indonesian kind of food . (34 0.07) what location would you like ? Human : Oh no. How about Indian? Machine: ( 0 0.59) there are 9 restaurants in the east side of town . would you like their address ? (30 0.15) curry prince serves indian cuisine . (34 0.10) what area would you like ? (16 0.09) curry prince serves indian food , is in the east part of town . Human : East side please Machine: ( 0 0.62) curry prince is a indian restaurant in the east side of town . would you like their telephone number ? (84 0.19) curry prince is an indian restaurant in the east of town and the number is 01223 566388 . (81 0.08) curry prince serves indian type of cuisine and is located in the east part of town . Human : yes , the address and phone number please Machine: ( 3 0.46) the address is 451 newmarket road fen ditton and the phone is 01223 566388 . (11 0.16) the address for curry prince is 451 newmarket road fen ditton and the phone number is 01223 566388 . ( 0 0.15) curry prince is located at 451 newmarket road fen ditton and their phone is 01223 566388 . is there anything else i can help you with ? (84 0.13) curry prince is located at 451 newmarket road fen ditton and the phone is 01223 566388 and is in the moderate price range . Human : Okay, thank you goodbye Machine: ( 0 0.57) thank you , goodbye . ( 2 0.13) goodbye . ( 1 0.11) you are welcome . goodbye . ( 6 0.10) thank you for using the cambridge restaurant system . goodbye . Table 8.4 A sample dialogue from the LIDM, I=100 model, one exchange per block. Each induced latent intention is shown using an (index, probability) tuple, followed by a decoded response. The sample dialogue was produced by following the responses highlighted in bold. is shown via an (index, probability) tuple followed by a decoded response. The sample dialogues were produced by following the responses highlighted in bold. As we can see, the LIDM shown in Table 8.4 demonstrates a clear multi-modal distribution over the learnt intention latent variable, and what the variable represents can be easily interpreted by the response generated. Note the variation of the latent variable is at both the semantic and syntactic levels. However, some intentions (such as intent 0) can still decode very different responses under different dialogue states, even though they were supervised by a small response set as shown in Table 8.1. This is mainly because of the variance introduced during the variational inference. Finally, when comparing Table 8.4 and Table 8.5, we can observe the difference between the two dialogue strategies: the LIDM, by inferring its policy from the supervised dataset, reflects the rich multi-modality in the underlying distribution; while LIDM+RL, which refines its strategy using RL, exhibits much greedier behaviour for achieving task success (Table 8.5 in block 2 & 4 the LIDM+RL agent provides the address and phone number even before the user asks). This greedy behaviour is mainly caused by two reasons: (1) the sparse nature of the success reward signal, and (2) the fixed user behaviour in the corpus-based RL setting 2. This is also supported by the human evaluation in Table 8.3 where LIDM+RL has a much shorter dialogue turn in average when comparing to the other 114 Generation based on a Latent Policy Human : Hi, I’m hungry for some Indonesian . Machine: ( 0 0.61) i am sorry , there are no indonesian restaurants in the area . would you like something else ? (46 0.17) no restaurants in the part of town . (34 0.14) what part of town would you like ? Human : Oh no. How about Indian? Machine: (93 0.51) the gandhi is located at 72 regent street city centre and their telephone number is 01223 353942 . (72 0.21) the gandhi serves indian kind of food and is in the centre of town and is in the cheap price range . (12 0.12) the gandhi is a indian restaurant in the centre location and is in the cheap price range . Human : Okay, thank you goodbye Machine: ( 6 0.28) thank you for using the cambridge restaurant system . goodbye . ( 1 0.24) you are welcome . goodbye . ( 85 0.19) is there anything else i can help you with ? Human : hey can i have some chinese cruisine in the centre ? and a cheap one please Machine: (15 0.42) golden house is located at 12 lensfield road city centre and their number is 01842 753771 . ( 5 0.38) golden house is located at 12 lensfield road city centre and the number is 01842 753771 . is there ... (93 0.10) golden house is located at 12 lensfield road city centre and their phone is 01842 753771 . Human : that is all Machine: ( 6 0.57) thank you for using the cambridge restaurant system . goodbye . ( 1 0.20) you are welcome . goodbye . (17 0.11) have a nice day ! Table 8.5 Two sample dialogues from the LIDM+RL, I=100 model, one exchange per block. Comparing to Table 8.4, the RL agent demonstrates a much greedier behavior toward task success. This can be seen in block 2 & block 4 in which the agent provides the address and phone number even before the user asks. two models. A better model can be achieved by learning from real user interaction with a more expressive reward signal. 8.4 Conclusions In this chapter, we continued our study of end-to-end neural network-based dialogue models by introducing a discrete latent variable as its internal decision-making component. The discrete latent variable enables the model to make conversational decisions by sampling from a categorical distribution, which is inferred from a human-human dialogue corpus. The corpus can be interpreted as the intrinsic policy that reflects human decision-making under a particular conversational scenario. The proposed LIDM is based on the framework of Conditional VAE as illustrated in Section 4.3.1. We introduce a discrete latent variable between the input and output nodes and optimise the model parameters against the variational lower bound. To carry out efficient neural variational inference, we construct an inference network that approximates the true posterior by encoding both the input and output as observations. In Section 4.3.2, we opt for the scoring function-based gradient estimator to optimise the model parameters via stochastic back-propagation. During NVI, we alternately optimise the generative model 8.4 Conclusions 115 and the inference network by fixing the parameters of one, while updating the parameters of the other. To reduce the variance, baselines are also introduced, as in the REINFORCE algorithm, to mitigate the high-variance problem. To further reduce the variance, we utilise a labelled subset of the corpus. The labels of intentions are automatically generated by clustering in this subset. Then, the latent intention distribution can be learned in a semi-supervised fashion. Thus, the learning signals are either from the direct supervision (labelled set) or from the variational lower bound (unlabelled set). Based on the same framework, we show that the model can refine its strategy easily against alternative objectives by using policy gradient-based reinforcement learning. From a machine learning perspective, LIDM is attractive because the latent variable can serve as an interface. The interface, which is also learnt, can be used to decompose the learning of language and the internal dialogue decision-making process. This decomposition can effectively help us to resolve the credit assignment problem in end-to-end learning dialogue models. This is because different learning signals can be efficiently assigned to a variety of system sub-modules to update parameters. Variational inference, for discrete latent variables, uses a latent distribution that is updated by a reward from a variational lower bound. Reinforcement learning uses a latent distribution (i.e., policy network) that is updated by the rewards from dialogue success and sentence BLEU. Hence, the latent variable bridges the different learning paradigms such as Bayesian learning and Reinforcement Learning and brings them together under the same framework. In summary, this framework provides a more robust neural network-based approach than the NDM introduced in Chapter 7 because the new framework does not depend solely on sequence-to-sequence learning. Instead the framework explicitly models the hidden dialogue intentions underlying the user’s utterances and allows the agent to directly learn a dialogue policy via interaction. The need for additional supervision indicates that the NVI framework still has difficulty converging, which requires more study before it can fully succeed. Chapter 9 Conclusions This thesis has shown that with a right architecture design, an RNN language model can generate high quality dialogue responses by learning from human-authored examples. Unlike previous approaches that rely on constructing intermediate representations, with overly explicit linguistic annotations, the goal of this thesis has been to validate the need for these representations and remove any unused or redundant ones to improve the scalability and learnability of the system. This backward integration from an NLG component has proven effective in Spoken Dialogue System architectures because it not only mitigates the development load, and makes the system more scalable, but also demonstrates a competitive advantage on human-perceived quality over previous approaches. The original contributions of this thesis include: the development of the RNNLG model, which combines the H-LSTM, SC-LSTM, and ENC-DEC models; the domain adaptation recipe for NLG models, which applies data counterfeiting and discriminative training; and learning-based end-to-end dialogue modelling, including NDM and its latent-variable variant LIDM. The proposed RNNLG model enables training corpus-based NLG on dialogue act- sentence pairs without alignment annotations. Alternatively, these models optimise a surface realisation objective function but also learns these alignments jointly as a by-product of using either the gating or the attention mechanism. A corpus- and human-based evaluation across four dialogue domains demonstrate all RNNLG models outperform pervious approaches, while the top performance is achieved by the SC-LSTM generator, which employs a learnable gate to control the semantic elements while generating dialogue responses. An adaptation recipe is proposed for NLG-based domain adaptations to facilitate training for RNNLGs under a limited data scenario. Data counterfeiting allows training an out-of- domain generator on a dataset that is counterfeited from the examples in existing domains by masking (delexicalisation) and replacing the slot-value specific surface forms. Discriminative 118 Conclusions training, on the other hand, exploits a small set of in-domain examples to correct inadequate and non-fluent realisations. Experiments based on adaptation across four domains have shown improved performance compared to baseline-based approaches. This result is also confirmed via a human evaluation. NDM further integrates the dialogue components (backward) to the dialogue manager (or content planning) and can directly learn its conversational strategies from a human-human corpus via supervised learning. By observing the fact that deterministic models, such as NDM, cannot capture natural variation in human language effectively, we discover that LIDM can carry out a more natural conversation as a generative model. Theoretically, LIDM is well-founded on conditional VAE and Neural Variational Inference, which demonstrates a strong attractiveness due to its ability to bridge different learning paradigms and resolve the credit assignment problem. In practise, LIDM and NDM both break the custom of modelling goal-oriented dialogue, as the pipelines of independent modules communicating via dialogue act taxonomies. In fact, the experimental results showed that removing the hard-coded dialogue act taxonomies reduces the annotation load but doesn’t jeopardise the learning of rational system behaviour. Both a corpus-based evaluation and a live user trial show that both NDM and LIDM can engage in natural conversations to help users accomplish tasks. 9.1 Limitations and Future Work Arguably the greatest bottleneck for statistical approaches to dialogue system development is the collection of appropriate training data. This is especially true for goal-oriented dialogue systems, which requires in-domain data for optimal performance. For example, technical support for Apple computers may differ completely from that for Windows, due to the many differences in software and hardware. Therefore, the ease of collecting these datasets plays a crucial role in the system development process. Since RNNLG relies on annotating corpora, based on the dialogue act taxonomies, it requires experts to create the labels, which is not possible to do when crowd-sourcing the work completely (Henderson, 2015a). This also applies to several system modules such as SLU and dialogue manager. On the other hand, the proposed Wizard-of-Oz data collection enables us to collect dialogue corpora with coarse-grained annotations to train NDM and LIDM. Because the WoZ data collection procedure does not require expert knowledge, it is much easier to run through crowdsourcing platforms. Domain scalability of statistical approaches applied to dialogue systems is still an unan- swered question because the current state-of-the-art methods still focus on improvements in small domains. Although we have shown that RNNLG can be readily scaled to similar 9.1 Limitations and Future Work 119 domains by the proposed adaptation recipe, it is not clear whether adaptation is still feasible if an extremely different domain is involved. In addition, the situation becomes even more difficult if the generator needs to produce a single description that covers several domains. Arguably the proposed NDM and LIDM have a chance to scale better than the pipelined systems because of their much more flexible learning ability. As a result, more work should be done to train NDM or LIDM on more complex domains to validate this proposition. A hierarchical attention mechanism would be promising in this direction. One limitation of sequential NLG models is the inability to generate long and coherent arguments. Whether it is a long sentence encompassing several subjects, or multiple sentences that each has its own idea, most sequential NLG models cannot consistently produce complex nested sentence structures. This is due to the notorious vanishing gradient problem in RNN so that it learns to be biased toward short sentences but forgets what has been said previously. Therefore, to generate long and coherent texts, better content planning and sentence planning techniques for neural networks need to be investigated. This may involve a hierarchical process with a sequence of intent variables that are sampled from a semantic-level LSTM before sentence construction, or a structured generation model in which a latent grammar is used to produce the outputs implicitly (Yogatama et al., 2016). The RNNLG, NDM, and LIDM all rely on the process of delexicalisation (Henderson et al., 2014c; Mrkšic´ et al., 2015) to replace domain-specific content words or phrases with placeholders so that the model doesn’t need to learn to fill in these content words during training - It is a powerful technique which allows a massive parameter sharing between homogeneous slots and values in the ontology. However, delexicalisation is a rather crude process because it usually depends on exact string matching. This restriction prevents the model from generating complex linguistic phenomena, such as referential expressions or value-based comparisons. This simple string matching technique could also cause problems1 when the domain becomes large and therefore limit the scalability of the system. One possible solution is to apply the pointer network (Vinyals et al., 2015) with soft string matching so that the model can learn to softly match domain-specific words and copy them into sentences. The problem with softly matching different surface forms of the same semantic is that surface forms can potentially be addressed by specialised word vectors which would be lacking in this case (Faruqui et al., 2015; Mrkšic´ et al., 2016a). For NDM and LIDM, one interesting direction would be to see how much we can gain from learning via direct interactions with real users by using reinforcement learning. Although a simple corpus-based RL setting has been studied for LIDM in this thesis, RL 1For example, in a domain which contains "Chinese" as a food type and "Chinese" as an ethnic group the delexicalisation can be confusing. 120 Conclusions in the real-world is a completely different story and there are many foreseeable difficulties: firstly, how to define the rewards itself is problematic because each reward is typically implicit (Weston, 2016) and requires a model to infer (Su et al., 2015); secondly, knowing when to learn and when not to is also difficult. For example, as system developers, we may want to prevent the model from picking up undesired behaviours such swearing or impolite responses; finally, whether the model can learn efficiently in a real-world setting is also a question. There is a lot of work on sample-efficient RL approaches recently (Su et al., 2017). The discrete latent variable employed in LIDM can model categorical dialogue inten- tions. However, the results may still be sub-optimal and insufficient in capturing real-world, complex conversations since there are many intentions with several degrees of similarity. Therefore, a structured representation of latent variables could be beneficial. For example, one possible configuration of a structured latent variable might consist of a few categorical dis- tributions and a set of Bernoulli distributions to arbitrarily mimic the dialogue act taxonomies. This structured latent variable could also be self-supervised via the snapshot learning tech- nique introduced in this thesis so that the learning can still be done in a semi-supervised fashion to guarantee the robustness of the model. Another limitation of NDM and LIDM, or end-to-end dialogue modelling in general, is how to ensure that the model is robust to speech recognition errors. End-to-End dialogue systems typically learn dialogue strategies concurrently with language from a collection of text corpora that is usually noise-free. This is problematic if the system is operating via a speech interface with background noise that makes recognition errors inevitable. The model could behave greedily as if it is operating in a noise-free environment even in the situation where it is not quite sure what the user is talking about. One possible solution to this problem is to try and expose the wizards to noisy speech conditions during WoZ data collections so that the model can also learn from humans, e.g., confirmation behaviour. Despite the current limitations, the proposed RNNLG, NDM, and LIDM may be particu- larly useful for system designers looking to build dialogue agents with extensive capabilities. The three methods were proposed with a complementary data collection methodology in mind to reduce the cost of data collection and annotation. In addition, they are all based on a flexible, neural-network-based framework to allow rapid learning in new domains. An open problem is the extent to which extensible models like these can allow for dialogue agents to continuously improve capabilities and acquire knowledge by learning either from human-human conversations or via direct interactions. Appendix A Slot-based Dialogue Domains Goal-oriented dialogue systems typically use slot-filling mechanism to proceed a dialogue. The slots of a slot-based dialogue system specify the domain the system can talk about and the tasks it can help users to accomplish. The slots also define the set of actions the system can take, the set of possible semantics of the user utterances, and the possible dialogue states. Most of the slots come with values that come from a set of slot-specific attributes in a domain. In the information-seeking dialogue scenario, where the goal is to help users to search a database for data items by specifying constraints. The slots are the attributes of the items in the database and a set of slot-value pairs form a search query. There are typically two types of slots that constitute the set of all slots S = Sinf ∪ Sreq, where the Sinf is the set of informable slots while Sreq is the set of requestable slots. Informable slots are attributes of the entities in the database that users can use to constrain a search. On the other hand, requestable slots are attributes that allows users to ask for values, but may not be used as part of the search constraint. A common example of a requestable slot is the address of an entity, which may not be specified as a search constraint ("I’m looking for a restaurant that is located at 29 North Street.") but the user may ask the value for (Can I have the address of the restaurant you just mentioned?). In addition, these two sets of slots are not necessarily disjoint. Requestable slots are typically not informable, while informable slots are typically requestable. The four domains used for evaluating the proposed approaches in this thesis are restaurant, hotel, laptop, and TV domains. The restaurant and hotel domains were collected and experimented on Wen et al. (2015c), which is about finding a venue (either a restaurant or a hotel) in the San Francisco area. The laptop and TV domains were proposed and evaluated in Wen et al. (2015b) and they are concerned with finding a product (either a laptop or a TV) to buy. A summary of the four domains is given in Table A.1. 122 Slot-based Dialogue Domains Slot ∥Vs∥ Restaurant Hotel Laptop TV Sinf price range 3 4 3 3 area 155 155 - - near 39 28 - - food 59 - - - good for meal 4 - - - kids allowed 2 - - - has internet - 2 - - accept cards - 2 - - dogs allowed - 2 - - family - - 4 12 battery rating - - 3 - drive range - - 3 - weight range - - 3 - is for business - - 2 - screen size range - - - 3 eco rating - - - 5 hdmi port - - - 4 has usb port - - - 2 Sreq\Sinf type 1 1 1 1 name 239 180 123 94 price 96 - 79 11 address 239 180 - - phone 239 180 - - postcode 21 19 - - warranty - - 3 - battery - - 16 - design - - 20 - dimension - - 19 - utility - - 7 - weight - - 24 - platform - - 5 - memory - - 7 - drive - - 7 - processor - - 10 - resolution - - - 3 power consumption - - - 24 accessories - - - 7 color - - - 22 screen size - - - 12 audio - - - 3 Table A.1 Slots in the restaurant, hotel, laptop, and TV domains respectively. All the informable slots are also requestable. The group of Sreq\Sinf shows the requestable slots that are not informable. Note the type slot contains only one value, which is the domain string. Appendix B Dialogue Act Format A dialogue act can be considered a shallow representation of the semantics of either a user’s utterance or a system’s prompt. On the input side, the dialogue system understands natural language by mapping text into one of the dialogue act taxonomies to represent the input semantics. The output side dialogue acts represent system actions or intentions that are latter transcribed back to text via the Natural Language Generation component. Therefore, both the input and output semantics are constrained with a particular scope for facilitating rational system behaviours within a domain. Multiple formats have been proposed and employed for representing dialogue acts (Traum, 2000). This appendix describes the dialogue act format used in the Cambridge University Dialogue Systems group (Young, 2007), which is a relatively general format for representing the semantics of slot-based, goal-oriented dialogues. A dialogue act consists of two components: a dialogue act type da, and a set of slot-value pairs. As mentioned in Appendix A, the set of all possible slots within a domain is the union of two sets S = Sinf∪Sreq, the set of all informable slots Sinf and the set of all requestable slots Sreq. In a dialogue act, each slot s ∈ S can either bind to a value "s = v", bind to a particular value but negated "s ̸= v" (bounded slots), or simply doesn’t bind to any value and is written as "s" (unbounded slots), where v ∈ |Vs| and Vs denote the set of all possible values for slot s. Consider a simple example of da = request. The set of slot-value pairs is {name = seven days, food = chinese, price range != expensive, address, postcode}. This dialogue act can be written in shorthand notation as follows: "request(address, postcode, name = seven days, food = chinese, price range != expensive )". It represents the abstract meaning of the following description "Can you tell me the address and postcode of the inexpensive chinese restaurant called Seven Days? " The two unbounded slots are address and postcode and the other slots are all bounded by a maximal value. Different dialogue act types impose restrictions on what 124 Dialogue Act Format must be contained in its slot-value pairs. They are used differently, depending on whether they are a user request or a system prompt. This thesis focuses on the Natural Language Generation component of the system therefore we emphasis on dialogue acts for system prompts, as shown in Table B.1. Note that Table B.1 is grouped by the restrictions imposed on the corresponding slot-value pairs. 125 Restrictions on s-v pairs Z Act Type Description Z must be empty hello A welcome message to start the dialogue. bye A goodbye message to end the dialogue. reqmore Asking whether more information is required. Z = {s = v, ...} only contains bounded slots. inform Informing the requested information of a given entity as specified in Z. inform_all Informing that all the entities with specific constraints fulfils a requirement. inform_no_info Informing that the system cannot help with that specific information. inform_no_match Informing that there are no entities that match the constraint. confirm confirming that the user wants a venue matching the bound slot constraints. Z contain one name slot plus a set of bounded slots recommend Offering or recommending an entity and informing its attributes as provided in Z. inform_only_match Informing that the only match is a particular entity. Z contain one count slot plus a set of bounded slots inform_count Informing that there are count entities that match the constraint. Z = {s, ...}, Z may contain only unbounded slots request Requesting what the user wants for unbounded slots. Z= {s= v0,s= v1, ...}, Z may contain several pairs of the same slot with different values select Asking the user to select between the two suggested values for the slot. suggest Asking the user to select between the three suggested values for the slot. Z = {n = n0,n = n1,s = v0,s = v1, ...}, Z contains two name slots and two set of bounded slots compare Comparing two entities with the same set of attributes. Table B.1 The set of dialogue acts used in RNNLG. Note that the dialogue act definition here is different from Young (2007) in which the inform act is broken down into specific categories to facilitate data collection and learning. Appendix C Example of a Template-based Generator The template-based generator introduced here is based on a recursive, top-down realisation strategy. Each rule is a mapping from the concept level string to a final surface form which contains only words, or an intermediate sentence structure that consists of both vocabulary words and lower level concept strings. Based on the input dialogue act, the generator first chooses the root rule to apply by matching the input dialogue act against a set of pre-defined rules. Once a matching root rule is found, it is directly applied to map the dialogue act into the target string. If the mapped string contains only words, it is treated directly as the final surface form and subsequently passed to the speech synthesis component. However, if lower level concept strings are found, the generator needs to apply the rules to each string recursively until no abstract concept string is left. Table C.1 and Table C.2 show a few snapshots of the template-based generator used in the restaurant domain. Table C.1 shows a set of root rules for mapping the input dialogue act into a target sentence structure. Both the slot and value could be variables (annotated by the $ sign). These variables can be directly realised by replacing the $ sign variable in the sentence with the corresponding value specified in the dialogue act. Whenever a % sign is encountered, the model needs to look for another rule specified in the rule set to recursively map the result back to natural language. This can then make use of the lower-level concept rules shown in Table C.2. Depending on the value the input dialogue act holds, the concept can be mapped to different realisations. 128 Example of a Template-based Generator Root Rules request(area) What part of town do you have in mind? request(name) What is the name of the place you are looking for? request(pricerange) Would you like something in the cheap, moderate, or expensive price range? request(food) What kind of food would you like? ... ... request(area, option=$X) What part of town do you have in mind? For example, something in the area of $X? request(name, option=$X) What is the name of the place you are looking for? For example, $X? request(food, option=$X) What kind of food would you like? For example, $X? ... ... inform(name=$X,$Y=$O) $X is a nice place %$Y_inf($O). inform(name=$X,$Y=$O,$Z=$P) $X is a nice place, %$Y_inf($O) and %$Z_inf($P). inform(name=$X,$Y=$O,$Z=$P,$W=$R) $X is a nice place, %$Y_inf($O), %$Z_inf($P) and %$W_inf($R). ... ... select(food=$X, food=dontcare) Sorry would you like $X food or you don’t care? select(food=$X, food=$Y) Sorry would you like $X or $Y food? select($X=$Y,$X=dontcare) Sorry would you like $Y for $X or you don’t care? select($X=$Y,$X=$Z) Sorry would you like $Y or $Z for $X? ... ... ... ... Table C.1 A sample snapshot of the root rules of the template-based generator used in this thesis. Note that both slot and value can be variables. The matching is done by matching as many slot-value pairs as possible. 129 Lower Level Concept Rules Concept String Arguments Realisation area_inf($area) $area=centre it is in the centre of town $area=north it is in the north part of town $area=south it is in the south part of town ... ... $area=none I don’t know the area $area=$area it is in the area of $area dogsallowed_str($allow) $allow=dontcare if you don’t care whether dogs are allowed $allow=1 where dogs are allowed $allow=0 where no dogs are allowed $allow=none I don’t know if it allows dogs phone_str($phone) $phone=none I don’t know their address $phone=$phone Their address is $addr address_str($address) $address=none I don’t know their phone number $address=$addr Their phone number is $phone postcode_str($code) $code=none I don’t know their post code $code=$code Their postcode is $code ... ... ... Table C.2 A sample snapshot of the lower level rules. Appendix D The Wizard-of-Oz website Fig. D.1 The user webpage. The worker who plays a user is given a task to follow. For each mturk HIT, he/she needs to type an appropriate sentence to carry on the dialogue by looking at both the task description and the dialogue history. 132 The Wizard-of-Oz website Fig. D.2 The wizard page. The wizard’s job is slightly more complex: the worker needs to go through the dialogue history, fill in the form (top green) by interpreting the user input at this turn, and type in an appropriate response based on the history and the DB result (bottom green). The DB search result is updated when the form is submitted. The form can be divided into informable slots (top) and requestable slots (bottom), which contain all the labels we need to train the trackers. Acronyms AMT Amazon Mechanical Turk. 3, 50, 51, 54, 57 ANN Artificial Neural Network. 27 ASR Automatic Speech Recognition. 6, 7, 9, 10 BPTT Back-propagation through time. 30, 34, 35 CCG Combinatorial Categorical Grammar. 7 CNN Convolutional Neural Network. xiv, 7–9, 35–37, 39 CV Computer Vision. 62 DA dialogue act. xiii, xiv, 7, 8, 12, 14, 18, 19, 21, 22, 24, 25, 43–49, 51–54, 57, 59, 60 DBN Dynamic Bayesian Network. xiii, 8, 9, 19, 20, 25, 53 DNN Deep Neural Network. 62 DST Dialogue State Tracking. 9, 10 DSTC Dialogue State Tracking Challenge. 9, 10 ENC-DEC Attention-based Encoder Decoder. xiv, 3, 48, 49, 53, 56, 58, 59 FNN Feed-forward Neural Network. xiv, 27, 30, 36, 37 GP Gaussian Process. 11 H-LSTM Heuristically Gated LSTM. xiv, 3, 45, 46, 56–60 HMM Hidden Markov Model. 6 134 Acronyms kNN k-Nearest Neighbour. 21, 52 LIDM Latent Intention Dialogue Model. 4 LSTM Long Short-term Memory. xiv, 32–34, 45–50, 53, 57, 59 MDP Markov Decision Process. 10 MLE Maximum Likelihood Estimation. 37 MLP Multi-layer Perceptron. 38, 39 MT Machine Translation. 22, 23, 32, 35, 48 NDM Neural Dialogue Model. 4 NLG Natural Language Generation. xiii, 2–4, 11–17, 19–25, 52, 61, 62, 64, 66, 74, 77, 78 NLP Natural Language Processing. 48, 62 NN Neural Network. 38, 63 NVI Neural Variational Inference. 38, 39 POMDP Partially Observable Markov Decision Process. 4, 10 RL Reinforcement Learning. 11 RNN Recurrent Neural Network. xiii, 2, 3, 6, 8, 9, 25, 30–35, 38, 39, 43, 44, 48, 64, 74 RNNLG Recurrent Neural Network Language Generation. xiv, xviii, 3, 43, 44, 52–54, 57–59, 66, 74, 79 RST Rhetorical Structure Theory. 14 SC-LSTM Semantically Conditioned LSTM. xiv, 3, 4, 45, 47, 48, 56–60, 68, 73, 74 SDS Spoken Dialogue System. 3, 5, 6, 11 SGD Stochastic Gradient Decent. 29 SLU Spoken Language Understanding. 7–10 SVD Singular Value Decomposition. 63 Acronyms 135 SVM Support Vector Machine. 21, 63 VAE Variational Autoencoder. 4, 38, 39 WoZ Wizard-of-Oz. xi, 85, 86 References Allen, J. (1995). Natural language understanding. Pearson. Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Hannun, A. Y., Jun, B., LeGresley, P., Lin, L., Narang, S., Ng, A. Y., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao, B., Yogatama, D., Zhan, J., and Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, ICML’16, pages 173–182. JMLR.org. Angeli, G., Liang, P., and Klein, D. (2010). A simple domain-independent probabilistic approach to generation. In EMNLP, pages 502–512, Cambridge, MA. ACL. Auli, M. and Gao, J. (2014). Decoder integration and expected bleu training for recurrent neural network language models. In Proceedings of the 52nd Annual Meeting of the ACL (Volume 2: Short Papers), pages 136–142, Baltimore, Maryland. ACL. Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. ICLR. Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., and Soda, G. (1999). Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15:937–946. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. Belz, A. (2006). Comparing automatic and human evaluation of nlg systems. In In Proc. EACL06, pages 313–320. Belz, A. (2008). Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Natural Language Engineering. Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on. Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. In EMNLP. 138 References Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation. Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural cor- respondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 120–128, Stroudsburg, PA, USA. Association for Computational Linguistics. Bohus, D. and Rudnicky, A. I. (1999). Recent Trends in Discourse and Dialogue, chapter Sorry, I didnt catch that! Springer. Bordes, A., Usunier, N., Chopra, S., and Weston, J. (2015). Large-scale simple question answering with memory networks. CoRR, abs/1506.02075. Bordes, A. and Weston, J. (2017). Learning end-to-end goal-oriented dialog. In ICLR. Bunt, H., Alexandersson, J., Carletta, J., Choe, J.-W., Chengyu Fang, A., Hasida, K., Lee, K., Petukhova, V., Popescu-Belis, A., Romary, L., Soria, C., and Traum, D. (2010). Towards an ISO Standard for Dialogue Act Annotation. In Seventh conference on International Language Resources and Evaluation (LREC’10), La Valette, Malta. Busemann, S. and Horacek, H. (1998). A flexible shallow approach to text generation. CoRR, cs.CL/9812018. Cao, K. and Clark, S. (2017). Latent variable dialogue models and their diversity. In EACL. Celikyilmaz, A. and Hakkani-Tur, D. (2015). Convolutional neural network based semantic tagging with entity embeddings. In NIPS Workshop on Machine Learning for SLU and Interaction. Chambers, N., Allen, J., HUMAN, F. I. F., and FL., M. C. I. P. (2004). Stochastic Language Generation in a Dialogue System: Toward a Domain Independent Generator. Defense Technical Information Center. Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference. Chen, J. and Chaudhari, N. S. (2004). Capturing long-term dependencies for protein sec- ondary structure prediction. In Advances in Neural Networks - ISNN 2004, International Symposiumon Neural Networks, Part II, volume 3174 of Lecture Notes in Computer Science, pages 494–500. Springer. Chen, Y., Hakkani-Tür, D. Z., and He, X. (2016). Zero-shot learning of intent embeddings for expansion by convolutional deep structured semantic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, pages 6045–6049. Cheyer, A. and Guzzoni, D. (2007). Method and apparatus for building an intelligent automated assistant. US Patent App. 11/518,292. References 139 Chiang, D., Andreas, J., Bauer, D., Hermann, K. M., Jones, B., and Knight, K. (2013). Parsing graphs with hyperedge replacement grammars. In In Proceedings of the 51st Meeting of the ACL. Marie-Catherine de Marneffe. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pages 1724–1734, Doha, Qatar. ACL. Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. Ciresan, D., Meier, U., Gambardella, L., and Schmidhuber, J. (2011). Convolutional neural network committees for handwritten character classification. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 1135–1139. Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics. Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 160–167, New York, NY, USA. ACM. Craven, M. and Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. In ISMB. Cuayahuitl, H., Dethlefs, N., Hastie, H., and Liu, X. (2014). Training a statistical surface realiser from automatic slot labelling. In 2014 IEEE Spoken Language Technology Workshop (SLT), pages 112–117. Daume III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263, Prague, Czech Republic. Association for Computational Linguistics. Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A., and Seltzer, M. (2013). Recent advances in deep learning for speech research at microsoft. In ICASSP. der Maaten, L. V. and Hinton, G. (2008). Visualizing Data using t-SNE. JMLR. Dethlefs, N., Hastie, H., Cuayahuitl, H., and Lemon, O. (2013). Conditional random fields for responsive surface realisation using global features. In ACL. Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, HLT ’02, pages 138–145, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv e-prints. 140 References Dušek, O. and Jurcicek, F. (2016). Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 45–51. Association for Computational Linguistics. Dušek, O. and Jurcicek, F. (2015). Training a natural language generator from unaligned data. In ACL, pages 451–461, Beijing, China. Dušek, O. and Jurcicek, F. (2016). Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 45–51, Berlin, Germany. Association for Computational Linguistics. Elman, J. L. (1990). Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211. Epstein, R. (1992). The quest for the thinking computer. AI Mag., 13(2):81–95. Espinosa, D., White, M., and Mehay, D. (2008). Hypertagging: Supertagging for surface realization with ccg. In In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL): Human Language Technologies (HLT, pages 183–191. Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., and Smith, N. A. (2015). Retrofitting word vectors to semantic lexicons. In NAACL-HLT, pages 1606–1615, Denver, Colorado. Association for Computational Linguistics. Frühwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of time series analysis, 15(2):183–202. Gales, M. and Woodland, P. (1996). Mean and variance adaptation within the mllr framework. Computer Speech and Language, 10(4):249 – 264. Gales, M. and Young, S. (2007). The application of hidden markov models in speech recognition. Found. Trends Signal Process., 1(3):195–304. Gašic´, M., Breslin, C., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis, P., and Young, S. (2013). On-line policy optimisation of bayesian spoken dialogue systems via human interaction. In ICASSP. Gašic´, M., Mrkšic´, N., Su, P.-H., Vandyke, D., Wen, T.-H., and Young, S. J. (2015). Policy committee for adaptation in multi-domain spoken dialogue systems. In ASRU. Gasic, M. and Young, S. (2014). Gaussian processes for pomdp-based dialogue manager optimization. IEEE/ACM Transaction on Audio, Speech and Language. Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE transactions on speech and audio processing, 2(2):291–298. Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Commun. ACM, 33(10):75–84. References 141 Goddeau, D., Meng, H., Polifroni, J., Seneff, S., and Busayapongchai, S. (1996). A form- based dialogue manager for spoken language applications. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 2, pages 701–704 vol.2. Goodman, J. T. (2001). A bit of progress in language modeling. Computer Speech Language, 15(4):403–434. Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Jebara, T. and Xing, E. P., editors, ICML, pages 1764–1772. JMLR Workshop and Conference Proceedings. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on. Graves, A., Mohamed, A., and Hinton, G. E. (2013a). Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778. Graves, A., Mohamed, A.-r., and Hinton, G. (2013b). Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. CoRR, abs/1410.5401. Hammer, B. (1998). On the approximation capability of recurrent neural networks. In In International Symposium on Neural Computation, pages 12–4. He, X. and Deng, L. (2012). Maximum expected bleu training of phrase and lexicon translation models. In Proceedings of the 50th Annual Meeting of the Association for Com- putational Linguistics: Long Papers - Volume 1, ACL ’12, pages 292–301, Stroudsburg, PA, USA. Association for Computational Linguistics. He, Y. and Young, S. (2006). Spoken language understanding using the hidden vector state model. Speech Communication, 48(3-4):262–275. Heidel, A. and Lee, L.-s. (2007). Robust topic inference for latent semantic language model adaptation. In 2007 IEEE Workshop on Automatic Speech Recognition Understanding (ASRU), pages 177–182. Henderson, M. (2015a). Discriminative Methods for Statistical Spoken Dialogue Systems. PhD thesis, University of Cambridge. Henderson, M. (2015b). Machine learning for dialog state tracking: A review. In Machine Learning in Spoken Language Processing Workshop. Henderson, M., Thomson, B., and Williams, J. (2014a). The Second Dialog State Tracking Challenge. In Proceedings of SIGdial. Henderson, M., Thomson, B., and Williams, J. (2014b). The Third Dialog State Tracking Challenge. In Proceedings of IEEE Spoken Language Technology. 142 References Henderson, M., Thomson, B., and Young, S. (2013). Deep Neural Network Approach for the Dialog State Tracking Challenge. In Proceedings of SIGdial. Henderson, M., Thomson, B., and Young, S. (2014c). Word-based dialog state tracking with recurrent neural networks. In SIGdial, pages 292–299, Philadelphia, PA, U.S.A. ACL. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR. Hill, F., Bordes, A., Chopra, S., and Weston, J. (2016). The goldilocks principle: Reading children’s books with explicit memory representations. In ICLR. Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine. Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., and Weld, D. S. (2011). Knowledge- based weak supervision for information extraction of overlapping relations. In ACL. Hogan, D., Foster, J., Wagner, J., and van Genabith, J. (2008). Parser-based retraining for domain adaptation of probabilistic generators. In INLG, pages 165–168, Stroudsburg, PA, USA. ACL. Hubel, D. H. and Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195(1):215–243. Jaeger, H. (2001). The "echo state" approach to analysing and training recurrent neural networks. GMD Report 148, GMD - German National Research Institute for Computer Science. Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax. In ICLR. Jeong, M. and Geunbae Lee, G. (2008). Triangular-chain conditional random fields. Trans. Audio, Speech and Lang. Proc., 16(7):1287–1302. Jordan, M. I. (1989). Serial order: A parallel, distributed processing approach. In Advances in Connectionist Theory: Speech. Lawrence Erlbaum Associates. Jordan, M. I. (1990). Attractor dynamics and parallelism in a connectionist sequential machine. In Diederich, J., editor, Artificial Neural Networks, pages 112–127. IEEE Press, Piscataway, NJ, USA. Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical exploration of recurrent network architectures. Journal of Machine Learning Research. References 143 Jurcˇícˇek, F., Thomson, B., and Young, S. (2011). Natural actor and belief critic: Reinforce- ment algorithm for learning parameters of dialogue systems modelled as pomdps. ACM Trans. Speech Lang. Process., 7(3):6:1–6:26. Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for modelling sentences. ACL. Karpathy, A. and Fei-Fei, L. (2014). Deep visual-semantic alignments for generating image descriptions. CoRR. Kate, R. J. and Mooney, R. J. (2006). Using string-kernels for learning semantic parsers. In ACL 2006: Proceedings of the 21st International Conference on Computational Lin- guistics and the 44th annual meeting of the ACL, pages 913–920, Morristown, NJ, USA. Association for Computational Linguistics. Kelley, J. F. (1984). An iterative design methodology for user-friendly natural language office information applications. ACM Transaction on Information Systems. Kim, S., D’Haro, L. F., Banchs, R. E., Williams, J., and Henderson, M. (2016a). The Fourth Dialog State Tracking Challenge. In Proceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS). Kim, S., D’Haro, L. F., Banchs, R. E., Williams, J., Henderson, M., and Yoshino, K. (2016b). The Fifth Dialog State Tracking Challenge. In Proceedings of the 2016 IEEE Workshop on Spoken Language Technology (SLT). Kim, Y. (2014). Convolutional neural networks for sentence classification. In EMNLP. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint: 1412.6980, abs/1412.6980. Kingma, D. P. and Welling, M. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML. Konda, V. R. and Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM J. Control Optim., 42(4):1143–1166. Kondadadi, R., Howald, B., and Schilder, F. (2013). A statistical nlg framework for aggre- gated planning and realization. In ACL, pages 1406–1415, Sofia, Bulgaria. ACL. Konstas, I. and Lapata, M. (2012). Unsupervised concept-to-text generation with hypergraphs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 752–761, Stroudsburg, PA, USA. Association for Computational Linguistics. Kocˇiský, T., Melis, G., Grefenstette, E., Dyer, C., Ling, W., Blunsom, P., and Hermann, K. M. (2016). Semantic parsing with semi-supervised sequential autoencoders. In EMNLP, pages 1078–1087, Austin, Texas. Association for Computational Linguistics. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc. 144 References Kukich, K. (1987). Where do phrases come from: Some preliminary experiments in connec- tionist phrase generation. In Natural Language Generation. Springer Netherlands. Kuo, H. K. J., Fosler-Lussier, E., Jiang, H., and Lee, C. H. (2002). Discriminative training of language models for speech recognition. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–325–I–328. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Langkilde, I. and Knight, K. (1998). Generation that exploits corpus-based statistical knowledge. In ACL, pages 704–710, Montreal, Quebec, Canada. ACL. Larsson, S. and Traum, D. R. (2000). Information state and dialogue management in the trindi dialogue move engine toolkit. Nat. Lang. Eng., 6(3-4):323–340. Lavoie, B. and Rambow, O. (1997). A fast and portable realizer for text generation systems. In Proceedings of the Fifth Conference on Applied Natural Language Processing, ANLC ’97, pages 265–268, Stroudsburg, PA, USA. Association for Computational Linguistics. Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE. Lecun, Y., Bottou, L., Orr, G. B., and Muller, K.-R. (1998b). Efficient backprop. Lee, B.-J., Lim, W., Kim, D., and Kim, K.-E. (2014). Optimizing generative dialog state tracker via cascading gradient descent. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 273–281, Philadelphia, PA, U.S.A. Association for Computational Linguistics. Lee, S. and Eskenazi, M. (2013). Recipe for building robust spoken dialog state trackers: Dialog state tracking challenge system description. In Proceedings of the SIGDIAL 2013 Conference, page 414–422, Metz, France. Association for Computational Linguistics, Association for Computational Linguistics. Leggetter, C. J. and Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech & Language, 9(2):171–185. Lemon, O. (2008). Adaptive natural language generation in dialogue using reinforcement learning. In SemDial. Levin, E. (1995). Chronus, the next generation. Proc. ARPA Spoken Language Technology Workshop, pages 269–271. Levin, E. and Pieraccini, R. (1997). A stochastic model of computer-human interaction for learning dialogue strategies. In In EUROSPEECH 97, pages 1883–1886. Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2016a). A diversity-promoting objective function for neural conversation models. In NAACL-HLT. References 145 Li, J., Galley, M., Brockett, C., Spithourakis, G., Gao, J., and Dolan, B. (2016b). A persona- based neural conversation model. In ACL, pages 994–1003, Berlin, Germany. Association for Computational Linguistics. Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M., and Gao, J. (2016c). Deep rein- forcement learning for dialogue generation. In EMNLP, pages 1192–1202, Austin, Texas. Association for Computational Linguistics. Lin, C.-Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co- occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech- nology - Volume 1, NAACL ’03, pages 71–78, Stroudsburg, PA, USA. Association for Computational Linguistics. Liu, C., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., and Pineau, J. (2016). How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. CoRR, abs/1603.08023. Liu, J., Cyphers, D. S., Pasupat, P., McGraw, I., and Glass, J. R. (2012). A conversational movie search system based on conditional random fields. In INTERSPEECH. Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval. Lucas, B. (2000). Voicexml for web-based distributed conversational applications. Commun. ACM, 43(9):53–57. Lyons, J. (2007). Artificial stupidity. In ACM SIGGRAPH 2007 Computer Animation Festival, SIGGRAPH ’07, pages 27–, New York, NY, USA. ACM. Mairesse, F., Gasic, M., Jurcicek, F., Keizer, S., Thomson, B., Yu, K., and Young, S. (2009). Spoken language understanding from unaligned data using discriminative classification models. In 2009 IEEE International Conference on Acoustics, Speech and Signal Process- ing, pages 4749–4752. Mairesse, F., Gasic, M., Jurcicek, F., Keizer, S., Thomson, B., Yu, K., and Young, S. (2010). Phrase-based statistical language generation using graphical models and active learning. In ACL, pages 1552–1561, Uppsala, Sweden. ACL. Mairesse, F. and Walker, M. A. (2011). Controlling user perceptions of linguistic style: Trainable generation of personality traits. Computer Linguistics. Mairesse, F. and Young, S. (2014). Stochastic language generation in dialogue using factored language models. Computer Linguistics. Mangu, L., Brill, E., and Stolcke, A. (2000). Finding consensus among words: lattice-based word error minimisation. Computer Speech and Language, pages 373–400. Mann, W. C. and Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text. 146 References McCarthy, J., Minsky, M. L., Rochester, N., and Shannon, C. E. (1955). A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence. McCulloch, W. S. and Pitts, W. (1988). A logical calculus of the ideas immanent in nervous activity. In Anderson, J. A. and Rosenfeld, E., editors, Neurocomputing: Foundations of Research, pages 15–27. MIT Press, Cambridge, MA, USA. Mei, H., Bansal, M., and Walter, M. R. (2015). What to talk about and how? selective generation using lstms with coarse-to-fine alignment. CoRR, abs/1509.00838. Melcˇuk, I. (1988). Dependency Syntax: Theory and Practice. Daw Book Collectors. State University Press of New York. Mesnil, G., Dauphin, Y., Yao, K., Bengio, Y., Deng, L., Hakkani-Tur, D., He, X., Heck, L., Tur, G., Yu, D., and Zweig, G. (2015). Using recurrent neural networks for slot filling in spoken language understanding. Trans. Audio, Speech and Lang. Proc., 23(3):530–539. Meza-Ruiz, I. V., Riedel, S., and Lemon, O. (2008). Accurate statistical spoken language un- derstanding from limited development resources. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5021–5024. Miao, Y. and Blunsom, P. (2016). Language as a latent variable: Discrete generative models for sentence compression. In EMNLP, pages 319–328, Austin, Texas. Association for Computational Linguistics. Miao, Y., Yu, L., and Blunsom, P. (2016). Neural variational inference for text processing. In ICML. Mikolov, T., Karafiat, M., Burget, L., Cˇernocký, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In InterSpeech. Mikolov, T., Kombrink, S., Burget, L., Cˇernocký, J., and Khudanpur, S. (2011). Extensions of recurrent neural network language model. In ICASSP, pages 5528–5531. IEEE. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. Mikolov, T. and Zweig, G. (2012). Context dependent recurrent neural network language model. In IEEE SLT. Miller, A. H., Fisch, A., Dodge, J., Karimi, A.-H., Bordes, A., and Weston, J. (2016). Key-value memory networks for directly reading documents. In EMNLP. Miller, S., Schwartz, R., Bobrow, R., and Ingria, R. (1994). Statistical language processing using hidden understanding models. In Proceedings of the Workshop on Human Lan- guage Technology, HLT ’94, pages 278–282, Stroudsburg, PA, USA. Association for Computational Linguistics. Miller, S., Stallard, D., Bobrow, R., and Schwartz, R. (1996). A fully statistical approach to natural language interfaces. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pages 55–61, Stroudsburg, PA, USA. Association for Computational Linguistics. References 147 Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30. Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In ACL. Mirkovic, D. and Cavedon, L. (2011). Dialogue management using scripts. EP Patent 1,891,625. Misu, T. and Kawahara, T. (2007). Speech-based interactive information guidance system using question-answering technique. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 4:IV–145–IV–148. Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In ICML. Mnih, V., Heess, N., Graves, A., and kavukcuoglu, k. (2014). Recurrent models of visual attention. In NIPS. Mrkšic´, N., Ó Séaghdha, D., Thomson, B., Gasic, M., Su, P.-H., Vandyke, D., Wen, T.-H., and Young, S. (2015). Multi-domain dialog state tracking using recurrent neural networks. In ACL, pages 794–799, Beijing, China. ACL. Mrkšic´, N., Ó Séaghdha, D., Wen, T.-H., Thomson, B., and Young, S. (2016a). Counterfitting word vectors. In Proceedings of NAACL. Mrkšic´, N., Ó Séaghdha, D., Wen, T.-H., Thomson, B., and Young, S. (2016b). Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of ACL. Murveit, H., Butzberger, J., Digalakis, V., and Weintraub, M. (1993). Large-vocabulary dictation using sri’s decipher speech recognition system: progressive search techniques. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 319–322 vol.2. Nguyen, L., Shimazu, A., and hieu Phan, X. (2006). Semantic parsing with structured svm ensemble classification models. In In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 619–626. Novikova, J., Dušek, O., Cercas Curry, A., and Rieser, V. (2017). Why we need new evaluation metrics for nlg. In Proceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing, pages 2241–2252. Association for Computational Linguistics. Oh, A. H. and Rudnicky, A. I. (2000). Stochastic language generation for spoken dialogue systems. In NAACL Workshop on Conversational Systems. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Associ- ation for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Parsons, T. (1990). Events in the Semantics of English: A Study in Subatomic Semantics. MIT Press. 148 References Perez, J. (2016). Dialog state tracking, a machine reading approach using a memory-enhanced neural network. CoRR, abs/1606.04052. Pieraccini, R. and Huerta, J. M. (2008). Where do we go from here? Research and commercial spoken dialogue systems. In Recent Trends in Discourse and Dialogue, volume 39 of Text, Speech and Language Technology, chapter 1, pages 1–24. Springer, Dordrecht. Raiko, T., Berglund, M., Alain, G., and Dinh, L. (2014). Techniques for learning binary stochastic feedforward neural networks. CoRR, abs/1406.2989. Rapaport, W. J. (1986). Logical foundations for belief representation. Cognitive Science, 10(4):371–422. Ratnaparkhi, A. (2002). Trainable approaches to surface natural language generation and their application to conversational dialog systems. Computer Speech and Language. Reiter, E. and Dale, R. (1997). Building applied natural language generation systems. Nat. Lang. Eng., 3(1):57–87. Reiter, E. and Dale, R. (2000). Building Natural Language Generation Systems. Cambridge University Press, New York, NY, USA. Ren, H., Xu, W., and Yan, Y. (2014). Markovian discriminative modeling for cross-domain dialog state tracking. In 2014 IEEE Spoken Language Technology Workshop (SLT), pages 342–347. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML. Rieser, V. and Lemon, O. (2010). Natural language generation as planning under uncertainty for spoken dialogue systems. In EMNLG. Springer-Verlag. Rieser, V. and Lemon, O. (2011). Reinforcement Learning for Adaptive Dialogue Systems: A Data-driven Methodology for Dialogue Management and Natural Language Generation. Springer Publishing Company, Incorporated. Rieser, V., Lemon, O., and Keizer, S. (2014). Natural language generation as incremental planning under uncertainty: Adaptive information presentation for statistical dialogue systems. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(5):979–994. Robinson, A. J. and Fallside, F. (1987). The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, Cambridge. Rojas Barahona, L. M., Gasic, M., Mrkšic´, N., Su, P.-H., Ultes, S., Wen, T.-H., and Young, S. (2016). Exploiting sentence and context representations in deep neural models for spoken language understanding. In Coling, pages 258–267, Osaka, Japan. The COLING 2016 Organizing Committee. Roy, N., Pineau, J., and Thrun, S. (2000a). Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 93–100. Association for Computational Linguistics. References 149 Roy, N., Pineau, J., and Thrun, S. (2000b). Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, pages 93–100, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E., McClelland, J. L., and PDP Research Group, C., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pages 318–362. MIT Press, Cambridge, MA, USA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning internal representations by error propagation. In Anderson, J. A. and Rosenfeld, E., editors, Neurocomputing: Foundations of Research, pages 673–695. MIT Press, Cambridge, MA, USA. Sainath, T. N., Mohamed, A.-r., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolu- tional neural networks for lvcsr. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, pages 1401–1406, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Comput., 4(2):234–242. Schuster, M. and Paliwal, K. (1997a). Bidirectional recurrent neural networks. Trans. Sig. Proc., 45(11):2673–2681. Schuster, M. and Paliwal, K. K. (1997b). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on. Searle, J. (1969). Speech Acts: An Essay in the Philosophy of Language. Cam: Verschiedene Aufl. Cambridge University Press. Seneff, S. and Polifroni, J. (2000). Dialogue management in the mercury flight reservation system. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational Systems - Volume 3, ANLP/NAACL-ConvSyst ’00, pages 11–16, Stroudsburg, PA, USA. Association for Computational Linguistics. Serban, I. V., Lowe, R., Charlin, L., and Pineau, J. (2015a). A survey of available corpora for building data-driven dialogue systems. arXiv preprint:1512.05742. Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., and Pineau, J. (2015b). Hierarchical neural network generative models for movie dialogues. In AAAI. Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. (2016). A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint: 1605.06069. Shang, L., Lu, Z., and Li, H. (2015). Neural responding machine for short-text conversation. In ACL, pages 1577–1586, Beijing, China. ACL. 150 References Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrit- twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489. Singh, S., Kearns, M., Litman, D., and Walker, M. (1999). Reinforcement learning for spoken dialogue systems. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 956–962, Cambridge, MA, USA. MIT Press. Smith, R. (2014). Comparative error analysis of dialog state tracking. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 300–309, Philadelphia, PA, U.S.A. Association for Computational Linguistics. Snow, R., Jurafsky, D., and Ng, A. Y. (2004). Learning syntactic patterns for automatic hypernym discovery. In NIPS. Steedman, M. (2000). The Syntactic Process. MIT Press, Cambridge, MA, USA. Stent, A., Marge, M., and Singhai, M. (2005). Evaluating evaluation methods for generation in the presence of variation. In Proceedings of the 6th International Conference on Com- putational Linguistics and Intelligent Text Processing, pages 341–351, Berlin, Heidelberg. Springer-Verlag. Stent, A. and Molina, M. (2009). Evaluating automatic extraction of rules for sentence plan construction. In SIGdial, pages 290–297, London, UK. ACL. Stent, A., Prassad, R., and Walker, M. (2004). Trainable sentence planning for complex information presentations in spoken dialog systems. In ACL, pages 79–86, Barcelona, Spain. ACL. Su, P., Gasic, M., Mrksic, N., Rojas-Barahona, L. M., Ultes, S., Vandyke, D., Wen, T., and Young, S. J. (2016a). Continuously learning neural dialogue management. CoRR, abs/1606.02689. Su, P.-H., Budzianowski, P., Ultes, S., Gasic, M., and Young, S. (2017). Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management. In Proceedings of SIGdial. Su, P.-H., Gašic´, M., Mrkšic´, N., Rojas-Barahona, L., Ultes, S., Vandyke, D., Wen, T.-H., and Young, S. (2016b). On-line active reward learning for policy optimisation in spoken dialogue systems. In Proceedings of ACL. Su, P.-H., Vandyke, D., Gasic, M., Kim, D., Mrksic, N., Wen, T.-H., and Young, S. J. (2015). Learning from real users: Rating dialogue success with neural networks for reinforcement learning in spoken dialogue systems. In Interspeech. Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). End-to-end memory networks. In NIPS. Sun, K., Chen, L., Zhu, S., and Yu, K. (2014). The sjtu system for dialog state tracking challenge 2. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 318–326, Philadelphia, PA, U.S.A. Association for Computational Linguistics. References 151 Sundermeyer, M., Alkhouli, T., Wuebker, J., and Ney, H. (2014). Translation modeling with bidirectional recurrent neural networks. In EMNLP. ACL. Sutskever, I., Martens, J., and Hinton, G. (2011). Generating text with recurrent neural networks. In ICML, ICML ’11, pages 1017–1024, New York, NY, USA. ACM. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In NIPS, NIPS’14, pages 3104–3112, Cambridge, MA, USA. MIT Press. Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition. Sutton, S., Novick, D., Cole, R., Vermeulen, P., de Villiers, J., Schalkwyk, J., and Fanty, M. (1996). Building 10,000 spoken dialogue systems, volume 2, pages 709–712. IEEE. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American statistical Association, 82(398):528–540. Taylor, P., Black, A. W., and Caley, R. (1998). The architecture of the festival speech synthesis system. In The Third ESCA Workshop in Speech Synthesis, pages 147–151. Thomson, B. and Young, S. (2010). Bayesian update of dialogue state: A pomdp framework for spoken dialogue systems. Comput. Speech Lang., 24(4):562–588. Traum, D. R. (1999). Foundations of Rational Agency, chapter Speech Acts for Dialogue Agents. Springer. Traum, D. R. (2000). 20 questions on dialogue act taxonomies. Journal of semantics, 17(1):7–30. Tsiakoulis, P., Breslin, C., Gašic´, M., Henderson, M., Kim, D., Szummer, M., Thomson, B., and Young, S. (2014). Dialogue context sensitive hmm-based speech synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2554–2558. Tur, G. and Deoras, A. (2013). Semantic parsing using word confusion networks with conditional random fields. In In Proc. of Interspeech. Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236):433–460. Van Deemter, K., Krahmer, E., and Theune, M. (2005). Real versus template-based natural language generation: A false opposition? Comput. Linguist., 31(1):15–24. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbren- ner, N., Senior, A. W., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499. Vinyals, O., Fortunato, M., and Jaitly, N. (2015). Pointer networks. In NIPS. Vinyals, O. and Le, Q. V. (2015). A neural conversational model. In ICML Deep Learning Workshop. 152 References Walker, M., Stent, A., Mairesse, F., and Prasad, R. (2007). Individual and domain adaptation in sentence planning for dialogue. JAIR. Walker, M. A., Rambow, O. C., and Rogati, M. (2002). Training a sentence planner for spoken dialogue using boosting. Computer Speech and Language. Wang, Y.-Y. and Acero, A. (2006). Discriminative models for spoken language understanding. In in ICSLP. Wang, Z. and Lemon, O. (2013). A simple and generic belief tracking mechanism for the dialog state tracking challenge: On the believability of observed information. In Proceedings of the SIGDIAL 2013 Conference, pages 423–432, Metz, France. Association for Computational Linguistics. Ward, W. (1990). The cmu air travel information service: Understanding spontaneous speech. In Proceedings of the Workshop on Speech and Natural Language, HLT ’90, pages 127–129, Stroudsburg, PA, USA. Association for Computational Linguistics. Weizenbaum, J. (1966). Eliza—a computer program for the study of natural language communication between man and machine. Commun. ACM, 9(1):36–45. Wen, T.-H., Gasic, M., Kim, D., Mrksic, N., Su, P.-H., Vandyke, D., and Young, S. (2015a). Stochastic language generation in dialogue using recurrent neural networks with convolu- tional sentence reranking. In SIGdial, pages 275–284, Prague, Czech Republic. ACL. Wen, T.-H., Gašic´, M., Mrkšic´, N., Su, P.-H., Vandyke, D., Rojas-Barahona, L. M., and Young, S. (2015b). Toward multi-domain language generation using recurrent neural networks. In NIPS Workshop of Machine Learning on SLU and Interaction. Wen, T.-H., Gasic, M., Mrkšic´, N., Rojas Barahona, L. M., Su, P.-H., Ultes, S., Vandyke, D., and Young, S. (2016a). Conditional generation and snapshot learning in neural dialogue systems. In EMNLP, pages 2153–2162, Austin, Texas. ACL. Wen, T.-H., Gasic, M., Mrkšic´, N., Su, P.-H., Vandyke, D., and Young, S. (2015c). Semanti- cally conditioned lstm-based natural language generation for spoken dialogue systems. In EMNLP, pages 1711–1721, Lisbon, Portugal. ACL. Wen, T.-H., Gašic´, M., Mrkšic´, N., Rojas-Barahona, L. M., Su, P.-H., Vandyke, D., and Young, S. (2016b). Multi-domain neural network language generation for spoken dialogue systems. In NAACL-HLT, pages 120–129, San Diego, California. ACL. Wen, T.-H., Heidel, A., yi Lee, H., Tsao, Y., and Lee, L.-S. (2013a). Recurrent neural network based language model personalization by social network crowdsourcing. In InterSpeech. Wen, T.-H., Lee, H.-Y., Chen, T.-Y., and Lee, L.-S. (2012a). Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications. In SLT. Wen, T.-H., Lee, H.-Y., hao Su, P., and shan Lee, L. (2013b). Interactive spoken content retrieval by extended query model and continuous state space markov decision process. In ICASSP, pages 8510–8514. References 153 Wen, T.-H., Miao, Y., Blunsom, P., and Young, S. (2017a). Latent intention dialogue models. In ICML, ICML’17. JMLR.org. Wen, T.-H., Vandyke, D., Mrkšic´, N., Gašic´, M., M. Rojas-Barahona, L., Su, P.-H., Ultes, S., and Young, S. (2017b). A network-based end-to-end trainable task-oriented dialogue system. In EACL. Wen, T.-H., Vandyke, D., Mrkšic´, N., Gasic, M., Rojas Barahona, L. M., Su, P.-H., Ultes, S., and Young, S. (2017c). A network-based end-to-end trainable task-oriented dialogue system. In EACL, pages 438–449, Valencia, Spain. Association for Computational Lin- guistics. Wen, T.-H., yi Lee, H., and Lee, L.-S. (2012b). Interactive spoken content retrieval with different types of actions optimized by a markov decision process. In Interspeech, pages 2458–2461. ISCA. Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339 – 356. Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceed- ings of the IEEE. Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. CoRR, abs/1410.3916. Weston, J. E. (2016). Dialog-based language learning. In NIPS, pages 829–837. Curran Associates, Inc. White, M., Rajkumar, R., and Martin, S. (2007). Towards broad coverage surface realization with ccg. In In Proc. of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation (UCNLG+MT. Williams, J., Raux, A., Ramach, D., and Black, A. (2013). The dialog state tracking challenge. In In Proceedings of the 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL. Williams, J. D. (2010). Incremental partition recombination for efficient tracking of multiple dialog states. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5382–5385. Williams, J. D. and Young, S. (2005). Scaling up pomdps for dialog management: The “summary pomdp” method. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pages 177–182. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256. Xia, R., Zong, C., Hu, X., and Cambria, E. (2013). Feature ensemble plus sample selection: Domain adaptation for sentiment classification. IEEE Intelligent Systems, 28(3):10–18. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044. 154 References Yang, Y. and Eisenstein, J. (2015). Unsupervised multi-domain adaptation with feature embeddings. In HLT-NAACL. Yao, K., Peng, B., Zhang, Y., Yu, D., Zweig, G., and Shi, Y. (2014). Spoken language understanding using long short-term memory neural networks. In IEEE SLT. Yin, P., Lu, Z., Li, H., and Kao, B. (2015). Neural enquirer: Learning to query tables. arXiv preprint:1512.00965. Yogatama, D., Blunsom, P., Dyer, C., Grefenstette, E., and Ling, W. (2016). Learning to compose words into sentences with reinforcement learning. CoRR, abs/1611.09100. Young, S. (2002). Talking to machines (statistically speaking). ICSLP. Young, S. (2007). CUED standard dialogue acts: http://mi.eng.cam.ac.uk/research/dialogue/ LocalDocs/dastd.pdf. Technical report, Cambridge University Engineering Dept. Young, S., Gašic´, M., Thomson, B., and Williams, J. D. (2013). Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE. Young, S., Gašic´, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., and Yu, K. (2010). The hidden information state model: A practical framework for pomdp-based spoken dialogue management. Computer, Speech and Language. Young, S. and Proctor, C. (1989). The design and implementation of dialogue control in voice operated database inquiry systems. Computer Speech and Language, 3(4):329 – 353. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., and Tokuda, K. (2007). The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the 6th ISCA Workshop on Speech Synthesis, pages 294–299. ISCA. Zen, H., Senior, A., and Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 7962–7966. Zettlemoyer, L. S. and Collins, M. (2007). Online learning of relaxed ccg grammars for parsing to logical form. In In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL-2007, pages 678–687. Zhang, X. and Lapata, M. (2014). Chinese poetry generation with recurrent neural networks. In EMNLP, pages 670–680, Doha, Qatar. ACL. Zhang, X. and LeCun, Y. (2015). Text understanding from scratch. CoRR, abs/1502.01710. Zhou, D. and He, Y. (2011). Learning Conditional Random Fields from Unaligned Data for Natural Language Understanding, pages 283–288. Springer Berlin Heidelberg, Berlin, Heidelberg. Zue, V., Seneff, S., Glass, J., Polifroni, J., Pao, C., Hazen, T. J., and Hetherington, L. (2000). Jupiter: A telephone-based conversational interface for weather information. IEEE Trans. on Speech and Audio Processing, 8:85–96.