Weighting Finite-State Transductions With Neural Context

How should one apply deep learning to tasks such as morphological reinﬂection, which stochastically edit one string to get another? A recent approach to such sequence-to-sequence tasks is to compress the input string into a vector that is then used to generate the output string, using recurrent neural networks. In contrast, we propose to keep the traditional architecture, which uses a ﬁnite-state transducer to score all possible output strings , but to augment the scoring function with the help of recurrent networks. A stack of bidirectional LSTMs reads the input string from left-to-right and right-to-left, in order to summarize the input context in which a transducer arc is applied. We combine these learned features with the transducer to deﬁne a probability distribution over aligned output strings, in the form of a weighted ﬁnite-state automaton. This reduces hand-engineering of features, allows learned features to examine unbounded context in the input string, and still permits exact inference through dynamic programming. We illustrate our method on the tasks of morphological reinﬂection and lemmatization.


Introduction
Mapping one character sequence to another is a structured prediction problem that arises frequently in NLP and computational linguistics. Common applications include grapheme-to-phoneme (G2P), transliteration, vowelization, normalization, morphology, and phonology. The two sequences may have different lengths.
Traditionally, such settings have been modeled with weighted finite-state transducers (WFSTs) with parametric edge weights (Mohri, 1997;Eisner, 2002). This requires manual design of the transducer states and the features extracted from those states. Alternatively, deep learning has recently been tried for sequence-to-sequence transduction (Sutskever et al., 2014). While training these systems could discover contextual features that a hand-crafted parametric WFST might miss, they dispense with important structure in the problem, namely the monotonic input-output alignment. This paper describes a natural hybrid approach that marries simple FSTs with features extracted by recurrent neural networks.
Our novel architecture allows efficient modeling of globally normalized probability distributions over string-valued output spaces, simultaneously with automatic feature extraction. We evaluate on morphological reinflection and lemmatization tasks, showing that our approach strongly outperforms a standard WFST baseline as well as neural sequence-tosequence models with attention. Our approach also compares reasonably with a state-of-the-art WFST approach that uses task-specific latent variables.

Notation and Background
Let Σ x be a discrete input alphabet and Σ y be a discrete output alphabet. Our goal is to define a conditional distribution p(y | x) where x ∈ Σ * x and y ∈ Σ * y and x and y may be of different lengths. We use italics for characters and boldface for strings. x i denotes the i th character of x, and x i:j denotes the substring x i+1 x i+2 · · · x j of length j − i ≥ 0. Note that x i:i = ε, the empty string. Let n = |x|.
Our approach begins by hand-specifying an unweighted finite-state transducer (FST), F , that nondeterministically maps any well-formed input x to all appropriate outputs y. An FST is a directed graph whose vertices are called states, and whose arcs are each labeled with some pair s : t, representing a possible edit of a source substring s ∈ Σ * x into a target substring t ∈ Σ * y . A path π from the FST's initial state to its final state represents an alignment of x to y, where x and y (respectively) are the concatenations of the s and t labels of the arcs along π. In : An example transducer F , whose state remembers the most recent output character (or $ if none). Only a few of the states are shown, with all arcs among them. The Σ wildcard matches any symbol in Σ x ; the "?" wildcard matches the empty string ε or any symbol in Σ x .
general, two strings x, y can be aligned through exponentially many paths, via different edit sequences. If we represent x as a straight-line finite-state automaton (Figure 1), then composing x with F (Figure 2) yields a new FST, G (Figure 3). The paths in G are in 1-1 correspondence with exactly the paths in F that have input x. G can have cycles, allowing outputs of unbounded length.
Each path in G represents an alignment of x to some string in Σ * y . We say p(y | x) is the total probability of all paths in G that align x to y (Figure 4).
But how to define the probability of a path? Traditionally (Eisner, 2002), each arc in F would also be equipped with a weight. The weight of a path in F , or the corresponding path in G, is the sum of its arcs' weights. We would then define the probability p(π) of a path π in G as proportional to exp w(π), where w(·) ∈ R denotes the weight of an object.
The weight of an arc h s:t −→ h in F is traditionally defined as a function of features of the edit s : t and the names (h, h ) of the source and target states. In effect, h summarizes the alignment between the prefixes of x and y that precede this edit, and h summarizes the alignment of the suffixes that follow it.
Thus, while the weight of an edit s : t may depend on context, it traditionally does so only through h and h . So if F has k states, then the edit weight can only distinguish among k different types of preceding or following context.  Figure 3: An example of the transducer G, which pairs the string x=say with infinitely many possible strings y. This G was created as the composition of the straight-line input automaton ( Figure 1) and the transducer F (Figure 2). Thus, the state of G tracks the states of those two machines: the position in x and the most recent output character. To avoid a tangled diagram, this figure shows only a few of the states (the start state plus all states of the form i,s or i,a ), with all arcs among them.

Incorporating More Context
That limitation is what we aim to correct in this paper, by augmenting our representation of context. Our contextual weighting approach will assign weights directly to G's arcs, instead of to F 's arcs. Each arc of G can be regarded as a "token" of an edit arc in F : it "applies" that edit to a particular substring of x. It has the form i,h s:t −→ j,h , and represents the replacement of x i:j = s by t. The finite-state composition construction produced this arc of G by combining the arc h s:t −→ h in F with the path i s j in the straight-line automaton representing x. The latter automaton uses integers as state names: it is 0 x 1 −→ 1 x 2 −→ . . . xn −→ n . Our top-level idea is to make the weight of this arc in G depend also on (x, i, j), so that it can consider unbounded input context around the edit's location. Arc weights can now consider arbitrary features of the input x and the position i, j-exactly like the potential functions of a linear-chain conditional random field (CRF), which also defines p(y | x).
Why not just use a CRF? That would only model a situation that enforced |y| = |x| with each character y i aligned to x i , since the emissions of a CRF correspond to edits s : t with |s| = |t| = 1. An FST can also allow edits with |s| = |t|, if desired, so it can be fit to (x, y) pairs of different lengths with unknown  Figure 4: A compact lattice of the exponentially many paths in the transducer G of Figure 3 that align input string x=say with output string y=said. To find p(y | x), we must sum over these paths (i.e., alignments). The lattice is created by composing G with y, which selects all paths in G that output y. Note that horizontal movement makes progress through x; vertical movement makes progress through y. The lattice's states specialize states in G so that they also record a position in y. alignment, summing over their possible alignments.
A standard weighted FST F is similar to a dynamic linear-chain CRF. Both are unrolled against the input x to get a dynamic programming lattice G. But they are not equivalent. By weighting G instead of F , we combine the FST's advantage (aligning unequal-length strings x, y via a latent path) with the CRF's advantage (arbitrary dependence on x).
To accomplish this weighting in practice, sections 4-5 present a trainable neural architecture for an arc weight function w = f (s, t, h, h , x, i, j). The goal is to extract continuous features from all of x. While our specific architecture is new, we are not the first to replace hand-crafted log-linear models with trainable neural networks (see section 9). 1 Note that as in a CRF, our arc weights cannot consider arbitrary features of y, only of x. Still, a weight's dependence on states h, h does let it depend on a finite amount of information about y (also possible in CRFs/HCRFs) and its alignment to x.
In short, our model p(y | x) makes the weight of an s : t edit, applied to substring x i:j , depend jointly on s : t and two summaries of the edit's context: • a finite-state summary (h, h ) of its context in the aligned (x, y) pair, as found by the FST F • a vector-valued summary of the context in x only, as found by a recurrent neural network The neural vector is generally a richer summary of the context, but it considers only the input-side context. We are able to efficiently extract these rich features from the single input x, but not from each of the very many possible outputs y. The job of the FST F is to compute additional features that also depend on the output. 2 Thus our model of p(y | x) is defined by an FST together with a neural network.

Neural Context Features
Our arc weight function f will make use of a vector γ i:j (computed from x, i, j) to characterize the substring x i:j that is being replaced, in context. We define γ i:j as the concatenation of a left vector α j (describing the prefix x 0:j ) and a right vector β i (describing the suffix x i:n ), which characterize x i:j jointly with its left or right context. We use γ i to abbreviate γ i−1:i , just as x i abbreviates x i−1:i .
To extract α j , we read the string x one character at a time with an LSTM (Hochreiter and Schmidhuber, 1997), a type of trainable recurrent neural network that is good at extracting relevant features from strings. α j is the LSTM's output after j steps (which read x 0:j ). Appendix A reviews how α j ∈ R q is computed for j = 1, . . . , n using the recursive, differentiable update rules of the LSTM architecture.
We also read the string x in reverse with a second LSTM. β i ∈ R q is the second LSTM's output after n − i steps (which read reverse(x i:n )).
We regard the two LSTMs together as a BiLSTM function (Graves and Schmidhuber, 2005) that reads x ( Figure 5). For each bounded-length substring x i:j , the BiLSTM produces a characterization γ i:j of that substring in context, in O(n) total time.
We now define a "deep BiLSTM," which stacks up K BiLSTMs. This deepening is aimed at extracting the kind of rich features that Sutskever et al.
Figure 5: A level-1 BiLSTM reading the word x=say.

and produces a sequence of vectors
After this deep generalization, we define γ i:j to be the concatenation of all γ (k) i:j (rather than just γ (K) i:j ). This novel deep BiLSTM architecture has more connections than a pair of deep LSTMs, since α Thus, while we may informally regard α (k) i as being a deep summary of the prefix x 0:i , it actually depends indirectly on all of x (except when k = 1).

The Arc Weight Function
Given the vector γ i:j , we can now compute the weight of the edit arc i,h s:t −→ j,h in G, namely w = f (s, t, h, h , x, i, j). Many reasonable functions are possible. Here we use one that is inspired by the logbilinear language model (Mnih and Hinton, 2007): The first argument to the inner product is an embedding e s ∈ R d (1) of the source substring s, con-catenated to the edit's neural context and also (for good measure) its local context. 3 For example, if |s| = 1, i.e. s is a single character, then we would use the embedding of that character as e s . Note that the embeddings e s for |s| = 1 are also used to encode the local context characters and the level-1 BiLSTM input. We learn these embeddings, and they form part of our model's parameter vector θ.
The second argument is a joint embedding of the other properties of the edit: the target substring t, the edit arc's state labels from F , and the type of the edit (INS, DEL, or SUB: see section 8). When replacing s in a particular context, which fixes the first argument, we will prefer those replacements whose r embeddings yield a high inner product w. We will learn the r embeddings as well; note that their dimensionality must match that of the first argument.
The model's parameter vector θ includes the O((d (K) ) 2 ) parameters from the 2K LSTMs, where d (K) = O(d (1) + Kq). It also O(d (1) S) parameters for the embeddings e s of the S different input substrings mentioned by F , and O(d (K) T ) for the embeddings r t,h,h ,type(s:t) of the T "actions" in F .

Training
We train our model by maximizing the conditional log-likelihood objective, Recall that p(y * | x) sums over all alignments. As explained by Eisner (2002), it can be computed as the pathsum of the composition G • y * (Figure 4), divided by the pathsum of G (which gives the normalizing constant for the distribution p(y | x)). The pathsum of a weighted FST is the total weight of all paths from the initial state to a final state, and can be computed by the forward algorithm. 4 Eisner (2002) and  also explain how to compute the partial derivatives of p(y * | x) with respect to the arc weights, essentially by using the forward-backward algorithm. We backpropagate further from these partials to find the gradient of (2) with respect to all our model parameters. We describe our gradient-based maximization procedure in section 10.3, along with regularization.
Our model does not have to be trained with the conditional log likelihood objective. It could be trained with other objectives such as empirical risk or softmax-margin Gimpel and Smith, 2010), or with error-driven updates such as in the structured perceptron (Collins, 2002).

Inference and Decoding
For a new input x at test time, we can now construct a weighted FST, G, that defines a probability distribution over all aligned output strings. This can be manipulated to make various predictions about y * and its alignment.
In our present experiments, we find the most probable (highest-weighted) path in G (Dijkstra, 1959), and use its output stringŷ as our prediction. Note that Dijkstra's algorithm is exact; no beam search is required as in some neural sequence models.
On the other hand,ŷ may not be the most probable string-extracting that from a weighted FST is NP-hard (Casacuberta and de la Higuera, 1999). The issue is that the total probability of each y is split over many paths. Still, this is a well-studied problem in NLP. Instead of the Viterbi approximation, we could have used a better approximation, such as crunching (May and Knight, 2006) or variational decoding . We actually did try crunching the 10000-best outputs but got no significant improvement, so we do not report those results.

Transducer Topology
In our experiments, we choose F to be a simple contextual edit FST as illustrated in Figure 2. Just as in Levenshtein distance (Levenshtein, 1966), it allows all edits s : t where |s| ≤ 1, |t| ≤ 1, |s| + |t| = 0. We consider the edit type to be INS if s = ε, DEL if are defined. (For convenience, our experiments in this paper avoid cycles by limiting consecutive insertions: see section 8.) t = ε, and SUB otherwise. Note that copy is a SUB edit with s = t.
For a "memoryless" edit process (Ristad and Yianilos, 1996), the FST would require only a single state. By contrast, we use |Σ x | + 1 states, where each state records the most recent output character (initially, a special "beginning-of-string" symbol $). That is, the state label h is the "history" output character immediately before the edit s : t, so the state label h is the history before the next edit, namely the final character of ht. For edits other than DEL, ht is a bigram of y, which can be evaluated (in context) by the arc weight function w = f (s, t, h, h , x, i, j).
Naturally, a weighted version of this FST F is far too simple to do well on real NLP tasks (as we show in our experiments). The magic comes from instead weighting G so that we can pay attention to the input context γ i:j .
The above choice of F corresponds to the "(0, 1, 1) topology" in the more general scheme of Cotterell et al. (2014). For practical reasons, we actually modify it to limit the number of consecutive INS edits to 3. 5 This trick bounds |y| to be < 4 · (|x| + 1), ensuring that the pathsums in section 6 are finite regardless of the model parameters. This simplifies both the pathsum algorithm and the gradient-based training (Dreyer, 2011). Less importantly, since G becomes acyclic, Dijkstra's algorithm in section 7 simplifies to the Viterbi algorithm.

Related Work
Our model adds to recent work on linguistic sequence transduction using deep learning. Graves and Schmidhuber (2005) combined BiL-STMs with HMMs. Later, "sequence-to-sequence" models were applied to machine translation by Sutskever et al. (2014) and to parsing by Vinyals et al. (2015). That framework did not model any alignment between x and y, but adding an "attention" mechanism provides a kind of soft alignment that has improved performance on MT (Bahdanau et al., 2015). Faruqui et al. (2016) apply these methods to morphological reinflection (the only other application to morphology we know of). Grefenstette et al. (2015) recently augmented the sequence-tosequence framework with a continuous analog of stack and queue data structures, to better handle long-range dependencies often found in linguistic data.
Some recent papers have used LSTMs or BiL-STMs, as we do, to define probability distributions over action sequences that operate directly on an input sequence. Such actions are aligned to the input. For example, Andor et al. (2016) score edit sequences using a globally normalized model, and  evaluate the local probability of a parsing action given past actions (and their result) and future words. These architectures are powerful because their LSTMs can examine the output structure; but as a result they do not permit dynamic programming and must fall back on beam search.
Our use of dynamic programming for efficient exact inference has long been common in non-neural architectures for sequence transduction, including FST systems that allow "phrasal" replacements s : t where |s|, |t| > 1 (Chen, 2003;Jiampojamarn et al., 2007;Bisani and Ney, 2008). Our work augments these FSTs with neural networks, much as others have augmented CRFs. In this vein, Durrett and Klein (2015) augment a CRF parser (Finkel et al., 2008) to score constituents with a feedforward neural network. Likewise, FitzGerald et al. (2015) employ feedforward nets as a factor in a graphical model for semantic role labeling. Many CRFs have incorporated feedforward neural networks (Bridle, 1990;Peng et al., 2009;Do and Artieres, 2010;Vinel et al., 2011;Fujii et al., 2012;Chen et al., 2015, and others). Some work augments CRFs with BiLSTMs: Huang et al. (2015) report results on part-of-speech tagging and named entity recognition with a linear-chain CRF-BiLSTM, and Kong et al. (2015) on Chinese word segmentation and handwriting recognition with a semi-CRF-BiLSTM.

Experiments
We evaluated our approach on two morphological generation tasks of reinflection (section 10.1) and lemmatization (section 10.2). In the reinflection task, the goal is to transduce verbs from one inflected form into another, whereas the lemmatization task requires the model to reduce an inflected verb to its root form.
We compare our WFST-LSTM against two standard baselines, a WFST with hand-engineered features and the Moses phrase-based MT system (Koehn et al., 2007), as well as the more complex latent-variable model of Dreyer et al. (2008). The comparison with Dreyer et al. (2008) is of noted interest since their latent variables are structured particularly for morphological transduction tasks-we are directly testing the ability of the LSTM to structure its hidden layer as effectively as linguistically motivated latent-variables. Additionally, we provide detailed ablation studies and learning curves which show that our neural-WFSA hybrid model can generalize even with very low amounts of training data.
Concretely, each task requires us to map a German inflection into another inflection. Consider the 13SIA task and the German verb abreiben ("to rub off"). We require the model to learn to map a past tense form abrieb to a present tense form abreibe-this involves a combination of stem change and affix generation. Sticking with the same verb abreiben, task 2PIE requires the model to transduce abreibt to abreibenthis requires an insertion and a substitution at the end. The tasks 2PKE and rP are somewhat more challenging since performing well on these tasks requires the model to learn complex transduction: abreiben to abzureiben and abreibt to abgerieben, respectively. These are complex transductions with phenomenon like infixation in specific contexts (abzurieben) and circumfixation (abgerieben) along with additional stem and affix changes. See Dreyer (2011) for more details and examples of these tasks.
We use the datasets of Dreyer (2011). Each exper-iment sampled a different dataset of 2500 examples from CELEX, dividing this into 500 training + 1000 validation + 1000 test examples. Like them, we report exact-match accuracy on the test examples, averaged over 5 distinct experiments of this form. We also report results when the training and validation data are swapped in each experiment, which doubles the training size.

Lemmatization
Lemmatization is a special case of morphological reinflection where we map an inflected form of a word to its lemma (canonical form), i.e., the target inflection is fixed. This task is quite useful for NLP, as dictionaries typically list only the lemma for a given lexical entry, rather than all possible inflected forms.
In the case of German verbs, the lemma is taken to be the infinitive form, e.g., we map the past participle abgerieben to the infinitive abreiben. Following Dreyer (2011), we use a subset of the lemmatization dataset created by Wicentowski (2002) and perform 10-fold experiments on four languages: Basque (5843), English (4915), Irish (1376) and Tagalog (9545), where the numbers in parenthesis indicate the total number of data pairs available. For each experimental fold the total data was divided into train, development and test sets in the proportion of 80:10:10 and we report test accuracy averaged across folds.

Settings and Training Procedure
We set the hyperparameters of our model to K = 4 (stacking depth), d (1) = 10 (character embedding dimension), and q = 15 (LSTM state dimension). The alphabets Σ x and Σ y are always equal; their size is language-dependent, typically ≈ 26 but larger in languages like Basque and Irish where our datasets include properly accented characters. With |Σ| = 26 and the above settings for the hyperparameters, the number of parameters in our models is 352, 801.
We optimize these parameters through stochastic gradient descent of the negative log-likelihood objective, and regularize the training procedure through dropout accompanied with gradient clipping and projection of parameters onto L2-balls with small radii, which is equivalent to adding a groupridge regularization term to the training objective. The learning rate decay schedule, gradient clipping  (2011), whose experimental setup we copied exactly. The Moses15 result is obtained by applying the SMT toolkit Moses (Koehn et al., 2007) over letter strings with 15-character context windows. Dreyer (Backoff) refers to the ngrams+x model which has access to all the "backoff features." Dreyer (Lat-Class) is the ngrams+x+latent class model, and Dreyer (Lat Region) refers to the ngrams+x+latent class + latent change region model. The "Model Ensemble" row displays the performance of an ensemble including our full model and the 7 models that we performed ablation on. In each column, we boldfaced the highest result and those that were not significantly worse (sign test, p < 0.05). Finally, the last row reports the performance of our BiLSTM-WFST when trained on twice the training data.
threshold, radii of L2-balls, and dropout frequency were tuned by hand on development data. In our present experiments, we made one change to the architecture. Treating copy edits like other SUB edits led to poor performance: the system was unable to learn that all SUB edits with s = t were extremely likely. In the experiments reported here, we addressed the problem by simply tying the weights of all copy edits regardless of context, bypassing (1) and instead setting w = c where c is a learned parameter of the model. See section 11 for discussion. Table 1 and 2 show our results. We can see that our proposed BiLSTM-WFST model always outperforms all but the most complex latent-variable model of Dreyer (2011); it is competitive with that model, but only beats it once individually. All of Dreyer's models include output trigram features, while we only use bigrams. Figure 7 shows learning curves for the 13SIA and 2PKE tasks: test accuracy when we train on less data. Curiously, at 300 data points the performance of our model is tied to Dreyer (2011). We also note   Wicentowski (2002) and systems marked with a (D) are taken from Dreyer (2011). We outperform baselines on all languages and are competitive with the latentvariable approach (ngrams + x + l), beating it in two cases: Irish and Tagalog. In general, the results from our experiments are a promising indicator that LSTMs are capable of extracting linguistically relevant features for morphology. Our model outperforms all baselines, and is competitive with and sometimes surpasses the latent-variable model of Dreyer et al. (2008) without any of the hand-engineered features or linguistically inspired latent variables.

Results
On morphological reinflection, we outperform all of Dreyer et al's models on 2PIE, but fall short of his latent-change region model on the other tasks (outperforming the other models). On lemmatization, we outperform all of Wicentowski's models on all the languages and all of Dreyer et al.'s models on Irish and Tagalog, but, but not on English and Irish. This suggests that perhaps further gains are possible through using something like Dreyer's FST as our F . Indeed, this would be compatible with much recent work that gets best results from a combination of automatically learned neural features and hand-

Analysis of Results
We analyzed our lemmatization errors for all the languages on one fold of the datasets. On the English lemmatization task, 7 of our 27 errors simply copied the input word to the output: ate, kept, went, taught, torn, paid, strung. This suggests that our current aggressive parameter tying for copy edits may predict a high probability for a copy edit even in contexts that should not favor it. Also we found that the FST sometimes produced non-words while lemmatizing the input verbs. For example it mapped picnicked → picnick, happen → hapen, exceed → excy and lining → lin. Since these strings would be rare in a corpus, many such errors could be avoided by a reranking approach that combined the FST's path score with a string frequency feature.
In order to better understand our architecture and the importance of its various components, we performed an ablation study on the validation portions of the morphological induction datasets, shown in Table 3. We can see in particular that using a BiL-STM instead of an LSTM, increasing the depth of the network, and including local context all helped to improve the final accuracy.
"Deep BiLSTM w/ Tying" refers to our complete model. The other rows are ablation experimentsarchitectures that are the same as the first row except in the specified way. "Deep BiLSTM (No Context)"