CONTENTS
Natural Language Processing
-
Natural language processing (NLP) is the use of human languages, such as English or French, by a computer. Computer programs typically read and emit specialized languages designed to allow efficient and unambiguous parsing by simple programs. More naturally occurring languages are often ambiguous and defy formal description. Natural language processing includes applications such as machine translation, in which the learner must read a sentence in one human language and emit an equivalent sentence in another human language. Many NLP applications are based on language models that define a probability distribution over sequences of words, characters or bytes in a natural language. -
As with the other applications discussed in this chapter, very generic neural network techniques can be successfully applied to natural language processing. However, to achieve excellent performance and to scale well to large applications, some domain-specific strategies become important. To build an efficient model of natural language, we must usually use techniques that are specialized for processing sequential data. In many cases, we choose to regard natural language as a sequence of words, rather than a sequence of individual characters or bytes. Because the total number of possible words is so large, word-based language models must operate on an extremely high-dimensional and sparse discrete space. Several strategies have been developed to make models of such a space efficient, both in a computational and in a statistical sense.
n
n
n-grams
-
A language model defines a probability distribution over sequences of tokens in a natural language. Depending on how the model is designed, a token may be a word, a character, or even a byte. Tokens are always discrete entities. The earliest successful language models were based on models of fixed-length sequences of tokens called
n
n
n-grams. An
n
n
n-gram is a sequence of
n
n
n tokens. -
Models based on
n
n
n-grams define the conditional probability of the
n
n
n-th token given the preceding
n
?
1
n-1
n?1 tokens. The model uses products of these conditional distributions to define the probability distribution over longer sequences:
P
(
x
1
,
…
,
x
τ
)
=
P
(
x
1
,
…
,
x
n
?
1
)
∏
t
=
n
τ
P
(
x
t
∣
x
t
?
n
+
1
,
…
,
x
t
?
1
)
P\left(x_{1}, \ldots, x_{\tau}\right)=P\left(x_{1}, \ldots, x_{n-1}\right) \prod_{t=n}^{\tau} P\left(x_{t} \mid x_{t-n+1}, \ldots, x_{t-1}\right)
P(x1?,…,xτ?)=P(x1?,…,xn?1?)t=n∏τ?P(xt?∣xt?n+1?,…,xt?1?) -
This decomposition is justified by the chain rule of probability. The probability distribution over the initial sequence
P
(
x
1
,
…
,
x
n
?
1
)
P\left(x_{1}, \ldots, x_{n-1}\right)
P(x1?,…,xn?1?) may be modeled by a different model with a smaller value of
n
n
n. -
Training
n
n
n-gram models is straightforward because the maximum likelihood estimate can be computed simply by counting how many times each possible
n
n
n gram occurs in the training set. Models based on
n
n
n-grams have been the core building block of statistical language modeling for many decades (Jelinek and Mercer, 1980; Katz, 1987; Chen and Goodman, 1999). -
For small values of
n
n
n, models have particular names: unigram for
n
=
1
n=1
n=1, bigram for
n
=
2
n=2
n=2, and trigram for
n
=
3
n=3
n=3. These names derive from the Latin prefixes for the corresponding numbers and the Greek suffix “-gram” denoting something that is written. -
Usually we train both an
n
n
n-gram model and an
n
?
1
n-1
n?1 gram model simultaneously. This makes it easy to compute
P
(
x
t
∣
x
t
?
n
+
1
,
…
,
x
t
?
1
)
=
P
n
(
x
t
?
n
+
1
,
…
,
x
t
)
P
n
?
1
(
x
t
?
n
+
1
,
…
,
x
t
?
1
)
P\left(x_{t} \mid x_{t-n+1}, \ldots, x_{t-1}\right)=\frac{P_{n}\left(x_{t-n+1}, \ldots, x_{t}\right)}{P_{n-1}\left(x_{t-n+1}, \ldots, x_{t-1}\right)}
P(xt?∣xt?n+1?,…,xt?1?)=Pn?1?(xt?n+1?,…,xt?1?)Pn?(xt?n+1?,…,xt?)? simply by looking up two stored probabilities. For this to exactly reproduce inference in
P
n
P_{n}
Pn?, we must omit the final character from each sequence when we train
P
n
?
1
.
P_{n-1} .
Pn?1?. -
As an example, we demonstrate how a trigram model computes the probability of the sentence “THE DOG RAN AWAY.” The first words of the sentence cannot be handled by the default formula based on conditional probability because there is no context at the beginning of the sentence. Instead, we must use the marginal probability over words at the start of the sentence. We thus evaluate
P
3
P_{3}
P3? (THE DOG RAN). Finally, the last word may be predicted using the typical case, of using the conditional distribution
P
(
P(
P( AWAY
∣
\mid
∣ DOG RAN
)
)
). Putting this together with equation above, we obtain:
P
(
P(
P( THE DOG RAN AWAY
)
=
P
3
)=P_{3}
)=P3? (THE DOG RAN)
P
3
P_{3}
P3? (DOG RAN AWAY)
/
P
2
/ P_{2}
/P2? (DOG RAN). -
A fundamental limitation of maximum likelihood for
n
n
n-gram models is that
P
n
P_{n}
Pn? as estimated from training set counts is very likely to be zero in many cases, even though the tuple
(
x
t
?
n
+
1
,
…
,
x
t
)
\left(x_{t-n+1}, \ldots, x_{t}\right)
(xt?n+1?,…,xt?) may appear in the test set. This can cause two different kinds of catastrophic outcomes. When
P
n
?
1
P_{n-1}
Pn?1? is zero, the ratio is undefined, so the model does not even produce a sensible output. When
P
n
?
1
P_{n-1}
Pn?1? is non-zero but
P
n
P_{n}
Pn? is zero, the test
log
?
\log
log-likelihood is
?
∞
-\infty
?∞. -
To avoid such catastrophic outcomes, most
n
n
n-gram models employ some form of smoothing. Smoothing techniques shift probability mass from the observed tuples to unobserved ones that are similar. See Chen and Goodman (1999) for a review and empirical comparisons. One basic technique consists of adding non-zero probability mass to all of the possible next symbol values. This method can be justified as Bayesian inference with a uniform or Dirichlet prior over the count parameters. Another very popular idea is to form a mixture model containing higher-order and lower-order
n
n
n-gram models, with the higher-order models providing more capacity and the lower-order models being more likely to avoid counts of zero. Back-off methods look-up the lower-order
n
n
n-grams if the frequency of the context
x
t
?
1
,
…
,
x
t
?
n
+
1
x_{t-1}, \ldots, x_{t-n+1}
xt?1?,…,xt?n+1? is too small to use the higher-order model. More formally, they estimate the distribution over
x
t
x_{t}
xt? by using contexts
x
t
?
n
+
k
,
…
,
x
t
?
1
x_{t-n+k}, \ldots, x_{t-1}
xt?n+k?,…,xt?1?, for increasing
k
k
k, until a sufficiently reliable estimate is found. -
Classical
n
n
n-gram models are particularly vulnerable to the curse of dimensionality. There are
∣
V
∣
n
|\mathbb{V}|^{n}
∣V∣n possible
n
n
n-grams and
∣
V
∣
|\mathbb{V}|
∣V∣ is often very large. Even with a massive training set and modest
n
n
n, most
n
n
n-grams will not occur in the training set. One way to view a classical
n
n
n-gram model is that it is performing nearest-neighbor lookup. In other words, it can be viewed as a local non-parametric predictor, similar to
k
k
k-nearest neighbors. The statistical problems facing these extremely local predictors are described in section 5.11.2. The problem for a language model is even more severe than usual, because any two different words have the same distance from each other in one-hot vector space. It is thus difficult to leverage much information from any “neighbors”-only training examples that repeat literally the same context are useful for local generalization. To overcome these problems, a language model must be able to share knowledge between one word and other semantically similar words. -
To improve the statistical efficiency of
n
n
n-gram models, class-based language models (Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce the notion of word categories and then share statistical strength between words that are in the same category. The idea is to use a clustering algorithm to partition the set of words into clusters or classes, based on their co-occurrence frequencies with other words. The model can then use word class IDs rather than individual word IDs to represent the context on the right side of the conditioning bar. Composite models combining word-based and class-based models via mixing or back-off are also possible. Although word classes provide a way to generalize between sequences in which some word is replaced by another of the same class, much information is lost in this representation.
Neural Language Models
-
Neural language models or NLMs are a class of language model designed to overcome the curse of dimensionality problem for modeling natural language sequences by using a distributed representation of words (Bengio et al., 2001).Unlike class-based
n
n
n-gram models, neural language models are able to recognize that two words are similar without losing the ability to encode each word as distinct from the other. Neural language models share statistical strength between one word (and its context) and other similar words and contexts. The distributed representation the model learns for each word enables this sharing by allowing the model to treat words that have features in common similarly. For example, if the word dog and the word cat map to representations that share many attributes, then sentences that contain the word cat can inform the predictions that will be made by the model for sentences that contain the word dog, and vice-versa. Because there are many such attributes, there are many ways in which generalization can happen, transferring information from each training sentence to an exponentially large number of semantically related sentences. The curse of dimensionality requires the model to generalize to a number of sentences that is exponential in the sentence length. The model counters this curse by relating each training sentence to an exponential number of similar sentences. -
We sometimes call these word representations word embeddings. In this interpretation, we view the raw symbols as points in a space of dimension equal to the vocabulary size. The word representations embed those points in a feature space of lower dimension. In the original space, every word is represented by a one-hot vector, so every pair of words is at Euclidean distance
2
\sqrt{2}
2
? from each other. In the embedding space, words that frequently appear in similar contexts (or any pair of words sharing some “features” learned by the model) are close to each other. This often results in words with similar meanings being neighbors. Figure
12.3
12.3
12.3 zooms in on specific areas of a learned word embedding space to show how semantically similar words map to representations that are close to each other. -
Neural networks in other domains also define embeddings. For example, a hidden layer of a convolutional network provides an “image embedding.” Usually NLP practitioners are much more interested in this idea of embeddings because natural language does not originally lie in a real-valued vector space. The hidden layer has provided a more qualitatively dramatic change in the way the data is represented. -
The basic idea of using distributed representations to improve models for natural language processing is not restricted to neural networks. It may also be used with graphical models that have distributed representations in the form of multiple latent variables (Mnih and Hinton , 2007 ).
High-Dimensional Outputs
-
In many natural language applications, we often want our models to produce words (rather than characters) as the fundamental unit of the output. For large vocabularies, it can be very computationally expensive to represent an output distribution over the choice of a word, because the vocabulary size is large. In many applications,
V
\mathbb{V}
V contains hundreds of thousands of words. The naive approach to representing such a distribution is to apply an affine transformation from a hidden representation to the output space, then apply the softmax function. Suppose we have a vocabulary
V
\mathbb{V}
V with size
∣
V
∣
.
|\mathbb{V}| .
∣V∣. The weight matrix describing the linear component of this affine transformation is very large, because its output dimension is
∣
V
∣
|\mathbb{V}|
∣V∣. This imposes a high memory cost to represent the matrix, and a high computational cost to multiply by it. Because the softmax is normalized across all
∣
V
∣
|\mathbb{V}|
∣V∣ outputs, it is necessary to perform the full matrix multiplication at training time as well as test time - we cannot calculate only the dot product with the weight vector for the correct output. The high computational costs of the output layer thus arise both at training time (to compute the likelihood and its gradient) and at test time (to compute probabilities for all or selected words). For specialized loss functions, the gradient can be computed efficiently (Vincent et al., 2015), but the standard cross-entropy loss applied to a traditional softmax output layer poses many difficulties. -
Suppose that
h
\boldsymbol{h}
h is the top hidden layer used to predict the output probabilities
y
^
\hat{\boldsymbol{y}}
y^?. If we parametrize the transformation from
h
\boldsymbol{h}
h to
y
^
\hat{\boldsymbol{y}}
y^? with learned weights
W
\boldsymbol{W}
W and learned biases
b
\boldsymbol{b}
b, then the affine-softmax output layer performs the following computations:
a
i
=
b
i
+
∑
j
W
i
j
h
j
?
i
∈
{
1
,
…
,
∣
V
∣
}
y
^
i
=
e
a
i
∑
i
′
=
1
∣
V
∣
e
a
i
′
\begin{aligned} &a_{i}=b_{i}+\sum_{j} W_{i j} h_{j} \quad \forall i \in\{1, \ldots,|\mathbb{V}|\} \\ &\hat{y}_{i}=\frac{e^{a_{i}}}{\sum_{i^{\prime}=1}^{|\mathbb{V}|} e^{a_{i^{\prime}}}} \end{aligned}
?ai?=bi?+j∑?Wij?hj??i∈{1,…,∣V∣}y^?i?=∑i′=1∣V∣?eai′?eai??? -
If
h
\boldsymbol{h}
h contains
n
h
n_{h}
nh? elements then the above operation is
O
(
∣
V
∣
n
h
)
O\left(|\mathbb{V}| n_{h}\right)
O(∣V∣nh?). With
n
h
n_{h}
nh? in the thousands and
∣
V
∣
|\mathbb{V}|
∣V∣ in the hundreds of thousands, this operation dominates the computation of most neural language models.
Use of a Short List
-
The first neural language models (Bengio et al. 2001,2003
)
)
) dealt with the high cost of using a softmax over a large number of output words by limiting the vocabulary size to 10,000 or 20,000 words. Schwenk and Gauvain (2002) and Schwenk (2007) built upon this approach by splitting the vocabulary
V
\mathbb{V}
V into a shortlist
L
\mathbb{L}
L of most frequent words (handled by the neural net) and a tail
T
=
V
\
L
\mathbb{T}=\mathbb{V} \backslash \mathbb{L}
T=V\L of more rare words (handled by an
n
n
n-gram model). To be able to combine the two predictions, the neural net also has to predict the probability that a word appearing after context
C
C
C belongs to the tail list. This may be achieved by adding an extra sigmoid output unit to provide an estimate of
P
(
i
∈
T
∣
C
)
P(i \in \mathbb{T} \mid C)
P(i∈T∣C). The extra output can then be used to achieve an estimate of the probability distribution over all words in
V
\mathbb{V}
V as follows:
P
(
y
=
i
∣
C
)
=
1
i
∈
L
P
(
y
=
i
∣
C
,
i
∈
L
)
(
1
?
P
(
i
∈
T
∣
C
)
)
+
1
i
∈
T
P
(
y
=
i
∣
C
,
i
∈
T
)
P
(
i
∈
T
∣
C
)
\begin{aligned} P(y=i \mid C)=& 1_{i \in \mathbb{L}} P(y=i \mid C, i \in \mathbb{L})(1-P(i \in \mathbb{T} \mid C)) \\ &+1_{i \in \mathbb{T}} P(y=i \mid C, i \in \mathbb{T}) P(i \in \mathbb{T} \mid C) \end{aligned}
P(y=i∣C)=?1i∈L?P(y=i∣C,i∈L)(1?P(i∈T∣C))+1i∈T?P(y=i∣C,i∈T)P(i∈T∣C)? where
P
(
y
=
i
∣
C
,
i
∈
L
)
P(y=i \mid C, i \in \mathbb{L})
P(y=i∣C,i∈L) is provided by the neural language model and
P
(
y
=
i
∣
P(y=i \mid
P(y=i∣
C
,
i
∈
T
)
C, i \in \mathbb{T})
C,i∈T) is provided by the
n
n
n-gram model. With slight modification, this approach can also work using an extra output value in the neural language model’s softmax layer, rather than a separate sigmoid unit. -
An obvious disadvantage of the short list approach is that the potential generalization advantage of the neural language models is limited to the most frequent words, where, arguably, it is the least useful. This disadvantage has stimulated the exploration of alternative methods to deal with high-dimensional outputs, described below.
Hierarchical Softmax
-
A classical approach (Goodman, 2001) to reducing the computational burden of high-dimensional output layers over large vocabulary sets
V
\mathbb{V}
V is to decompose probabilities hierarchically. Instead of necessitating(需要) a number of computations proportional to
∣
V
∣
|\mathbb{V}|
∣V∣ (and also proportional to the number of hidden units,
n
h
)
\left.n_{h}\right)
nh?), the
∣
V
∣
|\mathbb{V}|
∣V∣ factor can be reduced to as low as
log
?
∣
V
∣
.
\log |\mathbb{V}| .
log∣V∣. Bengio ( 2002 ) and Morin and Bengio ( 2005 ) introduced this factorized approach to the context of neural language models. -
One can think of this hierarchy as building categories of words, then categories of categories of words, then categories of categories of categories of words, etc. These nested categories form a tree, with words at the leaves. In a balanced tree, the tree has depth
O
(
log
?
∣
V
∣
)
O(\log |\mathbb{V}|)
O(log∣V∣). The probability of a choosing a word is given by the product of the probabilities of choosing the branch leading to that word at every node on a path from the root of the tree to the leaf containing the word. Figure
12.4
12.4
12.4 illustrates a simple example. Mnih and Hinton (2009) also describe how to use multiple paths to identify a single word in order to better model words that have multiple meanings. Computing the probability of a word then involves summation over all of the paths that lead to that word.
-
To predict the conditional probabilities required at each node of the tree, we typically use a logistic regression model at each node of the tree, and provide the same context
C
C
C as input to all of these models. Because the correct output is encoded in the training set, we can use supervised learning to train the logistic regression models. This is typically done using a standard cross-entropy loss, corresponding to maximizing the log-likelihood of the correct sequence of decisions. Because the output log-likelihood can be computed efficiently (as low as
log
?
∣
V
∣
\log |\mathbb{V}|
log∣V∣ rather than
∣
V
∣
)
|\mathbb{V}|)
∣V∣), its gradients may also be computed efficiently. This includes not only the gradient with respect to the output parameters but also the gradients with respect to the hidden layer activations. -
It is possible but usually not practical to optimize the tree structure to minimize the expected number of computations. Tools from information theory specify how to choose the optimal binary code given the relative frequencies of the words. To do so, we could structure the tree so that the number of bits associated with a word is approximately equal to the logarithm of the frequency of that word. However, in practice, the computational savings are typically not worth the effort because the computation of the output probabilities is only one part of the total computation in the neural language model. For example, suppose there are
l
l
l fully connected hidden layers of width
n
h
n_{h}
nh?. Let
n
b
n_{b}
nb? be the weighted average of the number of bits required to identify a word, with the weighting given by the frequency of these words. In this example, the number of operations needed to compute the hidden activations grows as as
O
(
ln
?
h
2
)
O\left(\ln _{h}^{2}\right)
O(lnh2?) while the output computations grow as
O
(
n
h
n
b
)
O\left(n_{h} n_{b}\right)
O(nh?nb?). As long as
n
b
≤
ln
?
h
n_{b} \leq \ln _{h}
nb?≤lnh?, we can reduce computation more by shrinking
n
h
n_{h}
nh? than by shrinking
n
b
n_{b}
nb?. Indeed,
n
b
n_{b}
nb? is often small. Because the size of the vocabulary rarely exceeds a million words and
log
?
2
(
1
0
6
)
≈
20
\log _{2}\left(10^{6}\right) \approx 20
log2?(106)≈20, it is possible to reduce
n
b
n_{b}
nb? to about 20 , but
n
h
n_{h}
nh? is often much larger, around
1
0
3
10^{3}
103 or more. Rather than carefully optimizing a tree with a branching factor of 2 , one can instead define a tree with depth two and a branching factor of
∣
V
∣
.
\sqrt{|\mathbb{V}|} .
∣V∣
?. Such a tree corresponds to simply defining a set of mutually exclusive word classes. The simple approach based on a tree of depth two captures most of the computational benefit of the hierarchical strategy. -
One question that remains somewhat open is how to best define these word classes, or how to define the word hierarchy in general. Early work used existing hierarchies (Morin and Bengio, 2005) but the hierarchy can also be learned, ideally jointly with the neural language model. Learning the hierarchy is difficult. An exact optimization of the log-likelihood appears intractable because the choice of a word hierarchy is a discrete one, not amenable to gradient-based optimization. However, one could use discrete optimization to approximately optimize the partition of words into word classes. -
An important advantage of the hierarchical softmax is that it brings computational benefits both at training time and at test time, if at test time we want to compute the probability of specific words. -
Of course, computing the probability of all
∣
V
∣
|\mathbb{V}|
∣V∣ words will remain expensive even with the hierarchical softmax. Another important operation is selecting the most likely word in a given context. Unfortunately the tree structure does not provide an efficient and exact solution to this problem. -
A disadvantage is that in practice the hierarchical softmax tends to give worse test results than sampling-based methods we will describe next. This may be due to a poor choice of word classes.
Importance Sampling
-
One way to speed up the training of neural language models is to avoid explicitly computing the contribution of the gradient from all of the words that do not appear in the next position. Every incorrect word should have low probability under the model. It can be computationally costly to enumerate all of these words. Instead, it is possible to sample only a subset of the words. Using the notation introduced in equation
a
i
=
b
i
+
∑
j
W
i
j
h
j
?
i
∈
{
1
,
…
,
∣
V
∣
}
a_{i}=b_{i}+\sum_{j} W_{i j} h_{j} \quad \forall i \in\{1, \ldots,|\mathbb{V}|\}
ai?=bi?+∑j?Wij?hj??i∈{1,…,∣V∣}, the gradient can be written as follows:
?
log
?
P
(
y
∣
C
)
?
θ
=
?
log
?
softmax
?
y
(
a
)
?
θ
=
?
?
θ
log
?
e
a
y
∑
i
e
a
i
=
?
?
θ
(
a
y
?
log
?
∑
i
e
a
i
)
=
?
a
y
?
θ
?
∑
i
P
(
y
=
i
∣
C
)
?
a
i
?
θ
\begin{aligned} \frac{\partial \log P(y \mid C)}{\partial \theta} &=\frac{\partial \log \operatorname{softmax}_{y}(\boldsymbol{a})}{\partial \theta} \\ &=\frac{\partial}{\partial \theta} \log \frac{e^{a_{y}}}{\sum_{i} e^{a_{i}}} \\ &=\frac{\partial}{\partial \theta}\left(a_{y}-\log \sum_{i} e^{a_{i}}\right) \\ &=\frac{\partial a_{y}}{\partial \theta}-\sum_{i} P(y=i \mid C) \frac{\partial a_{i}}{\partial \theta} \end{aligned}
?θ?logP(y∣C)??=?θ?logsoftmaxy?(a)?=?θ??log∑i?eai?eay??=?θ??(ay??logi∑?eai?)=?θ?ay???i∑?P(y=i∣C)?θ?ai???(最后一个求导链式法则之后就是一个softmax) where
a
\boldsymbol{a}
a is the vector of pre-softmax activations (or scores), with one element per word. The first term is the positive phase term (pushing
a
y
a_{y}
ay? up) while the second term is the negative phase term (pushing
a
i
a_{i}
ai? down for all
i
i
i, with weight
P
(
i
∣
C
)
P(i \mid C)
P(i∣C). Since the negative phase term is an expectation, we can estimate it with a Monte Carlo sample. However, that would require sampling from the model itself. Sampling from the model requires computing
P
(
i
∣
C
)
P(i \mid C)
P(i∣C) for all
i
i
i in the vocabulary, which is precisely what we are trying to avoid. -
Instead of sampling from the model, one can sample from another distribution, called the proposal distribution (denoted
q
q
q ), and use appropriate weights to correct for the bias introduced by sampling from the wrong distribution (Bengio and Sénécal, 2003; Bengio and Sénécal, 2008). This is an application of a more general technique called importance sampling, which will be described in more detail in section 17.2. Unfortunately, even exact importance sampling is not efficient because it requires computing weights
p
i
/
q
i
p_{i} / q_{i}
pi?/qi?, where
p
i
=
P
(
i
∣
C
)
p_{i}=P(i \mid C)
pi?=P(i∣C), which can only be computed if all the scores
a
i
a_{i}
ai? are computed. -
The solution adopted for this application is called biased importance sampling, where the importance weights are normalized to sum to 1 . When negative word
n
i
n_{i}
ni? is sampled, the associated gradient is weighted by
w
i
=
p
n
i
/
q
n
i
∑
j
=
1
N
p
n
j
/
q
n
j
w_{i}=\frac{p_{n_{i}} / q_{n_{i}}}{\sum_{j=1}^{N} p_{n_{j}} / q_{n_{j}}}
wi?=∑j=1N?pnj??/qnj??pni??/qni??? -
These weights are used to give the appropriate importance to the
m
m
m negative samples from
q
q
q used to form the estimated negative phase contribution to the gradient:
∑
i
=
1
∣
V
∣
P
(
i
∣
C
)
?
a
i
?
θ
≈
1
m
∑
i
=
1
m
w
i
?
a
n
i
?
θ
\sum_{i=1}^{|\mathbb{V}|} P(i \mid C) \frac{\partial a_{i}}{\partial \theta} \approx \frac{1}{m} \sum_{i=1}^{m} w_{i} \frac{\partial a_{n_{i}}}{\partial \theta}
i=1∑∣V∣?P(i∣C)?θ?ai??≈m1?i=1∑m?wi??θ?ani??? -
A unigram or a bigram distribution works well as the proposal distribution
q
q
q. It is easy to estimate the parameters of such a distribution from data. After estimating the parameters, it is also possible to sample from such a distribution very efficiently. -
Importance sampling is not only useful for speeding up models with large softmax outputs. More generally, it is useful for accelerating training with large sparse output layers, where the output is a sparse vector rather than a 1 -of-
n
n
n choice. An example is a bag of words. A bag of words is a sparse vector
v
\boldsymbol{v}
v where
v
i
v_{i}
vi? indicates the presence or absence of word
i
i
i from the vocabulary in the document. Alternately,
v
i
v_{i}
vi? can indicate the number of times that word
i
i
i appears. Machine learning models that emit such sparse vectors can be expensive to train for a variety of reasons. Early in learning, the model may not actually choose to make the output truly sparse. Moreover, the loss function we use for training might most naturally be described in terms of comparing every element of the output to every element of the target. This means that it is not always clear that there is a computational benefit to using sparse outputs, because the model may choose to make the majority of the output non-zero and all of these non-zero values need to be compared to the corresponding training target, even if the training target is zero. Dauphin et al. (2011) demonstrated that such models can be accelerated using importance sampling. The efficient algorithm minimizes the loss reconstruction for the “positive words” (those that are non-zero in the target) and an equal number of “negative words.” The negative words are chosen randomly, using a heuristic to sample words that are more likely to be mistaken. The bias introduced by this heuristic oversampling can then be corrected using importance weights. -
In all of these cases, the computational complexity of gradient estimation for the output layer is reduced to be proportional to the number of negative samples rather than proportional to the size of the output vector.
Noise-Contrastive Estimation and Ranking Loss
-
Other approaches based on sampling have been proposed to reduce the computational cost of training neural language models with large vocabularies. An early example is the ranking loss proposed by Collobert and Weston ( 2008a ), which views the output of the neural language model for each word as a score and tries to make the score of the correct word
a
y
a_{y}
ay? be ranked high in comparison to the other scores
a
i
a_{i}
ai?. The ranking loss proposed then is
L
=
∑
i
max
?
(
0
,
1
?
a
y
+
a
i
)
L=\sum_{i} \max \left(0,1-a_{y}+a_{i}\right)
L=i∑?max(0,1?ay?+ai?) -
The gradient is zero for the
i
i
i-th term if the score of the observed word,
a
y
a_{y}
ay?, is greater than the score of the negative word
a
i
a_{i}
ai? by a margin of
1.
1 .
1. One issue with this criterion is that it does not provide estimated conditional probabilities, which are useful in some applications, including speech recognition and text generation (including conditional text generation tasks such as translation). -
A more recently used training objective for neural language model is noise-contrastive estimation, which is introduced in section 18.6. This approach has been successfully applied to neural language models (Mnih and Teh, 2012; Mnil Kavukcuoglu, 2013
)
)
)
Combining Neural Language Models with
n
n
n-grams
-
A major advantage of
n
n
n-gram models over neural networks is that
n
n
n-gram models achieve high model capacity (by storing the frequencies of very many tuples) while requiring very little computation to process an example (by looking up only a few tuples that match the current context). If we use hash tables or trees to access the counts, the computation used for
n
n
n-grams is almost independent of capacity. In comparison, doubling a neural network’s number of parameters typically also roughly doubles its computation time. Exceptions include models that avoid using all parameters on each pass. Embedding layers index only a single embedding in each pass, so we can increase the vocabulary size without increasing the computation time per example. Some other models, such as tiled convolutional networks, can add parameters while reducing the degree of parameter sharing in order to maintain the same amount of computation. However, typical neural network layers based on matrix multiplication use an amount of computation proportional to the number of parameters. -
One easy way to add capacity is thus to combine both approaches in an ensemble consisting of a neural language model and an
n
n
n-gram language model (Bengit et al., 2001, 2003). As with any ensemble, this technique can reduce test error if the ensemble members make independent mistakes. The field of ensemble learning provides many ways of combining the ensemble members’ predictions, including uniform weighting and weights chosen on a validation set. Mikolov et al. (2011a) extended the ensemble to include not just two models but a large array of models. It is also possible to pair a neural network with a maximum entropy model and train both jointly (Mikolov et al., 2011b). This approach can be viewed as training a neural network with an extra set of inputs that are connected directly to the output, and not connected to any other part of the model. The extra inputs are indicators for the presence of particular
n
n
n-grams in the input context, so these variables are very high-dimensional and very sparse. The increase in model capacity is huge
?
-
? the new portion of the architecture contains up to
∣
s
V
∣
n
|s V|^{n}
∣sV∣n parameters-but the amount of added computation needed to process an input is minimal because the extra inputs are very sparse.
Neural Machine Translation
-
Machine translation is the task of reading a sentence in one natural language and emitting a sentence with the equivalent meaning in another language. Machine translation systems often involve many components. At a high level, there is often one component that proposes many candidate translations. Many of these translations will not be grammatical due to differences between the languages. For example, many languages put adjectives after nouns, so when translated to English directly they yield phrases such as “apple red.” The proposal mechanism suggests many variants of the suggested translation, ideally including “red apple.” A second component of the translation system, a language model, evaluates the proposed translations, and can score “red apple” as better than “apple red.” -
The earliest use of neural networks for machine translation was to upgrade the language model of a translation system by using a neural language mode(Schwenk et al., 2006 ; Schwenk , 2010 ). Previously, most machine translation systems had used an
n
n
n-gram model for this component. The
n
n
n-gram based models used for machine translation include not just traditional back-off
n
n
n-gram models (Jelinek and Mercer , 1980 ; Katz , 1987 ; Chen and Goodman , 1999 ) but also maximum entropy language models( Berger et al. , 1996 ), in which an affine-softmax layer predicts the next word given the presence of frequent
n
n
n-grams in the context. -
Traditional language models simply report the probability of a natural language sentence. Because machine translation involves producing an output sentence given an input sentence, it makes sense to extend the natural language model to be conditional. As described in section 6.2.1.1, it is straightforward to extend a model that defines a marginal distribution over some variable to define a conditional distribution over that variable given a context
C
C
C, where
C
C
C might be a single variable or a list of variables. Devlin (2014) beat the state-of-the-art in some statistical machine translation benchmarks by using an MLP to score a phrase
t
1
,
t
2
,
…
,
t
k
\mathrm{t}_{1}, \mathrm{t}_{2}, \ldots, \mathrm{t}_{k}
t1?,t2?,…,tk? in the target language given a phrase
s
1
,
?
s
2
,
…
,
s
n
\mathrm{s}_{1}, \mathrm{~s}_{2}, \ldots, \mathrm{s}_{n}
s1?,?s2?,…,sn? in the source language. The MLP estimates
P
(
t
1
,
t
2
,
…
,
t
k
∣
s
1
,
?
s
2
,
…
,
s
n
)
.
P\left(\mathrm{t}_{1}, \mathrm{t}_{2}, \ldots, \mathrm{t}_{k} \mid \mathrm{s}_{1}, \mathrm{~s}_{2}, \ldots, \mathrm{s}_{n}\right) .
P(t1?,t2?,…,tk?∣s1?,?s2?,…,sn?). The estimate formed by this MLP replaces the estimate provided by conditional
n
n
n-gram models. -
A drawback of the MLP-based approach is that it requires the sequences to be preprocessed to be of fixed length. To make the translation more flexible, we would like to use a model that can accommodate variable length inputs and variable length outputs. An RNN provides this ability. Section
10.2.4
10.2 .4
10.2.4 describes several ways of constructing an RNN that represents a conditional distribution over a sequence given some input, and section
10.4
10.4
10.4 describes how to accomplish this conditioning when the input is a sequence. -
In all cases, one model first reads the input sequence and emits a data structure that summarizes the input sequence. We call this summary the “context”
C
.
C .
C. The context
C
C
C may be a list of vectors, or it may be a vector or tensor. The model that reads the input to produce
C
C
C may be an RNN (Cho et al., 2014a; Sutskever et al., 2014; Jean et al., 2014) or a convolutional network (Kalchbrenner and Blunsom, 2013). A second model, usually an RNN, then reads the context
C
C
C and generates a sentence in the target language. This general idea of an encoder-decoder framework for machine translation is illustrated in figure
12.5
12.5
12.5. -
In order to generate an entire sentence conditioned on the source sentence, the model must have a way to represent the entire source sentence. Earlier models were only able to represent individual words or phrases. From a representation learning point of view, it can be useful to learn a representation in which sentences that have the same meaning have similar representations regardless of whether they were written in the source language or the target language. This strategy was explored first using a combination of convolutions and RNNs (Kalchbrenner and Blunsom , 2013 ). Later work introduced the use of an RNN for scoring proposed translations ( Cho et al. , 2014a ) and for generating translated sentences ( Sutskever et al. , 2014 ). Jean et al. ( 2014 ) scaled these models to larger vocabularies.
Using an Attention Mechanism and Aligning Pieces of Data
- Using a fixed-size representation to capture all the semantic details of a very long sentence of say 60 words is very difficult. It can be achieved by training a sufficiently large RNN well enough and for long enough, as demonstrated by Cho et al. ( 2014a ) and Sutskever et al. ( 2014 ). However, a more efficient approach is to read the whole sentence or paragraph (to get the context and the gist of what is being expressed), then produce the translated words one at a time, each time focusing on a different part of the input sentence in order to gather the semantic details that are required to produce the next output word. That is exactly the idea that Bahdanau et al. (2015) first introduced. The attention mechanism used to focus on specific parts of the input sequence at each time step is illustrated in figure 12.6. We can think of an attention-based system as having three components:
1. A process that “reads” raw data (such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position.
- A list of feature vectors storing the output of the reader. This can be understood as a "memory’ containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them.
- A process that " exploits" the content of the memory to sequentially perform a task, at each time step having the ability put attention on the content of one memory element (or a few, with a different weight).
- The third component generates the translated sentence.
- When words in a sentence written in one language are aligned with corresponding words in a translated sentence in another language, it becomes possible to relate the corresponding word embeddings. Earlier work showed that one could learn a kind of translation matrix relating the word embeddings in one language with the word embeddings in another (Ko?isky et al., 2014 ), yielding lower alignment error rates than traditional approaches based on the frequency counts in the phrase table. There is even earlier work on learning cross-lingual word vectors (Klementiev et al., 2012). Many extensions to this approach are possible. For example, more efficient cross-lingual alignment (Gouws et al., 2014) allows training on larger datasets.
Historical Perspective
-
The idea of distributed representations for symbols was introduced by Rumelhart et al. (1986a) in one of the first explorations of back-propagation, with symbols corresponding to the identity of family members and the neural network capturing the relationships between family members, with training examples forming triplets such as (Colin, Mother, Victoria). The first layer of the neural network learned a representation of each family member. For example, the features for Colin might represent which family tree Colin was in, what branch of that tree he was in, what generation he was from, etc. One can think of the neural network as computing learned rules relating these attributes together in order to obtain the desired predictions. The model can then make predictions such as inferring who is the mother of Colin. -
The idea of forming an embedding for a symbol was extended to the idea of an embedding for a word by Deerwester et al. (1990). These embeddings were learned using the SVD. Later, embeddings would be learned by neural networks. -
The history of natural language processing is marked by transitions in the popularity of different ways of representing the input to the model. Following this early work on symbols or words, some of the earliest applications of neural networks to NLP (Miikkulainen and Dyer, 1991; Schmidhuber, 1996) represented the input as a sequence of characters. -
Bengio et al. (2001) returned the focus to modeling words and introduced neural language models, which produce interpretable word embeddings. These neural models have scaled up from defining representations of a small set of symbols in the 1980 s to millions of words (including proper nouns and misspellings) in modern applications. This computational scaling effort led to the invention of the techniques described above in section 12.4.3. -
Initially, the use of words as the fundamental units of language models yielded improved language modeling performance (Bengio et al., 2001). To this day, new techniques continually push both character-based models (Sutskever et al., 2011) and word-based models forward, with recent work (Gillick et al., 2015) even modeling individual bytes of Unicode characters. -
The ideas behind neural language models have been extended into several natural language processing applications, such as parsing (Henderson, 2003,2004 ; Collobert, 2011), part-of-speech tagging, semantic role labeling, chunking, etc, sometimes using a single multi-task learning architecture (Collobert and Weston, 2008a; Collobert et al., 2011a) in which the word embeddings are shared across tasks. -
Two-dimensional visualizations of embeddings became a popular tool for analyzing language models following the development of the t-SNE dimensionality reduction algorithm (van der Maaten and Hinton, 2008) and its high-profile application to visualization word embeddings by Joseph Turian in 2009 .
|