Neural Word Segmentation Learning for Chinese
paper code 会议:ACL2016 作者:Deng Cai and Hai Zhao 机构: Department of Computer Science and Engineering Key Lab of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering Shanghai Jiao Tong University, Shanghai, China 主要工作(解决了什么):消除了上下文窗口,可以利用完整的分割信息 主要方法(使用了什么):LSTM 数据集:PKU and MSR 取得的成果:①给定字后给出对应的词表示 ②句子级的似然评估系统 ③提出算法找到最佳划分方式 缺陷和不足:与SOTA模型相比,性能不足,并且想要很长的训练时间 下一步工作:提升模型性能
摘要
过去:只能捕获固定大小的局部窗口内的上下文信息和相邻标签之间的简单交互 本文:thoroughly eliminates context windows and can utilize complete segmentation history. 完全消除了上下文窗口,并且可以利用完整的分割历史 Our model employs a gated combination neural network over characters to produce distributed representations of word candidates, which are then given to a long shortterm memory (LSTM) language scoring model. 我们的模型在字符上使用门控组合神经网络来生成候选词的分布式表示,然后将其提供给长短期记忆 (LSTM) 语言评分模型。
介绍
观点:
- Most east Asian languages including Chinese are written without explicit word delimiters, therefore, word segmentation is a preliminary step for processing those languages.
- Since Xue (2003), most methods formalize the Chinese word segmentation (CWS) as a sequence labeling problem with character position tags, which can be handled with supervised learning methods such as Maximum Entropy and Conditional Random Fields. However, those methods heavily depend on the choice of handcrafted features.
先前方法的缺点: Nevertheless, the tag-tag transition is insufficient to model the complicated influence from previous segmentation decisions, though it could sometimes be a crucial clue to later segmentation decisions. The fixed context window size, which is broadly adopted by these methods for feature engineering, also restricts the flexibility of modeling diverse distances. Moreover, word-level information, which is being the greater granularity unit as suggested in (Huang and Zhao, 2006), remains unemployed. 本文方法: this paper makes a latest attempt to re-formalize CWS as a direct segmentation learning task. Our method does not make tagging decisions on individual characters, but directly evaluates the relative likelihood of different segmented sentences and then search for a segmentation with the highest score. 我们的方法不对单个字符进行标记决策,而是直接评估不同分段句子的相对可能性,然后搜索得分最高的分段。 To feature a segmented sentence, a series of distributed vector representations (Bengio et al., 2003) are generated to characterize the corresponding word candidates. 为了表征一个分段的句子,生成了一系列分布式向量表示(Bengio et al., 2003)来表征相应的候选词. Such a representation setting makes the decoding quite different from previous methods and indeed much more challenging, however, more discriminative features can be captured. 这样的表示设置使得解码与以前的方法完全不同,并且确实更具挑战性,但是,可以捕获更多的判别特征。 Though the vector building is word centered, our proposed scoring model covers all three processing levels from character, word until sentence. 尽管向量构建是以单词为中心的,但我们提出的评分模型涵盖了从字符、单词到句子的所有三个处理级别。 First, the distributed representation starts from character embedding, as in the context of word segmentation, the n-gram data sparsity issue makes it impractical to use word vectors immediately. Second, as the word candidate representation is derived from its characters, the inside character structure will also be encoded, thus it can be used to determine the word likelihood of its own. Third, to evaluate how a segmented sentence makes sense through word interacting, an LSTM is used to chain together word candidates incrementally and construct the representation of partially segmented sentence at each decoding step, so that the coherence between next word candidate and previous segmentation history can be depicted. 首先,分布式表示从字符嵌入开始,因为在分词的上下文中,n-gram 数据稀疏性问题使得立即使用词向量变得不切实际。 其次,由于词候选表示是从它的字符派生的,内部的字符结构也会被编码,因此它可以用来确定它自己的词似然度。 第三,为了评估分段句子如何通过单词交互变得有意义,LSTM 用于将候选单词递增地链接在一起,并在每个解码步骤构建部分分段句子的表示,这样就可以描述下一个候选词和之前的切分历史之间的连贯性。 To our best knowledge, our proposed approach to CWS is the first attempt which explicitly models the entire contents of the segmenter’s state, including the complete history of both segmentation decisions and input characters. The compar isons of feature windows used in different models are shown in Table 1. Compared to both sequence labeling schemes and word-based models in the past, our model thoroughly eliminates context windows and can capture the complete history of segmentation decisions, which offers more possibilities to effectively and accurately model segmentation context.
2 概述
We formulate the CWS problem as finding a mapping from an input character sequence x to a word sequence y, and the output sentence y? satisfies:
Unlike all previous works, our scoring function is sensitive to the complete contents of partially segmented sentence. As shown in Figure 1, to solve CWS in this way, a neural network scoring model is designed to evaluate the likelihood of a segmented sentence. Based on the proposed model, a decoder is developed to find the segmented sentence with the highest score. Meanwhile, a max-margin method is utilized to perform the training by comparing the structured difference of decoder output and the golden segmentation. 如图 1 所示,为了以这种方式解决 CWS,设计了一个神经网络评分模型来评估分段句子的可能性。 基于所提出的模型,开发了一个解码器来找到得分最高的分段句子。 同时,通过比较解码器输出的结构化差异和黄金分割,利用最大边距方法进行训练。
3 神经网络评分模型
The score for a segmented sentence is computed by first mapping it into a sequence of word candidate vectors, then the scoring model takes the vector sequence as input, scoring on each word candidate from two perspectives: (1) how likely the word candidate itself can be recognized as a legal word; (2) how reasonable the link is for the word candidate to follow previous segmentation history immediately. After that, the word candidate is appended to the segmentation history, updating the state of the scoring system for subsequent judgements. Figure 2 illustrates the entire scoring neural network. 一个分词的得分是通过首先将其映射到一个候选词向量序列中来计算的,然后评分模型将向量序列作为输入,从两个角度对每个候选词进行评分:(1)候选词本身被识别为合法词的可能性有多大; (2) 候选词立即遵循先前的切分历史的链接有多合理。 之后,候选词被附加到切分历史中,更新评分系统的状态以供后续判断。 图 2 展示了整个评分神经网络。
3.1 词评分
Character Embedding. While the scores are decided at the word-level, using word embedding immediately will lead to a remarkable issue that rare words and out-of-vocabulary words will be poorly estimated. In addition, the character-level information inside an n-gram can be helpful to judge whether it is a true word. Therefore, a lookup table of character embeddings is used as the bottom layer. Formally, we have a character dictionary D of size |D|. Then each character c ∈ D is represented as a real-valued vector (character embedding) c ∈ Rd , where d is the dimensionality of the vector space. The character embeddings are then stacked into an embedding matrix M ∈ Rd×|D| . For a character c ∈ D, its character embedding c ∈ Rd is retrieved by the embedding layer according to its index. Gated Combination Neural Network. In order to obtain word representation through its characters, in the simplest strategy, character vectors are integrated into their word representation using a weight matrix W(L) that is shared across all words with the same length L, followed by a non-linear function g(·). Specifically, ci (1 ≤ i ≤ L) are d-dimensional character vector representations respectively, the corresponding word vector w will be d-dimensional as well: 门控组合神经网络。 为了通过其字符获得单词表示,在最简单的策略中,使用权重矩阵 W(L) 将字符向量集成到它们的单词表示中,该矩阵在具有相同长度 L 的所有单词之间共享,然后是非线性函数 g (·)。 具体来说,ci (1 ≤ i ≤ L) 分别是 d 维字符向量表示,对应的词向量 w 也将是 d 维: Gated structure in neural network can be useful for hybrid feature extraction, we therefore propose a gated combination neural network (GCNN) especially for character compositionality which contains two types of gates, namely reset gate and update gate. Intuitively, the reset gates decide which part of the character vectors should be mixed while the update gates decide what to preserve when combining the characters information. Concretely, for words with length L, the word vector w ∈ Rd is computed as follows: The gated mechanism is capable of capturing both character and character interaction characteristics to give an efficient word representation.
3.2 链接得分
we utilize an LSTM system to capture the coherence in a segmented sentence. Long Short-Term Memory Networks. The LSTM neural network (Hochreiter and Schmidhuber, 1997) is an extension of the recurrent neural network (RNN), which is an effective tool for sequence modeling tasks using its hidden states for history information preservation. At each time step t, an RNN takes the input xt and updates its recurrent hidden state ht by Link Score. LSTMs have been shown to outperform RNNs on many NLP tasks, notably language modeling. In our model, LSTM is utilized to chain together word candidates in a left-to-right, incremental manner. At time step t, a prediction pt+1 ∈ Rd about next word yt+1 is made based on the hidden state ht : LSTM以增量的方式从左到右将候选词链接到一起: link score for next word yt+1 is then computed as: Due to the structure of LSTM, the prediction vector pt+1 carries useful information detected from the entire segmentation history, including previous segmentation decisions. In this way, our model gains the ability of sequence-level discrimination rather than local optimization.
3.3 句子级分数
Sentence score for a segmented sentence y with n word candidates is computed by summing up word scores (2) and link scores (3) as follow: 单词分数和链接分数相加:
4 解码
The total number of possible segmented sentences grows exponentially with the length of character sequence, which makes it impractical to compute the scores of every possible segmentation. In order to get exact inference, most sequence-labeling systems address this problem with a Viterbi search, which takes the advantage of their hypothesis that the tag interactions only exist within adjacent characters (Markov assumption). However, since our model is intended to capture complete history of segmentation decisions, such dynamic programming algorithms can not be adopted in this situation. To make our model efficient in practical use, we propose a beam-search algorithm with dynamic programming motivations as shown in Algorithm 1. The main idea is that any segmentation of the first i characters can be separated as two parts, the first part consists of characters with indexes from 0 to j that is denoted as y, the rest part is the word composed by c[j+1 : i]. The influence from previous segmentation y can be represented as a triple (y.score, y.h, y.c), where y.score, y.h, y.c indicate the current score, current hidden state vector and current memory cell vector respectively. Beam search ensures that the total time for segmenting a sentence of n characters is w × k × n, where w, k are maximum word length and beam size respectively.
5 训练
We use the max-margin criterion (Taskar et al., 2005) to train our model. As reported in (Kummerfeld et al., 2015), the margin methods generally outperform both likelihood and perception methods. For a given character sequence x(i), denote the correct segmented sentence for x (i) as y(i). We define a structured margin loss ?(y(i), y?) for predicting a segmented sentence y?: The calculation of margin loss could be regarded as to count the number of incorrectly segmented characters and then multiple it with a fixed discount parameter for smoothing. Therefore, the loss is proportional to the number of incorrectly segmented characters. 边际损失的计算可以看作是统计了错误分割字符的数量,然后将其与固定的折扣参数相乘以进行平滑。 因此,损失与错误分割字符的数量成正比。 Given a set of training set ?, the regularized objective function is the loss function J(θ) including an l2 norm term: Due to the hinge loss, the objective function is not differentiable, we use a subgradient method (Ratliff et al., 2007) which computes a gradientlike direction. Following (Socher et al., 2013), we use the diagonal variant of AdaGrad (Duchi et al., 2011) with minibatchs to minimize the objective. 使用 AdaGrad (Duchi et al., 2011) 的对角线变体和小批量来最小化目标 The update for the i-th parameter at time step t is as follows:
6 实验
6.1 数据
To evaluate the proposed segmenter, we use two popular datasets, PKU and MSR, from the second International Chinese Word Segmentation Bakeoff (Emerson, 2005). Both datasets are preprocessed by replacing the continuous English characters and digits with a unique token.
6.2 超参数
To determine a set of suitable hyper-parameters, we divide the training data into two sets, the first 90% sentences as training set and the rest 10% sentences as development set. We choose the hyper-parameters as shown in Table 2. We also drop the input layer of our model with dropout rate 20% to avoid overfitting.
6.3 模型分析
Beam Size. Figure 5 shows that a segmenter with beam size 4 is enough to get the best performance, which makes our model find a good balance between accuracy and efficiency. GCNN To reveal the impact of GCNN, we re-implemented a simplified version of our model,which replaces the GCNN part with a single nonlinear layer as in equation (1). The results are listed in Table 3, which demonstrate that the performance is significantly boosted by exploiting the GCNN architecture (94.0% to 95.5% on F1-score), while the best performance that the simplified version can achieve is 94.7%, but using a much larger character embedding size. Link Score & Word Score. We conducted several experiments to investigate the individual effect of link score and word score, since these two types of scores are intended to estimate the sentence likelihood from two different perspectives: the semantic coherence between words and the existence of individual words. The learning curves of models with different scoring strategies are shown in Figure 6.
The model with only word score can be regarded as the situation that the segmentation decisions are made only based on local window information. The comparisons show that such a model gives moderate performance. By contrast, the model with only link score gives a much better performance close to the joint model, which demonstrates that the complete segmentation history, which can not be effectively modeled in previous schemes, possesses huge appliance value for word segmentation.
6.4 结果
We first compare our model with the latest neural network methods as shown in Table 4. The results presented in (Chen et al., 2015a; Chen et al., 2015b) used an extra preprocess to filter out Chinese idioms according to an external dictionary. Table 4 lists the results (F1-scores) with different dictionaries, which show that our models perform better when under the same settings.
3 The dictionary used in (Chen et al., 2015a; Chen et al., 2015b) is neither publicly released nor specified the exact source until now. We have to re-run their code using our selected dictionary to make a fair comparison. Our dictionary has been submitted along with this submission. 4 In detail, when a dictionary is used, a preprocess is performed before training and test, which scans original text to find out Chinese idioms included in the dictionary and replace them with a unique token.
Table 5 gives comparisons among previous neural network models. In the first block of Table 5, the character embedding matrix M is randomly initialized. The results show that our proposed novel model outperforms previous neural network methods. 5 To make comparisons fair, we re-run their code (https://github.com/dalstonChen) without using any Chinese idiom dictionary. Previous works have found that the performance can be improved by pre-training the character embeddings on large unlabeled data. Therefore, we use word2vec (Mikolov et al., 2013) toolkit6 to pre-train the character embeddings on the Chinese Wikipedia corpus and use them for initialization. Table 5 also shows the results with additional pre-trained character embeddings. Again, our model achieves better performance than previous neural network models do. Table 6 compares our models with previous state-of-the-art systems. Recent systems such as (Zhang et al., 2013), (Chen et al., 2015b) and (Chen et al., 2015a) rely on both extensive feature engineering and external corpora to boost performance. Such systems are not directly comparable with our models. In the closed-set setting, our models can achieve state-of-the-art performance on PKU dataset but a competitive result on MSR dataset, which can attribute to too strict maximum word length setting for consistence as it is well known that MSR corpus has a much longer average word length (Zhao et al., 2010). Table 7 demonstrates the results on MSR corpus with different maximum decoding word lengths, in which both F1 scores and training time are given. The results show that the segmentation performance can indeed further be improved by allowing longer words during decoding, though longer training time are also needed. As 6-character words are allowed, F1 score on MSR can be furthermore improved 0.3%.
7 相关工作
Neural Network Models. Most modern CWS methods followed (Xue, 2003) treated CWS as a sequence labeling problems (Zhao et al., 2006b). Recently, researchers have tended to explore neural network based approaches (Collobert et al., 2011) to reduce efforts of feature engineering. They modeled CWS as tagging problem as well, scoring tags on individual characters. In those models, tag scores are decided by context information within local windows and the sentence-level score is obtained via context-independent tag transitions. Pei introduced the tag embedding as input to capture the combinations of context and tag history. However, in previous works, only the tag of previous one character was taken into consideration though theoretically the complete history of actions taken by the segmenter should be considered.
Alternatives to Sequence Labeling. Besides sequence labeling schemes, Zhang and Clark (2007) proposed a word-based perceptron method. Zhang et al. (2012) used a linear-time incremental model which can also benefits from various kinds of features including word-based features. But both of them rely heavily on massive handcrafted features. Contemporary to this work, some neural models also leverage word-level information. Specifically, Liu et al. (2016) use a semi-CRF taking segment-level embeddings as input and Zhang et al. (2016a) use a transition-based framework. Another notable exception is (Ma and Hinrichs, 2015), which is also an embedding-based model, but models CWS as configuration-action matching. However, again, this method only uses the context information within limited sized windows.
Other Techniques. The proposed model might furthermore benefit from some techniques in recent state-of-the-art systems, such as semisupervised learning, incorporating global information, and joint models.
8 总结
This paper presents a novel neural framework for the task of Chinese word segmentation, which contains three main components: (1) a factory to produce word representation when given its governed characters; (2) a sentence-level likelihood evaluation system for segmented sentence; (3) an efficient and effective algorithm to find the best segmentation. The proposed framework makes a latest attempt to formalize word segmentation as a direct structured learning procedure in terms of the recent distributed representation framework. Though our system outputs results that are better than the latest neural network segmenters but comparable to all previous state-of-the-art systems, the framework remains a great of potential that can be further investigated and improved in the future.
注:划线删除的文本表示不确定的内容
|