目录
简介
主要出发点
主要工作
3.2. Explicitly N-gram Masked Language
3.3 Comprehensive N-gram Prediction
3.4 Enhanced N-gram Relation Modeling
实验结果
消融实验
Effect of Explicitly N-gram MLM
Size of N-gram Lexicon
Effect of Comprehensive N-gram Prediction and Enhanced N-gram Relation Modeling
思考小结
简介
ERNIE-Gram, an explicitly n-gram masking and predicting method to eliminate the limitations of previous contiguously masking strategies and incorporate coarse-grained linguistic information into pre-training sufficiently. ERNIE-Gram conducts comprehensive n-gram pre- diction and relation modeling to further enhance the learning of semantic n-grams for pre-training.
主要出发点
- BERT’s MLM focuses on the representations of fine-grained text units (e.g. words or subwords in English and characters in Chinese), rarely considering the coarse-grained linguistic information (e.g. named entities or phrases in English and words in Chinese) thus incurring inadequate representation learning.
- Many efforts have been devoted to integrate coarse-grained semantic information by independently masking and predicting contiguous sequences of n tokens, namely n-grams, such as named entities, phrases (Sun et al., 2019b), whole words.
- We argue that such contiguously masking strategies are less effective and reliable since the prediction of tokens in masked n-grams are independent of each other, which neglects the intra-dependencies of n-grams.
主要工作
3.2. Explicitly N-gram Masked Language
- 如上图f1(a): 之前的Contiguously MLM,忽略了ngram内部词之前的依赖关系,预测时ngram中的各个token之间是相互独立的,loss计算方式:
- 如上图f1(b): explicitly N-gram MLM,将ngram看成一个整体(token)(此处需额外一个ngram字典),预测时只需在一个位置预测,loss计算方式:
3.3 Comprehensive N-gram Prediction
- 更进一步的,该工作同时进行了ngram整体片段的预测和内部各个token的预测,作者对mask matrix进行了精心的设计,详见原文
3.4 Enhanced N-gram Relation Modeling
- To explicitly learn the semantic relationships be- tween n-grams, we jointly pre-train a small genera- tor model θ′ with explicitly n-gram MLM objective to sample plausible n-gram identities. Then we employ the generated identities to preform mask- ing and train the standard model θ to predict the original n-grams from fake ones in coarse-grained and fine-grained manners, as shown in Figure 3(a), which is efficient to model the pair relationships between similar n-grams.
- 建模ngram之间的关系,借鉴了一部分ELECTRA的思想
实验结果
消融实验
Effect of Explicitly N-gram MLM
- Explicitly N-gram MLM 对于 contiguously mlm 的提升并没有想象的那么大,0.5左右
Size of N-gram Lexicon
Effect of Comprehensive N-gram Prediction and Enhanced N-gram Relation Modeling
思考小结
- 整个工作感觉还是比较复杂的,看来想有效提升,刷榜还是很不容易的,不过总感觉不是那么丝滑,大道至简;
- 之前做相关项目的时候,自己对于ngram或span也是没有好的解决方式(想扩大字典将词包含进来),没想到其实粗暴的 contiguously mlm也有效果,但是 Explicitly N-gram MLM 对于 contiguously mlm 的提升并没有我想象的那么大(太天真)(另,侧面反映其实采用字级别的处理方式表现也还可以)
|