强烈安利开源学习社区Datawhale!附上github地址:datawhale 系列文章为Datawhale九月组队学习笔记,对课程内容有所引用。
学习目标
本教程将会基于最前沿的深度学习模型结构(transformers)来解决NLP里的几个经典任务。通过本教程的学习,我们将能够了解transformer相关原理、熟练使用transformer相关的深度学习模型来解决NLP里的实际问题以及在各类任务上取得很好的效果。
常见的NLP任务如下:
- 文本分类
- 序列标注
- 问答任务——抽取式问答和多选问答
- 生成任务——语言模型、机器翻译和摘要生成
学习资源(汇总中)
Natural language processing (NLP) is a crucial part of artificial intelligence (AI), modeling how people share information. In recent years, deep learning approaches have obtained very high performance on many NLP tasks. In this course, students gain a thorough introduction to cutting-edge neural networks for NLP. Natural Language Processing with Deep Learning /Stanford / Winter 2021
自然语言处理领域的圣经《自然语言处理综论(Speech and Language Processing)》第三版(简称SLP3)备受瞩目,该书的正式出版日期一再推迟,不过该书作者NLP领域的大神 Daniel Jurafsky 教授和 James H. Martin 教授一直在该书官网上更新着相关章节的电子版,终于在2020年即将结束的前两天,两位教授发布了一个更新了若干章节的完整草稿版本(2020.12.30 version),而此前的版本大概是2019年10月版,估计这应该是正式出版前的最后一个完整电子草稿版了。 Speech and Language Processing (3rd ed. draft) 《自然语言处理综论(Speech and Language Processing)》第三版终于在2020年年底更新了
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Attention Is All You Need
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting-edge NLP easier to use for everyone. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It’s straightforward to train your models with one before loading them for inference with the other. State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow
“微调”/fine-tune通常指:一种深度学习模型的训练方式/步骤。
- 如图1左边所示,假设我们有一个Source model(左边浅蓝色框的Layer组成),先在Source data上进行训练(Pre-train,预训练)。
- 如图1右边所示,假设我们还有一个Target model(右边浅蓝色框的Layer+深蓝色的Output Layer组成)。
- Target model的浅蓝色框Layer和Source model一样,意味着可以直接复制Source model中训练好的模型参数。
- 而右边的深蓝色Output Layer和左边浅蓝色Output Layer其实有两种可能: A 如图中所示,不相同,意味着Target model的任务类型/目标类型与Source model不一致; B 图中未展示,相同,意味着Target model的任务类型/目标类型与Source model一致;
- 如果是情况A,那么Output Layer只能随机初始化,意味着在Target data上需要一定的样本,才能将Target model中的Output Layer训练好。 当然我们可以固定浅蓝色的Layer 1-Layer L - 1,也可以微调这些Layers。
- 如果是情况B,那么Target model和Source model完全一样,意味着在Target data上即使没有任何样本,Target model也是可以直接使用的(zero-shot,当然效果可能不好),也可以在Target data上将Target model中的部分/全部参数进行“微调”。
所以“微调”也像字面意思一样,对模型参数“微微”调整。 2021年如何科学的“微调”预训练模型?
本文的主题是自然语言处理中的预训练过程,会大致说下NLP中的预训练技术是一步一步如何发展到Bert模型的,从中可以很自然地看到Bert的思路是如何逐渐形成的,Bert的历史沿革是什么,继承了什么,创新了什么,为什么效果那么好,主要原因是什么,以及为何说模型创新不算太大,为何说Bert是近年来NLP重大进展的集大成者。我们一步一步来讲,而串起来这个故事的脉络就是自然语言的预训练过程,但是落脚点还是在Bert身上。要讲自然语言的预训练,得先从图像领域的预训练说起。 从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史
|