[人工智能] CS224N NLP

文章目录

Abbreviation
Lecture 1 - Introduction and Word Vectors
- NLP
- Word2vec
- Show some achievement with code(5640-h0516)
- QA
Lecture 2 Word Vectors,Word Senses,and Neural Classifiers
- Bag models (0245)
- Gradient descent (0600)
- - stochastic gradient descent SGD TOBELM (0920)
- more details of word2vec(1400)
- Why use two vectors(1500)
- Why not capture co-occurrence counts directly?(2337)
- SVD(3230) [ToL]
- Count based vs direct prediction
- Encoing meaning components in vector differences(3948)
- GloVe (4313)
- How to evaluate word vectors Intrinsic vs. extrinsic(4756)
- - Analogy evaluation and hyperparameters (intrinsic)(5515)
  - Word vector distances and their correlation with human judgements(5640)
- Data shows that 300 dimensional word vector is good(5536)
- The objective function for the GloVe model and What log-bilinear means(5739)
- Word senses and word sense ambiguity(h0353)
Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning
- Need to be learn again, it is not totally understanded.
- Named Entity Recognition(0530)
- Simple NER (0636)
- - How the sample model run (0836)
- update equation(1220)
- jacobian(1811)
- Chain Rule(2015)
- do one example step (2650)
- image-20220214193417520
- Reusing Computation(3402)
- - ds/dw
- Forward and backward propagation(5000)
- An example(5507)
- Compute all gradients at once (h0005)
- Back-prop in general computation graph(h0800)[ToL]
- Automatic Differentiation(h1346)
- Manual Gradient checking : Numeric Gradient(h1900)
Lecture 4 Dependency Parsing
- Two views of linguistic structure
- - Constituency = phrase structure grammar = context-free grammars(CFGs)(0331)
  - Dependency structure(1449)
- Why do we need sentence structure?(2205)
- Prepositional phrase attachment ambiguity.(2422)
- - San Jose cops kill man with knife
- Coordination scope ambiguity(3614)
- Adjectival/Adverbial Modifier Ambiguity(3755)
- Verb Phrase(VP) attachment ambiguity(4404)
- Dependency Grammar and Dependency structure(4355)
- - Will add a fake ROOT for handy
- Dependency Grammar history(4742)
- The rise of annotated data Universal Dependency tree(5100)
- - Tree bank(5400)
- how to build parser with dependency(5738)
- Dependency Parsing
- - Projectivity(h0416)
- Methods of Dependency Parsing(h0521)
- Greedy transition-based parsing(h0621)
- Basic transition-based dependency parser (h0808)
- MaltParser(h1351)[ToL]
- Evaluation of Dependency Parsing (h1845)[ToL]
Lecture-5 Languages models and Recurrent Neural Networks(RNNs)
- A neural dependency parser(0624)
- Distributed Representations(0945)
- Deep Learning Classifier are non-linear classifiers(1210)
- Simple feed-forward neural network multi-class classifier (1621)
- Neural Dependency Parser Model Architecture(1730)
- Graph-based dependency parsers (2044)
- Regularization && Overfitting (2529)
- Dropout (3100)[ToL]
- Vectorization(3333)
- Non-linearities (4000)
- Parameter Initialization (4357)
- Optimizers(4617)
- Learning Rates(4810)
- Language Modeling (5036)
- n-gram Language Models(5356)
- Sparsity Problems (5922)
- Storage Problems(h0117)
- How to build a neural language model(h0609)
- A fixed-window neural Language Model(h1100)
- Recurrent Neural Network (RNN)(h1250)
- A Simple RNN Language Model(h1430)
Lecture 6 Simple and LSTM Recurrent Neural Networks.
- The Simple RNN Language Model (0310)
- Training an RNN Language Model (0818)
- - Teacher Forcing
- Evaluating Language Models (2447)[ToL]
- Language Model is a system that predicts the next word(3130)
- Other use of RNN(3229)
- Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]
- - Why This is a problem (4400)
- Long Short Term Memory RNNS(LSTMS)(5000)[ToL]
- Bidirectional RNN (h2000)
Lecture-7 Translation, Seq2Seq, Attention
- Machine Translation(0245)
- - What do you need (1200)
- Decoding for SMT(1748)
- What is Neural Machine Translation(NMT)(2130)
- Seq2seq is more than MT(2600)
- (2732)[ToL]
- Multi-layer RNNs(3323)
- Greedy decoding(4000)
- Exhaustive search decoding(4200)
- beam search decoding(4400)
- How do we evaluate Machine Translation(5550)
- - BLEU
- NMT perhaps the biggest success story of NLP Deep Learning(h00000)
- Attention(h1300)
Lecture 8 Final Projects; Practical Tips
- Sequence to Sequence with attention(0235)
- Attention: in equations(0800)
- there are several attention variants(1500)
- Attention is a general Deep Learning technique(2240)
- Final Project(3000)
Lecture-9 Self- Attention and Transformers
- Issues with recurrent models (0434)
- - Linear interaction distance
  - Lack of parallelizability(0723)
- If not recurrence
- - Word window models aggregate local contexts (1031)
  - Attention(1406)
- Self-Attention(1638)
- Self-attention as an nlp building block(2222)
- Fix the first self-attention problem
- Barriers and solutions for Self-Attention as building block(2945)
- The transformer encoder-decoder(3638)
- - key query value(4000)
  - Multi-headed attention (4322)
- Residual connections(4723)
- Layer normalization(5045)
- Scaled fot product(5415)
Lecture 10 - Transformers and Pretraining
- Word structure and subword models(0300)
- The byte-pair encoding(0659)
- Motivating word meaning and context(1556)
- Pretraining whole models(2000)
- this model haven't met overfitting now, you can save some data to test it.(2811)
- transformers for encoding and decoding (3030)
- Pretraining through language modeling(3400)
- Stochastic gradient descent and pretrain/finetune(3740)
- Model pretraining has three ways (4021)
- - Decoder(4300)
- Generative Pretrained Transformer(GPT) (4818)
- GPT2(5400)
- Pretraining Encoding(5545)
- - (Bert)(5654)
- Bidirectional encoder representations from transformers(h0100)
- Limitations of pretrained encoders(h0900)
- Extensions of BERT(h1000)
- Pretraining Encoder-Decoder (h1200)
- - T5(h1500)
- GPT3(h1800)
- Lecture 11 Question Answering
- What is question answering(0414)
- Beyond textual QA problems(1100)
- Reading comprehension(1223)
- Standord question answering dataset (1815)
- Neural models for reading comprehension(2428)
- LSTM-based vs BERT models (2713)
- BiDAF(3200)
- - Encoding(3200)
  - Attention(3400)
  - Modeling and output layers(4640)
- BERT for reading comprehension (5227)
- Comparisons between BiDAF and BERT models(2734)
- Can we design better pre-training objectives(h0000)
- open domain question answering(h1000)
- DPR(H1400)
- DensePhrase:Demo(h1800)
Lecture 12 - Natural Language Generation[ToL]
- What is neural language generation?(0300)
- Components of NLG Systems(0845)
- Decoding(1317)
- Training NLG models(4114)
- Evaluating NLG Systems(5613)
- Types of evaluation methods for text generation(5734)
- Ethical Considerations(h1025)
Lecture 13 - Coreference Resolution
- What is Coreference Resolution?(0604)
- Applications (1712)
- Coreference Resolution in Two steps(1947)
- Mention Detection(2049)
- - Not quite so simple(2255)
- Avoiding a traditional pipeline system(2811)
- Onto Coreference! First, some linguistics (3035)
- - not all anaphoric relations are coreferential (3349)
- Anaphora vs Cataphora(3610)
- Taking stock (3801)
- Four kinds of coreference Models(4018)
- Traditional pronominal anaphora resolution:Hobbs's naive algorithm(4130)
- Knowledge-based Pronominal Coreference(4820)
- Coreference Models: Mention Pair(5624)
- - Mention Pair Test Time(5800)
  - Disadvantage(5953)
- Coreference Models: Mention Ranking(h0050)
- Convolutional Neural Nets(h0341)
- What is convolution anyway?(h0452)
- End-to-End Neural Coref Model(h1206)
- Conclusion (h2017)
Lecture 14 - T5 and Large Language Models
- T5 with a task prefix(0800)
- Others
- - STSB
  - Summarize
- T5 change little from original transformer(1300)
- what should my pre-training data set be?(1325)
- Then is how to train from a start(1659)
- pretrain(1805)
- choose the model(2412)
- pre-training objective(2629)
- different structure of data source(2822)
- Multi task learning (3443)
- close the gap between multi-task training and this pre-training followed by separate fine tuning(3621)
- What if it happens there are four times computes as much as before (3737)
- Overview(3840)
- What about all of the other languages?(mT5)(4735)
- XTREME (5000)
- How much knowledge does a language model pick up during pre-training?(5225)
- Salient span masking (5631)
- Do large language models memorize their training data(h0100)
- Can we close the gap between large and small models by improving the transformer architecture(h1010)
- QA(h1915)
Lecture 15 - Add Knowledge to Language Models
- Recap: LM(0232)
- What does a language model know?(0423)
- The importance of know ledge-aware language models(0700)
- Query traditional knowledge bases(0750)
- Query language models as knowledge bases(0955)
- Compare and disadvantage(1010)
- Techniques to add knowledge to LMs(130)
- Add pretrained embeddings(1403)
- Aside: What is entity linking?(1516)
- Method 1: Add pretrained entity embeddings(1815)
- - How to we incorporate pretrained entity embeddings from a different embedding space?(2000)
- ERNIE: Enhanced language representation with informative entities(2143)
- - strengths & remaining challenges(2610)
- Jointly learn to link entities with KnowBERT(2958)
- Use an external memory(3140)
- - KGLM(3355)
  - Local knowledge and full knowledge
  - When should the model use the external knowledge(3600)
- Compare to the others(4334)
- More recent takes: Nearest Neighbor Language Models(kNN-LM)(4730)
- Modify the training data(5230)
- WKLM(5458)
- Learn inductive biases through masking(5811)
- Salient span masking(5927)
- Recap(h0053)
- Evaluating knowledge in LMS(h0211)
- - LAMA(h0250)
  - The limitations (h0650)
- LAMA_UnHelpful Names(LAMA-UHN)
- - Developing better prompts to query knowledge in LMS
  - Knowledge-driven downstream tasks(h1253)
- Relation extraction performance on TACED(h1400)
- Entity typing performance on Open Entuty
- Recap: Evaluating knowledge in LMs(h1600)
- Other exciting progress & what's next?(h1652)
Lecture 17 - Model Analysis and Explanation
- Motivation
- - what are our models doing(0415)
  - how do we make tomorrow's model?(0515)
  - What biases are built into model?(0700)
  - how do we make in the following 25years(0800)
- Model analysis at varying levels of abstraction(0904)
- Model evaluation as model analysis(1117)
- Model evaluation as model analysis in natural language inference(1344)
- - What if the model is simple using heuristics to get good accuracy?(1558)
- Language models as linguistic test subjects(2023)
- Careful test sets as unit test suites: CheckListing(3230)
- Fitting the dataset vs learning the task(3500)
- Knowledge evaluation as model analysis(3642)
- Input influence: does my model really use long-distance context?(3822)
- Prediction explanations: what in the input led to this output?(4054)
- Prediction explanations: simple saliency maps(4230)
- Explanation by input reduction (4607)
- Analyzing models by breaking them(5106)
- Are models robust to noise in their input?(5518)
- Analysis of "interpretable" architecture components(5719)
- Probing: supervised analysis of neural networks(h0408)
- Emergent simple structure in neural networks(h1019)
- Probing: tress simply recoverable from BERT representations(h1136)
- Final thoughts on probing and correlation studies(h1341)
- Recasting model tweaks and ablations as analysis(h1406)
- - Ablation analysis: do we need all these attension heads?(h1445)
- What's the right layer order for a transformer?(h1537)
- Parting thoughts(h1612)
Lecture 18 - Future of NLP + Deep Learning
- General Representation Learning Recipe(0312)
- Large Language Models and GPT-3(0358)
- - Large Language models and GPT-3(0514)
  - What's new about GPT-3
There are three lessons left, They will be finished in the review when I come back from Lee.

Abbreviation

Lecture 1 - Introduction and Word Vectors

NLP

Word2vec

Use two vector in one word: centor word context word.

softmax function

Train the model: gradient descent

Show some achievement with code(5640-h0516)

Lecture 2 Word Vectors,Word Senses,and Neural Classifiers

Bag models (0245)

Gradient descent (0600)

stochastic gradient descent SGD TOBELM (0920)

But it is stochastic, either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for vectors.(1344)ToL

more details of word2vec(1400)

SG use center to predict context

SGNS negative sampling [ToBLO]

CBOW opposite.

Why use two vectors(1500)

negative word is being sampled cause the center word will turn up on other occasions, when it does, there will have other sampling, and it will learn step by step.

Why not capture co-occurrence counts directly?(2337)

SVD(3230) [ToL]

Count based vs direct prediction

Encoing meaning components in vector differences(3948)

GloVe (4313)

How to evaluate word vectors Intrinsic vs. extrinsic(4756)

Analogy evaluation and hyperparameters (intrinsic)(5515)

Word vector distances and their correlation with human judgements(5640)

Data shows that 300 dimensional word vector is good(5536)

The objective function for the GloVe model and What log-bilinear means(5739)

Word senses and word sense ambiguity(h0353)

Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning

Need to be learn again, it is not totally understanded.

Named Entity Recognition(0530)

Simple NER (0636)

How the sample model run (0836)

update equation(1220)

jacobian(1811)

Chain Rule(2015)

do one example step (2650)

Reusing Computation(3402)

ds/dw

Forward and backward propagation(5000)

An example(5507)

Compute all gradients at once (h0005)

Back-prop in general computation graph(h0800)[ToL]

Automatic Differentiation(h1346)

Manual Gradient checking : Numeric Gradient(h1900)

Lecture 4 Dependency Parsing

Two views of linguistic structure

Constituency = phrase structure grammar = context-free grammars(CFGs)(0331)

Dependency structure(1449)

Dependency structure shows which words depend on (modify, attach to,or are arguments of)

Why do we need sentence structure?(2205)

Prepositional phrase attachment ambiguity.(2422)

San Jose cops kill man with knife

The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting].

Coordination scope ambiguity(3614)

**Shuttle veteran and longtime NASA executive Fred Gregory appointed to board **

Adjectival/Adverbial Modifier Ambiguity(3755)

Students get [first hand job] experience Students get first [hand job] experience

Verb Phrase(VP) attachment ambiguity(4404)

Dependency Grammar and Dependency structure(4355)

Will add a fake ROOT for handy

Dependency Grammar history(4742)

The rise of annotated data Universal Dependency tree(5100)

Tree bank(5400)

Its too slow to write a grammar by hand but its still worth,cause it can used in another place but not only nlp .

how to build parser with dependency(5738)

Dependency Parsing

Projectivity(h0416)

Methods of Dependency Parsing(h0521)

Greedy transition-based parsing(h0621)

Basic transition-based dependency parser (h0808)

MaltParser(h1351)[ToL]

Evaluation of Dependency Parsing (h1845)[ToL]

Lecture-5 Languages models and Recurrent Neural Networks(RNNs)

A neural dependency parser(0624)

Distributed Representations(0945)

Deep Learning Classifier are non-linear classifiers(1210)

Simple feed-forward neural network multi-class classifier (1621)

Neural Dependency Parser Model Architecture(1730)

Graph-based dependency parsers (2044)

Regularization && Overfitting (2529)

Dropout (3100)[ToL]

Vectorization(3333)

Non-linearities (4000)

Parameter Initialization (4357)

Optimizers(4617)

Learning Rates(4810)

Language Modeling (5036)

n-gram Language Models(5356)

Sparsity Problems (5922)

Storage Problems(h0117)

How to build a neural language model(h0609)

A fixed-window neural Language Model(h1100)

Recurrent Neural Network (RNN)(h1250)

A Simple RNN Language Model(h1430)

Lecture 6 Simple and LSTM Recurrent Neural Networks.

The Simple RNN Language Model (0310)

Training an RNN Language Model (0818)

Teacher Forcing

Evaluating Language Models (2447)[ToL]

Language Model is a system that predicts the next word(3130)

Other use of RNN(3229)

Tag for word

Used for classification(3420)

Used to Language encoder module (3500)

Used to generate text (3600)

Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]

Why This is a problem (4400)

Long Short Term Memory RNNS(LSTMS)(5000)[ToL]

Bidirectional RNN (h2000)

Lecture-7 Translation, Seq2Seq, Attention

Machine Translation(0245)

What do you need (1200)

Decoding for SMT(1748)

What is Neural Machine Translation(NMT)(2130)

Neural Machine Translation(NMT) is a way to do Machine Translation with a single end-to-end neural net work.

The neural network architecture is called sequence-to-sequence model(aka seq2seq) and it involves RNNs

Seq2seq is more than MT(2600)

(2732)[ToL]

Multi-layer RNNs(3323)

Greedy decoding(4000)

Exhaustive search decoding(4200)

beam search decoding(4400)

How do we evaluate Machine Translation(5550)

BLEU

NMT perhaps the biggest success story of NLP Deep Learning(h00000)

Attention(h1300)

Lecture 8 Final Projects; Practical Tips

Sequence to Sequence with attention(0235)

Attention: in equations(0800)

there are several attention variants(1500)

Attention is a general Deep Learning technique(2240)

Final Project(3000)

Lecture-9 Self- Attention and Transformers

Issues with recurrent models (0434)

Linear interaction distance

Lack of parallelizability(0723)

If not recurrence

Word window models aggregate local contexts (1031)

Attention(1406)

Self-Attention(1638)

Self-attention as an nlp building block(2222)

Fix the first self-attention problem

sequence order (2423)

Position representation vector through sinusoids(2624)

Sinusoidal position representations(2730)

Position representation vector from scratch(2830)

Adding nonlinearities in self-attention(2953)

Barriers and solutions for Self-Attention as building block(2945)

The transformer encoder-decoder(3638)

key query value(4000)

Multi-headed attention (4322)

Residual connections(4723)

Layer normalization(5045)

Scaled fot product(5415)

Lecture 10 - Transformers and Pretraining

Word structure and subword models(0300)

The byte-pair encoding(0659)

Subwords model learn the structure of word. The byte-pair between it and dont learn structure.

Motivating word meaning and context(1556)

Pretraining whole models(2000)

this model haven’t met overfitting now, you can save some data to test it.(2811)

transformers for encoding and decoding (3030)

Pretraining through language modeling(3400)

Stochastic gradient descent and pretrain/finetune(3740)

Model pretraining has three ways (4021)

Decoder(4300)

Generative Pretrained Transformer(GPT) (4818)

GPT2(5400)

Pretraining Encoding(5545)

(Bert)(5654)

Bidirectional encoder representations from transformers(h0100)

Limitations of pretrained encoders(h0900)

Extensions of BERT(h1000)

Pretraining Encoder-Decoder (h1200)

T5(h1500)

GPT3(h1800)

Lecture 11 Question Answering

What is question answering(0414)

Beyond textual QA problems(1100)

Reading comprehension(1223)

Reading comprehension is an important tested for evaluating how well computer systems understand human language

Standord question answering dataset (1815)

Neural models for reading comprehension(2428)

LSTM-based vs BERT models (2713)

BiDAF(3200)

Encoding(3200)

Attention(3400)

Modeling and output layers(4640)

BERT for reading comprehension (5227)

Comparisons between BiDAF and BERT models(2734)

Can we design better pre-training objectives(h0000)

open domain question answering(h1000)

DPR(H1400)

DensePhrase:Demo(h1800)

Lecture 12 - Natural Language Generation[ToL]

What is neural language generation?(0300)

Components of NLG Systems(0845)

Basic of natural language generation(0916)

A look at a single step(1024)

then select and train(1115)

Decoding(1317)

Greedy methods(1432)

Greedy methods get repetitive(1545)

why do repetition happen(1613)

How can we reduce repetition (1824)[ToL]

People is not always choose the greedy methods(1930)

Time to get random: Sampling(2047)

Decoding : Top-k sampling(2100)

Issues with Top-k sampling(2339)

Decoding: Top-p(nucleus)sampling(2421)

Scaling randomness: Softmax temperature (2500)[ToL]

improving decoding: re-balancing distributions(2710)

Backpropagation-based distribution re-balancing(3027)

Improving Decoding: Re-ranking(3300)[ToL]

Decoding: Takeaways(3540)

Training NLG models(4114)

Maximum Likelihood Training(4200)

Unlikelihood Training(4427)[ToL]

Exposure Bias(4513)[ToL]

Exposure Bias Solutions(4645)

Reinforce Basics(4900)

Reward Estimation(5020)

reinforce’s dark side(5300)

Training: Takeways(5423)

Evaluating NLG Systems(5613)

Types of evaluation methods for text generation(5734)

Content Overlap metrics(5800)

A simple failure case(5900)

Semantic overlap metrics(h0100)

Model-based metrics(h0120)

word distance functions(h0234)

Beyond word matching(h0350)

Human evaluations(h0433)

Issues(h0700)

Takeways(h0912)

Ethical Considerations(h1025)

Lecture 13 - Coreference Resolution

What is Coreference Resolution?(0604)

Applications (1712)

Coreference Resolution in Two steps(1947)

Mention Detection(2049)

Not quite so simple(2255)

Avoiding a traditional pipeline system(2811)

Onto Coreference! First, some linguistics (3035)

not all anaphoric relations are coreferential (3349)

Anaphora vs Cataphora(3610)

Taking stock (3801)

Four kinds of coreference Models(4018)

Traditional pronominal anaphora resolution:Hobbs’s naive algorithm(4130)

Knowledge-based Pronominal Coreference(4820)

Hobb’s method can not really solve the questions, the model should really understand the sentence.

Coreference Models: Mention Pair(5624)

Mention Pair Test Time(5800)

Disadvantage(5953)

Coreference Models: Mention Ranking(h0050)

Convolutional Neural Nets(h0341)

What is convolution anyway?(h0452)

End-to-End Neural Coref Model(h1206)

Conclusion (h2017)

Lecture 14 - T5 and Large Language Models

T5 with a task prefix(0800)

Others

STSB

Summarize

T5 change little from original transformer(1300)

what should my pre-training data set be?(1325)

Then is how to train from a start(1659)

pretrain(1805)

choose the model(2412)

pre-training objective(2629)

different structure of data source(2822)

Multi task learning (3443)

close the gap between multi-task training and this pre-training followed by separate fine tuning(3621)

What if it happens there are four times computes as much as before (3737)

Overview(3840)

What about all of the other languages?(mT5)(4735)

XTREME (5000)

How much knowledge does a language model pick up during pre-training?(5225)

Salient span masking (5631)

Do large language models memorize their training data(h0100)

They need to see examples, they need to see particular examples fewer times in order!

Can we close the gap between large and small models by improving the transformer architecture(h1010)

there actually were very few, if any modifications that improved performance meaningfully.