更新中...
本文代码在kaggle公开: https://www.kaggle.com/laugoon/hw4-emotion-classification
任务介绍
使用 RNN 實作, 不能使用額外 data (禁止使用其他 corpus 或 pretrained model).
文本情感分类
Text Sentiment Classification/Emotion Classification
数据为Twitter 上收集到的推文.
labeled data每則推文都會被標注為正面或負面.
1正面:
?0负面:
labeled training data :20萬 unlabeled training data? :120萬 testing data :20萬(10 萬 public,10 萬 private) |
RNN模型:
句子喂入RNN的方式
1. 建立字典,字典內含有每一個字所對應到的index(维度)
2. 句子每个字用向量(Word Embedding)代表.
?
?得到word embedding的常用方法: skip-gram, CBOW等. (这些方法也可以用套件, 不用自己手刻)
3. 将句子喂入RNN, 或bag of words (BOW), 得到代表句子的向量h.
embedding也可以和模型其他部分一起训练(设fix embedding参数).
?Bag of Words (BOW) 方法表示句子.
不考虑语法, 词的顺序. 不需要RNN, 喂入DNN就可以算.
如:
句子 | John likes to watch movies. Mary likes movies too. | BOW | [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] |
其中likes, movies出现两次, 则代表这两个词的维度处为2.
半监督
利用 unlabeled data. 常用Self-Training:
用训练好的模型对unlabeled data做标记, 再加入训练集.
可以调整阈值, 取比较有信心的data, 如:
pos_threshold = 0.8, 则prediction > 0.8时标1.
data格式
labeled data | label +++$+++ text | unlabeled data | text 每一行一个句子 | 預測結果 | id, label |
kaggle评估指标为正确率
代码思路
NLP任务語句分類(文本分類)
給定一個語句,判斷他其情绪(負面標 1,正面標 0)
加载数据集
文件位置
/kaggle/input/ml2020spring-hw4/testing_data.txt /kaggle/input/ml2020spring-hw4/training_nolabel.txt /kaggle/input/ml2020spring-hw4/training_label.txt |
1. training_label.txt:有 label 的训练集(句子与标记 0 or 1,+++$+++ 是分隔符號)
e.g., 1 +++$+++ are wtf ... awww thanks !
2. training_nolabel.txt:沒有 label 的训练集(只有句子),用來做半监督学习
e.g.: hates being this burnt !! ouch
3. testing_data.txt:要判斷 testing data 裡面的句子是 0 or 1
测试集的数据从第二行开始
id,text
0,my dog ate our dinner . no , seriously ... he ate it .
有标签训练集:
with open(path, 'r') as f: ??? lines = f.readlines() ??? lines = [line.strip('\n').split(' ') for line in lines] x = [line[2:] for line in lines] y = [line[0] for line in lines] | train_x, y = load_training_data('/kaggle/input/ml2020spring-hw4/training_label.txt') |
无标签训练集:
lines = f.readlines() x = [line.strip('\n').split(' ') for line in lines] | train_x_no_label = load_training_data('/kaggle/input/ml2020spring-hw4/training_nolabel.txt') |
测试集:
lines = f.readlines() X = ["".join(line.strip('\n').split(",")[1:]).strip() for line in lines[1:]] X = [sen.split(' ') for sen in X] | test_x = load_testing_data('/kaggle/input/ml2020spring-hw4/testing_data.txt') |
word embedding
利用word2vec训练词向量
(此处使用__name__是为了迁移到notebook外模块, 不影响执行)
word2vec.Word2Vec中size参数主要是用来向量的维度, 换成了vector_size. iter迭代次数,换成了epochs. 新版本的API改名了, 要注意修改.
model = word2vec.Word2Vec(x, vector_size=250, window=5, min_count=5, workers=12, epochs=10, sg=1) | model = train_word2vec(train_x + test_x) model.save(os.path.join(path_prefix, 'w2v_all.model')) |
数据预处理
Data Preprocess
读入Word2Vec模型:
self.embedding = Word2Vec.load(self.w2v_path) self.embedding_dim = self.embedding.vector_size |
make_embedding
注意vocab 属性不能用了
self.embedding.wv.vocab | The vocab attribute was removed from KeyedVector in Gensim 4.0.0 Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead. |
使用方法如:
rock_idx = model.wv.vocab["rock"].index? # 🚫 rock_cnt = model.wv.vocab["rock"].count? # 🚫 vocab_len = len(model.wv.vocab)? # 🚫 rock_idx = model.wv.key_to_index["rock"] words = list(model.wv.index_to_key) rock_cnt = model.wv.get_vecattr("rock", "count")? # 👍 vocab_len = len(model.wv)? # 👍 |
embedding用法也改变了
self.embedding_matrix.append(self.embedding[word]) | 'Word2Vec' object is not subscriptable |
改成:
self.embedding_matrix.append(self.embedding.wv[word]) |
RNN模型构建
import torch from torch import nn class LSTM_Net(nn.Module): ??? def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True): ??????? super(LSTM_Net, self).__init__() ??????? # 製作 embedding layer ??????? self.embedding = torch.nn.Embedding(embedding.size(0),embedding.size(1)) ??????? self.embedding.weight = torch.nn.Parameter(embedding) ??????? # 是否將 embedding fix 住,如果 fix_embedding 為 False,在訓練過程中,embedding 也會跟著被訓練 ??????? self.embedding.weight.requires_grad = False if fix_embedding else True ??????? self.embedding_dim = embedding.size(1) ??????? self.hidden_dim = hidden_dim ??????? self.num_layers = num_layers ??????? self.dropout = dropout ??????? self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True) ??????? self.classifier = nn.Sequential( nn.Dropout(dropout), ???????????????????????????????????????? nn.Linear(hidden_dim, 1), ???????????????????????????????????????? nn.Sigmoid() ) ??? def forward(self, inputs): ??????? inputs = self.embedding(inputs) ??????? x, _ = self.lstm(inputs, None) ??????? # x 的 dimension (batch, seq_len, hidden_size) ??????? # 取用 LSTM 最後一層的 hidden state ??????? x = x[:, -1, :] ??????? x = self.classifier(x) ??????? return x |
5次迭代的精确度没有很大提高:
[ Epoch1: 1407/1407 ] loss:0.694 acc:11.719 Train | Loss:0.69361 Acc: 50.032 Valid | Loss:0.69313 Acc: 50.174 saving model with acc 50.174 ----------------------------------------------- [ Epoch2: 1407/1407 ] loss:0.693 acc:12.500 Train | Loss:0.69320 Acc: 49.872 Valid | Loss:0.69316 Acc: 49.308 ----------------------------------------------- [ Epoch3: 1407/1407 ] loss:0.693 acc:14.062 Train | Loss:0.69319 Acc: 49.780 Valid | Loss:0.69313 Acc: 50.174 ----------------------------------------------- [ Epoch4: 1407/1407 ] loss:0.694 acc:10.156 Train | Loss:0.69317 Acc: 49.765 Valid | Loss:0.69318 Acc: 49.348 ----------------------------------------------- [ Epoch5: 1407/1407 ] loss:0.693 acc:15.625 Train | Loss:0.69317 Acc: 49.942 Valid | Loss:0.69314 Acc: 50.174 ----------------------------------------------- |
|