最近在学习pytorch也是正式从tensorflow转移到pytorch,因为很多论文的源码给出的都是pytorch为了方便自己学习比较新的模型和算法也是在不停的学习中,废话不多说正式开始今天的教程,之前给tensorflow挖的坑有时间了也会填上。
这次使用的数据集比较大我也是放在了百度云盘需要的朋友可以自行获取,链接:https://github.com/JohnLeek/DeepLearning-study,源码放在了github,要是觉得不错对你有所帮助希望你帮忙给本博客点个赞,给我的github一个star。
一、数据集概览
这里用到的是一个情感分类数据集,数据集为影评(英文),包含数据适中,所以比较适合用来入门NLP任务,数据集分为两个部分,review(评论),label(标签,只有两个分类,0为消极,1为积极)。
?这里是两条评论,对应的label为positive和negative
二、数据集的处理
?拿到数据集首先看下数据集里边都有什么,有的评论过长,有的评论过短,有的含有一些空字符或者难以理解的字符,这些都需要我们进行相应的处理,然后将数据集的处理为Embedding,方便LSTM进行学习。
2.1 导包和文件操作
这里是所需要使用的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from string import punctuation
from collections import Counter
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader,TensorDataset
文件操作
with open('./emotiondata/reviews.txt','r',encoding='utf-8') as file:
text = file.read()
# print(len(text))
with open('./emotiondata/labels.txt','r',encoding='utf-8') as file:
label = file.read()
# print(len(label))
open的时候最好指定编码为utf-8免得报错说无法解析字符。
获取数据集中所有的单词并进行处理,,同时对标签进行处理1为positive,0为negative
#punctuation的操作主要是去掉文中的非法字符chua串,如*/-+%#¥这样的
clean_text = ''.join([char for char in text if char not in punctuation])
# print(len(clean_text))
clean_text = clean_text.split('\n')
# print(clean_text[0])
label = label.split('\n')
# print(label[:5])
word = [word.lower() for sentence in clean_text for word in sentence.split(' ')]
# print(word[:10])
'''
筛选不同的的那次然后构建字典
dic->单词:整数 and 整数:单词
'''
various_words = list(set(word))
various_words.remove('')
print(len(various_words))
int_word = dict(enumerate(various_words,1))
# 单词:整数:单词
word_int = {w:int(i) for i,w in int_word.items()}
#标签转换
label_list = np.array([1 if x == 'positive' else 0 for x in label])
这里建立了两个词典,分别是word_int 处理之后的单词对应的索引 ,int_word 索引对应的单词。word_int会用到Embedding层进行编码。
2.2 评论处理
有的影评太长太短都不利于网络去学习,所以我们直接去除,留下评论长度适中的,然后按照每条评论200个单词进行限定,不够的补充0,超过的进行截断。
'''
清理过长或者过短的文本
'''
sentence_len = [len(sentence.split()) for sentence in clean_text]
counts = Counter(sentence_len)
min_sen = min((counts.items()))
max_sen = max((counts.items()))
# print(min_sen,' ',max_sen)
min_idx = [i for i ,length in enumerate(sentence_len) if length==min_sen[0]]
max_idx = [i for i ,length in enumerate(sentence_len) if length==max_sen[0]]
new_text = np.delete(clean_text,min_idx)
new_text2 = np.delete(new_text,max_idx)
# print(len(new_text2))
# print(new_text2[0])
new_label = np.delete(label_list,min_idx)
new_label2 = np.delete(new_label,max_idx)
#单词映射为整数
text_ints = []
for sentence in new_text2:
sample = list()
for word in sentence.split():
int_value = word_int[word]
sample.append(int_value)
text_ints.append(sample)
# print(text_ints[0])
# print(len(text_ints))
'''
对每条评论进行处理,限定每条评论200字,不足的补0,超过的截断
'''
def rest_text(text,sql_len):
dataset = np.zeros((len(text),sql_len))
for index,sentence in enumerate(text):
if len(sentence) < sql_len:
dataset[index,:len(sentence)] = sentence
else:
dataset[index,:] = sentence[:sql_len]
return dataset
dataset = rest_text(text_ints,sql_len=200)
2.3 numpy to Tensor
到这里数据集的清洗工作就完成了,需要将数据转化为Tensor的各式,过程也很简单,之所以单独写出来就是为了让大家更好的理解,搭建神经网络的每一个步骤。
dataset_tensor = torch.from_numpy(dataset)
label_tensor = torch.from_numpy(new_label2)
?2.4 数据集的分割
这里数据集我分为了3个部分,train,validation,test,分割比例为0.8,0.1,0.1
all_sample = len(dataset)
ratio = 0.8
train_size = int(all_sample*0.8)
rest_size = all_sample-train_size
val_size = int(rest_size*0.5)
test_size = int(rest_size*0.5)
train = dataset_tensor[:train_size]
train_label = label_tensor[:train_size]
rest_sample = dataset_tensor[train_size:]
rest_label = label_tensor[train_size:]
val = rest_sample[:val_size]
val_label = rest_label[:val_size]
test = rest_sample[val_size:]
test_label = rest_label[val_size:]
2.5 **数据集转化为torch所需要的格式**重点
batch_size = 64
train_dataset = TensorDataset(train,train_label)
val_dataset = TensorDataset(val,val_label)
test_dataset = TensorDataset(test,test_label)
train_loader = DataLoader(train_dataset,batch_size=batch_size,shuffle=True,drop_last=True)
val_loader = DataLoader(val_dataset,batch_size=batch_size,shuffle=True,drop_last=True)
test_loader = DataLoader(test_dataset,batch_size=batch_size,shuffle=True,drop_last=True)
这里处理数据集的时候指定了batch_size,shuffle=True会让torch自动打乱数据集防止陷入局部最优,TensorDataset会将data和label进行匹配。
3、搭建网络模型
class MyModel(nn.Module):
def __init__(self,input_size,embedding_dim,hidden_size,output_size,num_layers,dropout=0.5):
super(MyModel,self).__init__()
self.input_size = input_size
self.embedding_dim = embedding_dim
self.hidden_size = hidden_size
self.output_size = output_size
self.num_layers = num_layers
self.embedding = nn.Embedding(input_size,embedding_dim)
self.lstm = nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,num_layers=num_layers,batch_first=True,dropout=dropout)
self.linear = nn.Linear(hidden_size,output_size)
self.sigmod = nn.Sigmoid()
def forward(self,x):
b_size = x.size(0)
x = self.embedding(x.long())
x,_ = self.lstm(x) # [batch_size ,seq_len,hidden_dim]
x = x.reshape(-1,self.hidden_size) # [batch_size*seq_len,hidden_dim]
x = self.linear(x)# [batch_size*seq_len,output_size]
x = self.sigmod(x)
x = x.reshape(b_size,-1)# [batch_size ,seq_len]
x = x[:,-1]
return x
这里需要注意的地方第一个是embedding,embedding参数input_size为单词个数一般给大点就好了,embedding_dim为编码后的维度是多少,lstm的参数中input_size为embedding_dim,然后hidden_size为隐藏层神经元的个数,num_layers是有多少层lstm,batch_first=True会规定输入和输出的数据格式,具体参考我的这篇博客,链接:https://blog.csdn.net/JohnLeeK/article/details/118324677,
现在具体来看看forward中,首先是得到了b_size(batch_size),然后对输入的x进行了转化变成了long类型,因为在训练的时候会使用gpu,数据类型需要从cuda变为long,然后lstm 的输出应该是out,hn,cn,但是这里我们只用out,out的形状为[batch_size ,seq_len,hidden_dim],在输入到线形层的时候我们需要进行一次reshape操作,变成[batch_size*seq_len,hidden_dim],因为线性层的大小为[hidden_dim,1],两者做矩阵乘法得到一个,[batch_size*seq_len,1]的矩阵,然后经过激活函数得到这个batch数据对应的概率值,取最后一个维度就是我们需要的概率值,然后返回x。
这里还需要多补充一点,也是我自己阅读了很多博客得到的一个经验,就是我们在搭建神经网络的时候少定义变量,因为会占用显卡内存,在forward中我基本就是通过一个x来接收结果,还有就是尽量使用+= /+ -+ *=这样的运算,也可以大大减少显卡内存的消耗。
4、网络的训练
4.1一些参数
input_size = len(word_int)+1
output_size = 1
embedding_dim = 400
hidden_size = 128
num_layers = 2
num_epochs = 5
4.2 损失函数和优化器
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(),lr=0.001)
model = MyModel(input_size=input_size,embedding_dim=embedding_dim,hidden_size=hidden_size,num_layers=num_layers,output_size=output_size)
print(model)
device = ('cuda' if torch.cuda.is_available() else 'cpu')#看你有没有显卡,有就保留,没有就……cpu
损失函数选择二分类的一个损失函数,优化器采用Adam,然后实例化模型,加载到cuda中,没有显卡的同学就不要
4.3训练结构和代码
训练的模板通常都很固定,多写几次就明白了,注释我也给出了,还是比较容易明白的
?
def train(model, device, data_loader, criterion, optimizer, num_epochs, val_loader):
history = list()
for epoch in range(num_epochs):
hs = model.init_hidden(batch_size)
train_loss = []
train_correct = 0.0
model.train()
for data, target in data_loader:
data = data.to(device) # 部署到device
target = target.to(device)
optimizer.zero_grad() # 梯度置零
output, hs = model(data, hs) # 模型训练
hs = tuple([h.data for h in hs])
# print('output shape : ', output.shape) # torch.Size([128])
loss = criterion(output, target.float()) # 计算损失
train_loss.append(loss.item()) # 累计损失
loss.backward() # 反向传播
optimizer.step() # 参数更新
print(output)
print(target)
train_correct += torch.sum(output == target) # 比较
# 模型验证
model.eval()
hs = model.init_hidden(batch_size)
val_loss = []
val_correct = 0.0
with torch.no_grad():
for data, target in val_loader:
data = data.to(device)
target = target.to(device)
preds, hs = model(data, hs) # 验证
hs = tuple([h.data for h in hs])
loss = criterion(preds, target.float()) # 计算损失
val_loss.append(loss.item()) # 累计损失
val_correct += torch.sum(preds == target) # 比较
# history['val_loss'].append(np.mean(val_loss))
# history['val_correct'].append(np.mean(val_correct))
# history['train_loss'].append(np.mean(train_loss))
# history['train_correct'].append(np.mean(train_correct))
print(
f'Epoch {epoch}/{num_epochs} --- train loss {np.round(np.mean(train_loss), 5)} --- val loss {np.round(np.mean(val_loss), 5)}')
测试模板
def test(model, data_loader, device, criterion):
test_losses = []
num_correct = 0
# 初始化隐藏状态
hs = model.init_hidden(batch_size)
model.eval()
for i, dataset in enumerate(data_loader):
data = dataset[0].to(device) # 部署到device
target = dataset[1].to(device)
output, hs = model(data, hs) # 测试
loss = criterion(output, target.float()) # 计算损失
pred = torch.round(output) # 将预测值进行四舍五入,转换为0 或 1
test_losses.append(loss.item()) # 保存损失
correct_tensor = pred.eq(target.float().view_as(pred)) # 返回一堆True 或 False
correct = correct_tensor.cpu().numpy()
result = np.sum(correct)
num_correct += result
# print("num correct : ", num_correct)
print(f'Batch {i}')
print(f'loss : {np.round(np.mean(loss.item()), 3)}')
print(f'accuracy : {np.round(result / len(data), 3) * 100} %')
print()
print("总的测试损失 test loss : {:.2f}".format(np.mean(test_losses)))
print("总的测试准确率 test accuracy : {:.2f}".format(np.mean(num_correct / len(data_loader.dataset))))
调用训练和测试函数
def main():
train_val(model,criterion,optimizer,train_loader,val_loader,device,num_epochs)
test(model, data_loader, device, criterion)
if __name__ == '__main__':
main()
|