基于Pytorch的卷积神经网络CNN实例应用及详解
一、卷积神经网络CNN定义
卷积神经网络(CNN,有时被称为 ConvNet)是很吸引人的。在短时间内,它们变成了一种颠覆性的技术,打破了从文本、视频到语音等多个领域所有最先进的算法,远远超出了其最初在图像处理的应用范围。CNN 由许多神经网络层组成。卷积和池化这两种不同类型的层通常是交替的。网络中每个滤波器的深度从左到右增加。最后通常由一个或多个全连接的层组成。
二、卷积神经网络CNN的原理
三、卷积神经网络CNN实现的前期准备
四、卷积神经网络CNN实现案例分析
-
案例目的:是构造卷积神经网络模型训练后进行药物靶体交互的二分类(0和1)预测。 -
数据集及格式说明:首先案例用到的所有数据集是来自KIBA 数据集,数据以txt文本数据存储。 -
亲和矩阵数据集Y说明:亲和矩阵的维度是[229,2110],分别表示229种蛋白质和2110种药物分子,矩阵对应的数值表示对应蛋白质和药物分子交互的亲和性,若为“nan”表示无亲和性也就是后续分类为0,其余为1。 -
药物及其smiles序列数据集ligands_iso说明:药物名称序号和对应的结构smiles序列,以字典形式保存,长度是2111(表示有2111种药物)。 -
蛋白质及其sequences序列数据集proteins说明:蛋白质名称序号和对应的结构sequences序列,以字典形式保存,长度为229(表示有229种蛋白质)。 -
训练集train_fold_setting说明:数据维度是[5,19709],也就是分为五组,每组包含19709个数字,其中数字N表示为亲和矩阵数据集Y所有行列组合数目是483190(229*2110)种组合中按顺序排的第N组。 -
测试集test_fold_setting说明:数据维度是[19709],也就是只有一组,该组包含19709个数字,其中数字N表示为亲和矩阵数据集Y所有行列组合数目是483190(229*2110)种组合中按顺序排的第N组。 -
DTI案例CNN模型搭建大致如下(具体实现细节可见后面的代码): -
Label encoding 说明:将药物分子和蛋白质的序列的各种字符用数字字符替换,便于后面的词嵌入操作。 -
Embedding layer 说明:将药物分子和蛋白质的数字序列进行词嵌入操作,也就是增加一维度表示,比如原来的维度是一维只有长度(药物分子初始设置为100,蛋白质初始设置为1000),增加宽度(宽度初始设置都为128)这一维度。注意:卷积后长度宽度初始设置会变化。 -
concatenation(串联)说明:将原本分开的每组药物分子特征和蛋白质序列特征拼接起来,也就是根据宽度相同将两者的长拼接起来,所以拼接的宽度不变,长度两者相加。
五、卷积神经网络CNN实现完整代码和结果
import torch as t
import pickle
import numpy as np
import json
import math
from torch import nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
def datatransform(data):
drug = []
protein = []
for i in range(len(data)):
drug.append(math.floor(data[i] / 229))
protein.append((data[i] % 229))
Y = pickle.load(open("E:/data/kiba/Y", "rb"), encoding='latin1')
effective = []
for i in range(len(data)):
d = drug[i]
p = protein[i]
if np.isnan(Y[d][p]):
effective.append(0)
else:
effective.append(1)
effectives = t.LongTensor(effective)
drugs = json.load(open("E:\data\kiba\ligands_iso.txt"))
drug_smiles = []
train_drug_smiles = []
for d in drugs.values():
drug_smiles.append(d)
for td in range(len(drug)):
train_drug_smiles.append(drug_smiles[drug[td]])
proteins = json.load(open("E:\data\kiba\proteins.txt"))
proteins_sequence = []
train_protein_sequences = []
for p in proteins.values():
proteins_sequence.append(p)
for tp in range(len(protein)):
train_protein_sequences.append(proteins_sequence[protein[td]])
CHARISOSMISET = {"#": 29, "%": 30, ")": 31, "(": 1, "+": 32, "-": 33, "/": 34, ".": 2,
"1": 35, "0": 3, "3": 36, "2": 4, "5": 37, "4": 5, "7": 38, "6": 6,
"9": 39, "8": 7, "=": 40, "A": 41, "@": 8, "C": 42, "B": 9, "E": 43,
"D": 10, "G": 44, "F": 11, "I": 45, "H": 12, "K": 46, "M": 47, "L": 13,
"O": 48, "N": 14, "P": 15, "S": 49, "R": 16, "U": 50, "T": 17, "W": 51,
"V": 18, "Y": 52, "[": 53, "Z": 19, "]": 54, "\\": 20, "a": 55, "c": 56,
"b": 21, "e": 57, "d": 22, "g": 58, "f": 23, "i": 59, "h": 24, "m": 60,
"l": 25, "o": 61, "n": 26, "s": 62, "r": 27, "u": 63, "t": 28, "y": 64}
CHARISOSMILEN = 64
drug_label_encoding = []
def label_smiles(smiles):
D = np.zeros(100)
for i, ch in enumerate(smiles):
if i < 100:
D[i] = CHARISOSMISET[ch]
return D.tolist()
for l in range(len(train_drug_smiles)):
label_smile = label_smiles(train_drug_smiles[l])
drug_label_encoding.append(label_smile)
CHARPROTSET = {"A": 1, "C": 2, "B": 3, "E": 4, "D": 5, "G": 6,
"F": 7, "I": 8, "H": 9, "K": 10, "M": 11, "L": 12,
"O": 13, "N": 14, "Q": 15, "P": 16, "S": 17, "R": 18,
"U": 19, "T": 20, "W": 21,
"V": 22, "Y": 23, "X": 24,
"Z": 25}
CHARPROTLEN = 25
protein_label_encoding = []
def label_sequece(sequence):
P = np.zeros(1000)
for i, ch in enumerate(sequence):
if i < 1000:
P[i] = CHARPROTSET[ch]
return P.tolist()
for l in range(len(train_protein_sequences)):
label_protein = label_sequece(train_protein_sequences[l])
protein_label_encoding.append(label_protein)
t.manual_seed(100)
drug_label_encodings = t.LongTensor(drug_label_encoding)
drug_embedding = nn.Embedding(100, 128, padding_idx=0)
drug_embeddings = drug_embedding(drug_label_encodings)
protein_label_encodings = t.LongTensor(protein_label_encoding)
protein_embedding = nn.Embedding(1000, 128, padding_idx=0)
protein_embeddings = protein_embedding(protein_label_encodings)
return drug_embeddings, protein_embeddings, effectives
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.cnn1 = nn.Sequential(
nn.Conv1d(128, 32, 4, bias=False),
nn.ReLU(inplace=True),
nn.Conv1d(32, 64, 6, bias=False),
nn.ReLU(inplace=True),
nn.Conv1d(64, 96, 8, bias=False),
nn.ReLU(inplace=True),
nn.MaxPool1d(2, 1)
)
self.cnn2 = nn.Sequential(
nn.Conv1d(128, 32, 4, bias=False),
nn.ReLU(inplace=True),
nn.Conv1d(32, 64, 8, bias=False),
nn.ReLU(inplace=True),
nn.Conv1d(64, 96, 12, bias=False),
nn.ReLU(inplace=True),
nn.MaxPool1d(2, 1)
)
self.mtp = nn.Sequential(
nn.Linear(101952,1024),
nn.Dropout(0.1),
nn.Linear(1024, 512),
nn.Dropout(0.1),
nn.Linear(512, 2)
)
def forward(self, drug ,protein):
drug = drug.permute(0, 2, 1)
protein = protein.permute(0, 2, 1)
drug_cnn = self.cnn1(drug)
protein_cnn = self.cnn2(protein)
drug_pool = drug_cnn.permute(0, 2, 1)
protein_pool = protein_cnn.permute(0, 2, 1)
drug_target = t.cat((drug_pool, protein_pool), 1)
drug_target = drug_target.view(256,1062*96)
drug_target_mtp = self.mtp(drug_target)
drug_target_mtp = t.FloatTensor(drug_target_mtp)
return drug_target_mtp
train_folds = json.load(open("E:/data/kiba/folds/train_fold_setting.txt"))
loaders = t.utils.data.DataLoader(train_folds[0][:1024], batch_size=256, shuffle=False, drop_last=True)
epoch = 3
model = Net()
optimizer = t.optim.SGD(model.parameters(),lr=0.01)
for i in range(epoch):
loss_sum = 0
for train_datas in loaders:
drug_embeddings, protein_embeddings, effectives = datatransform(train_datas)
output = model(drug_embeddings, protein_embeddings)
loss = F.nll_loss(output,effectives)
optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
print("第" + str(i+1) + "轮训练")
test_folds = json.load(open("E:/data/kiba/folds/test_fold_setting.txt"))
loaderstest = t.utils.data.DataLoader(test_folds, batch_size=256, shuffle=False, drop_last=True)
pre_output = []
real_effectives = []
for test_datas in loaderstest:
drug_embeddings, protein_embeddings, effectives = datatransform(test_datas)
effectives = effectives.tolist()
real_effectives = real_effectives + effectives
output = model(drug_embeddings, protein_embeddings)
output = output.tolist()
for i in range(len(test_datas)):
pre_output.append(output[i].index(max(output[i])))
print(pre_output)
print(real_effectives)
print("DTI预测正确率:")
print(accuracy_score(real_effectives, pre_output))
|