终于更新了!这里我们来初探自然语言处理,众所周知,RNN在NLP领域兴风作浪,但是我这里没有用RNN,哈哈哈,所以下一篇就是RNN应用,当然了,最近会更新慢一点,我先努力入党还有各种其余的事情,乱糟糟的感觉。 这篇博客其实我是通过markdown导入的,但是太大了太大了,CSDN扛不住了,所以我就删减了很多输出的内容,没删减前,只能到数据处理的建立字典…包括前面的NLP基础,这些都是很重要的内容。 在人工智能这片领域里,有🌹也有荆棘,希望我能一直走下去吧! 下一篇,RNN我们再会!!
希望大家支持一下我的教程!也希望我粉丝尽快到2k好申请创作身份!感谢家人!! TensorFlow系统教程 NLP应用的难点在于,如何把输入的字符数值化,神经网络说白了就是一堆数×权重加上偏执项
import tensorflow as tf
tf.__version__
'2.5.0'
tf.test.is_gpu_available()
WARNING:tensorflow:From C:\Users\LENOVO\AppData\Local\Temp/ipykernel_14068/337460670.py:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
True
项目介绍
情感分析:又称意见挖掘、倾向性分析,是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。
NLP前置基础:分词
- 下载jieba库,pip install jieba
- 这里以中文分词为例
pip install jieba
Note: you may need to restart the kernel to use updated packages.
文件名、目录名或卷标语法不正确。
- jieba.cut有三个参数,第一个参数是需要分词的字符串,cut_all参数用来控制是否采用全模式;HMM参数用来控制是否使用HMM模型
import jieba
text = 'jupyter是一名非常优秀的AI作者,人帅又好,爱了爱了!'
word_generator = jieba.cut(text)
print(list(word_generator))
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\LENOVO\AppData\Local\Temp\jieba.cache
Loading model cost 0.618 seconds.
Prefix dict has been built successfully.
['jupyter', '是', '一名', '非常', '优秀', '的', 'AI', '作者', ',', '人帅', '又', '好', ',', '爱', '了', '爱', '了', '!']
print(list(jieba.cut(text,cut_all=True,HMM=False)))
['jupyter', '是', '一名', '非常', '优秀', '的', 'AI', '作者', ',', '人', '帅', '又', '好', ',', '爱', '了', '爱', '了', '!']
print(list(jieba.cut_for_search(text)))
['jupyter', '是', '一名', '非常', '优秀', '的', 'AI', '作者', ',', '人帅', '又', '好', ',', '爱', '了', '爱', '了', '!']
print(jieba.lcut(text))
['jupyter', '是', '一名', '非常', '优秀', '的', 'AI', '作者', ',', '人帅', '又', '好', ',', '爱', '了', '爱', '了', '!']
print(jieba.lcut_for_search(text))
['jupyter', '是', '一名', '非常', '优秀', '的', 'AI', '作者', ',', '人帅', '又', '好', ',', '爱', '了', '爱', '了', '!']
jieba.load_userdict('字典.txt')
'''
字典.txt内容:
AI作者
爱了爱了
'''
print(jieba.lcut(text))
['jupyter', '是', '一名', '非常', '优秀', '的', 'AI作者', ',', '人帅', '又', '好', ',', '爱了爱了', '!']
一、数据集
1.TF中IMDB预设数据
imdb = tf.keras.datasets.imdb
(train_data,train_labels),(test_data,test_labels) = imdb.load_data()
<__array_function__ internals>:5: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
C:\Users\LENOVO\anaconda3\envs\tf\lib\site-packages\tensorflow\python\keras\datasets\imdb.py:155: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
C:\Users\LENOVO\anaconda3\envs\tf\lib\site-packages\tensorflow\python\keras\datasets\imdb.py:156: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
train_data.shape,train_labels.shape,test_data.shape,test_labels.shape
((25000,), (25000,), (25000,), (25000,))
train_labels[0]
1
2.自制数据集
- 获取数据,确定数据格式规范
- 文字分词,英文分词可以按照空格分词,中文分词可以参考jieba
- 建立词索引表,给每个词一个数字索引编号
- 段落文字转为词索引向量
- 段落文字转为词嵌入矩阵
import os
import tarfile
import urllib.request
import numpy as np
import re
from random import randint
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
file_path = 'data/aclImdb_v1.tar.gz'
if not os.path.exists('data'):
os.mkdir('data')
if not os.path.isfile(file_path):
print('downloading')
result = urllib.request.urlretrieve(url,filename=file_path)
print('ok',result)
else:
print(file_path,'is existed!')
data/aclImdb_v1.tar.gz is existed!
if not os.path.exists('data/aclImdb'):
tfile = tarfile.open(file_path,'r:gz')
print('extracting…')
result = tfile.extractall('data/')
print('ok',result)
else:
print('data/aclImdb is existed')
data/aclImdb is existed
def remove_tags(text):
re_tag = re.compile(r'<[>]+>')
return re_tag.sub('',text)
def read_files(file_type):
path = 'data/aclImdb/'
file_list = []
positive_file_path = path+file_type+'/pos/'
for f in os.listdir(positive_file_path):
file_list.append(positive_file_path+f)
positive_num = len(file_list)
negitave_file_path = path+file_type+'/neg/'
for f in os.listdir(negitave_file_path):
file_list.append(negitave_file_path+f)
negitave_num = len(file_list) - positive_num
print('read',file_type,':',len(file_list))
print('positive_num',positive_num)
print('negitave_num',negitave_num)
labels = [[1,0]]*positive_num + [[0,1]]*negitave_num
features = []
for fi in file_list:
with open(fi,'rt',encoding='utf8') as f:
features+=[remove_tags(''.join(f.readlines()))]
return features,labels
train_x,train_y = read_files('train')
test_x,test_y = read_files('test')
test_y = np.array(test_y)
train_y = np.array(train_y)
read train : 21247
positive_num 8747
negitave_num 12500
read test : 25000
positive_num 12500
negitave_num 12500
train_x[0]
'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
train_y[0]
array([1, 0])
二、数据处理
1.建立字典
token = tf.keras.preprocessing.text.Tokenizer(num_words=4000)
token.fit_on_texts(train_x)
token.document_count
21247
print(token.word_index)
{'the': 1, 'a': 2, 'and': 3, 'of': 4, 'to': 5, 'is': 6, 'br': 7, 'in': 8, 'it': 9, 'i': 10, 'this': 11, 'that': 12, 'was': 13, 'as': 14, 'for': 15, 'movie': 16, 'with': 17, 'but': 18, 'film': 19, 'on': 20, 'not': 21, 'you': 22, 'are': 23, 'his': 24, 'have': 25, 'be': 26, 'he': 27, 'one': 28, 'all': 29, 'at': 30, 'by': 31, 'they': 32, 'an': 33, 'so': 34, 'like': 35, 'who': 36, 'from': 37, 'or': 38, 'just': 39, 'her': 40, 'about': 41, 'if': 42, 'out': 43, ckman': 5828, 'kennedy': 5829, 'net': 5830, 'creek': 5831, 'sniper': 5832, 'beowulf': 5833, 'headache': 5834, 'ariel': 5835, 'programs': 5836, 'insightful': 5837, 'gods': 5838, 'leaders': 5839, 'prominent': 5840, 'files': 5841, 'eleven': 5842, 'choosing': 5843, 'refers': 5844, 'evolution': 5845, 'hepburn': 5846, 'uplifting': 5847, 'triangle': 5848, 'lex': 5849, 'garner': 5850, 'accepts': 5851, 'outright': 5852, 'lasts': 5853, 'representation': 5854, 'teaches': 5855, 'spit': 5856, "anyone's": 5857, 'occasions': 5858, 'hats': 5859, 'popping': 5860, 'survives': 5861, 'studies': 5862, 'tossed': 5863, 'landed': 5864, 'terminator': 5865, 'femme': 5866, 'ish': 5867, 'continually': 5868, 'centre': 5869, 'incidentally': 5870, 'dismal': 5871, 'communicate': 5872, 'caricature': 5873, 'coat': 5874, 'chills': 5875, 'trivia': 5876, 'myth': 5877, '200': 5878, 'respective': 5879, 'damaged': 5880, 'marvel': 5881, 'affairs': 5882, "hitler's": 5883, 'motive': 5884, 'transformed': 5885, 'refuse': 5886, 'breakfast': 5887, 'unattractive': 5888, 'claude': 5889, 'underwear': 5890, 'pacific': 5891, 'misfortune': 5892, 'derivative': 5893, }
print(token.word_docs)
defaultdict(<class 'int'>, {'immediately': 363, 'some': 8229, '35': 79, 'right': 2349, 'recalled': 14, "teachers'": 1, 'believe': 1913, 'me': 6291, 'many': 4228, 'student': 271, 'pomp': 7, 'which': 6380, 'welcome': 171, 'school': 1065, 'who': 9377, 'remind': 128, 'inspector': 97, 'than': 6115, 'is': 19060, 'your': 3729, "isn't": 2242, 'situation': 478, 'through': 3449, 'years': 2977, 'of': 20179, 'financially': 20, 'students': 246, 'tried': 639, 'think': 4629, 'time': 7438, 'pettiness': 2, 'closer': 161, 'knew': 699, 'sack': 40, 'programs': 52, 'profession': 53, 'teaching': 68, 'to': 19978, 'high': 1587, 'the': 21072, 'burn': 106, 'their': 5874, 'episode': 836, 'see': 6790, 'insightful': 54, 'one': 11980, 'ran': 184, 'that': 17054, 'far': 2208, 'here': 3620, "high's": 1, 'expect': 924, 'i': 16439, 'my': 6881, 'repeatedly': 95, 'it': 18166, 'adults': 274, 'as': 13603, 'can': 6656, 'cartoon': 318, 'saw': 2319, 'line': 1398, 'pity': 194, 'satire': 183, 'in': 18691, "i'm": 3223, 'same': 2848, 'much': 6082, 'pathetic': 410, 'bromwell': 4, 'all': 11137, 'when': 7652, 'other': 5584, 'down': 2618, 'a': 20532, 'what': 8249, 'schools': 46, 'at': 11099, 'classic': 1247, 'about': 8957, 'such': 3461, 'comedy': 1960, 'lead': 991, 'whole': 2300, 'scramble': 6, 'teachers': 54, 'reality': 666, 'life': 3762, 'and': 20504, 'survive': 181, 'fetched': 85, 'age': 795, 'photography': 320, "i'd": 1016
print(token.word_counts)
OrderedDict([('bromwell', 8), ('high', 1844), ('is', 90075), ('a', 137721), ('cartoon', 473), ('comedy', 2681), ('it', 67260), ('ran', 191), ('at', 20123), ('the', 283652), ('same', 3488), ('time', 10745), ('as', 39107), ('some', 13483), ('other', 7556), ('programs', 56), ('about', 14798), ('school', 1371), ('life', 5313), ('such', 4403), ('teachers', 59), ('my', 10528), ('35', 80), ('years', 3684), ('in', 78849), ('teaching', 72), ('profession', 57), ('lead', 1105), ('me', 9171), ('to', 115333), ('believe', 2168), ('that', 59452), ("high's", 1), ('satire', 226), ('much', 8375), ('closer', 174), ('reality', 814), ('than', 8513), ('scramble', 6), ('survive', 198), ('financially', 21), ('insightful', 56), ('students', 316), ('who', 17329), ('can', 9386), ('see', 9621), ('right', 2796), ('through', 4307), ('their', 9431), ('pathetic', 441), ("teachers'", 1), ('pomp', 8), ('pettiness', 2), ('of', 122635), ('whole', 2702), ('situation', 530), ('all', 20397), ('remind', 132), ('schools', 53), ('i', 66219), ('knew', 762), ('and', 136984), ('when', 11932), ('saw', 2643), ('episode', 1363), ('which', 10116), ('student', 319), ('repeatedly', 97), ('tried', 703), ('burn', 107), ('down', 3146), ('immediately', 386), ('recalled', 14), ('classic', 1458), ('line', 1613), ('inspector', 146), ("i'm", 4167), ('here', 4749), ('sack', 42), ('one', 22498), ('your', 4963), ('welcome', 179), ('expect', 1018), ('many', 5583), ('adults', 313), ('age', 912), ('think', 6182), ('far', 2591), ('fetched', 90), ('what', 13107), ('pity', 198), ("isn't", 2761), ('liked', 1214), ('film', 32981), ('action', 2839), ('scenes', 4471), ('were', 9332), ('very', 11536), ('interesting', 2666), ('tense', 123), ('well', 8701), ('done',
2.文字转数字列表(词向量)
train_sequences = token.texts_to_sequences(train_x)
test_sequences = token.texts_to_sequences(test_x)
3.让转换后的数字列表长度相同
'''
tf.keras.preprocessing.sequence.pad_sequences(train_sequences, 浮点数或整数构成的两层嵌套列表
padding='post',‘pre’或‘post’,确定当需要补0时,在序列的起始还是结尾补0
truncating='post',‘pre’或‘post’,确定当截断序列时,从起始还是结尾截断
maxlen=400),’None或整数,为序列的最大长度。大于此长度的序列将会被截断,小于此长度’会填0
'''
train_x = tf.keras.preprocessing.sequence.pad_sequences(train_sequences,
padding='post',
truncating='post',
maxlen=400)
test_x = tf.keras.preprocessing.sequence.pad_sequences(test_sequences,
padding='post',
truncating='post',
maxlen=400)
三、建模
model = tf.keras.models.Sequential()
'''
model.add(tf.keras.layers.Embedding(output_dim=32,输出词向量的维度
input_dim=4000,#输入词汇表的长度,最大词汇数+1
input_length=400)) # 输入Tensor的长度
'''
model.add(tf.keras.layers.Embedding(output_dim=32,
input_dim=4000,
input_length=400))
model.add(tf.keras.layers.GlobalAveragePooling1D())
model.add(tf.keras.layers.Dense(units=256,activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(units=2,activation='softmax'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 400, 32) 128000
_________________________________________________________________
global_average_pooling1d (Gl (None, 32) 0
_________________________________________________________________
dense (Dense) (None, 256) 8448
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 514
=================================================================
Total params: 136,962
Trainable params: 136,962
Non-trainable params: 0
_________________________________________________________________
四、训练
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
history=model.fit(train_x,train_y,validation_split=0.2,epochs=10,batch_size=128,verbose=1)
Epoch 1/10
133/133 [==============================] - 3s 14ms/step - loss: 0.6625 - accuracy: 0.6158 - val_loss: 0.6072 - val_accuracy: 0.6784
Epoch 2/10
133/133 [==============================] - 2s 12ms/step - loss: 0.3943 - accuracy: 0.8412 - val_loss: 0.3679 - val_accuracy: 0.8511
Epoch 3/10
133/133 [==============================] - 2s 12ms/step - loss: 0.2833 - accuracy: 0.8893 - val_loss: 0.3094 - val_accuracy: 0.8779
Epoch 4/10
133/133 [==============================] - 1s 10ms/step - loss: 0.2439 - accuracy: 0.9038 - val_loss: 0.3789 - val_accuracy: 0.8386
Epoch 5/10
133/133 [==============================] - 1s 11ms/step - loss: 0.2217 - accuracy: 0.9148 - val_loss: 0.2759 - val_accuracy: 0.8934
Epoch 6/10
133/133 [==============================] - 1s 11ms/step - loss: 0.2000 - accuracy: 0.9255 - val_loss: 0.3568 - val_accuracy: 0.8640
Epoch 7/10
133/133 [==============================] - 1s 11ms/step - loss: 0.1890 - accuracy: 0.9283 - val_loss: 0.3279 - val_accuracy: 0.8798
Epoch 8/10
133/133 [==============================] - 1s 11ms/step - loss: 0.1769 - accuracy: 0.9347 - val_loss: 0.3767 - val_accuracy: 0.8619
Epoch 9/10
133/133 [==============================] - 1s 10ms/step - loss: 0.1687 - accuracy: 0.9384 - val_loss: 0.3250 - val_accuracy: 0.8882
Epoch 10/10
133/133 [==============================] - 2s 12ms/step - loss: 0.1610 - accuracy: 0.9430 - val_loss: 0.4318 - val_accuracy: 0.8522
import matplotlib.pyplot as plt
def show_train_history(train_history,train_metrics,val_metrics):
plt.plot(train_history[train_metrics])
plt.plot(train_history[val_metrics])
plt.title('Trian History')
plt.ylabel(train_metrics)
plt.xlabel('epoch')
plt.legend(['trian','validation'],loc='upper left')
plt.show()
show_train_history(history.history,'loss','val_loss')
show_train_history(history.history,'accuracy','val_accuracy')
看这个验证集的准确率和损失一直在波动,而训练集一直在上升,其实就可以大概估计出是有点过拟合的意思了
五、评估和预测
model.evaluate(test_x,test_y,verbose=1)
782/782 [==============================] - 2s 3ms/step - loss: 0.3558 - accuracy: 0.8661
[0.35584765672683716, 0.8661199808120728]
pre = model.predict(test_x)
pre[0]
array([0.97998744, 0.0200126 ], dtype=float32)
x = ["This is really a junk movie. Jupyter doesn't like it. Thank you! It's really bad"]
x = token.texts_to_sequences(x)
x = tf.keras.preprocessing.sequence.pad_sequences(x,
padding='post',
truncating='post',
maxlen=400)
x
array([[ 11, 6, 62, 2, 2356, 16, 147, 35, 9, 1298, 22,
44, 62, 71, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]])
y = model.predict(x)
y
array([[0.44359663, 0.55640334]], dtype=float32)
state = {0:'pos',1:'neg'}
state[np.argmax(y)]
'neg'
|