BERT特点

只有encoder没有decoder的双向语言表示预训练模型，可以接各种下游任务，它的输出只是文本表示，所以不能使用固定的decoder。
BERT是百层左右的深度神经网络，才能把各种语言学的特征提取出来。BERT面世之前，NLP领域的神经网络基本只有几层，Transformer架构出来之后才有可能将NLP网络推向几十至上百层。浅层是分析语法，语法层级的特征，深层进入语义的范畴。
动态词向量。在Word2Vec，GloVe的年代，词向量都是静态的，一旦训练之后词向量就固定不变了，但是这就限制了模型对多义词的识别能力，比如apple可以指水果也可以指苹果公司，因此词向量需要动态变化。

BERT实践

环境配置

安装anaconda，一个机器学习平台软件

Anaconda | The World's Most Popular Data Science Platform

安装pycharm，方便调试代码

PyCharm: the Python IDE for Professional Developers by JetBrains

将python interpreter设置成anaconda下面的python.exe，这样后面通过anaconda命令行下载的包都可以在pycharm下找到。

深度学习框架

实践中会发现某些资料中的代码无法运行，主要是tensorflow和BERT版本匹配的问题。

tensorflow 2.0之前的一些旧的版本可以配套bert-as-service这种第三方开源工具。我们很少使用旧的库，那么我们把目光集中在tensorflow 2.0以后的BERT配套。

目前针对tensorflow 2.0有两个深度学习框架可以使用：

Keras（Google）
transformers（Hugging Face）

Keras的开发主要是Google支持，配合tensorflow_hub使用。

transformers是Hugging Face公司开发，一家专注于NLP的公司。

下面分别介绍下两个框架下的实践。

transformers

anaconda打开命令行安装依赖：

pip install transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

max_length_test = 20
test_sentence = '曝梅西已通知巴萨他想离开'

# add special tokens
test_sentence_with_special_tokens = '[CLS]' + test_sentence + '[SEP]'
tokenized = tokenizer.tokenize(test_sentence_with_special_tokens)
print('tokenized', tokenized)

# convert tokens to ids in WordPiece
input_ids = tokenizer.convert_tokens_to_ids(tokenized)

# precalculation of pad length, so that we can reuse it later on
padding_length = max_length_test - len(input_ids)

# map tokens to WordPiece dictionary and add pad token for those text shorter than our max length
input_ids = input_ids + ([0] * padding_length)

# attention should focus just on sequence with non padded tokens
attention_mask = [1] * len(input_ids)

# do not focus attention on padded tokens
attention_mask = attention_mask + ([0] * padding_length)

# token types, needed for example for question answering, for our purpose we will just set 0 as we have just one sequence
token_type_ids = [0] * max_length_test
bert_input = {
??? "token_ids": input_ids,
??? "token_type_ids": token_type_ids,
??? "attention_mask": attention_mask
}
print(bert_input)

输出：

tokenized ['[CLS]', '曝', '梅', '西', '已', '通', '知', '巴', '萨', '他', '想', '离', '开', '[SEP]']

{'token_ids': [101, 3284, 3449, 6205, 2347, 6858, 4761, 2349, 5855, 800, 2682, 4895, 2458, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}

keras

pip install tensorflow==2.6.2

pip install tensorflow_hub

pip install bert-for-tf2

pip install tensorflow-probability

pip install tf-models-official

pip install tfds-nightly

pip install tensorflow_text==2.6.0

注：

tensorflow_text的2.7.0版本，在windows下报错ModuleNotFoundError: No module named 'tensorflow_text.core'，转到2.6.0版本以后解决，连带的tensorflow也从最新的2.7.0自动回退到2.6.2。

安装tf-models-official时，windows下报错Microsoft Visual C++ 14.0 or greater is required. 安装Microsoft Visual C++后解决。

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text # A dependency of the preprocessing model

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
??? "D:\\wyf\\workspace\\20211029_智能客服\\bert_zh_preprocess_3")
??? #"https://tfhub.dev/tensorflow/bert_zh_preprocess/3")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
??? "D:\\wyf\\workspace\\20211029_智能客服\\bert_zh_L-12_H-768_A-12_4",
??? #"https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/4"
??? trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]????? # [batch_size, 768].
sequence_output = outputs["sequence_output"]? # [batch_size, seq_length, 768].

#embedding_model = tf.keras.Model(text_input, pooled_output)
embedding_model = tf.keras.Model(text_input, sequence_output)
sentences = tf.constant(["你好漂亮"])
#print(preprocessor(sentences))
print(embedding_model(sentences))