概述

从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.

在这里插入图片描述

命名实例

命名实例 (Named Entity) 指的是 NLP 任务中具有特定意义的实体, 包括人名, 地名, 机构名, 专有名词等. 举个例子:

在这里插入图片描述

Luke Rawlence 代表人物
Aiimi 和 University of Lincoln 代表组织
Milton Keynes 代表地方

HMM

隐马可夫模型 (Hidden Markov Model) 可以描述一个含有隐含未知参数的马尔可夫过程. 如图:

在这里插入图片描述

随机场

随机场 (Random Field) 包含两个要素: 位置 (Site) 和相空间 (Phase Space). 当给每一个位置中按照某种分布随机赋予空间的一个值后, 其全体就叫做随机场. 举个例子, 位置好比是一亩亩农田, 相空间好比是各种庄稼. 我们可以给不同的地种上不同的庄稼. 这就好比给随机场的每个 “位置”, 赋予空间里不同的值. 随机场就是在哪块地里中什么庄稼.

在这里插入图片描述

马尔科夫随机场

马尔科夫随机场 (Markov Random Field) 是一种特殊的随机场. 任何一块地里的庄稼的种类仅与它邻近的地里中的庄稼的种类有关. 那么这种集合就是一个马尔科夫随机场.

在这里插入图片描述

CRF

条件随机场 (Conditional Random Field) 是给定随机变量 X 条件下, 随机变量 Y 的马尔科夫随机场. CRF 是在给定一组变量的情况下, 求解另一组变量的条件概率的模型, 常用于序列标注问题.

在这里插入图片描述
公式如下:

命名实例实战

数据集

我们将会用到的是一个医疗命名的数据集, 内容如下:

在这里插入图片描述

crf


import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow.keras.layers as L
from tensorflow_addons.text import crf_log_likelihood, crf_decode


class CRF(L.Layer):
    def __init__(self,
                 output_dim,
                 sparse_target=True,
                 **kwargs):
        """
        Args:
            output_dim (int): the number of labels to tag each temporal input.
            sparse_target (bool): whether the the ground-truth label represented in one-hot.
        Input shape:
            (batch_size, sentence length, output_dim)
        Output shape:
            (batch_size, sentence length, output_dim)
        """
        super(CRF, self).__init__(**kwargs)
        self.output_dim = int(output_dim)
        self.sparse_target = sparse_target
        self.input_spec = L.InputSpec(min_ndim=3)
        self.supports_masking = False
        self.sequence_lengths = None
        self.transitions = None

    def build(self, input_shape):
        assert len(input_shape) == 3
        f_shape = tf.TensorShape(input_shape)
        input_spec = L.InputSpec(min_ndim=3, axes={-1: f_shape[-1]})

        if f_shape[-1] is None:
            raise ValueError('The last dimension of the inputs to `CRF` '
                             'should be defined. Found `None`.')
        if f_shape[-1] != self.output_dim:
            raise ValueError('The last dimension of the input shape must be equal to output'
                             ' shape. Use a linear layer if needed.')
        self.input_spec = input_spec
        self.transitions = self.add_weight(name='transitions',
                                           shape=[self.output_dim, self.output_dim],
                                           initializer='glorot_uniform',
                                           trainable=True)
        self.built = True

    def compute_mask(self, inputs, mask=None):
        # Just pass the received mask from previous layer, to the next layer or
        # manipulate it if this layer changes the shape of the input
        return mask

    def call(self, inputs, sequence_lengths=None, training=None, **kwargs):
        sequences = tf.convert_to_tensor(inputs, dtype=self.dtype)
        if sequence_lengths is not None:
            assert len(sequence_lengths.shape) == 2
            assert tf.convert_to_tensor(sequence_lengths).dtype == 'int32'
            seq_len_shape = tf.convert_to_tensor(sequence_lengths).get_shape().as_list()
            assert seq_len_shape[1] == 1
            self.sequence_lengths = K.flatten(sequence_lengths)
        else:
            self.sequence_lengths = tf.ones(tf.shape(inputs)[0], dtype=tf.int32) * (
                tf.shape(inputs)[1]
            )

        viterbi_sequence, _ = crf_decode(sequences,
                                         self.transitions,
                                         self.sequence_lengths)
        output = K.one_hot(viterbi_sequence, self.output_dim)
        return K.in_train_phase(sequences, output)

    @property
    def loss(self):
        def crf_loss(y_true, y_pred):
            y_pred = tf.convert_to_tensor(y_pred, dtype=self.dtype)
            log_likelihood, self.transitions = crf_log_likelihood(
                y_pred,
                tf.cast(K.argmax(y_true), dtype=tf.int32) if self.sparse_target else y_true,
                self.sequence_lengths,
                transition_params=self.transitions,
            )
            return tf.reduce_mean(-log_likelihood)
        return crf_loss

    @property
    def accuracy(self):
        def viterbi_accuracy(y_true, y_pred):
            # -1e10 to avoid zero at sum(mask)
            mask = K.cast(
                K.all(K.greater(y_pred, -1e10), axis=2), K.floatx())
            shape = tf.shape(y_pred)
            sequence_lengths = tf.ones(shape[0], dtype=tf.int32) * (shape[1])
            y_pred, _ = crf_decode(y_pred, self.transitions, sequence_lengths)
            if self.sparse_target:
                y_true = K.argmax(y_true, 2)
            y_pred = K.cast(y_pred, 'int32')
            y_true = K.cast(y_true, 'int32')
            corrects = K.cast(K.equal(y_true, y_pred), K.floatx())
            return K.sum(corrects * mask) / K.sum(mask)
        return viterbi_accuracy

    def compute_output_shape(self, input_shape):
        tf.TensorShape(input_shape).assert_has_rank(3)
        return input_shape[:2] + (self.output_dim,)

    def get_config(self):
        config = {
            'output_dim': self.output_dim,
            'sparse_target': self.sparse_target,
            'supports_masking': self.supports_masking,
            'transitions': K.eval(self.transitions)
        }
        base_config = super(CRF, self).get_config()
        return dict(base_config, **config)

预处理

import numpy as np
import tensorflow as tf

def build_data():
    """
    获取数据
    :return: 返回数据(词, 标签) / 所有词汇总的字典
    """

    # 存放数据
    datas = []

    # 存放x
    sample_x = []

    # 存放y
    sample_y = []

    # 存放词
    vocabs = {'UNK'}

    # 遍历
    for line in open("data/train.txt", encoding="utf-8"):

        # 拆分
        line = line.rstrip().split('\t')

        # 取出字符
        char = line[0]

        # 如果字符为空, 跳过
        if not char:
            continue

        # 取出字符对应标签
        cate = line[-1]

        # append
        sample_x.append(char)
        sample_y.append(cate)
        vocabs.add(char)

        # 遇到标点代表句子结束
        if char in ['。', '?', '!', '！', '？']:
            datas.append([sample_x, sample_y])

            # 清空
            sample_x = []
            sample_y = []

    # set转换为字典存储出现过的字
    word_dict = {wd: index for index, wd in enumerate(list(vocabs))}

    print("vocab_size:", len(word_dict))


    return datas, word_dict


def modify_data():

    # 获取数据
    datas, word_dict = build_data()
    X, y = zip(*datas)
    print(X[:5])
    print(y[:5])

    # tokenizer
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(word_dict)
    X_train = tokenizer.texts_to_sequences(X)

    # 填充
    X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, 150)
    print(X_train[:5])

    class_dict = {
        'O': 0,
        'TREATMENT-I': 1,
        'TREATMENT-B': 2,
        'BODY-B': 3,
        'BODY-I': 4,
        'SIGNS-I': 5,
        'SIGNS-B': 6,
        'CHECK-B': 7,
        'CHECK-I': 8,
        'DISEASE-I': 9,
        'DISEASE-B': 10
    }

    # tokenize
    X_train = [[word_dict[char] for char in data[0]] for data in datas]
    y_train = [[class_dict[label] for label in data[1]] for data in datas]
    print(X_train[:5])
    print(y_train[:5])

    # padding
    X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, 150)
    y_train = tf.keras.preprocessing.sequence.pad_sequences(y_train, 150)
    y_train = np.expand_dims(y_train, 2)


    # ndarray
    X_train = np.asarray(X_train)
    y_train = np.asarray(y_train)
    print(X_train.shape)
    print(y_train.shape)

    return X_train, y_train

if __name__ == '__main__':
    modify_data()

主程序

import tensorflow as tf
from pre_processing import modify_data
from crf import CRF

# 超参数
EPOCHS = 10  # 迭代次数
BATCH_SIZE = 64  # 单词训练样本数目
learning_rate = 0.00003  # 学习率
VOCAB_SIZE = 1759 + 1
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)  # 优化器
loss = tf.keras.losses.CategoricalCrossentropy()  # 损失


def main():

    # 获取数据
    X_train, y_train = modify_data()

    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(VOCAB_SIZE, 300),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)),
        tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1)),
        CRF(1, sparse_target=True)
    ])


    # 组合
    model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

    # summery
    model.build([None, 150])
    print(model.summary())

    # 保存
    checkpoint = tf.keras.callbacks.ModelCheckpoint(
        "../model/model.h5", monitor='val_loss',
        verbose=1, save_best_only=True, mode='min',
        save_weights_only=True
    )

    # 训练
    model.fit(X_train, y_train, validation_split=0.2, epochs=EPOCHS, batch_size=BATCH_SIZE, callbacks=[checkpoint])

if __name__ == '__main__':
    main()

输出结果:

vocab_size: 1759
(['≠≠，', '男', '，', '双', '塔', '山', '人', '，', '主', '因', '咳', '嗽', '、', '少', '痰', '1', '个', '月', '，', '加', '重', '3', '天', '，', '抽', '搐', '1', '次', '于', '2', '0', '1', '6', '年', '1', '2', '月', '0', '8', '日', '0', '7', ':', '0', '0', '以', '1', '、', '肺', '炎', '2', '、', '抽', '搐', '待', '查', '收', '入', '院', '。'], ['性', '疼', '痛', '1', '年', '收', '入', '院', '。'], ['，', '男', '，', '4', '岁', '，', '河', '北', '省', '承', '德', '市', '双', '滦', '区', '陈', '栅', '子', '乡', '陈', '栅', '子', '村', '人', '，', '主', '因', '"', '咳', '嗽', '、', '咳', '痰', '，', '伴', '发', '热', '6', '天', '"', '于', '2', '0', '1', '6', '年', '1', '2', '月', '1', '3', '日', '1', '1', ':', '4', '7', '以', '支', '气', '管', '肺', '炎', '收', '入', '院', '。'], ['2', '年', '膀', '胱', '造', '瘘', '口', '出', '尿', '1', '年', '于', '2', '0', '1', '7', '-', '-', '0', '2', '-', '-', '0', '6', '收', '入', '院', '。'], [';', 'n', 'b', 's', 'p', ';', '郎', '鸿', '雁', '女', '5', '9', '岁', '已', '婚', ' ', '汉', '族', ' ', '河', '北', '承', '德', '双', '滦', '区', '人', '，', '现', '住', '电', '厂', '家', '属', '院', '，', '主', '因', '肩', '颈', '部', '疼', '痛', '1', '0', '余', '年', '，', '加', '重', '2', '个', '月', '于', '2', '0', '1', '6', '-', '0', '1', '-', '1', '8', ' ', '9', ':', '1', '9', '收', '入', '院', '。'])
(['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'DISEASE-B', 'DISEASE-I', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'DISEASE-B', 'DISEASE-I', 'DISEASE-I', 'DISEASE-I', 'DISEASE-I', 'O', 'O', 'O', 'O'], ['O', 'O', 'BODY-B', 'BODY-I', 'BODY-I', 'BODY-I', 'BODY-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'BODY-B', 'BODY-I', 'BODY-I', 'SIGNS-B', 'SIGNS-I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0  880 1182  602  698 1530 1630 1457
   602   31  878 1388  124 1211  225  346  456  267 1430  602  542  677
   796  272  602  238 1251  456 1170 1268  577   46  456 1056 1641  456
   577 1430   46  699  853   46 1231   46   46 1152  456 1211  797 1323
   577 1211  238 1251  591 1364 1133  513  282 1232]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0 1514 1259  709  456 1641 1133  513  282 1232]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0  602 1182  602 1090  959  602 1155 1708  882  426 1426 1561
   698 1242  908  174 1445 1334  229  174 1445 1334 1199 1457  602   31
   878 1388  124 1211 1388  346  602  216  767  371 1056  272 1268  577
    46  456 1056 1641  456  577 1430  456  796  853  456  456 1090 1231
  1152 1455  669 1322  797 1323 1133  513  282 1232]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
   577 1641 1584  734 1643 1126  186  896  967  456 1641 1268  577   46
   456 1231   46  577   46 1056 1133  513  282 1232]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0 1398    7   14   16  103  290 1491 1483 1024 1531  959 1081  559
   845  114 1155 1708  426 1426  698 1242  908 1457  602  583  188 1575
  1379 1337  326  282  602   31  878 1439  885 1520 1259  709  456   46
  1625 1641  602  542  677  577  267 1430 1268  577   46  456 1056   46
   456  456  699 1531  456 1531 1133  513  282 1232]]
[[891, 1203, 604, 702, 1562, 1665, 1486, 604, 11, 889, 1413, 110, 1233, 213, 337, 453, 255, 1457, 604, 542, 681, 803, 260, 604, 226, 1275, 453, 1190, 1292, 579, 26, 453, 1072, 1676, 453, 579, 1457, 26, 703, 864, 26, 1255, 1465, 26, 26, 1172, 453, 1233, 804, 1347, 579, 1233, 226, 1275, 593, 1388, 1153, 512, 270, 1256], [1546, 1283, 713, 453, 1676, 1153, 512, 270, 1256], [604, 1203, 604, 1108, 971, 604, 1175, 1745, 893, 421, 1451, 1594, 702, 1266, 919, 160, 1473, 1358, 217, 160, 1473, 1358, 1221, 1486, 604, 11, 889, 1127, 1413, 110, 1233, 1413, 337, 604, 204, 772, 362, 1072, 260, 1127, 1292, 579, 26, 453, 1072, 1676, 453, 579, 1457, 453, 803, 864, 453, 453, 1465, 1108, 1255, 1172, 1484, 673, 1346, 804, 1347, 1153, 512, 270, 1256], [579, 1676, 1618, 738, 1678, 1145, 173, 907, 979, 453, 1676, 1292, 579, 26, 453, 1255, 1495, 1495, 26, 579, 1495, 1495, 26, 1072, 1153, 512, 270, 1256], [369, 1423, 811, 1730, 986, 369, 88, 278, 1522, 1514, 1039, 1563, 971, 1099, 560, 1234, 855, 100, 1234, 1175, 1745, 421, 1451, 702, 1266, 919, 1486, 604, 585, 175, 1609, 1403, 1361, 317, 270, 604, 11, 889, 1467, 896, 1552, 1283, 713, 453, 26, 1660, 1676, 604, 542, 681, 579, 255, 1457, 1292, 579, 26, 453, 1072, 1495, 26, 453, 1495, 453, 703, 1234, 1563, 1465, 453, 1563, 1153, 512, 270, 1256]]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 0, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 9, 0, 0, 6, 5, 0, 0, 0, 0, 0, 0], [0, 6, 5, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 0, 6, 5, 0, 0, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 9, 9, 9, 9, 0, 0, 0, 0], [0, 0, 3, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
(7836, 150)
(7836, 150, 1)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 300)         528000    
_________________________________________________________________
bidirectional (Bidirectional (None, None, 256)         439296    
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 128)         164352    
_________________________________________________________________
time_distributed (TimeDistri (None, None, 1)           129       
_________________________________________________________________
crf (CRF)                    (None, None, 1)           1         
=================================================================
Total params: 1,131,778
Trainable params: 1,131,778
Non-trainable params: 0
_________________________________________________________________
None
2021-11-23 00:31:29.846318: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/10
10/98 [==>...........................] - ETA: 7:52 - loss: 5.2686e-08 - accuracy: 0.9232

在这里插入图片描述