[人工智能] NLP - sentencepiece

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> NLP - sentencepiece -> 正文阅读

[人工智能]NLP - sentencepiece

文章目录

关于 sentencepiece

https://github.com/google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.

SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences.

SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

重复出现次数多的词组，就认为是一个词。
粒度比分词大。
模型在训练中主要使用统计指标，比如出现的频率，左右连接度等，还有困惑度来训练最终的结果。

安装

方式一：pip 安装

$ pip install sentencepiece

方式二：源码安装

$ git clone https://github.com/google/sentencepiece
$ cd sentencepiece
$ ./autogen.sh
$ ./confiture; make; sudo make install # 注意需要先安装autogen,automake等编译工具

SentencePiece分为两部分：训练模型和使用模型。
训练模型部分是用C语言实现的，可编成二进程程序执行，训练结果是生成一个model和一个词典文件。

模型使用部分同时支持二进制程序和Python调用两种方式，训练完生成的词典数据是明文，可编辑，因此也可以用任何语言读取和使用。

训练模型

$ spm_train --input=/tmp/a.txt --model_prefix=/tmp/test

$ spm_train --input='../corpus.txt' -- model_prefix='../mypiece' --vocab_size=320000 --character_coverage=1 --model_type='bpe'

参数说明

--input 指定需要训练的文本文件
--model_prefix 指定训练好的模型名前缀。将会生成 <model_name>.model 和 <model_name>.vocab （词典信息）。
--vocab_size 训练后词表的大小，比如 8000, 16000, 或 32000。数量越大训练越慢，太小(<4000)可能训练不了。
--character_coverage 模型中覆盖的字符数，默认是0.995，中文语料设置为1。
--model_type，训练时模型的类别：unigram (default), bpe, char, or word。
max_sentence_length 最大句子长度，默认是4192，长度貌似按字节来算，意味一个中文字代表长度为2
max_sentencepiece_length 最大的句子块长度，默认是16
seed_sentencepiece_size 控制句子数量，默认是100w
num_threads 线程数，默认是开16个
use_all_vocab 使用所有的tokens作为词库，不过只对word/char 模型管用
input_sentence_size 训练器最大加载数量，默认为0

使用模型

命令行调用

$ echo "食材上不会有这样的纠结" | spm_encode --model=/tmp/test.model

Python 调用

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
text = "食材上不会有这样的纠结" 

sp.Load("/tmp/test.model") 
print(sp.EncodeAsPieces(text))

参考

烛之文 : sentencepiece原理与实践
https://www.jianshu.com/p/d36c3e06fb98

人工智能最新文章

2022吴恩达机器学习课程——第二课（神经网

第十五章规则学习

FixMatch: Simplifying Semi-Supervised Le

数据挖掘Java——Kmeans算法的实现

大脑皮层的分割方法

【翻译】GPT-3是如何工作的

论文笔记:TEACHTEXT: CrossModal Generaliz

python从零学（六）

详解Python 3.x 导入(import)

【答读者问27】backtrader不支持最新版本的

加:2022-03-22 20:35:19 更:2022-03-22 20:36:25

360图书馆购物三丰科技阅读网日历万年历 2025年10日历

-2025/10/17 12:26:03-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码