1.加载bert模型及分词
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model)
2.分词演示
- 这里是对COVID hospitalization分词
print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))
['CO', '##VI', '##D']
['hospital', '##ization']
- 如果让着两个词都能保持完整,而不被拆分,进行如下操作
new_tokens = ['COVID', 'hospitalization']
num_added_toks = tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))
print(tokenizer.tokenize('COVID'))
print(tokenizer.tokenize('hospitalization'))
tokenizer.savepretrained("modle_dir")
3.自定义bert词表
参考链接
https://zhuanlan.zhihu.com/p/391814780
|