IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 人工智能 -> huggingface.transformers任务简介 -> 正文阅读

[人工智能]huggingface.transformers任务简介

诸神缄默不语-个人CSDN博文目录

本文属于huggingface.transformers全部文档学习笔记博文的一部分。
全文链接:huggingface transformers包 文档学习笔记(持续更新ing…)

本部分网址:https://huggingface.co/docs/transformers/master/en/task_summary
本部分介绍了一些常见NLP任务使用transformers包的解决方案。本文使用的AutoModel具体内容可以参阅其文档,也可以参考我此前的撰写的transformers包文档笔记,我介绍了一些相关的用法和示例代码。

模型需要从针对对应任务上预训练过的checkpoint加载,才能更好地应用于对应任务。(如果加载的是未经过特定任务微调的checkpoint会仅加载基础transformers层,没有特定任务所需的additional head,就会随机初始化additional head权重,产生随机输出)
这些checkpoints往往是在大量语料上预训练(pre-train),然后再针对具体任务进行微调(fine-tune)过。这意味着:

  1. 不是所有模型都在所有任务上微调过。如果用户想针对某一任务对模型进行微调,可以参考使用transformers官方GitHub项目中的examples文件夹中的代码:transformers/examples at master · huggingface/transformers 关于这个的笔记我已经列在卫星里面了:接下来博文写作计划的卫星 以后会写的,所以本文不会讲这些代码
  2. checkpoints是在特定数据集上被微调过的,该数据集可能无法覆盖用户自己的用例和领域。如前所述,用户可以自己再继续微调。

如果想在指定任务上直接做推理,可以使用这些机制:

  1. Pipelines:很容易用的抽象,两行代码即可实现。对pipeline的使用,更详细的笔记可参考我之前撰写的博文:huggingface.transformers速成笔记:Pipeline推理和AutoClass第一节。
  2. 直接使用模型:抽象程度较低,但更易改造、功能更多,支持直接改换、使用tokenizer(原话是a direct access to a tokenizer,我没搞懂具体是啥意思?)和全部的推理功能。

以下两种方式都会展示:

1. Sequence Classification

Sequence Classification任务是将sequence在给定的类数中进行分类。如GLUE数据集。在GLUE数据集上进行微调可参考run_glue.pyrun_xnli.py

用pipeline进行情感分类的示例,使用在sst2(GLUE task)上微调过的模型,返回标签("POSITIVE""NEGATIVE")和得分:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

输出:
label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999

用AutoClass判断两句话是否同义(互为改写)的示例:

  1. 根据checkpoint名初始化tokenizer和模型,模型架构是BERT,并加载checkpoint中的权重
  2. 构建一个由两句话组成的sequence,含有正确的model-specific separators, token type ids and attention masks(由tokenizer自动生成)
  3. 将这个sequence传入模型,对它进行分类:是否同义
  4. 计算输出的softmax结果,获得在各类上的概率值
  5. 打印结果
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

输出:not paraphrase: 10%
is paraphrase: 90%

not paraphrase: 94%
is paraphrase: 6%

2. Extractive Question Answering

Extractive Question Answering是从context(一段文本)中抽取句子,作为特定问题答句。如SQuAD1数据集。在SQuAD数据集上微调可参考run_qa.py

用pipeline的示例,使用在SQuAD数据集上微调过的模型,返回从context中抽取的答案、confidence score、指示答案在context中位置的startend值:

from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is extractive question answering?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

输出:
Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

用AutoClass的示例:

  1. 根据checkpoint名初始化tokenizer和模型,模型架构是BERT,并加载checkpoint中的权重
  2. 定义context和一些问题
  3. 迭代所有问题,构建context和当前问题的sequence(用正确的model-specific separators, token type ids and attention masks)
  4. 将这个sequence传入模型,输出整个sequence上每个token的得分(该token是start index或end index的可能性得分)。
  5. 计算输出的softmax结果,获得在各token上的概率值
  6. 获取被识别为startend之间的值的token,将其转化为字符串
  7. 打印结果
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

输出:
Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch

3. Language Modeling

Language modeling是使模型适应某一语料(一般是特定领域的)的任务,这样说可能比较抽象,所以建议直接看本文后续的示例来直观了解其含义。
所有流行的transformer-based模型都是用language modeling的一种变体来训练的,如BERT用masked language modeling,GPT-2用causal language modeling。
Language modeling也可以用于预训练之外的情况,如将模型分布转移到domain-specific:用一个在大语料上预训练过的模型,在新数据集上微调,如在论文上微调:lysandre/arxiv-nlp · Hugging Face

3.1 Masked Language Modeling

MLM是用masking token来mask sequence中的一些tokens,然后调整模型使之用合适的token来填充这些mask。这让模型能够attend right context(mask右边的token)和left context(mask左边的token)。这样的训练设置为需要bi-directional context的下游任务(如SQuAD1)提供了强基础。
在MLM任务上微调的代码可参考run_mlm.py

用pipeline的示例,输出填充mask后的sequence、confidence score、被用以填充mask的token及其在tokenizer vocabulary中的token ID:

from transformers import pipeline

unmasker = pipeline("fill-mask")

from pprint import pprint

pprint(
    unmasker(
        f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."
    )
)

输出:

[{'score': 0.1793,
  'sequence': 'HuggingFace is creating a tool that the community uses to solve '
              'NLP tasks.',
  'token': 3944,
  'token_str': ' tool'},
 {'score': 0.1135,
  'sequence': 'HuggingFace is creating a framework that the community uses to '
              'solve NLP tasks.',
  'token': 7208,
  'token_str': ' framework'},
 {'score': 0.0524,
  'sequence': 'HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.',
  'token': 5560,
  'token_str': ' library'},
 {'score': 0.0349,
  'sequence': 'HuggingFace is creating a database that the community uses to '
              'solve NLP tasks.',
  'token': 8503,
  'token_str': ' database'},
 {'score': 0.0286,
  'sequence': 'HuggingFace is creating a prototype that the community uses to '
              'solve NLP tasks.',
  'token': 17715,
  'token_str': ' prototype'}]

用AutoClass的示例:

  1. 根据checkpoint名初始化tokenizer和模型,模型架构是DistilBERT,并加载checkpoint中的权重
  2. 定义含有一个masked token的sequence:用tokenizer.mask_token(这是个字符串格式的变量,在字符串中用花括号括起来以实现替换2)替换一个单词(我感觉这里的单词应该指的是一个token
  3. 将sequence编码为token IDs的列表,找到masked token在列表中的位置。
  4. 提取在mask token索引值处的预测值:这个张量和vocabulary有同样的尺寸,其元素值就是分配给每个token的得分。模型认为在给定context下,更有可能是这个masked token的token,会得到更高的分数。
  5. 用PyTorch的topk方法提取得分最高的5个token。
  6. 用上述的tokens来替代mask token,打印结果。
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = (
    "Distilled models are smaller than the models they mimic. Using them instead of the large "
    f"versions would help {tokenizer.mask_token} our carbon footprint."
)

inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
#得到分数最高的5个token的索引
#值得注意的是,topk函数默认是根据value经过sort的。参考其函数文档:https://pytorch.org/docs/stable/generated/torch.topk.html

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
    #将该token解码为文本形式,替代原文中的tokenizer.mask_token

输出:

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

3.2 Causal Language Modeling

CLM是预测一个sequence之后的token的任务。在这种情境下,模型只会attend left context(mask左边的token)。这样的训练设置特别关注于生成任务。
在CLM任务上微调的代码可参考run_clm.py
一般来说,预测下一个token是通过抽样输入sequence得到的最后一层hidden state的logits得到的。

用AutoClass的示例:用AutoModelForCausalLM、AutoTokenizer和top_k_top_p_filtering()方法,在输入sequence后抽样得到下一个token:

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

输出:Hugging Face is based in DUMBO, New York City, and is
我自己没跑,我看到文档里说会是isfeatures,我在gpt2模型首页直接调用推理API得到的第一个token也是is,那就:)在这里插入图片描述

实话说我没太看懂这个例子。总之大约就是这么回事吧。以后看了更多资料再来写详细解释吧。

在下一节使用的generation_utils.GenerationMixin.generate()方法可以用来生成多个长达指定长度的tokens,而不是一次只生成一个token。

3.3 Text Generation

文本生成(text generation,又名open-ended text generation)的目标是生成给定context(文本)后的一段连续的文本。

用pipeline的示例,使用的是GPT-2模型,Top-K抽样,参考GPT-2模型的configuration文件:config.json · gpt2 at main

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

输出:

[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

pipeline对象其实调用了PreTrainedModel.generate()方法,对这方面的介绍可参考我之前撰写的博文huggingface.transformers速成笔记:Pipeline推理和AutoClass_诸神缄默不语的博客-CSDN博客第一节序号③部分相关内容。

用AutoClass的示例:用XLNet及其对应的tokenizer。这个模型可以直接调用generate()函数:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

print(generated)

代码中的padding text见注释中给出的网站解释,具体的我也没看懂,总之是说XLNet的运算方式造成了一些问题,如context太短会导致生成内容效果不好,所以需要加一段硬编码的随机文本(在这段随机文本后要加<eod>,然后再加上真实context)。

这个模型我自己跑出来的结果是:

Today the weather is really nice and I am planning on going for a walk in the park with my mom (she can walk and play golf) to explore and see something I didn't know. The park is actually a giant green, with lots of shade in the first two thirds of the way (it is en route to a major golf course, which is really neat). I decided to walk

感觉效果还行?

如果不加padding text的话,运行出来的结果就是:

Today the weather is really nice and I am planning on going on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on going on on on on on on on on on on on on going on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on and on on on on on on on and on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on and on on on on on on on on on on on on on on and on on on on on on and on on on on on on on on on on on on on on on on and on on on and on on on on on on on on on on on on on on on on on on and on and on on on

嗯,完全就是人工智障,可见padding text是很有必要的……

此外我还试了一下在xlnet-base-cased · Hugging Face的推理pipeline上运行,结果是:
在这里插入图片描述
也完全是人工智障的样子。

文本生成任务现在在PyTorch上支持GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer模型。
和上述例子一样,XLNet和Transfo-XL的输入数据需要经pad才能正常工作。
GPT-2是一个open-ended text generation任务的好选择,因为它在上百万网页上以causal language modeling目标函数训练过。
对于如何使用不同的解码策略来进行文本生成,文档中给出了官方博客作为参考资料:How to generate text: using different decoding methods for language generation with Transformers,我对此篇博文也有撰写学习笔记博文的计划。

4. Named Entity Recognition

我不是做NER的,所以以下内容都是照着文档内容半理解半猜的,没有去仔细查证过,如有疏漏请直接跟我说
命名实体识别Named Entity Recognition (NER)是token分类任务中的一种,识别出文本中的命名实体。如将某一token识别为人物person、组织organization或地点location的实体的组成部分、或不属于任何实体。如CoNLL-2003数据集。在NER任务上进行微调可参考run_ner.py

用pipeline进行命名实体识别的示例,将token分为如下9类:

  • O, Outside of a named entity
  • B-MIS, Beginning of a miscellaneous3 entity right after another miscellaneous entity
  • I-MIS, Miscellaneous entity
  • B-PER, Beginning of a person’s name right after another person’s name
  • I-PER, Person’s name
  • B-ORG, Beginning of an organisation right after another organisation
  • I-ORG, Organisation
  • B-LOC, Beginning of a location right after another location
  • I-LOC, Location

使用在CoNLL-2003上微调过的模型(微调者@stefan-it,项目dbmdz):

from transformers import pipeline

ner_pipe = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

显示所需的返回值:

for entity in ner_pipe(sequence):
    print(entity)

输出效果:

{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}

sequence“Hugging Face”被识别为organization,“New York City”“DUMBO”“Manhattan Bridge”被识别为location。

用AutoClass进行命名实体识别的示例:

  1. 根据checkpoint名初始化tokenizer和模型,模型架构是BERT,并加载checkpoint中的权重
  2. 定义一个含有已知实体的sequence(如含有organization“Hugging Face”,location“New York City”
  3. tokenize sequence。
  4. 将input传入模型,返回第一个输出。这是每个token在9个类上的概率分布,用argmax可以得到每个token最有可能隶属的类。
  5. zip每个token和对应的预测值,打印出来。
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = (
    "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
    "therefore very close to the Manhattan Bridge."
)

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

和pipeline不同,在这里没有去掉0类,即该token并不是任何一种实体的情况。
predictions中每一类都对应一个整数,该整数与类名的对应可以通过model.config.id2label解码:

for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

输出:

('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')

5. Summarization

文本摘要(summarization)的目标是将长文本缩写为简短的摘要。如CNN / Daily Mail新闻数据集。在文本摘要上微调的任务可参考transformers/examples/pytorch/summarization at main · huggingface/transformers

用pipeline的示例,使用在CNN / Daily Mail数据集上微调过的BART模型:

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

输出:

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]

summarization pipeline也是基于PreTrainedModel.generate()写的,见前文文本生成部分的介绍。

用AutoClass的示例:

  1. 根据checkpoint名初始化tokenizer和模型。summarization往往用encoder-decoder模型实现,如BART或T5。
  2. 定义需要被summarize的文本。
  3. 添加T5 specific prefix “summarize: “
  4. PreTrainedModel.generate()方法生成摘要。

以下示例使用谷歌的T5模型,它是在多任务混合模型(包含CNN / Daily Mail数据集)上预训练的:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0]))

输出:

<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>

6. Translation

翻译(translation)的目标是将一种语言的文本翻译到另一种语言。如WMT数据集,输入为英语,输出为德语。在翻译任务上微调的代码可参考transformers/examples/pytorch/translation at main · huggingface/transformers

用pipeline的示例,使用上述文本摘要部分AutoClass部分用过的T5模型(其训练用的数据集包括WMT数据集):

from transformers import pipeline

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

输出:
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

translation pipeline也是基于PreTrainedModel.generate()写的,见前文文本生成部分的介绍。

用AutoClass的示例:

  1. 根据checkpoint名初始化tokenizer和模型。translation往往用encoder-decoder模型实现,如BART或T5。
  2. 定义需要被translate的文本。
  3. 添加T5 specific prefix “translate English to German: ”
  4. PreTrainedModel.generate()方法生成翻译。
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

输出:
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>

和pipeline示例的结果相同。


  1. 文档中给出SQuAD的参考资料:BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension 4.2部分有介绍。 ?? ??

  2. 这个用法可参考:Python字符串f-string使用大括号{}_彭世瑜的博客-CSDN博客_python 字符串大括号 ??

  3. miscellaneous混杂的;五花八门的;各式各样的
    在这里应该是人物、组织、地点之外的实体类型的意思。 ??

  人工智能 最新文章
2022吴恩达机器学习课程——第二课(神经网
第十五章 规则学习
FixMatch: Simplifying Semi-Supervised Le
数据挖掘Java——Kmeans算法的实现
大脑皮层的分割方法
【翻译】GPT-3是如何工作的
论文笔记:TEACHTEXT: CrossModal Generaliz
python从零学(六)
详解Python 3.x 导入(import)
【答读者问27】backtrader不支持最新版本的
上一篇文章      下一篇文章      查看所有文章
加:2022-04-04 12:11:32  更:2022-04-04 12:13:55 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2025年1日历 -2025/1/8 4:42:34-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码