诸神缄默不语-个人CSDN博文目录
本文属于huggingface.transformers全部文档学习笔记博文的一部分。 全文链接:huggingface transformers包 文档学习笔记(持续更新ing…)
本部分网址:https://huggingface.co/docs/transformers/master/en/task_summary 本部分介绍了一些常见NLP任务使用transformers包的解决方案。本文使用的AutoModel具体内容可以参阅其文档,也可以参考我此前的撰写的transformers包文档笔记,我介绍了一些相关的用法和示例代码。
模型需要从针对对应任务上预训练过的checkpoint加载,才能更好地应用于对应任务。(如果加载的是未经过特定任务微调的checkpoint会仅加载基础transformers层,没有特定任务所需的additional head,就会随机初始化additional head权重,产生随机输出) 这些checkpoints往往是在大量语料上预训练(pre-train),然后再针对具体任务进行微调(fine-tune)过。这意味着:
- 不是所有模型都在所有任务上微调过。如果用户想针对某一任务对模型进行微调,可以参考使用transformers官方GitHub项目中的examples文件夹中的代码:transformers/examples at master · huggingface/transformers 关于这个的笔记我已经列在卫星里面了:接下来博文写作计划的卫星 以后会写的,所以本文不会讲这些代码
- checkpoints是在特定数据集上被微调过的,该数据集可能无法覆盖用户自己的用例和领域。如前所述,用户可以自己再继续微调。
如果想在指定任务上直接做推理,可以使用这些机制:
- Pipelines:很容易用的抽象,两行代码即可实现。对pipeline的使用,更详细的笔记可参考我之前撰写的博文:huggingface.transformers速成笔记:Pipeline推理和AutoClass第一节。
- 直接使用模型:抽象程度较低,但更易改造、功能更多,支持直接改换、使用tokenizer(原话是a direct access to a tokenizer,我没搞懂具体是啥意思?)和全部的推理功能。
以下两种方式都会展示:
1. Sequence Classification
Sequence Classification任务是将sequence在给定的类数中进行分类。如GLUE数据集。在GLUE数据集上进行微调可参考run_glue.py或run_xnli.py。
用pipeline进行情感分类的示例,使用在sst2(GLUE task)上微调过的模型,返回标签("POSITIVE" 或"NEGATIVE" )和得分:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
输出: label: NEGATIVE, with score: 0.9991 label: POSITIVE, with score: 0.9999
用AutoClass判断两句话是否同义(互为改写)的示例:
- 根据checkpoint名初始化tokenizer和模型,模型架构是BERT,并加载checkpoint中的权重
- 构建一个由两句话组成的sequence,含有正确的model-specific separators, token type ids and attention masks(由tokenizer自动生成)
- 将这个sequence传入模型,对它进行分类:是否同义
- 计算输出的softmax结果,获得在各类上的概率值
- 打印结果
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
输出:not paraphrase: 10% is paraphrase: 90%
not paraphrase: 94% is paraphrase: 6%
2. Extractive Question Answering
Extractive Question Answering是从context(一段文本)中抽取句子,作为特定问题答句。如SQuAD数据集。在SQuAD数据集上微调可参考run_qa.py。
用pipeline的示例,使用在SQuAD数据集上微调过的模型,返回从context中抽取的答案、confidence score、指示答案在context中位置的start 和end 值:
from transformers import pipeline
question_answerer = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""
result = question_answerer(question="What is extractive question answering?", context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)
result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)
输出: Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95 Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
用AutoClass的示例:
- 根据checkpoint名初始化tokenizer和模型,模型架构是BERT,并加载checkpoint中的权重
- 定义context和一些问题
- 迭代所有问题,构建context和当前问题的sequence(用正确的model-specific separators, token type ids and attention masks)
- 将这个sequence传入模型,输出整个sequence上每个token的得分(该token是
start index或end index的可能性得分)。 - 计算输出的softmax结果,获得在各token上的概率值
- 获取被识别为
start 和end 之间的值的token,将其转化为字符串 - 打印结果
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in 🤗 Transformers?",
"What does 🤗 Transformers provide?",
"🤗 Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
answer_start = torch.argmax(answer_start_scores)
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)
print(f"Question: {question}")
print(f"Answer: {answer}")
输出: Question: How many pretrained models are available in 🤗 Transformers? Answer: over 32 + Question: What does 🤗 Transformers provide? Answer: general - purpose architectures Question: 🤗 Transformers provides interoperability between which frameworks? Answer: tensorflow 2. 0 and pytorch
3. Language Modeling
Language modeling是使模型适应某一语料(一般是特定领域的)的任务,这样说可能比较抽象,所以建议直接看本文后续的示例来直观了解其含义。 所有流行的transformer-based模型都是用language modeling的一种变体来训练的,如BERT用masked language modeling,GPT-2用causal language modeling。 Language modeling也可以用于预训练之外的情况,如将模型分布转移到domain-specific:用一个在大语料上预训练过的模型,在新数据集上微调,如在论文上微调:lysandre/arxiv-nlp · Hugging Face
3.1 Masked Language Modeling
MLM是用masking token来mask sequence中的一些tokens,然后调整模型使之用合适的token来填充这些mask。这让模型能够attend right context(mask右边的token)和left context(mask左边的token)。这样的训练设置为需要bi-directional context的下游任务(如SQuAD)提供了强基础。 在MLM任务上微调的代码可参考run_mlm.py。
用pipeline的示例,输出填充mask后的sequence、confidence score、被用以填充mask的token及其在tokenizer vocabulary中的token ID:
from transformers import pipeline
unmasker = pipeline("fill-mask")
from pprint import pprint
pprint(
unmasker(
f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."
)
)
输出:
[{'score': 0.1793,
'sequence': 'HuggingFace is creating a tool that the community uses to solve '
'NLP tasks.',
'token': 3944,
'token_str': ' tool'},
{'score': 0.1135,
'sequence': 'HuggingFace is creating a framework that the community uses to '
'solve NLP tasks.',
'token': 7208,
'token_str': ' framework'},
{'score': 0.0524,
'sequence': 'HuggingFace is creating a library that the community uses to '
'solve NLP tasks.',
'token': 5560,
'token_str': ' library'},
{'score': 0.0349,
'sequence': 'HuggingFace is creating a database that the community uses to '
'solve NLP tasks.',
'token': 8503,
'token_str': ' database'},
{'score': 0.0286,
'sequence': 'HuggingFace is creating a prototype that the community uses to '
'solve NLP tasks.',
'token': 17715,
'token_str': ' prototype'}]
用AutoClass的示例:
- 根据checkpoint名初始化tokenizer和模型,模型架构是DistilBERT,并加载checkpoint中的权重
- 定义含有一个masked token的sequence:用
tokenizer.mask_token (这是个字符串格式的变量,在字符串中用花括号括起来以实现替换)替换一个单词(我感觉这里的单词应该指的是一个token) - 将sequence编码为token IDs的列表,找到masked token在列表中的位置。
- 提取在mask token索引值处的预测值:这个张量和vocabulary有同样的尺寸,其元素值就是分配给每个token的得分。模型认为在给定context下,更有可能是这个masked token的token,会得到更高的分数。
- 用PyTorch的
topk 方法提取得分最高的5个token。 - 用上述的tokens来替代mask token,打印结果。
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
sequence = (
"Distilled models are smaller than the models they mimic. Using them instead of the large "
f"versions would help {tokenizer.mask_token} our carbon footprint."
)
inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
输出:
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
3.2 Causal Language Modeling
CLM是预测一个sequence之后的token的任务。在这种情境下,模型只会attend left context(mask左边的token)。这样的训练设置特别关注于生成任务。 在CLM任务上微调的代码可参考run_clm.py。 一般来说,预测下一个token是通过抽样输入sequence得到的最后一层hidden state的logits得到的。
用AutoClass的示例:用AutoModelForCausalLM、AutoTokenizer和top_k_top_p_filtering()方法,在输入sequence后抽样得到下一个token:
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]
next_token_logits = model(**inputs).logits[:, -1, :]
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
输出:Hugging Face is based in DUMBO, New York City, and is 我自己没跑,我看到文档里说会是is 或features ,我在gpt2模型首页直接调用推理API得到的第一个token也是is ,那就:)
实话说我没太看懂这个例子。总之大约就是这么回事吧。以后看了更多资料再来写详细解释吧。
在下一节使用的generation_utils.GenerationMixin.generate()方法可以用来生成多个长达指定长度的tokens,而不是一次只生成一个token。
3.3 Text Generation
文本生成(text generation,又名open-ended text generation)的目标是生成给定context(文本)后的一段连续的文本。
用pipeline的示例,使用的是GPT-2模型,Top-K抽样,参考GPT-2模型的configuration文件:config.json · gpt2 at main
from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
输出:
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
pipeline对象其实调用了PreTrainedModel.generate()方法,对这方面的介绍可参考我之前撰写的博文huggingface.transformers速成笔记:Pipeline推理和AutoClass_诸神缄默不语的博客-CSDN博客第一节序号③部分相关内容。
用AutoClass的示例:用XLNet及其对应的tokenizer。这个模型可以直接调用generate() 函数:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]
print(generated)
代码中的padding text见注释中给出的网站解释,具体的我也没看懂,总之是说XLNet的运算方式造成了一些问题,如context太短会导致生成内容效果不好,所以需要加一段硬编码的随机文本(在这段随机文本后要加<eod> ,然后再加上真实context)。
这个模型我自己跑出来的结果是:
Today the weather is really nice and I am planning on going for a walk in the park with my mom (she can walk and play golf) to explore and see something I didn't know. The park is actually a giant green, with lots of shade in the first two thirds of the way (it is en route to a major golf course, which is really neat). I decided to walk
感觉效果还行?
如果不加padding text的话,运行出来的结果就是:
Today the weather is really nice and I am planning on going on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on going on on on on on on on on on on on on going on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on and on on on on on on on and on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on and on on on on on on on on on on on on on on and on on on on on on and on on on on on on on on on on on on on on on on and on on on and on on on on on on on on on on on on on on on on on on and on and on on on
嗯,完全就是人工智障,可见padding text是很有必要的……
此外我还试了一下在xlnet-base-cased · Hugging Face的推理pipeline上运行,结果是: 也完全是人工智障的样子。
文本生成任务现在在PyTorch上支持GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer模型。 和上述例子一样,XLNet和Transfo-XL的输入数据需要经pad才能正常工作。 GPT-2是一个open-ended text generation任务的好选择,因为它在上百万网页上以causal language modeling目标函数训练过。 对于如何使用不同的解码策略来进行文本生成,文档中给出了官方博客作为参考资料:How to generate text: using different decoding methods for language generation with Transformers,我对此篇博文也有撰写学习笔记博文的计划。
4. Named Entity Recognition
(我不是做NER的,所以以下内容都是照着文档内容半理解半猜的,没有去仔细查证过,如有疏漏请直接跟我说) 命名实体识别Named Entity Recognition (NER)是token分类任务中的一种,识别出文本中的命名实体。如将某一token识别为人物person、组织organization或地点location的实体的组成部分、或不属于任何实体。如CoNLL-2003数据集。在NER任务上进行微调可参考run_ner.py。
用pipeline进行命名实体识别的示例,将token分为如下9类:
- O, Outside of a named entity
- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
- I-MIS, Miscellaneous entity
- B-PER, Beginning of a person’s name right after another person’s name
- I-PER, Person’s name
- B-ORG, Beginning of an organisation right after another organisation
- I-ORG, Organisation
- B-LOC, Beginning of a location right after another location
- I-LOC, Location
使用在CoNLL-2003上微调过的模型(微调者@stefan-it,项目dbmdz):
from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
显示所需的返回值:
for entity in ner_pipe(sequence):
print(entity)
输出效果:
{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
sequence“Hugging Face” 被识别为organization,“New York City” 、“DUMBO” 和“Manhattan Bridge” 被识别为location。
用AutoClass进行命名实体识别的示例:
- 根据checkpoint名初始化tokenizer和模型,模型架构是BERT,并加载checkpoint中的权重
- 定义一个含有已知实体的sequence(如含有organization
“Hugging Face” ,location“New York City” ) - tokenize sequence。
- 将input传入模型,返回第一个输出。这是每个token在9个类上的概率分布,用argmax可以得到每个token最有可能隶属的类。
- zip每个token和对应的预测值,打印出来。
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = (
"Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
"therefore very close to the Manhattan Bridge."
)
inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
和pipeline不同,在这里没有去掉0 类,即该token并不是任何一种实体的情况。 predictions 中每一类都对应一个整数,该整数与类名的对应可以通过model.config.id2label 解码:
for token, prediction in zip(tokens, predictions[0].numpy()):
print((token, model.config.id2label[prediction]))
输出:
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')
5. Summarization
文本摘要(summarization)的目标是将长文本缩写为简短的摘要。如CNN / Daily Mail新闻数据集。在文本摘要上微调的任务可参考transformers/examples/pytorch/summarization at main · huggingface/transformers。
用pipeline的示例,使用在CNN / Daily Mail数据集上微调过的BART模型:
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
输出:
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]
summarization pipeline也是基于PreTrainedModel.generate() 写的,见前文文本生成部分的介绍。
用AutoClass的示例:
- 根据checkpoint名初始化tokenizer和模型。summarization往往用encoder-decoder模型实现,如BART或T5。
- 定义需要被summarize的文本。
- 添加T5 specific prefix
“summarize: “ 。 - 用
PreTrainedModel.generate() 方法生成摘要。
以下示例使用谷歌的T5模型,它是在多任务混合模型(包含CNN / Daily Mail数据集)上预训练的:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)
print(tokenizer.decode(outputs[0]))
输出:
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>
6. Translation
翻译(translation)的目标是将一种语言的文本翻译到另一种语言。如WMT数据集,输入为英语,输出为德语。在翻译任务上微调的代码可参考transformers/examples/pytorch/translation at main · huggingface/transformers。
用pipeline的示例,使用上述文本摘要部分AutoClass部分用过的T5模型(其训练用的数据集包括WMT数据集):
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
输出: [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
translation pipeline也是基于PreTrainedModel.generate() 写的,见前文文本生成部分的介绍。
用AutoClass的示例:
- 根据checkpoint名初始化tokenizer和模型。translation往往用encoder-decoder模型实现,如BART或T5。
- 定义需要被translate的文本。
- 添加T5 specific prefix
“translate English to German: ” 。 - 用
PreTrainedModel.generate() 方法生成翻译。
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer(
"translate English to German: Hugging Face is a technology company based in New York and Paris",
return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0]))
输出: <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
和pipeline示例的结果相同。
|