句子分类任务
GLUE(General Language Understanding Evaluation)排行榜包含9个句子级别的分类任务,任务信息见下表
序号 | 名称 | 全称 | 内容 | 评价指标 |
---|
1 | CoLA | Corpus of Linguistic Acceptability | 鉴别一个句子是否语法正确 | Matthews Correlation Coefficient | 2 | MNLI | Multi-Genre Natural Language Inference | 给定一个假设,判断另一个句子与该假设的关系:entails, contradicts 或者 unrelated | Accuracy | 3 | MRPC | Microsoft Research Paraphrase Corpus | 判断两个句子是否互为paraphrases | Accuracy & F1 score | 4 | QNLI | Question-answering Natural Language Inference | 判断第2句是否包含第1句问题的答案 | Accuracy | 5 | QQP | Quora Question Pairs2 | 判断两个问句是否语义相同 | Accuracy & F1 score | 6 | RTE | Recognizing Textual Entailment | 判断一个句子是否与假设成entail关系 | Accuracy | 7 | SST-2 | Stanford Sentiment Treebank | 判断一个句子的情感正负向 | Accuracy | 8 | STS-B | Semantic Textual Similarity Benchmark | 判断两个句子的相似性(1-5分) | Pearson & Spearman相关系数 | 9 | WNLI | Winograd Natural Language Inference | 确定句子是否包含匿名代词并判断代词所指代词 | Accuracy |
一个MNLI任务
需要安装的库
- pytorch
- transformers
- transformers datasets
- optuna
- ray[tune]
数据加载
from datasets import load_dataset, load_metric
actual_task = "mnli"
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)
数据预处理
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
# 定义不同任务数据和对应的数据格式
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
# tokenizer
def preprocess_function(examples):
if sentence2_key is None:
return tokenizer(examples[sentence1_key], truncation=True)
return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
# 函数调用
encoded_dataset = dataset.map(preprocess_function, batched=True)
注:想改变输入的时候,最好清理返回结果缓存。清理的方式是使用load_from_cache_file=False参数。
微调预训练模型
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
# 定义模型
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
# 训练设定参数
batch_size = 16
args = TrainingArguments(
"test-glue",
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model = "accuracy",
)
# 定义评价指标函数
def compute_metrics(eval_pred):
predictions, labels = eval_pred
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
#定义trainer
validation_key = "validation_matched" # for mnli,"validation_mismatched" for "mnli-mm", else "validation"
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# 训练
trainer.train()
# 评估
trainer.evaluate()
超参数搜索
# 初始化模型
def model_init():
return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
# 定义trainer
trainer = Trainer(
model_init=model_init,
args=args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# 调参
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize") # 仅取1/10作为调参
# 选最优参数进行模型训练
for n, v in best_run.hyperparameters.items():
setattr(trainer.args, n, v)
trainer.train()
# 评估
trainer.evaluate()
|