|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474 |
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# E2. 使用 Bert + prompt 完成 SST2 分类\n",
- "\n",
- "  1   基础介绍:`prompt-based model`简介、与`fastNLP`的结合\n",
- "\n",
- "  2   准备工作:`P-Tuning v2`原理概述、`P-Tuning v2`模型搭建\n",
- "\n",
- "  3   模型训练:加载`tokenizer`、预处理`dataset`、模型训练与分析"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 1. 基础介绍:prompt-based model 简介、与 fastNLP 的结合\n",
- "\n",
- "  本示例使用`GLUE`评估基准中的`SST2`数据集,通过`prompt-based tuning`方式\n",
- "\n",
- "    微调`bert-base-uncased`模型,实现文本情感的二分类,在此之前本示例\n",
- "\n",
- "    将首先简单介绍提示学习模型的研究,以及与`fastNLP v0.8`结合的优势\n",
- "\n",
- "**`prompt`**,**提示词、提词器**,最早出自**`PET`**,\n",
- "\n",
- "  \n",
- "\n",
- "**`prompt-based tuning`**,**基于提示的微调**,描述\n",
- "\n",
- "  **`prompt-based model`**,**基于提示的模型**\n",
- "\n",
- "**`prompt-based model`**,**基于提示的模型**,举例\n",
- "\n",
- "  案例一:**`P-Tuning v1`**\n",
- "\n",
- "  案例二:**`PromptTuning`**\n",
- "\n",
- "  案例三:**`PrefixTuning`**\n",
- "\n",
- "  案例四:**`SoftPrompt`**\n",
- "\n",
- "使用`fastNLP v0.8`实现`prompt-based model`的优势\n",
- "\n",
- "  \n",
- "\n",
- "  本示例仍使用了`tutorial-E1`的`SST2`数据集,将`bert-base-uncased`作为基础模型\n",
- "\n",
- "    在后续实现中,意图通过将连续的`prompt`与`model`拼接,解决`SST2`二分类任务"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
- "</pre>\n"
- ],
- "text/plain": [
- "\n"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "4.18.0\n"
- ]
- }
- ],
- "source": [
- "import torch\n",
- "import torch.nn as nn\n",
- "from torch.optim import AdamW\n",
- "from torch.utils.data import DataLoader, Dataset\n",
- "\n",
- "import transformers\n",
- "from transformers import AutoTokenizer\n",
- "from transformers import AutoModelForSequenceClassification\n",
- "\n",
- "import sys\n",
- "sys.path.append('..')\n",
- "\n",
- "import fastNLP\n",
- "from fastNLP import Trainer\n",
- "from fastNLP.core.metrics import Accuracy\n",
- "\n",
- "print(transformers.__version__)\n",
- "\n",
- "task = 'sst2'\n",
- "model_checkpoint = 'bert-base-uncased'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 2. 准备工作:P-Tuning v2 原理概述、P-Tuning v2 模型搭建\n",
- "\n",
- "  本示例使用`P-Tuning v2`作为`prompt-based tuning`与`fastNLP v0.8`结合的案例\n",
- "\n",
- "    以下首先简述`P-Tuning v2`的论文原理,并由此引出`fastNLP v0.8`的代码实践\n",
- "\n",
- "`P-Tuning v2`出自论文 [Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)\n",
- "\n",
- "  其主要贡献在于,在`PrefixTuning`等深度提示学习基础上,提升了其在分类标注等`NLU`任务的表现\n",
- "\n",
- "    并使之在中等规模模型,主要是参数量在`100M-1B`区间的模型上,获得与全参数微调相同的效果\n",
- "\n",
- "  其结构如图所示,\n",
- "\n",
- "<img src=\"./figures/E2-fig-p-tuning-v2.png\" width=\"60%\" height=\"60%\" align=\"center\"></img>"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "class SeqClsModel(nn.Module):\n",
- " def __init__(self, model_checkpoint, num_labels, pre_seq_len):\n",
- " nn.Module.__init__(self)\n",
- " self.num_labels = num_labels\n",
- " self.back_bone = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, \n",
- " num_labels=num_labels)\n",
- " self.embeddings = self.back_bone.get_input_embeddings()\n",
- "\n",
- " for param in self.back_bone.parameters():\n",
- " param.requires_grad = False\n",
- " \n",
- " self.pre_seq_len = pre_seq_len\n",
- " self.prefix_tokens = torch.arange(self.pre_seq_len).long()\n",
- " self.prefix_encoder = nn.Embedding(self.pre_seq_len, self.embeddings.embedding_dim)\n",
- " \n",
- " def get_prompt(self, batch_size):\n",
- " prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(self.back_bone.device)\n",
- " prompts = self.prefix_encoder(prefix_tokens)\n",
- " return prompts\n",
- "\n",
- " def forward(self, input_ids, attention_mask, labels=None):\n",
- " \n",
- " batch_size = input_ids.shape[0]\n",
- " raw_embedding = self.embeddings(input_ids)\n",
- " \n",
- " prompts = self.get_prompt(batch_size=batch_size)\n",
- " inputs_embeds = torch.cat((prompts, raw_embedding), dim=1)\n",
- " prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.back_bone.device)\n",
- " attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)\n",
- "\n",
- " outputs = self.back_bone(inputs_embeds=inputs_embeds, \n",
- " attention_mask=attention_mask, labels=labels)\n",
- " return outputs\n",
- "\n",
- " def train_step(self, input_ids, attention_mask, labels):\n",
- " loss = self(input_ids, attention_mask, labels).loss\n",
- " return {'loss': loss}\n",
- "\n",
- " def evaluate_step(self, input_ids, attention_mask, labels):\n",
- " pred = self(input_ids, attention_mask, labels).logits\n",
- " pred = torch.max(pred, dim=-1)[1]\n",
- " return {'pred': pred, 'target': labels}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "接着,通过确定分类数量初始化模型实例,同时调用`torch.optim.AdamW`模块初始化优化器\n",
- "\n",
- "  根据`P-Tuning v2`论文:*Generally, simple classification tasks prefer shorter prompts (less than 20)*\n",
- "\n",
- "  此处`pre_seq_len`参数设定为`20`,学习率相应做出调整,其他内容和`tutorial-E1`中的内容一致"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']\n",
- "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
- "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
- "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
- "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
- ]
- }
- ],
- "source": [
- "model = SeqClsModel(model_checkpoint=model_checkpoint, num_labels=2, pre_seq_len=20)\n",
- "\n",
- "optimizers = AdamW(params=model.parameters(), lr=1e-2)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 3. 模型训练:加载 tokenizer、预处理 dataset、模型训练与分析\n",
- "\n",
- "  本示例沿用`tutorial-E1`中的数据集,即使用`GLUE`评估基准中的`SST2`数据集\n",
- "\n",
- "    以`bert-base-uncased`模型作为基准,基于`P-Tuning v2`方式微调\n",
- "\n",
- "    数据集加载相关代码流程见下,内容和`tutorial-E1`中的内容基本一致\n",
- "\n",
- "首先,使用`datasets.load_dataset`加载数据集,使用`transformers.AutoTokenizer`\n",
- "\n",
- "  构建`tokenizer`实例,通过`dataset.map`使用`tokenizer`将文本替换为词素序号序列"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "scrolled": false
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Reusing dataset glue (/remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n"
- ]
- },
- {
- "data": {
- "application/vnd.jupyter.widget-view+json": {
- "model_id": "b72eeebd34354a88a99b2e07ec9a86df",
- "version_major": 2,
- "version_minor": 0
- },
- "text/plain": [
- " 0%| | 0/3 [00:00<?, ?it/s]"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "from datasets import load_dataset, load_metric\n",
- "\n",
- "dataset = load_dataset('glue', task)\n",
- "\n",
- "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-18ec0e709f05e61e.arrow\n",
- "Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e2f02ee7442ad73e.arrow\n"
- ]
- },
- {
- "data": {
- "application/vnd.jupyter.widget-view+json": {
- "model_id": "d15505d825b34f649b719f1ff0d56114",
- "version_major": 2,
- "version_minor": 0
- },
- "text/plain": [
- " 0%| | 0/2 [00:00<?, ?ba/s]"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "def preprocess_function(examples):\n",
- " return tokenizer(examples['sentence'], truncation=True)\n",
- "\n",
- "encoded_dataset = dataset.map(preprocess_function, batched=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "然后,定义`SeqClsDataset`类、定义校对函数`collate_fn`,这里沿用`tutorial-E1`中的内容\n",
- "\n",
- "  同样需要注意/强调的是,**`__getitem__`函数的返回值必须和原始数据集中的属性对应**\n",
- "\n",
- "  **`collate_fn`函数的返回值必须和`train_step`和`evaluate_step`函数的参数匹配**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "class SeqClsDataset(Dataset):\n",
- " def __init__(self, dataset):\n",
- " Dataset.__init__(self)\n",
- " self.dataset = dataset\n",
- "\n",
- " def __len__(self):\n",
- " return len(self.dataset)\n",
- "\n",
- " def __getitem__(self, item):\n",
- " item = self.dataset[item]\n",
- " return item['input_ids'], item['attention_mask'], [item['label']] \n",
- "\n",
- "def collate_fn(batch):\n",
- " input_ids, atten_mask, labels = [], [], []\n",
- " max_length = [0] * 3\n",
- " for each_item in batch:\n",
- " input_ids.append(each_item[0])\n",
- " max_length[0] = max(max_length[0], len(each_item[0]))\n",
- " atten_mask.append(each_item[1])\n",
- " max_length[1] = max(max_length[1], len(each_item[1]))\n",
- " labels.append(each_item[2])\n",
- " max_length[2] = max(max_length[2], len(each_item[2]))\n",
- "\n",
- " for i in range(3):\n",
- " each = (input_ids, atten_mask, labels)[i]\n",
- " for item in each:\n",
- " item.extend([0] * (max_length[i] - len(item)))\n",
- " return {'input_ids': torch.cat([torch.tensor([item]) for item in input_ids], dim=0),\n",
- " 'attention_mask': torch.cat([torch.tensor([item]) for item in atten_mask], dim=0),\n",
- " 'labels': torch.cat([torch.tensor(item) for item in labels], dim=0)}"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "再然后,分别对`tokenizer`处理过的训练集数据、验证集数据,进行预处理和批量划分"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "dataset_train = SeqClsDataset(encoded_dataset['train'])\n",
- "dataloader_train = DataLoader(dataset=dataset_train, \n",
- " batch_size=32, shuffle=True, collate_fn=collate_fn)\n",
- "dataset_valid = SeqClsDataset(encoded_dataset['validation'])\n",
- "dataloader_valid = DataLoader(dataset=dataset_valid, \n",
- " batch_size=32, shuffle=False, collate_fn=collate_fn)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
- "To disable this warning, you can either:\n",
- "\t- Avoid using `tokenizers` before the fork if possible\n",
- "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
- ]
- }
- ],
- "source": [
- "trainer = Trainer(\n",
- " model=model,\n",
- " driver='torch',\n",
- " device=[0, 1],\n",
- " n_epochs=20,\n",
- " optimizers=optimizers,\n",
- " train_dataloader=dataloader_train,\n",
- " evaluate_dataloaders=dataloader_valid,\n",
- " metrics={'acc': Accuracy()}\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "trainer.run(num_eval_batch_per_dl=10)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "trainer.evaluator.run()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.13"
- },
- "pycharm": {
- "stem_cell": {
- "cell_type": "raw",
- "metadata": {
- "collapsed": false
- },
- "source": []
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
- }
|