{ "cells": [ { "cell_type": "markdown", "id": "213d538c", "metadata": {}, "source": [ "# T3. dataloader 的内部结构和基本使用\n", "\n", " 1 fastNLP 中的 dataloader\n", " \n", " 1.1 dataloader 的基本介绍\n", "\n", " 1.2 dataloader 的函数创建\n", "\n", " 2 fastNLP 中 dataloader 的延伸\n", "\n", " 2.1 collator 的概念与使用\n", "\n", " 2.2 结合 datasets 框架" ] }, { "cell_type": "markdown", "id": "85857115", "metadata": {}, "source": [ "## 1. fastNLP 中的 dataloader\n", "\n", "### 1.1 dataloader 的基本介绍\n", "\n", "在`fastNLP 1.0`的开发中,最关键的开发目标就是**实现 fastNLP 对当前主流机器学习框架**,例如\n", "\n", " **当下流行的 pytorch**,以及**国产的 paddle 、jittor 和 oneflow 的兼容**,扩大受众的同时,也是助力国产\n", "\n", "本着分而治之的思想,我们可以将`fastNLP 1.0`对`pytorch`、`paddle`、`jittor`、`oneflow`框架的兼容,划分为\n", "\n", " **对数据预处理**、**批量 batch 的划分与补齐**、**模型训练**、**模型评测**,**四个部分的兼容**\n", "\n", " 针对数据预处理,我们已经在`tutorial-1`中介绍了`dataset`和`vocabulary`的使用\n", "\n", " 而结合`tutorial-0`,我们可以发现**数据预处理环节本质上是框架无关的**\n", "\n", " 因为在不同框架下,读取的原始数据格式都差异不大,彼此也很容易转换\n", "\n", "只有涉及到张量、模型,不同框架才展现出其各自的特色:**pytorch 和 oneflow 中的 tensor 和 nn.Module**\n", "\n", " **在 paddle 中称为 tensor 和 nn.Layer**,**在 jittor 中则称为 Var 和 Module**\n", "\n", " 因此,**模型训练、模型评测**,**是兼容的重难点**,我们将会在`tutorial-5`中详细介绍\n", "\n", " 针对批量`batch`的处理,作为`fastNLP 1.0`中框架无关部分想框架相关部分的过渡\n", "\n", " 就是`dataloader`模块的职责,这也是本篇教程`tutorial-3`讲解的重点\n", "\n", "**dataloader 模块的职责**,详细划分可以包含以下三部分,**采样划分、补零对齐、框架匹配**\n", "\n", " 第一,确定`batch`大小,确定采样方式,划分后通过迭代器即可得到`batch`序列\n", "\n", " 第二,对于序列处理,这也是`fastNLP`主要针对的,将同个`batch`内的数据对齐\n", "\n", " 第三,**batch 内数据格式要匹配框架**,**但 batch 结构需保持一致**,**参数匹配机制**\n", "\n", " 对此,`fastNLP 1.0`给出了 **TorchDataLoader 、 PaddleDataLoader 、 JittorDataLoader 和 OneflowDataLoader**\n", "\n", " 分别针对并匹配不同框架,但彼此之间参数名、属性、方法仍然类似,前两者大致如下表所示\n", "\n", "名称|参数|属性|功能|内容\n", "----|----|----|----|----|\n", " `dataset` | √ | √ | 指定`dataloader`的数据内容 | |\n", " `batch_size` | √ | √ | 指定`dataloader`的`batch`大小 | 默认`16` |\n", " `shuffle` | √ | √ | 指定`dataloader`的数据是否打乱 | 默认`False` |\n", " `collate_fn` | √ | √ | 指定`dataloader`的`batch`打包方法 | 视框架而定 |\n", " `sampler` | √ | √ | 指定`dataloader`的`__len__`和`__iter__`函数的实现 | 默认`None` |\n", " `batch_sampler` | √ | √ | 指定`dataloader`的`__len__`和`__iter__`函数的实现 | 默认`None` |\n", " `drop_last` | √ | √ | 指定`dataloader`划分`batch`时是否丢弃剩余的 | 默认`False` |\n", " `cur_batch_indices` | | √ | 记录`dataloader`当前遍历批量序号 | |\n", " `num_workers` | √ | √ | 指定`dataloader`开启子进程数量 | 默认`0` |\n", " `worker_init_fn` | √ | √ | 指定`dataloader`子进程初始方法 | 默认`None` |\n", " `generator` | √ | √ | 指定`dataloader`子进程随机种子 | 默认`None` |\n", " `prefetch_factor` | | √ | 指定为每个`worker`装载的`sampler`数量 | 默认`2` |" ] }, { "cell_type": "markdown", "id": "60a8a224", "metadata": {}, "source": [ " 论及`dataloader`的函数,其中,`get_batch_indices`用来获取当前遍历到的`batch`序号,其他函数\n", "\n", " 包括`set_ignore`、`set_pad`和`databundle`类似,请参考`tutorial-2`,此处不做更多介绍\n", "\n", " 以下是`tutorial-2`中已经介绍过的数据预处理流程,接下来是对相关数据进行`dataloader`处理" ] }, { "cell_type": "code", "execution_count": 1, "id": "aca72b49", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[38;5;2m[i 0604 15:44:29.773860 92 log.cc:351] Load log_sync: 1\u001b[m\n" ] }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/4 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n", "| SentenceId | Sentence | Sentiment | input_ids | token_type_ids | attention_mask | target |\n", "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n", "| 1 | A series of... | negative | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n", "| 4 | A positivel... | neutral | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 2 |\n", "| 3 | Even fans o... | negative | [101, 2130,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n", "| 5 | A comedy-dr... | positive | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0 |\n", "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n" ] } ], "source": [ "import sys\n", "sys.path.append('..')\n", "\n", "import pandas as pd\n", "from functools import partial\n", "from fastNLP.transformers.torch import BertTokenizer\n", "\n", "from fastNLP import DataSet\n", "from fastNLP import Vocabulary\n", "from fastNLP.io import DataBundle\n", "\n", "\n", "class PipeDemo:\n", " def __init__(self, tokenizer='bert-base-uncased'):\n", " self.tokenizer = BertTokenizer.from_pretrained(tokenizer)\n", "\n", " def process_from_file(self, path='./data/test4dataset.tsv'):\n", " datasets = DataSet.from_pandas(pd.read_csv(path, sep='\\t'))\n", " train_ds, test_ds = datasets.split(ratio=0.7)\n", " train_ds, dev_ds = datasets.split(ratio=0.8)\n", " data_bundle = DataBundle(datasets={'train': train_ds, 'dev': dev_ds, 'test': test_ds})\n", "\n", " encode = partial(self.tokenizer.encode_plus, max_length=100, truncation=True,\n", " return_attention_mask=True)\n", " data_bundle.apply_field_more(encode, field_name='Sentence', progress_bar='tqdm')\n", " \n", " target_vocab = Vocabulary(padding=None, unknown=None)\n", "\n", " target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment')\n", " target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment',\n", " new_field_name='target')\n", "\n", " data_bundle.set_pad('input_ids', pad_val=self.tokenizer.pad_token_id)\n", " data_bundle.set_ignore('SentenceId', 'Sentence', 'Sentiment') \n", " return data_bundle\n", "\n", " \n", "pipe = PipeDemo(tokenizer='bert-base-uncased')\n", "\n", "data_bundle = pipe.process_from_file('./data/test4dataset.tsv')\n", "\n", "print(data_bundle.get_dataset('train'))" ] }, { "cell_type": "markdown", "id": "76e6b8ab", "metadata": {}, "source": [ "### 1.2 dataloader 的函数创建\n", "\n", "在`fastNLP 1.0`中,**更方便、可能更常用的 dataloader 创建方法是通过 prepare_xx_dataloader 函数**\n", "\n", " 例如下方的`prepare_torch_dataloader`函数,指定必要参数,读取数据集,生成对应`dataloader`\n", "\n", " 类型为`TorchDataLoader`,只能适用于`pytorch`框架,因此对应`trainer`初始化时`driver='torch'`\n", "\n", "同时我们看还可以发现,在`fastNLP 1.0`中,**batch 表示为字典 dict 类型**,**key 值就是原先数据集中各个字段**\n", "\n", " **除去经过 DataBundle.set_ignore 函数隐去的部分**,而`value`值为`pytorch`框架对应的`torch.Tensor`类型" ] }, { "cell_type": "code", "execution_count": 2, "id": "5fd60e42", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "