{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# T2. databundle 和 tokenizer 的基本使用\n", "\n", " 1 fastNLP 中 dataset 的延伸\n", "\n", " 1.1 databundle 的概念与使用\n", "\n", " 2 fastNLP 中的 tokenizer\n", " \n", " 2.1 PreTrainedTokenizer 的概念\n", "\n", " 2.2 BertTokenizer 的基本使用\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. fastNLP 中 dataset 的延伸\n", "\n", "### 1.1 databundle 的概念与使用\n", "\n", "在`fastNLP 0.8`中,在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间,还存在\n", "\n", " 一个中间模块,即 **数据包`DataBundle`模块**,可以从`fastNLP.io`路径中导入该模块\n", "\n", "在`fastNLP 0.8`中,**一个`databundle`数据包包含若干`dataset`数据集和`vocabulary`词汇表**\n", "\n", " 分别存储在`datasets`和`vocabs`两个变量中,所以了解`databundle`数据包之前\n", "\n", "需要首先**复习`dataset`数据集和`vocabulary`词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/6 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+------------------------------------------+----------+\n", "| text | label |\n", "+------------------------------------------+----------+\n", "| ['a', 'series', 'of', 'escapades', 'd... | negative |\n", "| ['this', 'quiet', ',', 'introspective... | positive |\n", "| ['even', 'fans', 'of', 'ismail', 'mer... | negative |\n", "| ['the', 'importance', 'of', 'being', ... | neutral |\n", "+------------------------------------------+----------+\n", "+------------------------------------------+----------+\n", "| text | label |\n", "+------------------------------------------+----------+\n", "| ['a', 'comedy-drama', 'of', 'nearly',... | positive |\n", "| ['a', 'positively', 'thrilling', 'com... | neutral |\n", "+------------------------------------------+----------+\n", "{'