{ "cells": [ { "cell_type": "markdown", "id": "cdc25fcd", "metadata": {}, "source": [ "# T1. dataset 和 vocabulary 的基本使用\n", "\n", " 1 dataset 的使用与结构\n", " \n", " 1.1 dataset 的结构与创建\n", "\n", " 1.2 dataset 的数据预处理\n", "\n", " 1.3 延伸:instance 和 field\n", "\n", " 2 vocabulary 的结构与使用\n", "\n", " 2.1 vocabulary 的创建与修改\n", "\n", " 2.2 vocabulary 与 OOV 问题\n", "\n", " 3 dataset 和 vocabulary 的组合使用\n", " \n", " 3.1 从 dataframe 中加载 dataset\n", "\n", " 3.2 从 dataset 中获取 vocabulary" ] }, { "cell_type": "markdown", "id": "0eb18a22", "metadata": {}, "source": [ "## 1. dataset 的基本使用\n", "\n", "### 1.1 dataset 的结构与创建\n", "\n", "在`fastNLP 0.8`中,使用`DataSet`模块表示数据集,**`dataset`类似于关系型数据库中的数据表**(下文统一为小写`dataset`)\n", "\n", " **主要包含`field`字段和`instance`实例两个元素**,对应`table`中的`field`字段和`record`记录\n", "\n", "在`fastNLP 0.8`中,`DataSet`模块被定义在`fastNLP.core.dataset`路径下,导入该模块后,最简单的\n", "\n", " 初始化方法,即将字典形式的表格 **`{'field1': column1, 'field2': column2, ...}`** 传入构造函数" ] }, { "cell_type": "code", "execution_count": 1, "id": "a1d69ad2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "from fastNLP.core.dataset import DataSet\n", "\n", "data = {'idx': [0, 1, 2], \n", " 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"],\n", " 'words': [['This', 'is', 'an', 'apple', '.'], \n", " ['I', 'like', 'apples', '.'], \n", " ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],\n", " 'num': [5, 4, 7]}\n", "\n", "dataset = DataSet(data)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "9260fdc6", "metadata": {}, "source": [ " 在`dataset`的实例中,字段`field`的名称和实例`instance`中的字符串也可以中文" ] }, { "cell_type": "code", "execution_count": 2, "id": "3d72ef00", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------+--------------------+------------------------+------+\n", "| 序号 | 句子 | 字符 | 长度 |\n", "+------+--------------------+------------------------+------+\n", "| 0 | 生活就像海洋, | ['生', '活', '就', ... | 7 |\n", "| 1 | 只有意志坚强的人, | ['只', '有', '意', ... | 9 |\n", "| 2 | 才能到达彼岸。 | ['才', '能', '到', ... | 7 |\n", "+------+--------------------+------------------------+------+\n" ] } ], "source": [ "temp = {'序号': [0, 1, 2], \n", " '句子':[\"生活就像海洋,\", \"只有意志坚强的人,\", \"才能到达彼岸。\"],\n", " '字符': [['生', '活', '就', '像', '海', '洋', ','], \n", " ['只', '有', '意', '志', '坚', '强', '的', '人', ','], \n", " ['才', '能', '到', '达', '彼', '岸', '。']],\n", " '长度': [7, 9, 7]}\n", "\n", "chinese = DataSet(temp)\n", "print(chinese)" ] }, { "cell_type": "markdown", "id": "202e5490", "metadata": {}, "source": [ "在`dataset`中,使用`drop`方法可以删除满足条件的实例,这里使用了python中的`lambda`表达式\n", "\n", " 注一:在`drop`方法中,通过设置`inplace`参数将删除对应实例后的`dataset`作为一个新的实例生成" ] }, { "cell_type": "code", "execution_count": 3, "id": "09b478f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2438703969992 2438374526920\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "dropped = dataset\n", "dropped = dropped.drop(lambda ins:ins['num'] < 5, inplace=False)\n", "print(id(dropped), id(dataset))\n", "print(dropped)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "aa277674", "metadata": {}, "source": [ " 注二:在`fastNLP 0.8`中,**对`dataset`使用等号**,**其效果是传引用**,**而不是赋值**(???)\n", "\n", " 如下所示,**`dropped`和`dataset`具有相同`id`**,**对`dropped`执行删除操作`dataset`同时会被修改**" ] }, { "cell_type": "code", "execution_count": 4, "id": "77c8583a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2438374526920 2438374526920\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "dropped = dataset\n", "dropped.drop(lambda ins:ins['num'] < 5)\n", "print(id(dropped), id(dataset))\n", "print(dropped)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "a76199dc", "metadata": {}, "source": [ "在`dataset`中,使用`delet_instance`方法可以删除对应序号的`instance`实例,序号从0开始" ] }, { "cell_type": "code", "execution_count": 5, "id": "d8824b40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+--------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+--------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "+-----+--------------------+------------------------+-----+\n" ] } ], "source": [ "dataset = DataSet(data)\n", "dataset.delete_instance(2)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "f4fa9f33", "metadata": {}, "source": [ "在`dataset`中,使用`delet_field`方法可以删除对应名称的`field`字段" ] }, { "cell_type": "code", "execution_count": 6, "id": "f68ddb40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+--------------------+------------------------------+\n", "| idx | sentence | words |\n", "+-----+--------------------+------------------------------+\n", "| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n", "| 1 | I like apples . | ['I', 'like', 'apples', '... |\n", "+-----+--------------------+------------------------------+\n" ] } ], "source": [ "dataset.delete_field('num')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "b1e9d42c", "metadata": {}, "source": [ "### 1.2 dataset 的数据预处理\n", "\n", "在`dataset`模块中,`apply`、`apply_field`、`apply_more`和`apply_field_more`函数可以进行简单的数据预处理\n", "\n", " **`apply`和`apply_more`针对整条实例**,**`apply_field`和`apply_field_more`仅针对实例的部分字段**\n", "\n", " **`apply`和`apply_field`仅针对单个字段**,**`apply_more`和`apply_field_more`则可以针对多个字段**\n", "\n", " **`apply`和`apply_field`返回的是个列表**,**`apply_more`和`apply_field_more`返回的是个字典**\n", "\n", "***\n", "\n", "`apply`的参数包括一个函数`func`和一个新字段名`new_field_name`,函数`func`的处理对象是`dataset`模块中\n", "\n", " 的每个`instance`实例,函数`func`的处理结果存放在`new_field_name`对应的新建字段内" ] }, { "cell_type": "code", "execution_count": 7, "id": "72a0b5f9", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------------+------------------------------+\n", "| idx | sentence | words |\n", "+-----+------------------------------+------------------------------+\n", "| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n", "| 1 | I like apples . | ['I', 'like', 'apples', '... |\n", "| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n", "+-----+------------------------------+------------------------------+\n" ] } ], "source": [ "data = {'idx': [0, 1, 2], \n", " 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"], }\n", "dataset = DataSet(data)\n", "dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "c10275ee", "metadata": {}, "source": [ " **`apply`使用的函数可以是一个基于`lambda`表达式的匿名函数**,**也可以是一个自定义的函数**" ] }, { "cell_type": "code", "execution_count": 8, "id": "b1a8631f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------------+------------------------------+\n", "| idx | sentence | words |\n", "+-----+------------------------------+------------------------------+\n", "| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n", "| 1 | I like apples . | ['I', 'like', 'apples', '... |\n", "| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n", "+-----+------------------------------+------------------------------+\n" ] } ], "source": [ "dataset = DataSet(data)\n", "\n", "def get_words(instance):\n", " sentence = instance['sentence']\n", " words = sentence.split()\n", " return words\n", "\n", "dataset.apply(get_words, new_field_name='words')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "64abf745", "metadata": {}, "source": [ "`apply_field`的参数,除了函数`func`外还有`field_name`和`new_field_name`,该函数`func`的处理对象仅\n", "\n", " 是`dataset`模块中的每个`field_name`对应的字段内容,处理结果存放在`new_field_name`对应的新建字段内" ] }, { "cell_type": "code", "execution_count": 9, "id": "057c1d2c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------------+------------------------------+\n", "| idx | sentence | words |\n", "+-----+------------------------------+------------------------------+\n", "| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n", "| 1 | I like apples . | ['I', 'like', 'apples', '... |\n", "| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n", "+-----+------------------------------+------------------------------+\n" ] } ], "source": [ "dataset = DataSet(data)\n", "dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "5a9cc8b2", "metadata": {}, "source": [ "`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n", "\n", " 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容" ] }, { "cell_type": "code", "execution_count": 10, "id": "51e2f02c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "dataset = DataSet(data)\n", "dataset.apply_more(lambda ins:{'words': ins['sentence'].split(), 'num': len(ins['sentence'].split())})\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "02d2b7ef", "metadata": {}, "source": [ "`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n", "\n", " 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容" ] }, { "cell_type": "code", "execution_count": 11, "id": "db4295d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "dataset = DataSet(data)\n", "dataset.apply_field_more(lambda sent:{'words': sent.split(), 'num': len(sent.split())}, \n", " field_name='sentence')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "9c09e592", "metadata": {}, "source": [ "### 1.3 延伸:instance 和 field\n", "\n", "在`fastNLP 0.8`中,使用`Instance`模块表示数据集`dataset`中的每条数据,被称为实例\n", "\n", " 构造方式类似于构造一个字典,通过键值相同的`Instance`列表,也可以初始化一个`dataset`,代码如下" ] }, { "cell_type": "code", "execution_count": 12, "id": "012f537c", "metadata": {}, "outputs": [], "source": [ "from fastNLP.core.dataset import DataSet\n", "from fastNLP.core.dataset import Instance\n", "\n", "dataset = DataSet([\n", " Instance(sentence=\"This is an apple .\",\n", " words=['This', 'is', 'an', 'apple', '.'],\n", " num=5),\n", " Instance(sentence=\"I like apples .\",\n", " words=['I', 'like', 'apples', '.'],\n", " num=4),\n", " Instance(sentence=\"Apples are good for our health .\",\n", " words=['Apples', 'are', 'good', 'for', 'our', 'health', '.'],\n", " num=7),\n", " ])" ] }, { "cell_type": "markdown", "id": "2fafb1ef", "metadata": {}, "source": [ " 通过`items`、`keys`和`values`方法,可以分别获得`dataset`的`item`列表、`key`列表、`value`列表" ] }, { "cell_type": "code", "execution_count": 13, "id": "a4c1c10d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_items([('sentence', 'This is an apple .'), ('words', ['This', 'is', 'an', 'apple', '.']), ('num', 5)])\n", "dict_keys(['sentence', 'words', 'num'])\n", "dict_values(['This is an apple .', ['This', 'is', 'an', 'apple', '.'], 5])\n" ] } ], "source": [ "ins = Instance(sentence=\"This is an apple .\", words=['This', 'is', 'an', 'apple', '.'], num=5)\n", "\n", "print(ins.items())\n", "print(ins.keys())\n", "print(ins.values())" ] }, { "cell_type": "markdown", "id": "b5459a2d", "metadata": {}, "source": [ " 通过`add_field`方法,可以在`Instance`实例中,通过参数`field_name`添加字段,通过参数`field`赋值" ] }, { "cell_type": "code", "execution_count": 14, "id": "55376402", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+------------------------+-----+-----+\n", "| sentence | words | num | idx |\n", "+--------------------+------------------------+-----+-----+\n", "| This is an apple . | ['This', 'is', 'an'... | 5 | 0 |\n", "+--------------------+------------------------+-----+-----+\n" ] } ], "source": [ "ins.add_field(field_name='idx', field=0)\n", "print(ins)" ] }, { "cell_type": "markdown", "id": "49caaa9c", "metadata": {}, "source": [ "在`fastNLP 0.8`中,使用`FieldArray`模块表示数据集`dataset`中的每条字段名(注:没有`field`类)\n", "\n", " 通过`get_all_fields`方法可以获取`dataset`的字段列表\n", "\n", " 通过`get_field_names`方法可以获取`dataset`的字段名称列表,代码如下" ] }, { "cell_type": "code", "execution_count": 15, "id": "fe15f4c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sentence':
\n", " | SentenceId | \n", "Sentence | \n", "Sentiment | \n", "
---|---|---|---|
0 | \n", "1 | \n", "A series of escapades demonstrating the adage ... | \n", "negative | \n", "
1 | \n", "2 | \n", "This quiet , introspective and entertaining in... | \n", "positive | \n", "
2 | \n", "3 | \n", "Even fans of Ismail Merchant 's work , I suspe... | \n", "negative | \n", "
3 | \n", "4 | \n", "A positively thrilling combination of ethnogra... | \n", "neutral | \n", "
4 | \n", "5 | \n", "A comedy-drama of nearly epic proportions root... | \n", "positive | \n", "
5 | \n", "6 | \n", "The Importance of Being Earnest , so thick wit... | \n", "neutral | \n", "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+------------------------------+-----------+\n", "| SentenceId | Sentence | Sentiment |\n", "+------------+------------------------------+-----------+\n", "| 1 | ['a', 'series', 'of', 'es... | negative |\n", "| 2 | ['this', 'quiet', ',', 'i... | positive |\n", "| 3 | ['even', 'fans', 'of', 'i... | negative |\n", "| 4 | ['a', 'positively', 'thri... | neutral |\n", "| 5 | ['a', 'comedy-drama', 'of... | positive |\n", "| 6 | ['the', 'importance', 'of... | neutral |\n", "+------------+------------------------------+-----------+\n" ] } ], "source": [ "from fastNLP.core.dataset import DataSet\n", "\n", "dataset = DataSet()\n", "dataset = dataset.from_pandas(df)\n", "dataset.apply_more(lambda ins:{'SentenceId': ins['SentenceId'], \n", " 'Sentence': ins['Sentence'].lower().split(), 'Sentiment': ins['Sentiment']})\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "5c1ae192", "metadata": {}, "source": [ " 如果需要保存中间结果,也可以使用`dataset`的`to_csv`方法,生成`.csv`或`.tsv`文件" ] }, { "cell_type": "code", "execution_count": 26, "id": "46722efc", "metadata": {}, "outputs": [], "source": [ "dataset.to_csv('./data/test4dataset.csv')" ] }, { "cell_type": "markdown", "id": "5ba13989", "metadata": {}, "source": [ "### 3.2 从 dataset 中获取 vocabulary\n", "\n", "然后,初始化`vocabulary`,使用`vocabulary`中的`from_dataset`方法,从`dataset`的指定字段中\n", "\n", " 获取字段中的所有元素,然后编号;如果指定字段是个列表,则针对字段中所有列表包含的元素编号\n", "\n", " 注:**使用`dataset`添加单词**,**不同于`add_word_list`**,**单词被添加次数越多**,**序号越靠前**,例如案例中的`a`" ] }, { "cell_type": "code", "execution_count": 27, "id": "a2de615b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Counter({'a': 9, 'of': 9, ',': 7, 'the': 6, '.': 5, 'is': 3, 'and': 3, 'good': 2, 'for': 2, 'which': 2, 'this': 2, \"'s\": 2, 'series': 1, 'escapades': 1, 'demonstrating': 1, 'adage': 1, 'that': 1, 'what': 1, 'goose': 1, 'also': 1, 'gander': 1, 'some': 1, 'occasionally': 1, 'amuses': 1, 'but': 1, 'none': 1, 'amounts': 1, 'to': 1, 'much': 1, 'story': 1, 'quiet': 1, 'introspective': 1, 'entertaining': 1, 'independent': 1, 'worth': 1, 'seeking': 1, 'even': 1, 'fans': 1, 'ismail': 1, 'merchant': 1, 'work': 1, 'i': 1, 'suspect': 1, 'would': 1, 'have': 1, 'hard': 1, 'time': 1, 'sitting': 1, 'through': 1, 'one': 1, 'positively': 1, 'thrilling': 1, 'combination': 1, 'ethnography': 1, 'all': 1, 'intrigue': 1, 'betrayal': 1, 'deceit': 1, 'murder': 1, 'shakespearean': 1, 'tragedy': 1, 'or': 1, 'juicy': 1, 'soap': 1, 'opera': 1, 'comedy-drama': 1, 'nearly': 1, 'epic': 1, 'proportions': 1, 'rooted': 1, 'in': 1, 'sincere': 1, 'performance': 1, 'by': 1, 'title': 1, 'character': 1, 'undergoing': 1, 'midlife': 1, 'crisis': 1, 'importance': 1, 'being': 1, 'earnest': 1, 'so': 1, 'thick': 1, 'with': 1, 'wit': 1, 'it': 1, 'plays': 1, 'like': 1, 'reading': 1, 'from': 1, 'bartlett': 1, 'familiar': 1, 'quotations': 1}) \n", "\n", "{'
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+------------------------------+-----------+\n", "| SentenceId | Sentence | Sentiment |\n", "+------------+------------------------------+-----------+\n", "| 1 | [2, 14, 3, 15, 16, 5, 17,... | negative |\n", "| 2 | [12, 32, 4, 33, 8, 34, 35... | positive |\n", "| 3 | [38, 39, 3, 40, 41, 13, 4... | negative |\n", "| 4 | [2, 52, 53, 54, 3, 55, 8,... | neutral |\n", "| 5 | [2, 67, 3, 68, 69, 70, 71... | positive |\n", "| 6 | [5, 81, 3, 82, 83, 4, 84,... | neutral |\n", "+------------+------------------------------+-----------+\n" ] } ], "source": [ "vocab.index_dataset(dataset, field_name='Sentence')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "6b26b707", "metadata": {}, "source": [ "最后,使用相同方法,再将`dataset`中`Sentiment`字段中的`negative`、`neutral`、`positive`转化为数字编号" ] }, { "cell_type": "code", "execution_count": 29, "id": "5f5eed18", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "{'negative': 0, 'positive': 1, 'neutral': 2}\n" ] }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+------------------------------+-----------+\n", "| SentenceId | Sentence | Sentiment |\n", "+------------+------------------------------+-----------+\n", "| 1 | [2, 14, 3, 15, 16, 5, 17,... | 0 |\n", "| 2 | [12, 32, 4, 33, 8, 34, 35... | 1 |\n", "| 3 | [38, 39, 3, 40, 41, 13, 4... | 0 |\n", "| 4 | [2, 52, 53, 54, 3, 55, 8,... | 2 |\n", "| 5 | [2, 67, 3, 68, 69, 70, 71... | 1 |\n", "| 6 | [5, 81, 3, 82, 83, 4, 84,... | 2 |\n", "+------------+------------------------------+-----------+\n" ] } ], "source": [ "target_vocab = Vocabulary(padding=None, unknown=None)\n", "\n", "target_vocab.from_dataset(dataset, field_name='Sentiment')\n", "print(target_vocab.word2idx)\n", "target_vocab.index_dataset(dataset, field_name='Sentiment')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "eed7ea64", "metadata": {}, "source": [ "在最后的最后,通过以下的一张图,来总结本章关于`dataset`和`vocabulary`主要知识点的讲解,以及两者的联系\n", "\n", "