|
|
@@ -47,7 +47,9 @@ |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": 1, |
|
|
|
"metadata": {}, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"from fastNLP.io import ChnSentiCorpLoader\n", |
|
|
@@ -126,7 +128,9 @@ |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": 3, |
|
|
|
"metadata": {}, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"from fastNLP.io import ChnSentiCorpPipe\n", |
|
|
@@ -280,7 +284,9 @@ |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": 9, |
|
|
|
"metadata": {}, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"from torch import nn\n", |
|
|
@@ -803,11 +809,221 @@ |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"### 基于词进行文本分类" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"由于汉字中没有显示的字与字的边界,一般需要通过分词器先将句子进行分词操作。\n", |
|
|
|
"下面的例子演示了如何不基于fastNLP已有的数据读取、预处理代码进行文本分类。" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"### (1) 读取数据" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"这里我们继续以之前的数据为例,但这次我们不使用fastNLP自带的数据读取代码 " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"from fastNLP.io import ChnSentiCorpLoader\n", |
|
|
|
"\n", |
|
|
|
"loader = ChnSentiCorpLoader() # 初始化一个中文情感分类的loader\n", |
|
|
|
"data_dir = loader.download() # 这一行代码将自动下载数据到默认的缓存地址, 并将该地址返回" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"下面我们先定义一个read_file_to_dataset的函数, 即给定一个文件路径,读取其中的内容,并返回一个DataSet。然后我们将所有的DataSet放入到DataBundle对象中来方便接下来的预处理" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [] |
|
|
|
"source": [ |
|
|
|
"import os\n", |
|
|
|
"from fastNLP import DataSet, Instance\n", |
|
|
|
"from fastNLP.io import DataBundle\n", |
|
|
|
"\n", |
|
|
|
"\n", |
|
|
|
"def read_file_to_dataset(fp):\n", |
|
|
|
" ds = DataSet()\n", |
|
|
|
" with open(fp, 'r') as f:\n", |
|
|
|
" f.readline() # 第一行是title名称,忽略掉\n", |
|
|
|
" for line in f:\n", |
|
|
|
" line = line.strip()\n", |
|
|
|
" target, chars = line.split('\\t')\n", |
|
|
|
" ins = Instance(target=target, raw_chars=chars)\n", |
|
|
|
" ds.append(ins)\n", |
|
|
|
" return ds\n", |
|
|
|
"\n", |
|
|
|
"data_bundle = DataBundle()\n", |
|
|
|
"for name in ['train.tsv', 'dev.tsv', 'test.tsv']:\n", |
|
|
|
" fp = os.path.join(data_dir, name)\n", |
|
|
|
" ds = read_file_to_dataset(fp)\n", |
|
|
|
" data_bundle.set_dataset(name=name.split('.')[0], dataset=ds)\n", |
|
|
|
"\n", |
|
|
|
"print(data_bundle) # 查看以下数据集的情况\n", |
|
|
|
"# In total 3 datasets:\n", |
|
|
|
"# train has 9600 instances.\n", |
|
|
|
"# dev has 1200 instances.\n", |
|
|
|
"# test has 1200 instances." |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"### (2) 数据预处理" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"在这里,我们首先把句子通过 [fastHan](http://gitee.com/fastnlp/fastHan) 进行分词操作,然后创建词表,并将词语转换为序号。" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"from fastHan import FastHan\n", |
|
|
|
"from fastNLP import Vocabulary\n", |
|
|
|
"\n", |
|
|
|
"model=FastHan()\n", |
|
|
|
"\n", |
|
|
|
"# 定义分词处理操作\n", |
|
|
|
"def word_seg(ins):\n", |
|
|
|
" raw_chars = ins['raw_chars']\n", |
|
|
|
" # 由于有些句子比较长,我们只截取前128个汉字\n", |
|
|
|
" raw_words = model(raw_chars[:128], target='CWS')[0]\n", |
|
|
|
" return raw_words\n", |
|
|
|
"\n", |
|
|
|
"for name, ds in data_bundle.iter_datasets():\n", |
|
|
|
" # apply函数将对内部的instance依次执行word_seg操作,并把其返回值放入到raw_words这个field\n", |
|
|
|
" ds.apply(word_seg, new_field_name='raw_words')\n", |
|
|
|
" # 除了apply函数,fastNLP还支持apply_field, apply_more(可同时创建多个field)等操作\n", |
|
|
|
"\n", |
|
|
|
"vocab = Vocabulary()\n", |
|
|
|
"\n", |
|
|
|
"# 对raw_words列创建词表, 建议把非训练集的dataset放在no_create_entry_dataset参数中\n", |
|
|
|
"# 也可以通过add_word(), add_word_lst()等建立词表,请参考http://www.fastnlp.top/docs/fastNLP/tutorials/tutorial_2_vocabulary.html\n", |
|
|
|
"vocab.from_dataset(data_bundle.get_dataset('train'), field_name='raw_words', \n", |
|
|
|
" no_create_entry_dataset=[data_bundle.get_dataset('dev'), \n", |
|
|
|
" data_bundle.get_dataset('test')]) \n", |
|
|
|
"\n", |
|
|
|
"# 将建立好词表的Vocabulary用于对raw_words列建立词表,并把转为序号的列存入到words列\n", |
|
|
|
"vocab.index_dataset(data_bundle.get_dataset('train'), data_bundle.get_dataset('dev'), \n", |
|
|
|
" data_bundle.get_dataset('test'), field_name='raw_words', new_field_name='words')\n", |
|
|
|
"\n", |
|
|
|
"# 建立target的词表,target的词表一般不需要padding和unknown\n", |
|
|
|
"target_vocab = Vocabulary(padding=None, unknown=None) \n", |
|
|
|
"# 一般情况下我们可以只用训练集建立target的词表\n", |
|
|
|
"target_vocab.from_dataset(data_bundle.get_dataset('train'), field_name='target') \n", |
|
|
|
"# 如果没有传递new_field_name, 则默认覆盖原词表\n", |
|
|
|
"target_vocab.index_dataset(data_bundle.get_dataset('train'), data_bundle.get_dataset('dev'), \n", |
|
|
|
" data_bundle.get_dataset('test'), field_name='target')\n", |
|
|
|
"\n", |
|
|
|
"# 我们可以把词表保存到data_bundle中,方便之后使用\n", |
|
|
|
"data_bundle.set_vocab(field_name='words', vocab=vocab)\n", |
|
|
|
"data_bundle.set_vocab(field_name='target', vocab=target_vocab)\n", |
|
|
|
"\n", |
|
|
|
"# 我们把words和target分别设置为input和target,这样它们才会在训练循环中被取出并自动padding, 有关这部分更多的内容参考\n", |
|
|
|
"# http://www.fastnlp.top/docs/fastNLP/tutorials/tutorial_6_datasetiter.html\n", |
|
|
|
"data_bundle.set_target('target')\n", |
|
|
|
"data_bundle.set_input('words') # DataSet也有这两个接口\n", |
|
|
|
"# 如果某些field,您希望它被设置为target或者input,但是不希望fastNLP自动padding或需要使用特定的padding方式,请参考\n", |
|
|
|
"# http://www.fastnlp.top/docs/fastNLP/fastNLP.core.dataset.html\n", |
|
|
|
"\n", |
|
|
|
"print(data_bundle.get_dataset('train')[:2]) # 我们可以看一下当前dataset的内容" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"### (3) 选择预训练词向量" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"这里我们选择腾讯的预训练中文词向量,可以在 [腾讯词向量](https://ai.tencent.com/ailab/nlp/en/embedding.html) 处下载并解压。这里我们不能直接使用BERT,因为BERT是基于中文字进行预训练的。" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"from fastNLP.embeddings import StaticEmbedding\n", |
|
|
|
"\n", |
|
|
|
"word2vec_embed = StaticEmbedding(data_bundle.get_vocab('words'), \n", |
|
|
|
" model_dir_or_name='/path/to/Tencent_AILab_ChineseEmbedding.txt')" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"metadata": { |
|
|
|
"collapsed": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# 初始化模型\n", |
|
|
|
"model = BiLSTMMaxPoolCls(word2vec_embed, len(data_bundle.get_vocab('target')))\n", |
|
|
|
"\n", |
|
|
|
"# 开始训练\n", |
|
|
|
"loss = CrossEntropyLoss()\n", |
|
|
|
"optimizer = Adam(model.parameters(), lr=0.001)\n", |
|
|
|
"metric = AccuracyMetric()\n", |
|
|
|
"device = 0 if torch.cuda.is_available() else 'cpu' # 如果有gpu的话在gpu上运行,训练速度会更快\n", |
|
|
|
"\n", |
|
|
|
"trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss, \n", |
|
|
|
" optimizer=optimizer, batch_size=32, dev_data=data_bundle.get_dataset('dev'),\n", |
|
|
|
" metrics=metric, device=device)\n", |
|
|
|
"trainer.train() # 开始训练,训练完成之后默认会加载在dev上表现最好的模型\n", |
|
|
|
"\n", |
|
|
|
"# 在测试集上测试一下模型的性能\n", |
|
|
|
"from fastNLP import Tester\n", |
|
|
|
"print(\"Performance on test is:\")\n", |
|
|
|
"tester = Tester(data=data_bundle.get_dataset('test'), model=model, metrics=metric, batch_size=64, device=device)\n", |
|
|
|
"tester.test()" |
|
|
|
] |
|
|
|
} |
|
|
|
], |
|
|
|
"metadata": { |
|
|
@@ -826,7 +1042,7 @@ |
|
|
|
"name": "python", |
|
|
|
"nbconvert_exporter": "python", |
|
|
|
"pygments_lexer": "ipython3", |
|
|
|
"version": "3.6.7" |
|
|
|
"version": "3.6.10" |
|
|
|
} |
|
|
|
}, |
|
|
|
"nbformat": 4, |
|
|
|