You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

tutorial_9_seq_labeling.rst 8.1 kB

6 years ago
6 years ago
6 years ago
6 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
  1. =====================
  2. 快速实现序列标注模型
  3. =====================
  4. 这一部分的内容主要展示如何使用fastNLP实现序列标注任务。您可以使用fastNLP的各个组件快捷,方便地完成序列标注任务,达到出色的效果。
  5. 在阅读这篇Tutorial前,希望您已经熟悉了fastNLP的基础使用,尤其是数据的载入以及模型的构建,通过这个小任务的能让您进一步熟悉fastNLP的使用。
  6. 命名实体识别(name entity recognition, NER)
  7. ------------------------------------------
  8. 命名实体识别任务是从文本中抽取出具有特殊意义或者指代性非常强的实体,通常包括人名、地名、机构名和时间等。
  9. 如下面的例子中
  10. 我来自复旦大学。
  11. 其中“复旦大学”就是一个机构名,命名实体识别就是要从中识别出“复旦大学”这四个字是一个整体,且属于机构名这个类别。这个问题在实际做的时候会被
  12. 转换为序列标注问题
  13. 针对"我来自复旦大学"这句话,我们的预测目标将是[O, O, O, B-ORG, I-ORG, I-ORG, I-ORG],其中O表示out,即不是一个实体,B-ORG是ORG(
  14. organization的缩写)这个类别的开头(Begin),I-ORG是ORG类别的中间(Inside)。
  15. 在本tutorial中我们将通过fastNLP尝试写出一个能够执行以上任务的模型。
  16. 载入数据
  17. ------------------------------------------
  18. fastNLP的数据载入主要是由Loader与Pipe两个基类衔接完成的,您可以通过 :doc:`使用Loader和Pipe处理数据 </tutorials/tutorial_4_load_dataset>`
  19. 了解如何使用fastNLP提供的数据加载函数。下面我们以微博命名实体任务来演示一下在fastNLP进行序列标注任务。
  20. .. code-block:: python
  21. from fastNLP.io import WeiboNERPipe
  22. data_bundle = WeiboNERPipe().process_from_file()
  23. print(data_bundle.get_dataset('train')[:2])
  24. 打印的数据如下 ::
  25. +-------------------------------------------------+------------------------------------------+------------------------------------------+---------+
  26. | raw_chars | target | chars | seq_len |
  27. +-------------------------------------------------+------------------------------------------+------------------------------------------+---------+
  28. | ['一', '节', '课', '的', '时', '间', '真', '... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, ... | [8, 211, 775, 3, 49, 245, 89, 26, 101... | 16 |
  29. | ['回', '复', '支', '持', ',', '赞', '成', '... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [116, 480, 127, 109, 2, 446, 134, 2, ... | 59 |
  30. +-------------------------------------------------+------------------------------------------+------------------------------------------+---------+
  31. 模型构建
  32. --------------------------------
  33. 首先选择需要使用的Embedding类型。关于Embedding的相关说明可以参见 :doc:`使用Embedding模块将文本转成向量 </tutorials/tutorial_3_embedding>` 。
  34. 在这里我们使用通过word2vec预训练的中文汉字embedding。
  35. .. code-block:: python
  36. from fastNLP.embeddings import StaticEmbedding
  37. embed = StaticEmbedding(vocab=data_bundle.get_vocab('chars'), model_dir_or_name='cn-char-fastnlp-100d')
  38. 选择好Embedding之后,我们可以使用fastNLP中自带的 :class:`fastNLP.models.BiLSTMCRF` 作为模型。
  39. .. code-block:: python
  40. from fastNLP.models import BiLSTMCRF
  41. data_bundle.rename_field('chars', 'words') # 这是由于BiLSTMCRF模型的forward函数接受的words,而不是chars,所以需要把这一列重新命名
  42. model = BiLSTMCRF(embed=embed, num_classes=len(data_bundle.get_vocab('target')), num_layers=1, hidden_size=200, dropout=0.5,
  43. target_vocab=data_bundle.get_vocab('target'))
  44. 下面我们选择用来评估模型的metric,以及优化用到的优化函数。
  45. .. code-block:: python
  46. from fastNLP import SpanFPreRecMetric
  47. from torch.optim import Adam
  48. from fastNLP import LossInForward
  49. metric = SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))
  50. optimizer = Adam(model.parameters(), lr=1e-2)
  51. loss = LossInForward()
  52. 使用Trainer进行训练
  53. .. code-block:: python
  54. from fastNLP import Trainer
  55. import torch
  56. device= 0 if torch.cuda.is_available() else 'cpu'
  57. trainer = Trainer(data_bundle.get_dataset('train'), model, loss=loss, optimizer=optimizer,
  58. dev_data=data_bundle.get_dataset('dev'), metrics=metric, device=device)
  59. trainer.train()
  60. 训练过程输出为::
  61. input fields after batch(if batch size is 2):
  62. target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
  63. seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
  64. words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
  65. target fields after batch(if batch size is 2):
  66. target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
  67. seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
  68. training epochs started 2019-09-25-10-43-09
  69. Evaluate data in 0.62 seconds!
  70. Evaluation on dev at Epoch 1/10. Step:43/430:
  71. SpanFPreRecMetric: f=0.070352, pre=0.100962, rec=0.053985
  72. ...
  73. Evaluate data in 0.61 seconds!
  74. Evaluation on dev at Epoch 10/10. Step:430/430:
  75. SpanFPreRecMetric: f=0.51223, pre=0.581699, rec=0.457584
  76. In Epoch:7/Step:301, got best dev performance:
  77. SpanFPreRecMetric: f=0.515528, pre=0.65098, rec=0.426735
  78. Reloaded the best model.
  79. 训练结束之后过,可以通过 :class:`~fastNLP.Tester` 测试其在测试集上的性能
  80. .. code-block::python
  81. from fastNLP import Tester
  82. tester = Tester(data_bundle.get_dataset('test'), model, metrics=metric)
  83. tester.test()
  84. 输出为::
  85. [tester]
  86. SpanFPreRecMetric: f=0.482399, pre=0.530086, rec=0.442584
  87. 使用更强的Bert做序列标注
  88. --------------------------------
  89. 在fastNLP使用Bert进行任务,您只需要切换为 :class:`fastNLP.embeddings.BertEmbedding` 即可。
  90. .. code-block:: python
  91. from fastNLP.io import WeiboNERPipe
  92. data_bundle = WeiboNERPipe().process_from_file()
  93. data_bundle.rename_field('chars', 'words')
  94. from fastNLP.embeddings import BertEmbedding
  95. embed = BertEmbedding(vocab=data_bundle.get_vocab('words'), model_dir_or_name='cn')
  96. model = BiLSTMCRF(embed=embed, num_classes=len(data_bundle.get_vocab('target')), num_layers=1, hidden_size=200, dropout=0.5,
  97. target_vocab=data_bundle.get_vocab('target'))
  98. from fastNLP import SpanFPreRecMetric
  99. from torch import Adam
  100. from fastNLP import LossInForward
  101. metric = SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))
  102. optimizer = Adam(model.parameters(), lr=2e-5)
  103. loss = LossInForward()
  104. from fastNLP import Trainer
  105. import torch
  106. device= 0 if torch.cuda.is_available() else 'cpu'
  107. trainer = Trainer(data_bundle.get_dataset('train'), model, loss=loss, optimizer=optimizer, batch_size=12,
  108. dev_data=data_bundle.get_dataset('dev'), metrics=metric, device=device)
  109. trainer.train()
  110. from fastNLP import Tester
  111. tester = Tester(data_bundle.get_dataset('test'), model, metrics=metric)
  112. tester.test()
  113. 输出为::
  114. training epochs started 2019-09-25-07-15-43
  115. Evaluate data in 2.02 seconds!
  116. Evaluation on dev at Epoch 1/10. Step:113/1130:
  117. SpanFPreRecMetric: f=0.0, pre=0.0, rec=0.0
  118. ...
  119. Evaluate data in 2.17 seconds!
  120. Evaluation on dev at Epoch 10/10. Step:1130/1130:
  121. SpanFPreRecMetric: f=0.647332, pre=0.589852, rec=0.717224
  122. In Epoch:6/Step:678, got best dev performance:
  123. SpanFPreRecMetric: f=0.669963, pre=0.645238, rec=0.696658
  124. Reloaded the best model.
  125. Evaluate data in 1.82 seconds!
  126. [tester]
  127. SpanFPreRecMetric: f=0.641774, pre=0.626424, rec=0.657895
  128. 可以看出通过使用Bert,效果有明显的提升,从48.2提升到了64.1。