|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509 |
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# LSTM 做词性预测\n",
- "前面我们讲了词嵌入以及 n-gram 模型做单词预测,但是目前还没有用到 RNN,在最后这一次课中,我们会结合前面讲的所有预备知识,教大家如何使用 LSTM 来做词性预测。\n",
- "\n",
- "## 模型介绍\n",
- "对于一个单词,会有这不同的词性,首先能够根据一个单词的后缀来初步判断,比如 -ly 这种后缀,很大概率是一个副词,除此之外,一个相同的单词可以表示两种不同的词性,比如 book 既可以表示名词,也可以表示动词,所以到底这个词是什么词性需要结合前后文来具体判断。\n",
- "\n",
- "根据这个问题,我们可以使用 lstm 模型来进行预测,首先对于一个单词,可以将其看作一个序列,比如 apple 是由 a p p l e 这 5 个单词构成,这就形成了 5 的序列,我们可以对这些字符构建词嵌入,然后输入 lstm,就像 lstm 做图像分类一样,只取最后一个输出作为预测结果,整个单词的字符串能够形成一种记忆的特性,帮助我们更好的预测词性。\n",
- "\n",
- "\n",
- "\n",
- "接着我们把这个单词和其前面几个单词构成序列,可以对这些单词构建新的词嵌入,最后输出结果是单词的词性,也就是根据前面几个词的信息对这个词的词性进行分类。\n",
- "\n",
- "下面我们用例子来简单的说明"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "import torch\n",
- "from torch import nn\n",
- "from torch.autograd import Variable"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "我们使用下面简单的训练集"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "training_data = [(\"The dog ate the apple\".split(),\n",
- " [\"DET\", \"NN\", \"V\", \"DET\", \"NN\"]),\n",
- " (\"Everybody read that book\".split(), \n",
- " [\"NN\", \"V\", \"DET\", \"NN\"])]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "接下来我们需要对单词和标签进行编码"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "word_to_idx = {}\n",
- "tag_to_idx = {}\n",
- "for context, tag in training_data:\n",
- " for word in context:\n",
- " if word.lower() not in word_to_idx:\n",
- " word_to_idx[word.lower()] = len(word_to_idx)\n",
- " for label in tag:\n",
- " if label.lower() not in tag_to_idx:\n",
- " tag_to_idx[label.lower()] = len(tag_to_idx)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'apple': 3,\n",
- " 'ate': 2,\n",
- " 'book': 7,\n",
- " 'dog': 1,\n",
- " 'everybody': 4,\n",
- " 'read': 5,\n",
- " 'that': 6,\n",
- " 'the': 0}"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "word_to_idx"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'det': 0, 'nn': 1, 'v': 2}"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tag_to_idx"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "然后我们对字母进行编码"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n",
- "char_to_idx = {}\n",
- "for i in range(len(alphabet)):\n",
- " char_to_idx[alphabet[i]] = i"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'a': 0,\n",
- " 'b': 1,\n",
- " 'c': 2,\n",
- " 'd': 3,\n",
- " 'e': 4,\n",
- " 'f': 5,\n",
- " 'g': 6,\n",
- " 'h': 7,\n",
- " 'i': 8,\n",
- " 'j': 9,\n",
- " 'k': 10,\n",
- " 'l': 11,\n",
- " 'm': 12,\n",
- " 'n': 13,\n",
- " 'o': 14,\n",
- " 'p': 15,\n",
- " 'q': 16,\n",
- " 'r': 17,\n",
- " 's': 18,\n",
- " 't': 19,\n",
- " 'u': 20,\n",
- " 'v': 21,\n",
- " 'w': 22,\n",
- " 'x': 23,\n",
- " 'y': 24,\n",
- " 'z': 25}"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "char_to_idx"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "接着我们可以构建训练数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "def make_sequence(x, dic): # 字符编码\n",
- " idx = [dic[i.lower()] for i in x]\n",
- " idx = torch.LongTensor(idx)\n",
- " return idx"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "\n",
- " 0\n",
- " 15\n",
- " 15\n",
- " 11\n",
- " 4\n",
- "[torch.LongTensor of size 5]"
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "make_sequence('apple', char_to_idx)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['Everybody', 'read', 'that', 'book']"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "training_data[1][0]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "\n",
- " 4\n",
- " 5\n",
- " 6\n",
- " 7\n",
- "[torch.LongTensor of size 4]"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "make_sequence(training_data[1][0], word_to_idx)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "构建单个字符的 lstm 模型"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "class char_lstm(nn.Module):\n",
- " def __init__(self, n_char, char_dim, char_hidden):\n",
- " super(char_lstm, self).__init__()\n",
- " \n",
- " self.char_embed = nn.Embedding(n_char, char_dim)\n",
- " self.lstm = nn.LSTM(char_dim, char_hidden)\n",
- " \n",
- " def forward(self, x):\n",
- " x = self.char_embed(x)\n",
- " out, _ = self.lstm(x)\n",
- " return out[-1] # (batch, hidden)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "构建词性分类的 lstm 模型"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "class lstm_tagger(nn.Module):\n",
- " def __init__(self, n_word, n_char, char_dim, word_dim, \n",
- " char_hidden, word_hidden, n_tag):\n",
- " super(lstm_tagger, self).__init__()\n",
- " self.word_embed = nn.Embedding(n_word, word_dim)\n",
- " self.char_lstm = char_lstm(n_char, char_dim, char_hidden)\n",
- " self.word_lstm = nn.LSTM(word_dim + char_hidden, word_hidden)\n",
- " self.classify = nn.Linear(word_hidden, n_tag)\n",
- " \n",
- " def forward(self, x, word):\n",
- " char = []\n",
- " for w in word: # 对于每个单词做字符的 lstm\n",
- " char_list = make_sequence(w, char_to_idx)\n",
- " char_list = char_list.unsqueeze(1) # (seq, batch, feature) 满足 lstm 输入条件\n",
- " char_infor = self.char_lstm(Variable(char_list)) # (batch, char_hidden)\n",
- " char.append(char_infor)\n",
- " char = torch.stack(char, dim=0) # (seq, batch, feature)\n",
- " \n",
- " x = self.word_embed(x) # (batch, seq, word_dim)\n",
- " x = x.permute(1, 0, 2) # 改变顺序\n",
- " x = torch.cat((x, char), dim=2) # 沿着特征通道将每个词的词嵌入和字符 lstm 输出的结果拼接在一起\n",
- " x, _ = self.word_lstm(x)\n",
- " \n",
- " s, b, h = x.shape\n",
- " x = x.view(-1, h) # 重新 reshape 进行分类线性层\n",
- " out = self.classify(x)\n",
- " return out"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "net = lstm_tagger(len(word_to_idx), len(char_to_idx), 10, 100, 50, 128, len(tag_to_idx))\n",
- "criterion = nn.CrossEntropyLoss()\n",
- "optimizer = torch.optim.SGD(net.parameters(), lr=1e-2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Epoch: 50, Loss: 0.86690\n",
- "Epoch: 100, Loss: 0.65471\n",
- "Epoch: 150, Loss: 0.45582\n",
- "Epoch: 200, Loss: 0.30351\n",
- "Epoch: 250, Loss: 0.20446\n",
- "Epoch: 300, Loss: 0.14376\n"
- ]
- }
- ],
- "source": [
- "# 开始训练\n",
- "for e in range(300):\n",
- " train_loss = 0\n",
- " for word, tag in training_data:\n",
- " word_list = make_sequence(word, word_to_idx).unsqueeze(0) # 添加第一维 batch\n",
- " tag = make_sequence(tag, tag_to_idx)\n",
- " word_list = Variable(word_list)\n",
- " tag = Variable(tag)\n",
- " # 前向传播\n",
- " out = net(word_list, word)\n",
- " loss = criterion(out, tag)\n",
- " train_loss += loss.data[0]\n",
- " # 反向传播\n",
- " optimizer.zero_grad()\n",
- " loss.backward()\n",
- " optimizer.step()\n",
- " if (e + 1) % 50 == 0:\n",
- " print('Epoch: {}, Loss: {:.5f}'.format(e + 1, train_loss / len(training_data)))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "最后我们可以看看预测的结果"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "net = net.eval()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [],
- "source": [
- "test_sent = 'Everybody ate the apple'\n",
- "test = make_sequence(test_sent.split(), word_to_idx).unsqueeze(0)\n",
- "out = net(Variable(test), test_sent.split())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Variable containing:\n",
- "-1.2148 1.9048 -0.6570\n",
- "-0.9272 -0.4441 1.4009\n",
- " 1.6425 -0.7751 -1.1553\n",
- "-0.6121 1.6036 -1.1280\n",
- "[torch.FloatTensor of size 4x3]\n",
- "\n"
- ]
- }
- ],
- "source": [
- "print(out)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'det': 0, 'nn': 1, 'v': 2}\n"
- ]
- }
- ],
- "source": [
- "print(tag_to_idx)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "最后可以得到上面的结果,因为最后一层的线性层没有使用 softmax,所以数值不太像一个概率,但是每一行数值最大的就表示属于该类,可以看到第一个单词 'Everybody' 属于 nn,第二个单词 'ate' 属于 v,第三个单词 'the' 属于det,第四个单词 'apple' 属于 nn,所以得到的这个预测结果是正确的"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
- }
|