OpenI
/
machinelearning_notebook

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# N-Gram 模型\n",
    "上一节课，我们讲了词嵌入以及词嵌入是如何得到的，现在我们来讲讲词嵌入如何来训练语言模型，首先我们介绍一下 N-Gram 模型的原理和其要解决的问题。\n",
    "\n",
    "对于一句话，单词的排列顺序是非常重要的，所以我们能否由前面的几个词来预测后面的几个单词呢，比如 'I lived in France for 10 years, I can speak _' 这句话中，我们能够预测出最后一个词是 French。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于一句话 T，其由 $w_1, w_2, \\cdots, w_n$ 这 n 个词构成，\n",
    "\n",
    "$$\n",
    "P(T) = P(w_1)P(w_2 | w_1)P(w_3 |w_2 w_1) \\cdots P(w_n |w_{n-1} w_{n-2}\\cdots w_2w_1)\n",
    "$$\n",
    "\n",
    "我们可以再简化一下这个模型，比如对于一个词，它并不需要前面所有的词作为条件概率，也就是说一个词可以只与其前面的几个词有关，这就是马尔科夫假设。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于这里的条件概率，传统的方法是统计语料中每个词出现的频率，根据贝叶斯定理来估计这个条件概率，这里我们就可以用词嵌入对其进行代替，然后使用 RNN 进行条件概率的计算，然后最大化这个条件概率不仅修改词嵌入，同时能够使得模型可以依据计算的条件概率对其中的一个单词进行预测。\n",
    "\n",
    "下面我们直接用代码进行说明"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "CONTEXT_SIZE = 2 # 依据的单词数\n",
    "EMBEDDING_DIM = 10 # 词向量的维度\n",
    "# 我们使用莎士比亚的诗\n",
    "test_sentence = \"\"\"When forty winters shall besiege thy brow,\n",
    "And dig deep trenches in thy beauty's field,\n",
    "Thy youth's proud livery so gazed on now,\n",
    "Will be a totter'd weed of small worth held:\n",
    "Then being asked, where all thy beauty lies,\n",
    "Where all the treasure of thy lusty days;\n",
    "To say, within thine own deep sunken eyes,\n",
    "Were an all-eating shame, and thriftless praise.\n",
    "How much more praise deserv'd thy beauty's use,\n",
    "If thou couldst answer 'This fair child of mine\n",
    "Shall sum my count, and make my old excuse,'\n",
    "Proving his beauty by succession thine!\n",
    "This were to be new made when thou art old,\n",
    "And see thy blood warm when thou feel'st it cold.\"\"\".split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里的 `CONTEXT_SIZE` 表示我们希望由前面几个单词来预测这个单词，这里使用两个单词，`EMBEDDING_DIM` 表示词嵌入的维度。\n",
    "\n",
    "接着我们建立训练集，便利整个语料库，将单词三个分组，前面两个作为输入，最后一个作为预测的结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    " trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2]) \n",
    "            for i in range(len(test_sentence)-2)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "113"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 总的数据量\n",
    "len(trigram)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(('When', 'forty'), 'winters')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 取出第一个数据看看\n",
    "trigram[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 建立每个词与数字的编码，据此构建词嵌入\n",
    "vocb = set(test_sentence) # 使用 set 将重复的元素去掉\n",
    "word_to_idx = {word: i for i, word in enumerate(vocb)}\n",
    "idx_to_word = {word_to_idx[word]: word for word in word_to_idx}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{\"'This\": 94,\n",
       " 'And': 71,\n",
       " 'How': 18,\n",
       " 'If': 49,\n",
       " 'Proving': 78,\n",
       " 'Shall': 48,\n",
       " 'Then': 33,\n",
       " 'This': 68,\n",
       " 'Thy': 75,\n",
       " 'To': 81,\n",
       " 'Were': 61,\n",
       " 'When': 14,\n",
       " 'Where': 95,\n",
       " 'Will': 27,\n",
       " 'a': 21,\n",
       " 'all': 53,\n",
       " 'all-eating': 3,\n",
       " 'an': 15,\n",
       " 'and': 23,\n",
       " 'answer': 80,\n",
       " 'art': 70,\n",
       " 'asked,': 69,\n",
       " 'be': 29,\n",
       " 'beauty': 16,\n",
       " \"beauty's\": 40,\n",
       " 'being': 79,\n",
       " 'besiege': 55,\n",
       " 'blood': 11,\n",
       " 'brow,': 1,\n",
       " 'by': 59,\n",
       " 'child': 8,\n",
       " 'cold.': 32,\n",
       " 'couldst': 26,\n",
       " 'count,': 77,\n",
       " 'days;': 43,\n",
       " 'deep': 62,\n",
       " \"deserv'd\": 41,\n",
       " 'dig': 64,\n",
       " \"excuse,'\": 86,\n",
       " 'eyes,': 84,\n",
       " 'fair': 56,\n",
       " \"feel'st\": 44,\n",
       " 'field,': 9,\n",
       " 'forty': 46,\n",
       " 'gazed': 93,\n",
       " 'held:': 12,\n",
       " 'his': 89,\n",
       " 'in': 45,\n",
       " 'it': 34,\n",
       " 'lies,': 57,\n",
       " 'livery': 28,\n",
       " 'lusty': 65,\n",
       " 'made': 54,\n",
       " 'make': 42,\n",
       " 'mine': 13,\n",
       " 'more': 83,\n",
       " 'much': 30,\n",
       " 'my': 50,\n",
       " 'new': 92,\n",
       " 'now,': 25,\n",
       " 'of': 47,\n",
       " 'old': 22,\n",
       " 'old,': 19,\n",
       " 'on': 74,\n",
       " 'own': 20,\n",
       " 'praise': 38,\n",
       " 'praise.': 96,\n",
       " 'proud': 5,\n",
       " 'say,': 63,\n",
       " 'see': 58,\n",
       " 'shall': 87,\n",
       " 'shame,': 90,\n",
       " 'small': 31,\n",
       " 'so': 67,\n",
       " 'succession': 36,\n",
       " 'sum': 10,\n",
       " 'sunken': 60,\n",
       " 'the': 73,\n",
       " 'thine': 24,\n",
       " 'thine!': 0,\n",
       " 'thou': 51,\n",
       " 'thriftless': 72,\n",
       " 'thy': 76,\n",
       " 'to': 85,\n",
       " \"totter'd\": 2,\n",
       " 'treasure': 17,\n",
       " 'trenches': 39,\n",
       " 'use,': 35,\n",
       " 'warm': 66,\n",
       " 'weed': 91,\n",
       " 'were': 82,\n",
       " 'when': 7,\n",
       " 'where': 37,\n",
       " 'winters': 88,\n",
       " 'within': 4,\n",
       " 'worth': 52,\n",
       " \"youth's\": 6}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_to_idx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上面可以看到每个词都对应一个数字，且这里的单词都各不相同\n",
    "\n",
    "接着我们定义模型，模型的输入就是前面的两个词，输出就是预测单词的概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn\n",
    "import torch.nn.functional as F\n",
    "from torch.autograd import Variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义模型\n",
    "class n_gram(nn.Module):\n",
    "    def __init__(self, vocab_size, context_size=CONTEXT_SIZE, n_dim=EMBEDDING_DIM):\n",
    "        super(n_gram, self).__init__()\n",
    "        \n",
    "        self.embed = nn.Embedding(vocab_size, n_dim)\n",
    "        self.classify = nn.Sequential(\n",
    "            nn.Linear(context_size * n_dim, 128),\n",
    "            nn.ReLU(True),\n",
    "            nn.Linear(128, vocab_size)\n",
    "        )\n",
    "        \n",
    "    def forward(self, x):\n",
    "        voc_embed = self.embed(x) # 得到词嵌入\n",
    "        voc_embed = voc_embed.view(1, -1) # 将两个词向量拼在一起\n",
    "        out = self.classify(voc_embed)\n",
    "        return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后我们输出就是条件概率，相当于是一个分类问题，我们可以使用交叉熵来方便地衡量误差"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "net = n_gram(len(word_to_idx))\n",
    "\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, weight_decay=1e-5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch: 20, Loss: 0.088273\n",
      "epoch: 40, Loss: 0.065301\n",
      "epoch: 60, Loss: 0.057113\n",
      "epoch: 80, Loss: 0.052442\n",
      "epoch: 100, Loss: 0.049236\n"
     ]
    }
   ],
   "source": [
    "for e in range(100):\n",
    "    train_loss = 0\n",
    "    for word, label in trigram: # 使用前 100 个作为训练集\n",
    "        word = Variable(torch.LongTensor([word_to_idx[i] for i in word])) # 将两个词作为输入\n",
    "        label = Variable(torch.LongTensor([word_to_idx[label]]))\n",
    "        # 前向传播\n",
    "        out = net(word)\n",
    "        loss = criterion(out, label)\n",
    "        train_loss += loss.data[0]\n",
    "        # 反向传播\n",
    "        optimizer.zero_grad()\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "    if (e + 1) % 20 == 0:\n",
    "        print('epoch: {}, Loss: {:.6f}'.format(e + 1, train_loss / len(trigram)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后我们可以测试一下结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "net = net.eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input: ('so', 'gazed')\n",
      "label: on\n",
      "\n",
      "real word is on, predicted word is on\n"
     ]
    }
   ],
   "source": [
    "# 测试一下结果\n",
    "word, label = trigram[19]\n",
    "print('input: {}'.format(word))\n",
    "print('label: {}'.format(label))\n",
    "print()\n",
    "word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))\n",
    "out = net(word)\n",
    "pred_label_idx = out.max(1)[1].data[0]\n",
    "predict_word = idx_to_word[pred_label_idx]\n",
    "print('real word is {}, predicted word is {}'.format(label, predict_word))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input: (\"'This\", 'fair')\n",
      "label: child\n",
      "\n",
      "real word is child, predicted word is child\n"
     ]
    }
   ],
   "source": [
    "word, label = trigram[75]\n",
    "print('input: {}'.format(word))\n",
    "print('label: {}'.format(label))\n",
    "print()\n",
    "word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))\n",
    "out = net(word)\n",
    "pred_label_idx = out.max(1)[1].data[0]\n",
    "predict_word = idx_to_word[pred_label_idx]\n",
    "print('real word is {}, predicted word is {}'.format(label, predict_word))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到网络在训练集上基本能够预测准确，不过这里样本太少，特别容易过拟合。\n",
    "\n",
    "下一次课我们会讲一讲 RNN 如何应用在自然语言处理中"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}