{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perceptron\n",
    "\n",
    "Perceptron is a a linear classification model of dichotomy, the input is the egienvector of instance and the output is other category of instance(take +1 and -1). The perceptron corresponds to a separate hyperplane in which the instance is divided into two classes in the input space. The perceptron aims to find the hyperplane. In order to find the hyperplane, the loss function based on misclassification is introduced. The gradient descent method is used to optimize the loss function (optimization).Perceptron learning algorithm is simple and easy to implement. It can be divided into primitive form and dual form. Perceptron prediction is a discriminant model, beacause it uses the peceptron model through learning to predict the new instance. Perceptronsis proposed by Rosenblatt in 1957, are the basis of neural networks and support vector machines.\n",
    "\n",
    "It imitate neurons in the biological nervous system, which can receive signals from multiple sources and then convert them into signals that are easy to transmit for output (which in the biological body is represented as electrical signals).\n",
    "\n",
    "![neuron](images/neuron.png)\n",
    "\n",
    "* dendrites - 树突\n",
    "* nucleus - 细胞核\n",
    "* axon - 轴突\n",
    "\n",
    "Psychologist Rosenblatt conceive perception machine, as a simplified mathematical model to explain how neurons in the brain work: it took a set of binary input values (nearby neurons), multiply each input value by a continuous value weight (near each neuron synaptic strength), and setting up a threshold, if the weighted input value is more than the threshold, the output is 1, otherwise 0 (similarly in the neurons discharge process). For perceptrons, most of the input values are either data or outputs from other perceptrons.\n",
    "\n",
    "Donald Hebb proposed an unexpected and far-reaching idea that knowledge and learning occur in the brain mainly through the formation and change of synapses between neurons, which is briefly described as Hebb's law:\n",
    "\n",
    "\n",
    "> When cell A's axons are close enough to excite cell B, and repeatedly and continuously discharge cell B, some growth process or metabolic changes will occur in one or both of these cells, so that A becomes more efficient as one of the cells that discharge B.\n",
    "\n",
    "Perception machine has not completely follow the idea. **However, by the weight of the input value, we can have a very simple and intuitive learning plan: the training set of a given an input/output instance, perception machine should \"learn\" a function: for each example, if the output value is much lower than the instance, then increase the weight of it, otherwise if the value is much higher than the instance, reduce the weight of it.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Perceptron model\n",
    "\n",
    "Assume that the input space(eigenvector) is $X \\subseteq R^n$, then the output space is $Y=\\{-1, +1\\}$. Input $x \\in X$ stands for the eigenvector of instance which is correspond to the point in input sapce; Input $y \\in Y$ stancds for the category of instance. The function from input space to output space is:\n",
    "\n",
    "$$\n",
    "f(x) = sign(w x + b)\n",
    "$$\n",
    "\n",
    "This is called perceptron. Among this, parameter $w$ is called weigth vector, $b$ is called bias. $w·x$ represent the dot product of $w$ and $x$. $sign$ is symbol function, which is:\n",
    "\n",
    "![sign_function](images/sign.png)\n",
    "\n",
    "### 1.1 Geometric interpretation\n",
    "\n",
    "Perceptron model is linear classification model, it assumes that the space is all linear classification model defined in egienspace, which is the function set ${f|f(x)=w·x+b}$. Liner function $w·x+b=0$ correspond to a hyperplane $S$ in eigen space $Rn$, $w$ is a normal vector of the hyperplane, and B is a truncation of the hyperplane. This hyperpalne divide eigen space into two parts. The points on both sides are positive and negative. Hyperplane S is called the separation hyperplane, as shown in the figure below:\n",
    "\n",
    "![perceptron_geometry_def](images/perceptron_geometry_def.png)\n",
    "\n",
    "### 1.2 Biological analogy\n",
    "![perceptron_2](images/perceptron_2.PNG)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Learning strategy of perceptron\n",
    "\n",
    "Assume that the training data is linear separable, the goal of perceptron learing is to get a hyperpalne that can totally split the positve and negative point in the training data, that is to get the parameter w and b. This need a learning strategy, which is define loss function and minimize the loss function.\n",
    "\n",
    "A natural selection of the loss function is the total number of misclassified points. However, the loss function obtained in this way is not a continuous differentiable function of parameters W and B, so it is not suitable for optimization. Another choice for the loss function is the sum of the distances from the misclassification point to the classification plane.\n",
    "\n",
    "Firstly, for any poitn $x_0$ the distance for it to hyperplane is:\n",
    "\n",
    "$$\n",
    "\\frac{1}{||w||} | w \\cdot xo + b |\n",
    "$$\n",
    "\n",
    "Next, for the misclassified point $(x_i,y_i)$:\n",
    "\n",
    "$-y_i(w \\cdot x_i + b) > 0$\n",
    "\n",
    "In this way, assume that the total misclassified point of hyperplane S is set M:\n",
    "\n",
    "$$\n",
    "-\\frac{1}{||w||} \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
    "$$\n",
    "\n",
    "Without the consideration of 1/||w||, we can get the loss function of perceptron learning.\n",
    "\n",
    "### Empirical risk function\n",
    "\n",
    "Given a dataset $T = \\{(x_1,y_1), (x_2, y_2), ... (x_N, y_N)\\}$(among them $x_i \\in R^n$, $y_i \\in \\{-1, +1\\}，i=1,2...N$), the loss function of perceptron $sign(w·x+b)$ is defined as:\n",
    "\n",
    "$$\n",
    "L(w, b) = - \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
    "$$\n",
    "\n",
    "Among them M is the set of misclassified point, and the loss funciton is the [empirical risk function](https://blog.csdn.net/zhzhx1204/article/details/70163099) of perceptron learning.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Algorithm of perceptron learning\n",
    "\n",
    "Optimization problem: Given a dataset $T = \\{(x_1,y_1), (x_2, y_2), ... (x_N, y_N)\\}$(among them $x_i \\in R^n$, $y_i \\in \\{-1, +1\\}，i=1,2...N$), calcualte parameter w,b to make it the solve of loss function(M is the set of misclassified point): \n",
    "\n",
    "$$\n",
    "min_{w,b} L(w, b) =  - \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
    "$$\n",
    "\n",
    "Perceptron learnign is driven by misclassified, it specifically use [random gradient descent method](https://blog.csdn.net/zbc1090549839/article/details/38149561). Firstly, randomly choose $w_0$、$b_0$. After that, use gradient descent method to constantly minimize object function, the minimization process is not a gradient descent of all the misclassification points in M all at once, instead, it randomly choose one misclassified point at a time to make it gradient descent.\n",
    "\n",
    "Assume that misclassified set M is fixed, then the depeth of loss function $L(w,b)$ is:\n",
    "\n",
    "$$\n",
    "\\triangledown_w L(w, b) = - \\sum_{x_i \\in M} y_i x_i \\\\\n",
    "\\triangledown_b L(w, b) = - \\sum_{x_i \\in M} y_i \\\\\n",
    "$$\n",
    "\n",
    "Randomly choose a misclassified point $(x_i,y_i)$, update $w,b$:\n",
    "\n",
    "$$\n",
    "w = w + \\eta y_i x_i \\\\\n",
    "b = b + \\eta y_i\n",
    "$$\n",
    "\n",
    "In the formula $\\eta$（0 ≤ $ \\eta $ ≤ 1） is step length(In statistics, it is learning rate). TThe greater the step size is, the faster the gradient descends and the more it approaches the minimum point. If the step length is too large, it may cross the minimum point and lead to divergence of the function. If the step size is too small, it may take a long time to reach the minimum.\n",
    "\n",
    "Visually explain: when a instance point is being misclassified, adjust w,b to make hyperplane move to the side of misclassified point, so that the distance between misclassified point and hyperplane will be reduced untill pass thorough the point and correctly classify it.\n",
    "\n",
    "\n",
    "Algorithm\n",
    "\n",
    "```\n",
    "Input：T={(x1,y1),(x2,y2)...(xN,yN)}（Among them xi∈X=Rn，yi∈Y={-1, +1}，i=1,2...N，learning rate is η）\n",
    "outout：w, b;Perceptron model: f(x)=sign(w·x+b)\n",
    "(1) Initialization: w0,b0\n",
    "(2) Choose（xi, yi）in the training dataset.\n",
    "(3) Ifyi(w * xi+b)≤0\n",
    "           w = w + ηyixi\n",
    "           b = b + ηyi\n",
    "(4) If all the sample are correctly classified, or the iteration steps reaches the limit, finish the program.\n",
    "(5) Otherwise, jump to（2）\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.Sample algorithm\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "lines_to_end_of_cell_marker": 2
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "update weight and bias:  1.0 2.5 0.5\n",
      "update weight and bias:  -2.5 1.0 0.0\n",
      "update weight and bias:  -1.5 3.5 0.5\n",
      "update weight and bias:  -5.0 2.0 0.0\n",
      "update weight and bias:  -4.0 4.5 0.5\n",
      "w =  [-4.0, 4.5]\n",
      "b =  0.5\n",
      "ground_truth:  [1, 1, 1, 1, -1, -1, -1, -1]\n",
      "predicted:     [1, 1, 1, 1, -1, -1, -1, -1]\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "import numpy as np\n",
    "\n",
    "# 符号函数\n",
    "def sign(v):\n",
    "    if v > 0:  return 1\n",
    "    else:      return -1\n",
    "    \n",
    "def perceptron_train(train_data, eta=0.5, n_iter=100):\n",
    "    weight = [0, 0]  # 权重\n",
    "    bias = 0  # 偏置量\n",
    "    learning_rate = eta  # 学习速率\n",
    "\n",
    "    train_num = n_iter  # 迭代次数\n",
    "\n",
    "    for i in range(train_num):\n",
    "        #FIXME: the random chose sample is to slow\n",
    "        train = random.choice(train_data)\n",
    "        x1, x2, y = train\n",
    "        predict = sign(weight[0] * x1 + weight[1] * x2 + bias)  # 输出\n",
    "        #print(\"train data: x: (%2d, %2d) y: %2d  ==> predict: %2d\" % (x1, x2, y, predict))\n",
    "        \n",
    "        if y * predict <= 0:  # 判断误分类点\n",
    "            weight[0] = weight[0] + learning_rate * y * x1  # 更新权重\n",
    "            weight[1] = weight[1] + learning_rate * y * x2\n",
    "            bias      = bias      + learning_rate * y       # 更新偏置量\n",
    "            print(\"update weight and bias: \", weight[0], weight[1], bias)\n",
    "\n",
    "    #print(\"stop training: \", weight[0], weight[1], bias)\n",
    "\n",
    "    return weight, bias\n",
    "\n",
    "def perceptron_pred(data, w, b):\n",
    "    y_pred = []\n",
    "    for d in data:\n",
    "        x1, x2, y = d\n",
    "        yi = sign(w[0]*x1 + w[1]*x2 + b)\n",
    "        y_pred.append(yi)\n",
    "        \n",
    "    return y_pred\n",
    "\n",
    "# set training data\n",
    "train_data = np.array([[1, 3,  1], [2, 5,  1], [3, 8,  1], [2, 6,  1], \n",
    "                       [3, 1, -1], [4, 1, -1], [6, 2, -1], [7, 3, -1]])\n",
    "\n",
    "# do training\n",
    "w, b = perceptron_train(train_data)\n",
    "print(\"w = \", w)\n",
    "print(\"b = \", b)\n",
    "\n",
    "# predict \n",
    "y_pred = perceptron_pred(train_data, w, b)\n",
    "\n",
    "print(\"ground_truth: \", list(train_data[:, 2]))\n",
    "print(\"predicted:    \", y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "* [感知机（Python实现）](http://www.cnblogs.com/kaituorensheng/p/3561091.html)\n",
    "* [Programming a Perceptron in Python](https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/)\n",
    "* [损失函数、风险函数、经验风险最小化、结构风险最小化](https://blog.csdn.net/zhzhx1204/article/details/70163099)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}