add english version to bp method

4 years ago · 4bf587e9a1
--- a/2_knn/knn_classification.ipynb
+++ b/2_knn/knn_classification.ipynb
@@ -334,7 +334,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
--- a/3_kmeans/1-k-means.ipynb
+++ b/3_kmeans/1-k-means.ipynb
@@ -955,7 +955,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
--- a/4_logistic_regression/1-Least_squares.ipynb
+++ b/4_logistic_regression/1-Least_squares.ipynb
@@ -5146,7 +5146,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
--- a/4_logistic_regression/2-Logistic_regression.ipynb
+++ b/4_logistic_regression/2-Logistic_regression.ipynb
@@ -698,7 +698,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
--- a/5_nn/2-mlp_bp_EN.ipynb
+++ b/5_nn/2-mlp_bp_EN.ipynb
@@ -46,7 +46,6 @@
    "\\end{eqnarray}\n",
    "\n",
    "We can see that the derivative of sigmoid function is very interesting, it can use sigmoid function itself to represent. In this way, once the value of sigmoid funtion is being calcualted, it is very convenient to calculate the value of its derivative.\n",
    "可以看到，sigmoid函数的导数非常有趣，它可以用sigmoid函数自身来表示。这样，一旦计算出sigmoid函数的值，计算它的导数的值就非常方便。\n",
    "\n"
   ]
  },
@@ -73,35 +72,36 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. 计算神经网络的输出\n",
    "## 3. Calculate the out put of neurons\n",
    "\n",
    "神经网络实际上就是一个输入向量$\\vec{x}$到输出向量$\\vec{y}$的函数，即：\n",
    "Neural network is actually a function represent the process from input vector $\\vec{x}$ to output vector $\\vec{y}$, which is:\n",
    "\n",
    "$$\n",
    "\\vec{y} = f_{network}(\\vec{x})\n",
    "$$\n",
    "根据输入计算神经网络的输出，需要首先将输入向量$\\vec{x}$的每个元素的值$x_i$赋给神经网络的输入层的对应神经元，然后根据式1依次向前计算每一层的每个神经元的值，直到最后一层输出层的所有神经元的值计算完毕。最后，将输出层每个神经元的值串在一起就得到了输出向量$\\vec{y}$。\n",
    "\n",
    "接下来举一个例子来说明这个过程，我们先给神经网络的每个单元写上编号。\n",
    "According to the input, we can calculate the output of neural network. We need firstly assign each value of element in the input vector to the correspond neurons in the output layer. Then, according to the formula, 1 we forward calculate the value of each neurons in every layer until the value of neurons in the last output layer is all calculated. Finally, we can get the output vector of $\\vec{y}$ by combining every vlaue of neurons together\n",
    "\n",
    "![nn2](images/nn2.png)\n",
    "Next we will list an example to show this process. Noting every element firstly in the neural network is necessary.\n",
    "\n",
    "如上图，输入层有三个节点，我们将其依次编号为1、2、3；隐藏层的4个节点，编号依次为4、5、6、7；最后输出层的两个节点编号为8、9。因为我们这个神经网络是全连接网络，所以可以看到每个节点都和上一层的所有节点有连接。比如，我们可以看到隐藏层的节点4，它和输入层的三个节点1、2、3之间都有连接，其连接上的权重分别为$w_{41}$,$w_{42}$,$w_{43}$。那么，我们怎样计算节点4的输出值$a_4$呢？\n",
    "![nn2](images/nn2.png)\n",
    "\n",
    "As shown in the upper graph, there are three node in the input layer, we note them 1, 2, 3 in turn. The 4 nodes of the hidden layer are numbered 4, 5, 6, and 7, respectively. The last two node in the output layer is 8 and 9. Because our neural network is fully connected network, so we can see that each node is connected to all the nodes in the previous layer. For example, we can see the node 4 in the hidden layer, they have connection with the three node(1, 2, 3) in the input layer. The weight on the connection is $w_{41}$,$w_{42}$,$w_{43}$ respectively. Then, how can we calculate the output value of node 4?\n",
    "\n",
    "为了计算节点4的输出值，我们必须先得到其所有上游节点（也就是节点1、2、3）的输出值。节点1、2、3是输入层的节点，所以，他们的输出值就是输入向量$\\vec{x}$本身。按照上图画出的对应关系，可以看到节点1、2、3的输出值分别是$x_1$,$x_2$,$x_3$。我们要求输入向量的维度和输入层神经元个数相同，而输入向量的某个元素对应到哪个输入节点是可以自由决定的，你偏非要把$x_1$赋值给节点2也是完全没有问题的，但这样除了把自己弄晕之外，并没有什么价值。\n",
    "In order to calculate the output value of node 4, we must firstly get the output value of all the other upstream node(which is node 1, 2, 3). Node 1, 2, 3 is the input layer node, so that their output value is the input vector $\\vec{x}$ itself. According to the corresponding relationship in the upper graph, we can see that the output value of node 1, 2, 3 is $x_1$,$x_2$,$x_3$ respectively. We want the dimension of input vector is the same with input neurons, while the element of input vector can be free to decide which corresponds to the input nodes. It's also perfectly fine if you want to assign $x_1$ to node 2, however this will have no meaning without mistaking your self.\n",
    "\n",
    "一旦我们有了节点1、2、3的输出值，我们就可以根据式1计算节点4的输出值$a_4$：\n",
    "Once we have the output value of node 1, 2 and 3, we can calculate the output value $a_4$ of node 4 according to formula 1.\n",
    "\n",
    "![eqn_3_4](images/eqn_3_4.png)\n",
    "\n",
    "上式的$w_{4b}$是节点4的偏置项，图中没有画出来。而$w_{41}$,$w_{42}$,$w_{43}$分别为节点1、2、3到节点4连接的权重，在给权重$w_{ji}$编号时，我们把目标节点的编号$j$放在前面，把源节点的编号$i$放在后面。\n",
    "The $w_{4b}$ of above formula is the bias term of node 4, without drawing in the graph. While $w_{41}$,$w_{42}$,$w_{43}$are the weights of node 1, 2, 3 and 4 connections, respectively. When we note weight $w_{ji}$, We put the destination node number $j$ first and the source node number $i$ after.\n",
    "\n",
    "同样，我们可以继续计算出节点5、6、7的输出值$a_5$,$a_6$,$a_7$。这样，隐藏层的4个节点的输出值就计算完成了，我们就可以接着计算输出层的节点8的输出值$y_1$：\n",
    "Similarly, we can continue to calculate the output value $a_5$,$a_6$,$a_7$ of node 5, 6, 7. In this way, the output values of the four nodes in the hidden layer are calculated and we can calculate the output value of node 8 of the output layer, $y_1$:\n",
    "\n",
    "![eqn_5_6](images/eqn_5_6.png)\n",
    "\n",
    "同理，我们还可以计算出$y_2$的值。这样输出层所有节点的输出值计算完毕，我们就得到了在输入向量$\\vec{x} = (x_1, x_2, x_3)^T$时，神经网络的输出向量$\\vec{y} = (y_1, y_2)^T$。这里我们也看到，输出向量的维度和输出层神经元个数相同。\n",
    "In the same way, we can also calculate the value of $y_2$. Thus, all the output value of the output layer node is calculated.  When we get the input vector $\\vec{x} = (x_1, x_2, x_3)^T$, the output vector of neural network is $\\vec{y} = (y_1, y_2)^T$. We also see that the dimension of output vector is the same with the number of neurons.\n",
    "\n",
    "\n"
   ]
  },
@@ -109,44 +109,43 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 神经网络的矩阵表示\n",
    "## 4. A matrix representation of a neural network\n",
    "\n",
    "神经网络的计算如果用矩阵来表示会很方便（当然逼格也更高），我们先来看看隐藏层的矩阵表示。\n",
    "The calculation of neural network will be very convenient if we use matrix to represent. Let's check the representation of the hidden layer.\n",
    "\n",
    "首先我们把隐藏层4个节点的计算依次排列出来：\n",
    "First, we arrange the calculation of the four nodes of the hidden layer in order：\n",
    "\n",
    "![eqn_hidden_units](images/eqn_hidden_units.png)\n",
    "\n",
    "接着，定义网络的输入向量$\\vec{x}$和隐藏层每个节点的权重向量$\\vec{w}$。令\n",
    "Next, define the input vector of the net $\\vec{x}$ and each node weight vector $\\vec{w}$ in the hidden layer. We let:\n",
    "\n",
    "![eqn_7_12](images/eqn_7_12.png)\n",
    "\n",
    "代入到前面的一组式子，得到：\n",
    "Substitute into the previous set of expressions, and get:\n",
    "\n",
    "![eqn_13_16](images/eqn_13_16.png)\n",
    "\n",
    "现在，我们把上述计算$a_4$, $a_5$,$a_6$,$a_7$的四个式子写到一个矩阵里面，每个式子作为矩阵的一行，就可以利用矩阵来表示它们的计算了。令\n",
    "Now, we put the four formula that calculated $a_4$, $a_5$,$a_6$,$a_7$ into one matrix, each formula worked as one row of matrix so that we can use matrix to represent their calculation. Let:\n",
    "\n",
    "![eqn_matrix1](images/eqn_matrix1.png)\n",
    "\n",
    "带入前面的一组式子，得到\n",
    "Substitute into former group of formula we can get:\n",
    "\n",
    "![formular_2](images/formular_2.png)\n",
    "\n",
    "在式2中，$f$是激活函数，在本例中是$sigmod$函数；$W$是某一层的权重矩阵；$\\vec{x}$是某层的输入向量；$\\vec{a}$是某层的输出向量。式2说明神经网络的每一层的作用实际上就是先将输入向量左乘一个数组进行线性变换，得到一个新的向量，然后再对这个向量逐元素应用一个激活函数。\n",
    "In formula 2, $f$ is the activation function, in this instance, it is $sigmod$ function. $W$ is the weight matrix of one layer. $\\vec{x}$ is the input vector of some layer. $\\vec{a}$ is the output vector of some layer. Formula 2 shows that the function of each layer of the neural network is to first multiply the input vector left by an array for linear transformation to obtain a new vector, and then apply an activation function to this vector element by element.\n",
    "\n",
    "每一层的算法都是一样的。比如，对于包含一个输入层，一个输出层和三个隐藏层的神经网络，我们假设其权重矩阵分别为$W_1$,$W_2$,$W_3$,$W_4$，每个隐藏层的输出分别是$\\vec{a}_1$,$\\vec{a}_2$,$\\vec{a}_3$，神经网络的输入为$\\vec{x}$，神经网络的输出为$\\vec{y}$，如下图所示：\n",
    "The algorithm in each layer is the same. For example, for a neural network that contains one input layer, one output layer and three hidden layer, we assume that their weight matrix is $W_1$,$W_2$,$W_3$,$W_4$ respectively. Every hidden layer output is $\\vec{a}_1$,$\\vec{a}_2$,$\\vec{a}_3$ respectively, the input of neural network is $\\vec{x}$, and the output of neural network is $\\vec{y}$. As shown in the figure below:\n",
    "\n",
    "![nn_parameters_demo](images/nn_parameters_demo.png)\n",
    "\n",
    "则每一层的输出向量的计算可以表示为：\n",
    "The calculation of output vector in each layer can be represent as:\n",
    "\n",
    "![eqn_17_20](images/eqn_17_20.png)\n",
    "\n",
    "This is the matrix calculation method of neural network output value.\n",
    "\n",
    "这就是神经网络输出值的矩阵计算方法。\n",
    "\n",
    "如果写成一个公式：\n",
    "If written as a formula:\n",
    "$$\n",
    "\\vec{y} = f(W4 \\cdot f(W3 \\cdot f(W2 \\cdot f(W1 \\cdot \\vec{x}))))\n",
    "$$"
@@ -156,8 +155,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The process of neural network forward calculation is relatively simple, that is, it is ok to keep doing the calculation layer by layer. The dynamic demonstration is shown in the figure below\n",
    "\n",
    "神经网络正向计算的过程比较简单，就是一层一层不断做运算就可以了，动态的演示如下图所示：\n",
    "![](images/nn-forward.gif)"
   ]
  },
@@ -165,42 +164,41 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. 神经网络的训练 - 反向传播算法\n",
    "## 5. The training of neural network - Back propagation algorithm \n",
    "\n",
    "现在，我们需要知道一个神经网络的每个连接上的权值是如何得到的。我们可以说神经网络是一个模型，那么这些权值就是模型的参数，也就是模型要学习的东西。然而，一个神经网络的连接方式、网络的层数、每层的节点数这些参数，则不是学习出来的，而是人为事先设置的。对于这些人为设置的参数，我们称之为超参数(Hyper-Parameters)。\n",
    "Now, we need to know how to get the weight in every connection of neural networks. We can say that neural network is a model, then theses weight is the parameter of the model, which is the thing that model need to learn. However, such parameters as the connection mode of a neural network, the number of layers of the network, and the number of nodes in each layer are not learned, but artificially set in advance. For these parameter that setted in advance, we call it Hyper-Parameters.\n",
    "\n",
    "反向传播算法其实就是链式求导法则的应用。然而，这个如此简单且显而易见的方法，却是在Roseblatt提出感知器算法将近30年之后才被发明和普及的。对此，Bengio这样回应道：\n",
    "Back propagation algorithm is actually the application of chain rule. However, this simple and obvious method was invented and popularized nearly 30 years after Roseblatt proposed the perceptron algorithm. For this, Bengio answered:\n",
    "\n",
    "> 很多看似显而易见的想法只有在事后才变得显而易见。\n",
    "> Many ideas that seem obvious become obvious only in hindsight.\n",
    "\n",
    "按照机器学习的通用套路，我们先确定神经网络的目标函数，然后用随机梯度下降优化算法去求目标函数最小值时的参数值。\n",
    "According to the general routine of machine learning, we first determine the objective function of the neural network, and then use the stochastic gradient descent optimization algorithm to calculate the parameter value of the minimum objective function\n",
    "\n",
    "我们取网络所有输出层节点的误差平方和作为目标函数：\n",
    "We take the sum of error square of all the output layer in the network as the target function:\n",
    "\n",
    "![bp_loss](images/bp_loss.png)\n",
    "\n",
    "其中，$E_d$表示是样本$d$的误差。\n",
    "Among them $E_d$ is the error of sample.\n",
    "\n",
    "然后，使用随机梯度下降算法对目标函数进行优化：\n",
    "After that, we can use random gradient descent method to optimize the target function.\n",
    "\n",
    "![bp_weight_update](images/bp_weight_update.png)\n",
    "\n",
    "随机梯度下降算法也就是需要求出误差$E_d$对于每个权重$w_{ji}$的偏导数（也就是梯度），怎么求呢？\n",
    "The random gradient descent algorithm is to find the partial derivative of the error $E_d$with respect to each weight $w_{ji}$, how to find it?\n",
    "\n",
    "![nn3](images/nn3.png)\n",
    "\n",
    "观察上图，我们发现权重$w_{ji}$仅能通过影响节点$j$的输入值影响网络的其它部分，设$net_j$是节点$j$的加权输入，即\n",
    "Observing the upper graph, we find that weight $w_{ji}$ can only affect the other parts of the network through the input of node $j$, set $net_j$  as the weighted input of node $j$, therefore:\n",
    "\n",
    "![eqn_21_22](images/eqn_21_22.png)\n",
    "\n",
    "$E_d$是$net_j$的函数，而$net_j$是$w_{ji}$的函数。根据链式求导法则，可以得到：\n",
    "$E_d$ is the function of $net_j$, while $net_j$ is the function of $w_{ji}$. According to the chain rule, we can get:\n",
    "\n",
    "![eqn_23_25](images/eqn_23_25.png)\n",
    "\n",
    "In the upper formula, $x_{ji}$ is the input value that node $i$ pass to node $j$, which is the output value of node $i$.  \n",
    "\n",
    "上式中，$x_{ji}$是节点传递给节点$j$的输入值，也就是节点$i$的输出值。\n",
    "\n",
    "对于的$\\frac{\\partial E_d}{\\partial net_j}$推导，需要区分输出层和隐藏层两种情况。\n",
    "About the derivation of formula $\\frac{\\partial E_d}{\\partial net_j}$, we need to distinguish the two case between input layer and hidden layer.\n",
    "\n"
   ]
  },
@@ -208,32 +206,31 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 输出层权值训练\n",
    "### 5.1 The training of propagation layer\n",
    "\n",
    "![nn3](images/nn3.png)\n",
    "\n",
    "对于输出层来说，$net_j$仅能通过节点$j$的输出值$y_j$来影响网络其它部分，也就是说$E_d$是$y_j$的函数，而$y_j$是$net_j$的函数，其中$y_j = sigmod(net_j)$。所以我们可以再次使用链式求导法则：\n",
    "For output layer, $net_j$ can only affect the other parts of the net through affecting the output value $net_j$ of node $j$, which means that $E_d$ is the function of $y_j$ while $y_j$ is the function of $net_j$, among them $y_j = sigmod(net_j)$. Therefore we can use chain rule again:\n",
    "\n",
    "![eqn_26](images/eqn_26.png)\n",
    "\n",
    "考虑上式第一项:\n",
    "Consdering about the first term of the upper formula\n",
    "\n",
    "![eqn_27_29](images/eqn_27_29.png)\n",
    "\n",
    "\n",
    "考虑上式第二项：\n",
    "Considering about the second term of the formula:\n",
    "\n",
    "![eqn_30_31](images/eqn_30_31.png)\n",
    "\n",
    "将第一项和第二项带入，得到：\n",
    "Combine the first and second term together, we get:\n",
    "\n",
    "![eqn_ed_net_j.png](images/eqn_ed_net_j.png)\n",
    "\n",
    "如果令$\\delta_j = - \\frac{\\partial E_d}{\\partial net_j}$，也就是一个节点的误差项$\\delta$是网络误差对这个节点输入的偏导数的相反数。带入上式，得到：\n",
    "If we let $\\delta_j = - \\frac{\\partial E_d}{\\partial net_j}$, so that a node error term $\\delta$ is the negative of the partial derivative of the network error with respect to the input to this node. Substitute into the upper formula, we get:\n",
    "\n",
    "![eqn_delta_j.png](images/eqn_delta_j.png)\n",
    "\n",
    "将上述推导带入随机梯度下降公式，得到：\n",
    "Put the derive result into random gradient descent formula, we get:\n",
    "\n",
    "![eqn_32_34.png](images/eqn_32_34.png)\n"
   ]
@@ -242,74 +239,71 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 隐藏层权值训练\n",
    "### 5.2　The training of the weight in hidden layer\n",
    "\n",
    "现在我们要推导出隐藏层的$\\frac{\\partial E_d}{\\partial net_j}$。\n",
    "Now, we need to derive the $\\frac{\\partial E_d}{\\partial net_j}$ of hidden layer.\n",
    "\n",
    "![nn3](images/nn3.png)\n",
    "\n",
    "首先，我们需要定义节点$j$的所有直接下游节点的集合$Downstream(j)$。例如，对于节点4来说，它的直接下游节点是节点8、节点9。可以看到$net_j$只能通过影响$Downstream(j)$再影响$E_d$。设$net_k$是节点$j$的下游节点的输入，则$E_d$是$net_k$的函数，而$net_k$是$net_j$的函数。因为$net_k$有多个，我们应用全导数公式，可以做出如下推导：\n",
    "Firstly, we need to define the set as all the direct downstream node $Downstream(j)$ of node $j$. For exmaple, to node 4, the downstream of it is node 8 and 9. We can see that $net_j$ can noly affect $E__c$ by affecting $Downstream(j)$. Set $net_k$ as the downstream node input of node $j$, then $E_d$ is the function of $net_k$, while $net_k$ is the function of $net_j$. Beacuse there are many $net_k$, we should apply full derivative formula to do the derivative as following:\n",
    "\n",
    "![eqn_35_40](images/eqn_35_40.png)\n",
    "\n",
    "因为$\\delta_j = - \\frac{\\partial E_d}{\\partial net_j}$，带入上式得到：\n",
    "Because $\\delta_j = - \\frac{\\partial E_d}{\\partial net_j}$, put this into the upper formula we can get:\n",
    "\n",
    "![eqn_delta_hidden.png](images/eqn_delta_hidden.png)\n",
    "\n",
    "\n",
    "至此，我们已经推导出了反向传播算法。需要注意的是，我们刚刚推导出的训练规则是根据激活函数是sigmoid函数、平方和误差、全连接网络、随机梯度下降优化算法。如果激活函数不同、误差计算方式不同、网络连接结构不同、优化算法不同，则具体的训练规则也会不一样。但是无论怎样，训练规则的推导方式都是一样的，应用链式求导法则进行推导即可。\n"
    "At here, we have derive the backward propagation algoritm. One thing to note, the trian rule that we derived just now is  according to the activation function of sigmoid function, square sum error, fully connected network, ramdom gradient descent optimization algorithm. If we have diferent activation function, error calculation mode, net connection strucutre and optimization, we will have different training rules. Whatever, it is all the same in the derivation of training rules, we noly need to use the chain rule do the derivaiton.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  5.3 具体解释\n",
    "###  5.3 The specific explanation\n",
    "\n",
    "我们假设每个训练样本为$(\\vec{x}, \\vec{t})$，其中向量$\\vec{x}$是训练样本的特征，而$\\vec{t}$是样本的目标值。\n",
    "We assume that evey training sample is $(\\vec{x}, \\vec{t})$, among them vector $\\vec{x}$ is the characteristic of training sample, while $\\vec{t}$ is the target value of sample.\n",
    "\n",
    "![nn3](images/nn3.png)\n",
    "\n",
    "首先，我们根据上一节介绍的算法，用样本的特征$\\vec{x}$，计算出神经网络中每个隐藏层节点的输出$a_i$，以及输出层每个节点的输出$y_i$。\n",
    "Firstly, we use the characteristic of the sample $\\vec{x}$ to calculate the output $a_i$ of every hidden node in neural network and every output value of output layer, according to the algorithm that we introduced in last section. \n",
    "\n",
    "然后，我们按照下面的方法计算出每个节点的误差项$\\delta_i$：\n",
    "Then, we calculate every node error term $\\delta_i$ according to the following method:\n",
    "\n",
    "* **对于输出层节点$i$**\n",
    "* **For output layer node $i$**\n",
    "\n",
    "![formular_3.png](images/formular_3.png)\n",
    "\n",
    "其中，$\\delta_i$是节点$i$的误差项，$y_i$是节点$i$的输出值，$t_i$是样本对应于节点$i$的目标值。举个例子，根据上图，对于输出层节点8来说，它的输出值是$y_1$，而样本的目标值是$t_1$，带入上面的公式得到节点8的误差项应该是：\n",
    "Among them, $\\delta_i$is the error term of node $i$, $y_i$ is the output value of node $i$. For example, according to the upper graph, the output layer node 8 have output value $y_1$, while the target value of smaple is $t_1$, substitute into upper formula we get the error term of node 8:\n",
    "\n",
    "![forumlar_delta8.png](images/forumlar_delta8.png)\n",
    "\n",
    "* **对于隐藏层节点**\n",
    "* **For hidden layer node**\n",
    "\n",
    "![formular_4.png](images/formular_4.png)\n",
    "\n",
    "其中，$a_i$是节点$i$的输出值，$w_{ki}$是节点$i$到它的下一层节点$k$的连接的权重，$\\delta_k$是节点$i$的下一层节点$k$的误差项。例如，对于隐藏层节点4来说，计算方法如下：\n",
    "Among them, $a_i$ is the output value of node $i$, while $w_{ki}$ is the weight that node $i$ connect to it's next layer $k$. $\\delta_k$ is the next layer error term of node $i$. For example, for hidden layer node 4, the calcultaion method is as following:\n",
    "\n",
    "![forumlar_delta4.png](images/forumlar_delta4.png)\n",
    "\n",
    "\n",
    "\n",
    "最后，更新每个连接上的权值：\n",
    "At last, update weight of every connection.\n",
    "\n",
    "![formular_5.png](images/formular_5.png)\n",
    "\n",
    "其中，$w_{ji}$是节点$i$到节点$j$的权重，$\\eta$是一个成为学习速率的常数，$\\delta_j$是节点$j$的误差项，$x_{ji}$是节点$i$传递给节点$j$的输入。例如，权重$w_{84}$的更新方法如下：\n",
    "Among them, $w_{ji}$ is the weight from node $i$ to node $j$, $\\eta$ is a constant that represent the learning rate, $\\delta_j$ is the error term of node $j$, while $x_{ji}$ is the output that node $i$ pass to node $j$.\n",
    "For example, the update way of weight $w_{84}$ is as following:\n",
    "\n",
    "![eqn_w84_update.png](images/eqn_w84_update.png)\n",
    "\n",
    "类似的，权重$w_{41}$的更新方法如下：\n",
    "Similarly, the update method of weight $w_{41}$ is as following:\n",
    "\n",
    "![eqn_w41_update.png](images/eqn_w41_update.png)\n",
    "\n",
    "\n",
    "偏置项的输入值永远为1。例如，节点4的偏置项$w_{4b}$应该按照下面的方法计算：\n",
    "The input value of bias term is always one. For example, the bias term of node 4 $w_{4b}$should be calculated according to the following method.\n",
    "\n",
    "![eqn_w4b_update.png](images/eqn_w4b_update.png)\n",
    "\n",
    "我们已经介绍了神经网络每个节点误差项的计算和权重更新方法。显然，计算一个节点的误差项，需要先计算每个与其相连的下一层节点的误差项。这就要求误差项的计算顺序必须是从输出层开始，然后反向依次计算每个隐藏层的误差项，直到与输入层相连的那个隐藏层。这就是反向传播算法的名字的含义。当所有节点的误差项计算完毕后，我们就可以根据式5来更新所有的权重。\n",
    "We have introduced the calculation method and weight updating method for each node error term of neural network. Apparently, to calculate the error term of a node, you need to first calculate the error term of each node connected to the next layer. This requires that the error terms be calculated in order from the output layer and then in reverse order for each hidden layer until the hidden layer is connected to the input layer. This is the meaning of the name backward propagation algorithm. When all the node error term are calculated, we can update all the weight according to formula 5.\n",
    "\n"
   ]
  },
@@ -317,76 +311,78 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. 为什么要使用激活函数\n",
    "激活函数在神经网络中非常重要，使用激活函数也是非常必要的，前面我们从人脑神经元的角度理解了激活函数，因为神经元需要通过激活才能往后传播，所以神经网络中需要激活函数，下面我们从数学的角度理解一下激活函数的必要性。\n",
    "## 6. Why we use activation function\n",
    "Activation function is very important in neural network and it is also important to use activation function. We get to konw the activation funciton in the former section from the perspective of human neurons. The neuron need to propagate backward thorough activation, therefore activation is needed in neural networks, we will understand the necessity of the activation function from math perspective.\n",
    "\n",
    "比如一个两层的神经网络，使用 A 表示激活函数，那么\n",
    "For a to layer neural network network, if we use A represent activation, then:\n",
    "\n",
    "$$\n",
    "y = w_2 A(w_1 x)\n",
    "$$\n",
    "\n",
    "如果我们不使用激活函数，那么神经网络的结果就是\n",
    "If we do not use activation function, then the result of neural work is:\n",
    "\n",
    "$$\n",
    "y = w_2 (w_1 x) = (w_2 w_1) x = \\bar{w} x\n",
    "$$\n",
    "\n",
    "可以看到，我们将两层神经网络的参数合在一起，用 $\\bar{w}$ 来表示，两层的神经网络其实就变成了一层神经网络，只不过参数变成了新的 $\\bar{w}$，所以如果不使用激活函数，那么不管多少层的神经网络，$y = w_n \\cdots w_2 w_1 x = \\bar{w} x$，就都变成了单层神经网络，所以在每一层我们都必须使用激活函数。\n",
    "We can see that we combined the two layer neural network parameter together, represented in $\\bar{w}$, so that the two layer network is actually one layer neural network while the parameter changes to new $\\bar{w}$. Therefore, if we do not use activation function, whatever how many layer neural network we have, $y = w_n \\cdots w_2 w_1 x = \\bar{w} x$ is changing into a one layer network, so that we must use activation function in every layer.\n",
    "\n",
    "最后我们看看激活函数对神经网络的影响\n",
    "Finally, let's look at the effects of activation functions on neural networks:\n",
    "\n",
    "![](images/nn-activation-function.gif)\n",
    "\n",
    "可以看到使用了激活函数之后，神经网络可以通过改变权重实现任意形状，越是复杂的神经网络能拟合的形状越复杂，这就是著名的神经网络万有逼近定理。神经网络使用的激活函数都是非线性的，每个激活函数都输入一个值，然后做一种特定的数学运算得到一个结果。\n",
    "We can see that when we use the activation, the neurak network can change into any shapes by changing weght, the more complicated neural network can fit more complicate shapes, which is known as the universal approximation theorem for neural networks.The activation function that used in neural network are all nolinear, every time the acitvaiton funciton input a value, we will get a result through a special math calculation. \n",
    "\n",
    "### 6.1 sigmoid 激活函数\n",
    "### 6.1 sigmoid activation function\n",
    "\n",
    "$$\\sigma(x) = \\frac{1}{1 + e^{-x}}$$\n",
    "\n",
    "![](images/act-sigmoid.jpg)\n",
    "\n",
    "### 6.2 tanh 激活函数\n",
    "### 6.2 tanh activation\n",
    "\n",
    "$$tanh(x) = 2 \\sigma(2x) - 1$$\n",
    "\n",
    "![](images/act-tanh.jpg)\n",
    "\n",
    "### 6.3 ReLU 激活函数\n",
    "### 6.3 ReLU activation\n",
    "\n",
    "$$ReLU(x) = max(0, x)$$\n",
    "\n",
    "![](images/act-relu.jpg)\n",
    "\n",
    "当输入 $x<0$ 时，输出为 $0$，当 $x> 0$ 时，输出为 $x$。该激活函数使网络更快速地收敛。它不会饱和，即它可以对抗梯度消失问题，至少在正区域（$x> 0$ 时）可以这样，因此神经元至少在一半区域中不会把所有零进行反向传播。由于使用了简单的阈值化（thresholding），ReLU 计算效率很高。\n",
    "When the input $x<0$, the output is $0$, while for $x> 0$, the output is $x$. This activation function make network converge rapaidly. It does not saturate, that is, it can resist gradient disappearance, at least in the positive region ($x> 0$). Therefore, neuron will not back propagate all the zero at at least half of the region. Because we use the simple thresholding, the RelU will have a high calculation efficiency.\n",
    "\n",
    "在网络中，不同的输入可能包含着大小不同关键特征，使用大小可变的数据结构去做容器，则更加灵活。假如神经元激活具有稀疏性，那么不同激活路径上：不同数量（选择性不激活）、不同功能（分布式激活）。两种可优化的结构生成的激活路径，可以更好地从有效的数据的维度上，学习到相对稀疏的特征，起到自动化解离效果。\n",
    "In the network, the different input may contains key characteristic of differrent size, and it will be more flexible if we use the changeable data structure as the container. Assume that the neurons have sparse characteristic, then for different activation way: different numbers(Selective inactivation), different function(Distributed activation). The activation paths generated by the two optimizable structures can better learn the relatively sparse features from the dimension of the effective data and play an automatic de-separation effect. \n",
    "\n",
    "![](images/nn-sparse.png)\n",
    "\n",
    "在深度神经网络中，对非线性的依赖程度就少一些。另外，稀疏特征并不需要网络具有很强的处理线性不可分机制。因此在深度学习模型中，使用简单、速度快的线性激活函数可能更为合适。如图，一旦神经元与神经元之间改为线性激活，网络的非线性部分仅仅来自于神经元部分选择性激活。\n",
    "In deep neural network, the dependence to nolinear is much less. What's more, sparse characteristic do not require the network have strong processing linear inseparability mechanism. Therefore, in the deep learning model, it is more suitable to use simple, quickly linear activation function. As shown in the figure, once the activation changes linearly from neuron to neuron, the nonlinear part of the network only comes from the partial selective activation of the neuron. \n",
    "\n",
    "Another reason that we are more prone to use linear activation function is to reduce the Vanishing Gradient Problem when trianing deep netwrok with gradient method.\n",
    "\n",
    "\n",
    "更倾向于使用线性神经激活函数的另外一个原因是，减轻梯度法训练深度网络时的Vanishing Gradient Problem。\n",
    "Those of you who have seen the BP derivation know that when you calculate the gradient from the back propagation of the error from the output layer, you multiply the input neuron value of the current layer at each layer to get the first derivative of the activation function.\n",
    "\n",
    "看过BP推导的人都知道，误差从输出层反向传播算梯度时，在各层都要乘当前层的输入神经元值，激活函数的一阶导数。\n",
    "$$\n",
    "grad = error ⋅ sigmoid'(x) ⋅ x\n",
    "$$\n",
    "\n",
    "使用双端饱和(即值域被限制)Sigmoid系函数会有两个问题：\n",
    "There are two problems with using the Sigmoid family of bi-terminal saturation (that is, the range is limited) functions:\n",
    "\n",
    "1. sigmoid'(x) ∈ (0,1)  Derivative zoom\n",
    "2. x∈(0,1)或x∈(-1,1)  Saturation scaling\n",
    "\n",
    "1. sigmoid'(x) ∈ (0,1)  导数缩放\n",
    "2. x∈(0,1)或x∈(-1,1)  饱和值缩放\n",
    "\n",
    "这样，经过每一层时，Error都是成倍的衰减，一旦进行递推式的多层的反向传播，梯度就会不停的衰减，消失，使得网络学习变慢。而校正激活函数的梯度是1，且只有一端饱和，梯度很好的在反向传播中流动，训练速度得到了很大的提高。"
    "In this way, when passing through each layer, the Error will decay exponentially. Once the recursive multi-layer back propagation is carried out, the gradient will constantly decay and disappear, making the network learning slow down. The gradient of the corrected activation function is 1, and only one end is saturated. The gradient flows well in the back propagation, and the training speed is greatly improved."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. 示例程序"
    "## 7. The demo program"
   ]
  },
  {
@@ -2608,7 +2604,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. 如何使用类的方法封装多层神经网络?"
    "## 8. How to encapsulate a multi-layer neural network using class methods?"
   ]
  },
  {
@@ -4857,7 +4853,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. 深入分析与问题"
    "## 9. In-depth analysis and problems"
   ]
  },
  {
@@ -4892,10 +4888,10 @@
   "metadata": {},
   "source": [
    "**问题**\n",
    "1. 我们希望得到的每个类别的概率\n",
    "2. 如何做多分类问题？\n",
    "3. 如何能让神经网络更快的训练好？\n",
    "4. 如何更好的构建网络的类定义，从而让神经网络的类支持更多的类型的处理层？"
    "1. We want to get the probability of each of these categories\n",
    "2. How to do multiple classification problem？\n",
    "3. How can you make neural network faster training good？\n",
    "4. How to better construct the class definition of the network so that the class of the neural network supports more types of processing layer？"
   ]
  },
  {
--- a/5_nn/3-softmax_ce.ipynb
+++ b/5_nn/3-softmax_ce.ipynb
@@ -168,7 +168,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
   "version": "3.6.8"
  }
 },
 "nbformat": 4,