"We can see that the derivative of sigmoid function is very interesting, it can use sigmoid function itself to represent. In this way, once the value of sigmoid funtion is being calcualted, it is very convenient to calculate the value of its derivative.\n",
"According to the input, we can calculate the output of neural network. We need firstly assign each value of element in the input vector to the correspond neurons in the output layer. Then, according to the formula, 1 we forward calculate the value of each neurons in every layer until the value of neurons in the last output layer is all calculated. Finally, we can get the output vector of $\\vec{y}$ by combining every vlaue of neurons together\n",
"\n",
"\n",
"Next we will list an example to show this process. Noting every element firstly in the neural network is necessary.\n",
"As shown in the upper graph, there are three node in the input layer, we note them 1, 2, 3 in turn. The 4 nodes of the hidden layer are numbered 4, 5, 6, and 7, respectively. The last two node in the output layer is 8 and 9. Because our neural network is fully connected network, so we can see that each node is connected to all the nodes in the previous layer. For example, we can see the node 4 in the hidden layer, they have connection with the three node(1, 2, 3) in the input layer. The weight on the connection is $w_{41}$,$w_{42}$,$w_{43}$ respectively. Then, how can we calculate the output value of node 4?\n",
"In order to calculate the output value of node 4, we must firstly get the output value of all the other upstream node(which is node 1, 2, 3). Node 1, 2, 3 is the input layer node, so that their output value is the input vector $\\vec{x}$ itself. According to the corresponding relationship in the upper graph, we can see that the output value of node 1, 2, 3 is $x_1$,$x_2$,$x_3$ respectively. We want the dimension of input vector is the same with input neurons, while the element of input vector can be free to decide which corresponds to the input nodes. It's also perfectly fine if you want to assign $x_1$ to node 2, however this will have no meaning without mistaking your self.\n",
"\n",
"一旦我们有了节点1、2、3的输出值,我们就可以根据式1计算节点4的输出值$a_4$:\n",
"Once we have the output value of node 1, 2 and 3, we can calculate the output value $a_4$ of node 4 according to formula 1.\n",
"The $w_{4b}$ of above formula is the bias term of node 4, without drawing in the graph. While $w_{41}$,$w_{42}$,$w_{43}$are the weights of node 1, 2, 3 and 4 connections, respectively. When we note weight $w_{ji}$, We put the destination node number $j$ first and the source node number $i$ after.\n",
"Similarly, we can continue to calculate the output value $a_5$,$a_6$,$a_7$ of node 5, 6, 7. In this way, the output values of the four nodes in the hidden layer are calculated and we can calculate the output value of node 8 of the output layer, $y_1$:\n",
"In the same way, we can also calculate the value of $y_2$. Thus, all the output value of the output layer node is calculated. When we get the input vector $\\vec{x} = (x_1, x_2, x_3)^T$, the output vector of neural network is $\\vec{y} = (y_1, y_2)^T$. We also see that the dimension of output vector is the same with the number of neurons.\n",
"\n",
"\n"
]
},
@@ -109,44 +109,43 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 神经网络的矩阵表示\n",
"## 4. A matrix representation of a neural network\n",
"\n",
"神经网络的计算如果用矩阵来表示会很方便(当然逼格也更高),我们先来看看隐藏层的矩阵表示。\n",
"The calculation of neural network will be very convenient if we use matrix to represent. Let's check the representation of the hidden layer.\n",
"\n",
"首先我们把隐藏层4个节点的计算依次排列出来:\n",
"First, we arrange the calculation of the four nodes of the hidden layer in order:\n",
"Now, we put the four formula that calculated $a_4$, $a_5$,$a_6$,$a_7$ into one matrix, each formula worked as one row of matrix so that we can use matrix to represent their calculation. Let:\n",
"\n",
"\n",
"\n",
"带入前面的一组式子,得到\n",
"Substitute into former group of formula we can get:\n",
"In formula 2, $f$ is the activation function, in this instance, it is $sigmod$ function. $W$ is the weight matrix of one layer. $\\vec{x}$ is the input vector of some layer. $\\vec{a}$ is the output vector of some layer. Formula 2 shows that the function of each layer of the neural network is to first multiply the input vector left by an array for linear transformation to obtain a new vector, and then apply an activation function to this vector element by element.\n",
"The algorithm in each layer is the same. For example, for a neural network that contains one input layer, one output layer and three hidden layer, we assume that their weight matrix is $W_1$,$W_2$,$W_3$,$W_4$ respectively. Every hidden layer output is $\\vec{a}_1$,$\\vec{a}_2$,$\\vec{a}_3$ respectively, the input of neural network is $\\vec{x}$, and the output of neural network is $\\vec{y}$. As shown in the figure below:\n",
"The process of neural network forward calculation is relatively simple, that is, it is ok to keep doing the calculation layer by layer. The dynamic demonstration is shown in the figure below\n",
"\n",
"神经网络正向计算的过程比较简单,就是一层一层不断做运算就可以了,动态的演示如下图所示:\n",
""
]
},
@@ -165,42 +164,41 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 神经网络的训练 - 反向传播算法\n",
"## 5. The training of neural network - Back propagation algorithm \n",
"Now, we need to know how to get the weight in every connection of neural networks. We can say that neural network is a model, then theses weight is the parameter of the model, which is the thing that model need to learn. However, such parameters as the connection mode of a neural network, the number of layers of the network, and the number of nodes in each layer are not learned, but artificially set in advance. For these parameter that setted in advance, we call it Hyper-Parameters.\n",
"Back propagation algorithm is actually the application of chain rule. However, this simple and obvious method was invented and popularized nearly 30 years after Roseblatt proposed the perceptron algorithm. For this, Bengio answered:\n",
"\n",
"> 很多看似显而易见的想法只有在事后才变得显而易见。\n",
"> Many ideas that seem obvious become obvious only in hindsight.\n",
"According to the general routine of machine learning, we first determine the objective function of the neural network, and then use the stochastic gradient descent optimization algorithm to calculate the parameter value of the minimum objective function\n",
"\n",
"我们取网络所有输出层节点的误差平方和作为目标函数:\n",
"We take the sum of error square of all the output layer in the network as the target function:\n",
"\n",
"\n",
"\n",
"其中,$E_d$表示是样本$d$的误差。\n",
"Among them $E_d$ is the error of sample.\n",
"\n",
"然后,使用随机梯度下降算法对目标函数进行优化:\n",
"After that, we can use random gradient descent method to optimize the target function.\n",
"Observing the upper graph, we find that weight $w_{ji}$ can only affect the other parts of the network through the input of node $j$, set $net_j$ as the weighted input of node $j$, therefore:\n",
"About the derivation of formula $\\frac{\\partial E_d}{\\partial net_j}$, we need to distinguish the two case between input layer and hidden layer.\n",
"For output layer, $net_j$ can only affect the other parts of the net through affecting the output value $net_j$ of node $j$, which means that $E_d$ is the function of $y_j$ while $y_j$ is the function of $net_j$, among them $y_j = sigmod(net_j)$. Therefore we can use chain rule again:\n",
"\n",
"\n",
"\n",
"考虑上式第一项:\n",
"Consdering about the first term of the upper formula\n",
"\n",
"\n",
"\n",
"\n",
"考虑上式第二项:\n",
"Considering about the second term of the formula:\n",
"\n",
"\n",
"\n",
"将第一项和第二项带入,得到:\n",
"Combine the first and second term together, we get:\n",
"If we let $\\delta_j = - \\frac{\\partial E_d}{\\partial net_j}$, so that a node error term $\\delta$ is the negative of the partial derivative of the network error with respect to the input to this node. Substitute into the upper formula, we get:\n",
"\n",
"\n",
"\n",
"将上述推导带入随机梯度下降公式,得到:\n",
"Put the derive result into random gradient descent formula, we get:\n",
"\n",
"\n"
]
@@ -242,74 +239,71 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.2 隐藏层权值训练\n",
"### 5.2 The training of the weight in hidden layer\n",
"Firstly, we need to define the set as all the direct downstream node $Downstream(j)$ of node $j$. For exmaple, to node 4, the downstream of it is node 8 and 9. We can see that $net_j$ can noly affect $E__c$ by affecting $Downstream(j)$. Set $net_k$ as the downstream node input of node $j$, then $E_d$ is the function of $net_k$, while $net_k$ is the function of $net_j$. Beacuse there are many $net_k$, we should apply full derivative formula to do the derivative as following:\n",
"At here, we have derive the backward propagation algoritm. One thing to note, the trian rule that we derived just now is according to the activation function of sigmoid function, square sum error, fully connected network, ramdom gradient descent optimization algorithm. If we have diferent activation function, error calculation mode, net connection strucutre and optimization, we will have different training rules. Whatever, it is all the same in the derivation of training rules, we noly need to use the chain rule do the derivaiton.\n"
"We assume that evey training sample is $(\\vec{x}, \\vec{t})$, among them vector $\\vec{x}$ is the characteristic of training sample, while $\\vec{t}$ is the target value of sample.\n",
"Firstly, we use the characteristic of the sample $\\vec{x}$ to calculate the output $a_i$ of every hidden node in neural network and every output value of output layer, according to the algorithm that we introduced in last section. \n",
"\n",
"然后,我们按照下面的方法计算出每个节点的误差项$\\delta_i$:\n",
"Then, we calculate every node error term $\\delta_i$ according to the following method:\n",
"Among them, $\\delta_i$is the error term of node $i$, $y_i$ is the output value of node $i$. For example, according to the upper graph, the output layer node 8 have output value $y_1$, while the target value of smaple is $t_1$, substitute into upper formula we get the error term of node 8:\n",
"Among them, $a_i$ is the output value of node $i$, while $w_{ki}$ is the weight that node $i$ connect to it's next layer $k$. $\\delta_k$ is the next layer error term of node $i$. For example, for hidden layer node 4, the calcultaion method is as following:\n",
"Among them, $w_{ji}$ is the weight from node $i$ to node $j$, $\\eta$ is a constant that represent the learning rate, $\\delta_j$ is the error term of node $j$, while $x_{ji}$ is the output that node $i$ pass to node $j$.\n",
"For example, the update way of weight $w_{84}$ is as following:\n",
"We have introduced the calculation method and weight updating method for each node error term of neural network. Apparently, to calculate the error term of a node, you need to first calculate the error term of each node connected to the next layer. This requires that the error terms be calculated in order from the output layer and then in reverse order for each hidden layer until the hidden layer is connected to the input layer. This is the meaning of the name backward propagation algorithm. When all the node error term are calculated, we can update all the weight according to formula 5.\n",
"Activation function is very important in neural network and it is also important to use activation function. We get to konw the activation funciton in the former section from the perspective of human neurons. The neuron need to propagate backward thorough activation, therefore activation is needed in neural networks, we will understand the necessity of the activation function from math perspective.\n",
"\n",
"比如一个两层的神经网络,使用 A 表示激活函数,那么\n",
"For a to layer neural network network, if we use A represent activation, then:\n",
"\n",
"$$\n",
"y = w_2 A(w_1 x)\n",
"$$\n",
"\n",
"如果我们不使用激活函数,那么神经网络的结果就是\n",
"If we do not use activation function, then the result of neural work is:\n",
"We can see that we combined the two layer neural network parameter together, represented in $\\bar{w}$, so that the two layer network is actually one layer neural network while the parameter changes to new $\\bar{w}$. Therefore, if we do not use activation function, whatever how many layer neural network we have, $y = w_n \\cdots w_2 w_1 x = \\bar{w} x$ is changing into a one layer network, so that we must use activation function in every layer.\n",
"\n",
"最后我们看看激活函数对神经网络的影响\n",
"Finally, let's look at the effects of activation functions on neural networks:\n",
"We can see that when we use the activation, the neurak network can change into any shapes by changing weght, the more complicated neural network can fit more complicate shapes, which is known as the universal approximation theorem for neural networks.The activation function that used in neural network are all nolinear, every time the acitvaiton funciton input a value, we will get a result through a special math calculation. \n",
"When the input $x<0$, the output is $0$, while for $x> 0$, the output is $x$. This activation function make network converge rapaidly. It does not saturate, that is, it can resist gradient disappearance, at least in the positive region ($x> 0$). Therefore, neuron will not back propagate all the zero at at least half of the region. Because we use the simple thresholding, the RelU will have a high calculation efficiency.\n",
"In the network, the different input may contains key characteristic of differrent size, and it will be more flexible if we use the changeable data structure as the container. Assume that the neurons have sparse characteristic, then for different activation way: different numbers(Selective inactivation), different function(Distributed activation). The activation paths generated by the two optimizable structures can better learn the relatively sparse features from the dimension of the effective data and play an automatic de-separation effect. \n",
"In deep neural network, the dependence to nolinear is much less. What's more, sparse characteristic do not require the network have strong processing linear inseparability mechanism. Therefore, in the deep learning model, it is more suitable to use simple, quickly linear activation function. As shown in the figure, once the activation changes linearly from neuron to neuron, the nonlinear part of the network only comes from the partial selective activation of the neuron. \n",
"\n",
"Another reason that we are more prone to use linear activation function is to reduce the Vanishing Gradient Problem when trianing deep netwrok with gradient method.\n",
"Those of you who have seen the BP derivation know that when you calculate the gradient from the back propagation of the error from the output layer, you multiply the input neuron value of the current layer at each layer to get the first derivative of the activation function.\n",
"In this way, when passing through each layer, the Error will decay exponentially. Once the recursive multi-layer back propagation is carried out, the gradient will constantly decay and disappear, making the network learning slow down. The gradient of the corrected activation function is 1, and only one end is saturated. The gradient flows well in the back propagation, and the training speed is greatly improved."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. 示例程序"
"## 7. The demo program"
]
},
{
@@ -2608,7 +2604,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. 如何使用类的方法封装多层神经网络?"
"## 8. How to encapsulate a multi-layer neural network using class methods?"
]
},
{
@@ -4857,7 +4853,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. 深入分析与问题"
"## 9. In-depth analysis and problems"
]
},
{
@@ -4892,10 +4888,10 @@
"metadata": {},
"source": [
"**问题**\n",
"1. 我们希望得到的每个类别的概率\n",
"2. 如何做多分类问题?\n",
"3. 如何能让神经网络更快的训练好?\n",
"4. 如何更好的构建网络的类定义,从而让神经网络的类支持更多的类型的处理层?"
"1. We want to get the probability of each of these categories\n",
"2. How to do multiple classification problem?\n",
"3. How can you make neural network faster training good?\n",
"4. How to better construct the class definition of the network so that the class of the neural network supports more types of processing layer?"