Improve README.md, add some description

6 years ago · 45706b71e2
--- a/0_python/README.md
+++ b/0_python/README.md
@@ -1,92 +1,37 @@
 # Python-Lectures  
 # 简明Python教程 （90分钟学会Python）
 Python 是一门上手简单、功能强大、通用型的脚本编程语言。Python 类库极其丰富，这使得 Python 几乎无所不能，网站开发、软件开发、大数据分析、网络爬虫、机器学习等都不在话下。Python最主要的优点是使用人类的思考方式来完成大部分的工作，大多数时候使用封装好的库快速完成给定的任务，虽然可能执行的效率不一定很高，但是极大的缩短了程序设计、编写、调试的时间，因此非常适合快速试错。
 本教程来自[IPython Notebooks to learn Python](https://github.com/rajathkmp/Python-Lectures)，将其中部分示例代码转化成Python3。关于Python的按照可以自行去网络上查找相关的资料，或者参考[安装Python环境](../tips/InstallPython.md)。
 ## 内容
 0. [Introduction](0_Introduction.ipynb)
    - Install ipython
 1. [Basics](1_Basics.ipynb)
    - Why Python, Zen of Python
    - Variables, Operators, Built-in functions
 2. [Print statement](2_Print_Statement.ipynb)
    - Tips of print
 3. [Data structure - 1](3_Data_Structure_1.ipynb)
    - Lists, Tuples, Sets
 4. [Data structure - 2](4_Data_Structure_2.ipynb)
    - Strings, Dictionaries
 5. [Control flow](5_Control_Flow.ipynb)
    - if, else, elif, for, while, break, continue
 6. [Functions](6_Function.ipynb)
    - Function define, return, arguments
    - Gloabl and local variables
    - Lambda functions
 7. [Class](7_Class.ipynb)
    - Class define
    - Inheritance
 ## References
 * [安装Python环境](../tips/InstallPython.md)
 * [IPython Notebooks to learn Python](https://github.com/rajathkmp/Python-Lectures)
 * [廖雪峰的Python教程](https://www.liaoxuefeng.com/wiki/1016959663602400)
 * [智能系统实验室入门教程-Python](https://gitee.com/pi-lab/SummerCamp/tree/master/python)
 * [Python tips](../tips/python)
 Note: [Andreas Ernst](http://users.monash.edu/~andrease/) has improvised and updated the repo to python 3, [Link](https://gitlab.erc.monash.edu.au/andrease/Python4Maths/tree/master)
 ## Introduction
 Python is a modern, robust, high level programming language. It is very easy to pick up even if you are completely new to programming.
 ## Installation
 Mac OS X and Linux comes pre installed with python. Windows users can download python from https://www.python.org/downloads/ .
 To install IPython run,
    $ pip install ipython[all]
 This will install all the necessary dependencies for the notebook, qtconsole, tests etc.
 ### Installation from unofficial distributions
 Installing all the necessary libraries might prove troublesome. Anaconda and Canopy comes pre packaged with all the necessary python libraries and also IPython.
 #### Anaconda
 Download Anaconda from https://www.continuum.io/downloads
 Anaconda is completely free and includes more than 300 python packages. Both python 2.7 and 3.4 options are available.
 #### Canopy
 Download Canopy from https://store.enthought.com/downloads/#default
 Canopy has a premium version which offers 300+ python packages. But the free version works just fine. Canopy as of now supports only 2.7 but it comes with its own text editor and IPython environment.
 ## Launching IPython Notebook
 From the terminal
    ipython notebook
 In Canopy and Anaconda, Open the respective terminals and execute the above.
 ## How to learn from this resource?
 You can download the pdf copy from here : [Get Started with Python](https://github.com/rajathkumarmp/Python-Lectures/blob/master/Python.pdf)
 It is better to download all the ipython notebooks from this repository https://github.com/rajathkumarmp/Python-Lectures and learn it on the notebook itself rather than having to refer to a pdf.
 Launch ipython notebook from the folder which contains the notebooks. Open each one of them
    Cell > All Output > Clear
 This will clear all the outputs and now you can understand each statement and learn interactively.
 ## Table of contents
 [00 - Introduction and Installation](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/00.ipynb)
 [01 - Variable, Operators and Built-in Functions](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/01.ipynb)
 [02 - Print Statement, Precision and FieldWidth](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/02.ipynb)
 [03 - Lists, Tuples and Sets](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/03.ipynb)
 [04 - Strings and Dictionaries](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/04.ipynb)
 [05 - Control Flow Statements](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/05.ipynb)
 [06 - Functions](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/06.ipynb)
 [07 - Classes](http://nbviewer.ipython.org/github/rajathkumarmp/Python-Lectures/blob/master/07.ipynb)
 These are online read-only versions.
 ## License
 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/
--- a/1_numpy_matplotlib_scipy_sympy/README.md
+++ b/1_numpy_matplotlib_scipy_sympy/README.md
@@ -3,11 +3,13 @@
 ## 内容
 * [numpy教程](numpy_tutorial.ipynb)
 * matplotlib
  * [matplotlib系统学习](matplotlib_full.ipynb)
  * [matplotlib简易教程](matplotlib_simple_tutorial.ipynb)
    - [matplotlib系统学习](matplotlib_full.ipynb)
    - [matplotlib简易教程](matplotlib_simple_tutorial.ipynb)
 * [scipy](scipy_tutorial.ipynb)
 * [sympy](sympy_tutorial.ipynb)
 * [git introduction](utils_git.ipynb)
 * [git workflow](utils_git_advanced.ipynb)
 * [shell](utils_shell.ipynbs)
 ## References
 * [手把手教你用Python做数据可视化](https://mp.weixin.qq.com/s/3Gwdjw8trwTR5uyr4G7EOg)
--- a/1_numpy_matplotlib_scipy_sympy/matplotlib_ani.py
+++ b/1_numpy_matplotlib_scipy_sympy/matplotlib_ani.py
@@ -1,87 +0,0 @@
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # ## Matplotlib Animation
 #
 # ## Method 1
 # +
 # %matplotlib inline
 import numpy as np
 import matplotlib.pyplot as plt
 from matplotlib import animation, rc
 from IPython.display import HTML
 # First set up the figure, the axis, and the plot element we want to animate
 fig, ax = plt.subplots()
 ax.set_xlim(( 0, 2))
 ax.set_ylim((-2, 2))
 line, = ax.plot([], [], lw=2)
 # +
 # initialization function: plot the background of each frame
 def init():
    line.set_data([], [])
    return (line,)
 # animation function. This is called sequentially
 def animate(i):
    x = np.linspace(0, 2, 1000)
    y = np.sin(2 * np.pi * (x - 0.01 * i))
    line.set_data(x, y)
    return (line,)
 # call the animator. blit=True means only re-draw the parts that have changed.
 anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=20, blit=True)
 HTML(anim.to_html5_video())
 # -
 # ## Method 2
 # +
 # %matplotlib nbagg
 import numpy as np
 import matplotlib.pyplot as plt
 import matplotlib.animation as animation
 fig = plt.figure()
 x = np.arange(0, 10, 0.1)
 ims = []
 for a in range(50):
    y = np.sin(x - a)
    im = plt.plot(x, y, "r")
    ims.append(im)
 ani = animation.ArtistAnimation(fig, ims)
 plt.show()
--- a/1_numpy_matplotlib_scipy_sympy/matplotlib_simple_tutorial.py
+++ b/1_numpy_matplotlib_scipy_sympy/matplotlib_simple_tutorial.py
@@ -1,125 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # matplotlib
 #
 #
 # ## 1. pyplot
 # matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes (please note that “axes” here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).
 # +
 # This line configures matplotlib to show figures embedded in the notebook, 
 # instead of opening a new window for each figure. More about that later. 
 # If you are using an old version of IPython, try using '%pylab inline' instead.
 # %matplotlib inline
 import matplotlib.pyplot as plt
 plt.plot([1,2,3,4])
 plt.ylabel('some numbers')
 plt.show()
 # -
 plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
 # For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot. The letters and symbols of the format string are from MATLAB, and you concatenate a color string with a line style string. The default format string is ‘b-‘, which is a solid blue line. For example, to plot the above with red circles, you would issue
 import matplotlib.pyplot as plt
 plt.plot([1,2,3,4], [1,4,9,16], 'ro')
 plt.axis([0, 6, 0, 20])
 plt.show()
 # +
 import numpy as np
 import matplotlib.pyplot as plt
 # evenly sampled time at 200ms intervals
 t = np.arange(0., 5., 0.2)
 # red dashes, blue squares and green triangles
 plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
 plt.show()
 # -
 # ### [Controlling line properties](https://matplotlib.org/users/pyplot_tutorial.html#controlling-line-properties)
 #
 # Lines have many attributes that you can set: linewidth, dash style, antialiased, etc; see matplotlib.lines.Line2D. There are several ways to set line properties
 #
 # ### Working with multiple figures and axes
 #
 # MATLAB, and pyplot, have the concept of the current figure and the current axes. All plotting commands apply to the current axes. The function gca() returns the current axes (a matplotlib.axes.Axes instance), and gcf() returns the current figure (matplotlib.figure.Figure instance). Normally, you don’t have to worry about this, because it is all taken care of behind the scenes. Below is a script to create two subplots.
 #
 #
 # +
 import numpy as np
 import matplotlib.pyplot as plt
 def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)
 t1 = np.arange(0.0, 5.0, 0.1)
 t2 = np.arange(0.0, 5.0, 0.02)
 plt.figure(1)
 plt.subplot(211)
 plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')
 plt.subplot(212)
 plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
 plt.show()
 # -
 # ## 2. Image 
 # +
 import matplotlib.pyplot as plt
 import matplotlib.image as mpimg
 import numpy as np
 # load image
 img=mpimg.imread('example.png')
 imgplot = plt.imshow(img)
 # -
 # ### Applying pseudocolor schemes to image plots
 lum_img = img[:,:,0]
 plt.imshow(lum_img)
 # use 'hot' color map
 plt.imshow(lum_img, cmap="hot")
 plt.colorbar()
 # ### Examining a specific data range
 #
 plt.hist(lum_img.ravel(), bins=256, range=(0.0, 1.0), fc='k', ec='k')
 # ## References
 #
 #
 # * [Pyplot tutorial](https://matplotlib.org/users/pyplot_tutorial.html)
 # * [Image tutorial](https://matplotlib.org/users/image_tutorial.html)
 # * [手把手教你用Python做数据可视化](https://mp.weixin.qq.com/s/3Gwdjw8trwTR5uyr4G7EOg)
--- a/1_numpy_matplotlib_scipy_sympy/utils_git.ipynb
+++ b/1_numpy_matplotlib_scipy_sympy/utils_git.ipynb
@@ -275,7 +275,25 @@
   ]
  }
 ],
 "metadata": {},
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/1_numpy_matplotlib_scipy_sympy/utils_git_advanced.ipynb
+++ b/1_numpy_matplotlib_scipy_sympy/utils_git_advanced.ipynb
@@ -405,7 +405,25 @@
   ]
  }
 ],
 "metadata": {},
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/1_numpy_matplotlib_scipy_sympy/utils_shell.ipynb
+++ b/1_numpy_matplotlib_scipy_sympy/utils_shell.ipynb
@@ -353,7 +353,25 @@
   ]
  }
 ],
 "metadata": {},
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/2_knn/knn_classification.py
+++ b/2_knn/knn_classification.py
@@ -1,249 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # kNN Classification
 #
 #
 # kNN最邻近规则，主要应用领域是对未知事物的识别，即判断未知事物属于哪一类，判断思想是，基于欧几里得定理，判断未知事物的特征和哪一类已知事物的的特征最接近；
 #
 # K最近邻(k-Nearest Neighbor，kNN)分类算法，是一个理论上比较成熟的方法，也是最简单的机器学习算法之一。该方法的思路是：如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别，则该样本也属于这个类别。KNN算法中，所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。 KNN方法虽然从原理上也依赖于极限定理，但在类别决策时，只与极少量的相邻样本有关。由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。
 #
 # kNN算法不仅可以用于分类，还可以用于回归。通过找出一个样本的k个最近邻居，将这些邻居的属性的平均值赋给该样本，就可以得到该样本的属性。更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值(weight)，如权值与距离成正比（组合函数）。
 #
 # 该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。 该算法只计算“最近的”邻居样本，某一类的样本数量很大，那么或者这类样本并不接近目标样本，或者这类样本很靠近目标样本。无论怎样，数量并不能影响运行结果。可以采用权值的方法（和该样本距离小的邻居权值大）来改进。该方法的另一个不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K个最近邻点。目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。
 #
 # k-NN可以说是一种最直接的用来分类未知数据的方法。基本通过下面这张图跟文字说明就可以明白K-NN是干什么的
 # ![knn](images/knn.png)
 #
 # 简单来说，k-NN可以看成：有那么一堆你已经知道分类的数据，然后当一个新数据进入的时候，就开始跟训练数据里的每个点求距离，然后挑离这个训练数据最近的K个点看看这几个点属于什么类型，然后用少数服从多数的原则，给新数据归类。
 #
 #
 # 算法步骤：
 #
 # * step.1---初始化距离为最大值
 # * step.2---计算未知样本和每个训练样本的距离dist
 # * step.3---得到目前K个最临近样本中的最大距离maxdist
 # * step.4---如果dist小于maxdist，则将该训练样本作为K-最近邻样本
 # * step.5---重复步骤2、3、4，直到未知样本和所有训练样本的距离都算完
 # * step.6---统计K-最近邻样本中每个类标号出现的次数
 # * step.7---选择出现频率最大的类标号作为未知样本的类标号
 # +
 # %matplotlib inline
 import numpy as np
 import matplotlib.pyplot as plt
 # generate sample data
 n = 100
 x_1_1 = 10 + (np.random.rand(n, 1)*2 -1)*4
 x_1_2 = 15 + (np.random.rand(n, 1)*2 -1)*4
 x1 = np.concatenate((x_1_1, x_1_2), axis=1)
 y1 = np.zeros([n, 1])
 x_2_1 = 20 + (np.random.rand(n, 1)*2 -1)*4
 x_2_2 = 5 + (np.random.rand(n, 1)*2 -1)*4
 x2 = np.concatenate((x_2_1, x_2_2), axis=1)
 y2 = np.ones([n, 1])
 x = np.concatenate((x1, x2), axis=0)
 y = np.concatenate((y1, y2), axis=0)
 y = y.flatten()
 print(y.shape)
 # draw samle data
 plt.scatter(x[:,0], x[:,1], c=y)
 plt.show()
 # +
 # generate test data
 x_test = np.array([[12.5, 10.0], [15.4, 8.0]])
 k = 5
 # do knn
 for s in x_test:
    d = np.sum((s - x)**2, axis=1)
    idx = np.argsort(d)
    ys_5 = list(y[idx[:5]])    
    print(ys_5)
    # TODO: you need to implement the vote algorithm
 # -
 # ## Program
 # +
 import numpy as np
 import operator
 class KNN(object):
    def __init__(self, k=3):
        self.k = k
    def fit(self, x, y):
        self.x = x
        self.y = y
    def _square_distance(self, v1, v2):
        return np.sum(np.square(v1-v2))
    def _vote(self, ys):
        ys_unique = np.unique(ys)
        vote_dict = {}
        for y in ys:
            if y not in vote_dict.keys():
                vote_dict[y] = 1
            else:
                vote_dict[y] += 1
        sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1), reverse=True)
        return sorted_vote_dict[0][0]
    def predict(self, x):
        y_pred = []
        for i in range(len(x)):
            dist_arr = [self._square_distance(x[i], self.x[j]) for j in range(len(self.x))]
            sorted_index = np.argsort(dist_arr)
            top_k_index = sorted_index[:self.k]
            y_pred.append(self._vote(ys=self.y[top_k_index]))
        return np.array(y_pred)
    def score(self, y_true=None, y_pred=None):
        if y_true is None and y_pred is None:
            y_pred = self.predict(self.x)
            y_true = self.y
        score = 0.0
        for i in range(len(y_true)):
            if y_true[i] == y_pred[i]:
                score += 1
        score /= len(y_true)
        return score
 # +
 # %matplotlib inline
 import numpy as np
 import matplotlib.pyplot as plt
 # data generation
 np.random.seed(314)
 data_size_1 = 300
 x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
 x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
 y_1 = [0 for _ in range(data_size_1)]
 data_size_2 = 400
 x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
 x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
 y_2 = [1 for _ in range(data_size_2)]
 x1 = np.concatenate((x1_1, x1_2), axis=0)
 x2 = np.concatenate((x2_1, x2_2), axis=0)
 x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
 y = np.concatenate((y_1, y_2), axis=0)
 data_size_all = data_size_1+data_size_2
 shuffled_index = np.random.permutation(data_size_all)
 x = x[shuffled_index]
 y = y[shuffled_index]
 split_index = int(data_size_all*0.7)
 x_train = x[:split_index]
 y_train = y[:split_index]
 x_test = x[split_index:]
 y_test = y[split_index:]
 # visualize data
 plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
 plt.title("train data")
 plt.show()
 plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')
 plt.title("test data")
 plt.show()
 # +
 # data preprocessing
 x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
 x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))
 # knn classifier
 clf = KNN(k=3)
 clf.fit(x_train, y_train)
 print('train accuracy: {:.3}'.format(clf.score()))
 y_test_pred = clf.predict(x_test)
 print('test accuracy: {:.3}'.format(clf.score(y_test, y_test_pred)))
 # -
 # ## sklearn program
 # +
 % matplotlib inline
 import matplotlib.pyplot as plt
 from sklearn import datasets, neighbors, linear_model
 # load data
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target
 print("Feature dimensions: ", X_digits.shape)
 print("Label dimensions:   ", y_digits.shape)
 # +
 # plot sample images
 nplot = 10
 fig, axes = plt.subplots(nrows=1, ncols=nplot)
 for i in range(nplot):
    img = X_digits[i].reshape(8, 8)
    axes[i].imshow(img)
    axes[i].set_title(y_digits[i])
 # +
 # split train / test data
 n_samples = len(X_digits)
 n_train = int(0.4 * n_samples)
 X_train = X_digits[:n_train]
 y_train = y_digits[:n_train]
 X_test = X_digits[n_train:]
 y_test = y_digits[n_train:]
 # +
 # do KNN classification
 knn = neighbors.KNeighborsClassifier()
 logistic = linear_model.LogisticRegression()
 print('KNN score: %f' % knn.fit(X_train, y_train).score(X_test, y_test))
 print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
 # -
 # ## References
 # * [Digits Classification Exercise](http://scikit-learn.org/stable/auto_examples/exercises/plot_digits_classification_exercise.html)
 # * [knn算法的原理与实现](https://zhuanlan.zhihu.com/p/36549000)
--- a/3_kmeans/ClusteringAlgorithms.py
+++ b/3_kmeans/ClusteringAlgorithms.py
@@ -1,191 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # Comparing different clustering algorithms on toy datasets
 #
 # This example shows characteristics of different clustering algorithms on datasets that are “interesting” but still in 2D. With the exception of the last dataset, the parameters of each of these dataset-algorithm pairs has been tuned to produce good clustering results. Some algorithms are more sensitive to parameter values than others.
 # The last dataset is an example of a ‘null’ situation for clustering: the data is homogeneous, and there is no good clustering. For this example, the null dataset uses the same parameters as the dataset in the row above it, which represents a mismatch in the parameter values and the data structure.
 # While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional data.
 # +
 % matplotlib inline
 import time
 import warnings
 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn import cluster, datasets, mixture
 from sklearn.neighbors import kneighbors_graph
 from sklearn.preprocessing import StandardScaler
 from itertools import cycle, islice
 np.random.seed(0)
 # ============
 # Generate datasets. We choose the size big enough to see the scalability
 # of the algorithms, but not too big to avoid too long running times
 # ============
 n_samples = 1500
 noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.05)
 noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
 blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
 no_structure = np.random.rand(n_samples, 2), None
 # Anisotropicly distributed data
 random_state = 170
 X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
 transformation = [[0.6, -0.6], [-0.4, 0.8]]
 X_aniso = np.dot(X, transformation)
 aniso = (X_aniso, y)
 # blobs with varied variances
 varied = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[1.0, 2.5, 0.5],
                             random_state=random_state)
 # ============
 # Set up cluster parameters
 # ============
 plt.figure(figsize=(9 * 2 + 3, 12.5))
 plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)
 plot_num = 1
 default_base = {'quantile': .3,
                'eps': .3,
                'damping': .9,
                'preference': -200,
                'n_neighbors': 10,
                'n_clusters': 3}
 datasets = [
    (noisy_circles, {'damping': .77, 'preference': -240,
                     'quantile': .2, 'n_clusters': 2}),
    (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
    (varied, {'eps': .18, 'n_neighbors': 2}),
    (aniso, {'eps': .15, 'n_neighbors': 2}),
    (blobs, {}),
    (no_structure, {})]
 for i_dataset, (dataset, algo_params) in enumerate(datasets):
    # update parameters with dataset-specific values
    params = default_base.copy()
    params.update(algo_params)
    X, y = dataset
    # normalize dataset for easier parameter selection
    X = StandardScaler().fit_transform(X)
    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(X, quantile=params['quantile'])
    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(
        X, n_neighbors=params['n_neighbors'], include_self=False)
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)
    # ============
    # Create cluster objects
    # ============
    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
    ward = cluster.AgglomerativeClustering(
        n_clusters=params['n_clusters'], linkage='ward',
        connectivity=connectivity)
    spectral = cluster.SpectralClustering(
        n_clusters=params['n_clusters'], eigen_solver='arpack',
        affinity="nearest_neighbors")
    dbscan = cluster.DBSCAN(eps=params['eps'])
    affinity_propagation = cluster.AffinityPropagation(
        damping=params['damping'], preference=params['preference'])
    average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock",
        n_clusters=params['n_clusters'], connectivity=connectivity)
    birch = cluster.Birch(n_clusters=params['n_clusters'])
    gmm = mixture.GaussianMixture(
        n_components=params['n_clusters'], covariance_type='full')
    clustering_algorithms = (
        ('MiniBatchKMeans', two_means),
        ('AffinityPropagation', affinity_propagation),
        ('MeanShift', ms),
        ('SpectralClustering', spectral),
        ('Ward', ward),
        ('AgglomerativeClustering', average_linkage),
        ('DBSCAN', dbscan),
        ('Birch', birch),
        ('GaussianMixture', gmm)
    )
    for name, algorithm in clustering_algorithms:
        t0 = time.time()
        # catch warnings related to kneighbors_graph
        with warnings.catch_warnings():
            warnings.filterwarnings(
                "ignore",
                message="the number of connected components of the " +
                "connectivity matrix is [0-9]{1,2}" +
                " > 1. Completing it to avoid stopping the tree early.",
                category=UserWarning)
            warnings.filterwarnings(
                "ignore",
                message="Graph is not fully connected, spectral embedding" +
                " may not work as expected.",
                category=UserWarning)
            algorithm.fit(X)
        t1 = time.time()
        if hasattr(algorithm, 'labels_'):
            y_pred = algorithm.labels_.astype(np.int)
        else:
            y_pred = algorithm.predict(X)
        plt.subplot(len(datasets), len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)
        colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
                                             '#f781bf', '#a65628', '#984ea3',
                                             '#999999', '#e41a1c', '#dede00']),
                                      int(max(y_pred) + 1))))
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])
        plt.xlim(-2.5, 2.5)
        plt.ylim(-2.5, 2.5)
        plt.xticks(())
        plt.yticks(())
        plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
        plot_num += 1
 plt.show()
 # -
 # ## Reference
 # * [Comparing different clustering algorithms on toy datasets](http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html)
--- a/3_kmeans/k-means.py
+++ b/3_kmeans/k-means.py
@@ -1,488 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   jupytext_formats: ipynb,py
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # k-means
 # ## Theory
 #
 # 由于具有出色的速度和良好的可扩展性，K-Means聚类算法算得上是最著名的聚类方法。K-Means算法是一个重复移动类中心点的过程，把类的中心点，也称重心（centroids），移动到其包含成员的平均位置，然后重新划分其内部成员。
 #
 # K是算法计算出的超参数，表示类的数量；K-Means可以自动分配样本到不同的类，但是不能决定究竟要分几个类。
 #
 # K必须是一个比训练集样本数小的正整数。有时，类的数量是由问题内容指定的。例如，一个鞋厂有三种新款式，它想知道每种新款式都有哪些潜在客户，于是它调研客户，然后从数据里找出三类。也有一些问题没有指定聚类的数量，最优的聚类数量是不确定的。
 #
 # K-Means的参数是类的重心位置和其内部观测值的位置。与广义线性模型和决策树类似，K-Means参数的最优解也是以成本函数最小化为目标。K-Means成本函数公式如下：
 # $$
 # J = \sum_{k=1}^{K} \sum_{i \in C_k} | x_i - u_k|^2
 # $$
 #
 # $u_k$是第$k$个类的重心位置，定义为：
 # $$
 # u_k = \frac{1}{|C_k|} \sum_{x \in C_k} x
 # $$
 #
 #
 # 成本函数是各个类畸变程度（distortions）之和。每个类的畸变程度等于该类重心与其内部成员位置距离的平方和。若类内部的成员彼此间越紧凑则类的畸变程度越小，反之，若类内部的成员彼此间越分散则类的畸变程度越大。
 #
 # 求解成本函数最小化的参数就是一个重复配置每个类包含的观测值，并不断移动类重心的过程。
 # 1. 首先，类的重心是随机确定的位置。实际上，重心位置等于随机选择的观测值的位置。
 # 2. 每次迭代的时候，K-Means会把观测值分配到离它们最近的类，然后把重心移动到该类全部成员位置的平均值那里。
 # 3. 若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束，否则重复步骤2。
 #
 #
 # +
 % matplotlib inline
 import matplotlib.pyplot as plt
 import numpy as np
 X0 = np.array([7, 5, 7, 3, 4, 1, 0, 2, 8, 6, 5, 3])
 X1 = np.array([5, 7, 7, 3, 6, 4, 0, 2, 7, 8, 5, 7])
 plt.figure()
 plt.axis([-1, 9, -1, 9])
 plt.grid(True)
 plt.plot(X0, X1, 'k.');
 # -
 # 假设K-Means初始化时，将第一个类的重心设置在第5个样本，第二个类的重心设置在第11个样本.那么我们可以把每个实例与两个重心的距离都计算出来，将其分配到最近的类里面。计算结果如下表所示：
 # ![data_0](images/data_0.png)
 #
 # 新的重心位置和初始聚类结果如下图所示。第一类用X表示，第二类用点表示。重心位置用稍大的点突出显示。
 #
 #
 #
 C1 = [1, 4, 5, 9, 11]
 C2 = list(set(range(12)) - set(C1))
 X0C1, X1C1 = X0[C1], X1[C1]
 X0C2, X1C2 = X0[C2], X1[C2]
 plt.figure()
 plt.title('1st iteration results')
 plt.axis([-1, 9, -1, 9])
 plt.grid(True)
 plt.plot(X0C1, X1C1, 'rx')
 plt.plot(X0C2, X1C2, 'g.')
 plt.plot(4,6,'rx',ms=12.0)
 plt.plot(5,5,'g.',ms=12.0);
 # 现在我们重新计算两个类的重心，把重心移动到新位置，并重新计算各个样本与新重心的距离，并根据距离远近为样本重新归类。结果如下表所示：
 #
 # ![data_1](images/data_1.png)
 #
 # 画图结果如下：
 C1 = [1, 2, 4, 8, 9, 11]
 C2 = list(set(range(12)) - set(C1))
 X0C1, X1C1 = X0[C1], X1[C1]
 X0C2, X1C2 = X0[C2], X1[C2]
 plt.figure()
 plt.title('2nd iteration results')
 plt.axis([-1, 9, -1, 9])
 plt.grid(True)
 plt.plot(X0C1, X1C1, 'rx')
 plt.plot(X0C2, X1C2, 'g.')
 plt.plot(3.8,6.4,'rx',ms=12.0)
 plt.plot(4.57,4.14,'g.',ms=12.0);
 # 我们再重复一次上面的做法，把重心移动到新位置，并重新计算各个样本与新重心的距离，并根据距离远近为样本重新归类。结果如下表所示：
 # ![data_2](images/data_2.png)
 #
 # 画图结果如下：
 #
 C1 = [0, 1, 2, 4, 8, 9, 10, 11]
 C2 = list(set(range(12)) - set(C1))
 X0C1, X1C1 = X0[C1], X1[C1]
 X0C2, X1C2 = X0[C2], X1[C2]
 plt.figure()
 plt.title('3rd iteration results')
 plt.axis([-1, 9, -1, 9])
 plt.grid(True)
 plt.plot(X0C1, X1C1, 'rx')
 plt.plot(X0C2, X1C2, 'g.')
 plt.plot(5.5,7.0,'rx',ms=12.0)
 plt.plot(2.2,2.8,'g.',ms=12.0);
 # 再重复上面的方法就会发现类的重心不变了，K-Means会在条件满足的时候停止重复聚类过程。通常，条件是前后两次迭代的成本函数值的差达到了限定值，或者是前后两次迭代的重心位置变化达到了限定值。如果这些停止条件足够小，K-Means就能找到最优解。不过这个最优解不一定是全局最优解。
 #
 #
 # ## Program
 # +
 # This line configures matplotlib to show figures embedded in the notebook, 
 # instead of opening a new window for each figure. More about that later. 
 # If you are using an old version of IPython, try using '%pylab inline' instead.
 # %matplotlib inline
 # import librarys
 from numpy import *
 import matplotlib.pyplot as plt
 import pandas as pd
 # Load dataset
 names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
 dataset = pd.read_csv("iris.csv", header=0, index_col=0)
 dataset.head()
 # -
 #对类别进行编码，3个类别分别赋值0，1，2
 dataset['class'][dataset['class']=='Iris-setosa']=0
 dataset['class'][dataset['class']=='Iris-versicolor']=1
 dataset['class'][dataset['class']=='Iris-virginica']=2
 def originalDatashow(dataSet):
    #绘制原始的样本点
    num,dim=shape(dataSet)
    marksamples=['ob'] #样本图形标记
    for i in range(num):
        plt.plot(datamat.iat[i,0],datamat.iat[i,1],marksamples[0],markersize=5)
    plt.title('original dataset')
    plt.xlabel('sepal length')
    plt.ylabel('sepal width') 
    plt.show()
 # + {"scrolled": true}
 #获取样本数据
 datamat = dataset.loc[:, ['sepal-length', 'sepal-width']]
 # 真实的标签
 labels = dataset.loc[:, ['class']]
 #原始数据显示
 originalDatashow(datamat)
 # -
 def randChosenCent(dataSet,k):
    """初始化聚类中心：通过在区间范围随机产生的值作为新的中心点"""
    # 样本数
    m=shape(dataSet)[0]
    # 初始化列表
    centroidsIndex=[]
    #生成类似于样本索引的列表
    dataIndex=list(range(m))
    for i in range(k):
        #生成随机数
        randIndex=random.randint(0,len(dataIndex))
        #将随机产生的样本的索引放入centroidsIndex
        centroidsIndex.append(dataIndex[randIndex])
        #删除已经被抽中的样本
        del dataIndex[randIndex]
    #根据索引获取样本
    centroids = dataSet.iloc[centroidsIndex]
    return mat(centroids)
 # +
 def distEclud(vecA, vecB):
    """算距离, 两个向量间欧式距离"""
    return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
 def kMeans(dataSet, k):
    # 样本总数
    m = shape(dataSet)[0]
    # 分配样本到最近的簇：存[簇序号,距离的平方] (m行 x 2 列)
    clusterAssment = mat(zeros((m, 2)))
    # step1: 通过随机产生的样本点初始化聚类中心
    centroids = randChosenCent(dataSet, k)
    print('最初的中心=', centroids)
    # 标志位，如果迭代前后样本分类发生变化值为Tree，否则为False
    clusterChanged = True
    # 查看迭代次数
    iterTime = 0
    # 所有样本分配结果不再改变，迭代终止
    while clusterChanged:
        clusterChanged = False
        # step2:分配到最近的聚类中心对应的簇中
        for i in range(m):
            # 初始定义距离为无穷大
            minDist = inf;
            # 初始化索引值
            minIndex = -1
            # 计算每个样本与k个中心点距离
            for j in range(k):
                # 计算第i个样本到第j个中心点的距离
                distJI = distEclud(centroids[j, :], dataSet.values[i, :])
                # 判断距离是否为最小
                if distJI < minDist:
                    # 更新获取到最小距离
                    minDist = distJI
                    # 获取对应的簇序号
                    minIndex = j
            # 样本上次分配结果跟本次不一样，标志位clusterChanged置True
            if clusterAssment[i, 0] != minIndex:
                clusterChanged = True
            clusterAssment[i, :] = minIndex, minDist ** 2  # 分配样本到最近的簇
        iterTime += 1
        sse = sum(clusterAssment[:, 1])
        print('the SSE of %d' % iterTime + 'th iteration is %f' % sse)
        # step3:更新聚类中心
        for cent in range(k):  # 样本分配结束后，重新计算聚类中心
            # 获取该簇所有的样本点
            ptsInClust = dataSet.iloc[nonzero(clusterAssment[:, 0].A == cent)[0]]
            # 更新聚类中心：axis=0沿列方向求均值。
            centroids[cent, :] = mean(ptsInClust, axis=0)
    return centroids, clusterAssment
 # -
 # 进行k-means聚类
 k = 3  # 用户定义聚类数
 mycentroids, clusterAssment = kMeans(datamat, k)
 # +
 def datashow(dataSet, k, centroids, clusterAssment):  # 二维空间显示聚类结果
    from matplotlib import pyplot as plt
    num, dim = shape(dataSet)  # 样本数num ,维数dim
    if dim != 2:
        print('sorry,the dimension of your dataset is not 2!')
        return 1
    marksamples = ['or', 'ob', 'og', 'ok', '^r', '^b', '<g']  # 样本图形标记
    if k > len(marksamples):
        print('sorry,your k is too large,please add length of the marksample!')
        return 1
        # 绘所有样本
    for i in range(num):
        markindex = int(clusterAssment[i, 0])  # 矩阵形式转为int值, 簇序号
        # 特征维对应坐标轴x,y；样本图形标记及大小
        plt.plot(dataSet.iat[i, 0], dataSet.iat[i, 1], marksamples[markindex], markersize=6)
    # 绘中心点
    markcentroids = ['o', '*', '^']  # 聚类中心图形标记
    label = ['0', '1', '2']
    c = ['yellow', 'pink', 'red']
    for i in range(k):
        plt.plot(centroids[i, 0], centroids[i, 1], markcentroids[i], markersize=15, label=label[i], c=c[i])
        plt.legend(loc='upper left')
    plt.xlabel('sepal length')
    plt.ylabel('sepal width')
    plt.title('k-means cluster result')  # 标题
    plt.show()
 # 画出实际图像
 def trgartshow(dataSet, k, labels):
    from matplotlib import pyplot as plt
    num, dim = shape(dataSet)
    label = ['0', '1', '2']
    marksamples = ['ob', 'or', 'og', 'ok', '^r', '^b', '<g']
    # 通过循环的方式，完成分组散点图的绘制
    for i in range(num):
        plt.plot(datamat.iat[i, 0], datamat.iat[i, 1], marksamples[int(labels.iat[i, 0])], markersize=6)
    for i in range(0, num, 50):
        plt.plot(datamat.iat[i, 0], datamat.iat[i, 1], marksamples[int(labels.iat[i, 0])], markersize=6,
                 label=label[int(labels.iat[i, 0])])
    plt.legend(loc='upper left')
    # 添加轴标签和标题
    plt.xlabel('sepal length')
    plt.ylabel('sepal width')
    plt.title('iris true result')  # 标题
    # 显示图形
    plt.show()
    # label=labels.iat[i,0]
 # -
 # 绘图显示
 datashow(datamat, k, mycentroids, clusterAssment)
 trgartshow(datamat, 3, labels)
 # ## How to use sklearn to do the classifiction
 #
 # +
 from sklearn.datasets import load_digits
 import matplotlib.pyplot as plt 
 from sklearn.cluster import KMeans
 # load digital data
 digits, dig_label = load_digits(return_X_y=True)
 # draw one digital
 plt.gray() 
 plt.matshow(digits[0].reshape([8, 8])) 
 plt.show() 
 # calculate train/test data number
 N = len(digits)
 N_train = int(N*0.8)
 N_test = N - N_train
 # split train/test data
 x_train = digits[:N_train, :]
 y_train = dig_label[:N_train]
 x_test  = digits[N_train:, :]
 y_test  = dig_label[N_train:]
 # +
 # do kmeans
 kmeans = KMeans(n_clusters=10, random_state=0).fit(x_train)
 # kmeans.labels_ - output label
 # kmeans.cluster_centers_ - cluster centers
 # draw cluster centers
 fig, axes = plt.subplots(nrows=1, ncols=10)
 for i in range(10):
    img = kmeans.cluster_centers_[i].reshape(8, 8)
    axes[i].imshow(img)
 # -
 # ## Exerciese - How to caluate the accuracy?
 #
 # 1. How to match cluster label to groundtruth label
 # 2. How to solve the uncertainty of some digital
 # ## 评估聚类性能
 #
 # 方法1： 如果被用来评估的数据本身带有正确的类别信息，则利用Adjusted Rand Index(ARI)，ARI与分类问题中计算准确性的方法类似，兼顾了类簇无法和分类标记一一对应的问题。
 #
 #
 # +
 from sklearn.metrics import adjusted_rand_score
 ari_train = adjusted_rand_score(y_train, kmeans.labels_)
 print("ari_train = %f" % ari_train)
 # -
 # Given the contingency table:
 # ![ARI_ct](images/ARI_ct.png)
 #
 # the adjusted index is:
 # ![ARI_define](images/ARI_define.png)
 #
 # * [ARI reference](https://davetang.org/muse/2017/09/21/adjusted-rand-index/)
 #
 #
 # 方法2： 如果被用来评估的数据没有所属类别，则使用轮廓系数(Silhouette Coefficient)来度量聚类结果的质量，评估聚类的效果。轮廓系数同时兼顾了聚类的凝聚都和分离度，取值范围是[-1,1]，轮廓系数越大，表示聚类效果越好。 
 #
 # 轮廓系数的具体计算步骤： 
 # 1. 对于已聚类数据中第i个样本$x_i$，计算$x_i$与其同一类簇内的所有其他样本距离的平均值，记作$a_i$，用于量化簇内的凝聚度 
 # 2. 选取$x_i$外的一个簇$b$，计算$x_i$与簇$b$中所有样本的平均距离，遍历所有其他簇，找到最近的这个平均距离，记作$b_i$，用于量化簇之间分离度 
 # 3. 对于样本$x_i$，轮廓系数为$sc_i = \frac{b_i−a_i}{max(b_i,a_i)}$ 
 # 4. 最后，对所以样本集合$\mathbf{X}$求出平均值，即为当前聚类结果的整体轮廓系数。
 # +
 import numpy as np
 from sklearn.cluster import KMeans
 from sklearn.metrics import silhouette_score
 import matplotlib.pyplot as plt
 plt.rcParams['figure.figsize']=(10,10)
 plt.subplot(3,2,1)
 x1=np.array([1,2,3,1,5,6,5,5,6,7,8,9,7,9])   #初始化原始数据
 x2=np.array([1,3,2,2,8,6,7,6,7,1,2,1,1,3])
 X=np.array(list(zip(x1,x2))).reshape(len(x1),2)
 plt.xlim([0,10])
 plt.ylim([0,10])
 plt.title('Instances')
 plt.scatter(x1,x2)
 colors=['b','g','r','c','m','y','k','b']
 markers=['o','s','D','v','^','p','*','+']
 clusters=[2,3,4,5,8]
 subplot_counter=1
 sc_scores=[]
 for t in clusters:
    subplot_counter +=1
    plt.subplot(3,2,subplot_counter)
    kmeans_model=KMeans(n_clusters=t).fit(X)   #KMeans建模
    for i,l in enumerate(kmeans_model.labels_):
        plt.plot(x1[i],x2[i],color=colors[l],marker=markers[l],ls='None')
    plt.xlim([0,10])
    plt.ylim([0,10])
    sc_score=silhouette_score(X,kmeans_model.labels_,metric='euclidean')   #计算轮廓系数
    sc_scores.append(sc_score)
    plt.title('k=%s,silhouette coefficient=%0.03f'%(t,sc_score))
 plt.figure()
 plt.plot(clusters,sc_scores,'*-')   #绘制类簇数量与对应轮廓系数关系
 plt.xlabel('Number of Clusters')
 plt.ylabel('Silhouette Coefficient Score')
 plt.show()   
 # -
 # ## How to determin the 'k'?
 #
 # 利用“肘部观察法”可以粗略地估计相对合理的聚类个数。K-means模型最终期望*所有数据点到其所属的类簇距离的平方和趋于稳定，所以可以通过观察这个值随着K的走势来找出最佳的类簇数量。理想条件下，这个折线在不断下降并且趋于平缓的过程中会有斜率的拐点，这表示从这个拐点对应的K值开始，类簇中心的增加不会过于破坏数据聚类的结构*。
 #
 #
 # +
 import numpy as np
 from sklearn.cluster import KMeans
 from scipy.spatial.distance import cdist
 import matplotlib.pyplot as plt
 cluster1=np.random.uniform(0.5,1.5,(2,10))
 cluster2=np.random.uniform(5.5,6.5,(2,10))
 cluster3=np.random.uniform(3,4,(2,10))
 X=np.hstack((cluster1,cluster2,cluster3)).T
 plt.scatter(X[:,0],X[:,1])
 plt.xlabel('x1')
 plt.ylabel('x2')
 plt.show()
 # +
 K=range(1,10)
 meandistortions=[]
 for k in K:
    kmeans=KMeans(n_clusters=k)
    kmeans.fit(X)
    meandistortions.append(sum(np.min(cdist(X,kmeans.cluster_centers_,'euclidean'),axis=1))/X.shape[0])
 plt.plot(K,meandistortions,'bx-')
 plt.xlabel('k')
 plt.ylabel('Average Dispersion')
 plt.title('Selecting k with the Elbow Method')
 plt.show()
 # -
 # 从上图可见，类簇数量从1降到2再降到3的过程，更改K值让整体聚类结构有很大改变，这意味着新的聚类数量让算法有更大的收敛空间，这样的K值不能反映真实的类簇数量。而当K=3以后再增大K，平均距离的下降速度显著变缓慢，这意味着进一步增加K值不再会有利于算法的收敛，同时也暗示着K=3是相对最佳的类簇数量。
--- a/4_logistic_regression/Least_squares.py
+++ b/4_logistic_regression/Least_squares.py
@@ -1,375 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # Least squares
 #
 # A mathematical procedure for finding the best-fitting curve to a given set of points by minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve. The sum of the squares of the offsets is used instead of the offset absolute values because this allows the residuals to be treated as a continuous differentiable quantity. However, because squares of the offsets are used, outlying points can have a disproportionate effect on the fit, a property which may or may not be desirable depending on the problem at hand. 
 #
 # ### Show the data
 #
 # +
 # %matplotlib inline
 import matplotlib.pyplot as plt
 import numpy as np
 import sklearn
 from sklearn import datasets
 # load data
 d = datasets.load_diabetes()
 X = d.data[:, 2]
 Y = d.target
 # draw original data
 plt.scatter(X, Y)
 plt.xlabel("X")
 plt.ylabel("Y")
 plt.show()
 # -
 # ### Theory
 # For $N$ observation data:
 # $$
 # \mathbf{X} = \{x_1, x_2, ..., x_N \} \\
 # \mathbf{Y} = \{y_1, y_2, ..., y_N \}
 # $$
 #
 # We want to find the model which can predict the data. The simplest model is linear model, which has the form of 
 # $$
 # y = ax + b
 # $$
 #
 # The purpose is to find parameters $a, b$ which best fit the model to the observation data. 
 #
 # We use the sum of squares to measure the differences (loss function) between the model's prediction and observation data:
 # $$
 # L = \sum_{i=1}^{N} (y_i - a x_i + b)^2
 # $$
 #
 # To make the loss function minimize, we can find the parameters:
 # $$
 # \frac{\partial L}{\partial a} = -2 \sum_{i=1}^{N} (y_i - a x_i - b) x_i \\
 # \frac{\partial L}{\partial b} = -2 \sum_{i=1}^{N} (y_i - a x_i - b)
 # $$
 # When the loss is minimized, therefore the partial difference is zero, then we can get:
 # $$
 # -2 \sum_{i=1}^{N} (y_i - a x_i - b) x_i = 0 \\
 # -2 \sum_{i=1}^{N} (y_i - a x_i - b) = 0 \\
 # $$
 #
 # We reoder the items as:
 # $$
 # a \sum x_i^2 + b \sum x_i = \sum y_i x_i \\
 # a \sum x_i + b N = \sum y_i
 # $$
 # By solving the linear equation we can obtain the model parameters.
 # ### Program
 # +
 N = X.shape[0]
 S_X2 = np.sum(X*X)
 S_X  = np.sum(X)
 S_XY = np.sum(X*Y)
 S_Y  = np.sum(Y)
 A1 = np.array([[S_X2, S_X], 
               [S_X, N]])
 B1 = np.array([S_XY, S_Y])
 coeff = np.linalg.inv(A1).dot(B1)
 print('a = %f, b = %f' % (coeff[0], coeff[1]))
 x_min = np.min(X)
 x_max = np.max(X)
 y_min = coeff[0] * x_min + coeff[1]
 y_max = coeff[0] * x_max + coeff[1]
 plt.scatter(X, Y, label='original data')
 plt.plot([x_min, x_max], [y_min, y_max], 'r', label='model')
 plt.legend()
 plt.show()
 # -
 # ## 如何使用迭代的方法求出模型参数
 #
 # 当数据比较多的时候，或者模型比较复杂，无法直接使用解析的方式求出模型参数。因此更为常用的方式是，通过迭代的方式逐步逼近模型的参数。
 #
 # ### 梯度下降法
 # 在机器学习算法中，对于很多监督学习模型，需要对原始的模型构建损失函数，接下来便是通过优化算法对损失函数进行优化，以便寻找到最优的参数。在求解机器学习参数的优化算法中，使用较多的是基于梯度下降的优化算法(Gradient Descent, GD)。
 #
 # 梯度下降法有很多优点，其中，在梯度下降法的求解过程中，只需求解损失函数的一阶导数，计算的代价比较小，这使得梯度下降法能在很多大规模数据集上得到应用。梯度下降法的含义是通过当前点的梯度方向寻找到新的迭代点。
 #
 # 梯度下降法的基本思想可以类比为一个下山的过程。假设这样一个场景：一个人被困在山上，需要从山上下来(i.e. 找到山的最低点，也就是山谷)。但此时山上的浓雾很大，导致可视度很低。因此，下山的路径就无法确定，他必须利用自己周围的信息去找到下山的路径。这个时候，他就可以利用梯度下降算法来帮助自己下山。具体来说就是，以他当前的所处的位置为基准，寻找这个位置最陡峭的地方，然后朝着山的高度下降的地方走，同理，如果我们的目标是上山，也就是爬到山顶，那么此时应该是朝着最陡峭的方向往上走。然后每走一段距离，都反复采用同一个方法，最后就能成功的抵达山谷。
 #
 #
 # 我们同时可以假设这座山最陡峭的地方是无法通过肉眼立马观察出来的，而是需要一个复杂的工具来测量，同时，这个人此时正好拥有测量出最陡峭方向的能力。所以，此人每走一段距离，都需要一段时间来测量所在位置最陡峭的方向，这是比较耗时的。那么为了在太阳下山之前到达山底，就要尽可能的减少测量方向的次数。这是一个两难的选择，如果测量的频繁，可以保证下山的方向是绝对正确的，但又非常耗时，如果测量的过少，又有偏离轨道的风险。所以需要找到一个合适的测量方向的频率，来确保下山的方向不错误，同时又不至于耗时太多！
 #
 #
 # ![gradient_descent](images/gradient_descent.png)
 #
 # 如上图所示，得到了局部最优解。x,y表示的是$\theta_0$和$\theta_1$，z方向表示的是花费函数，很明显出发点不同，最后到达的收敛点可能不一样。当然如果是碗状的，那么收敛点就应该是一样的。
 #
 # 对于某一个损失函数
 # $$
 # L = \sum_{i=1}^{N} (y_i - a x_i + b)^2
 # $$
 #
 # 我们更新的策略是：
 # $$
 # \theta^1 = \theta^0 - \alpha \triangledown L(\theta)
 # $$
 # 其中$\theta$代表了模型中的参数，例如$a$, $b$
 #
 # 此公式的意义是：L是关于$\theta$的一个函数，我们当前所处的位置为$\theta_0$点，要从这个点走到L的最小值点，也就是山底。首先我们先确定前进的方向，也就是梯度的反向，然后走一段距离的步长，也就是$\alpha$，走完这个段步长，就到达了$\theta_1$这个点！
 #
 # 下面就这个公式的几个常见的疑问：
 #
 # * **$\alpha$是什么含义？**
 # $\alpha$在梯度下降算法中被称作为学习率或者步长，意味着我们可以通过$\alpha$来控制每一步走的距离，以保证不要步子跨的太大扯着蛋，哈哈，其实就是不要走太快，错过了最低点。同时也要保证不要走的太慢，导致太阳下山了，还没有走到山下。所以$\alpha$的选择在梯度下降法中往往是很重要的！$\alpha$不能太大也不能太小，太小的话，可能导致迟迟走不到最低点，太大的话，会导致错过最低点！
 # ![gd_stepsize](images/gd_stepsize.png)
 #
 # * **为什么要梯度要乘以一个负号？**
 # 梯度前加一个负号，就意味着朝着梯度相反的方向前进！我们在前文提到，梯度的方向实际就是函数在此点上升最快的方向！而我们需要朝着下降最快的方向走，自然就是负的梯度的方向，所以此处需要加上负号
 #
 #
 # ### Program
 # +
 n_epoch = 3000          # epoch size
 a, b = 1, 1             # initial parameters
 epsilon = 0.001         # learning rate
 for i in range(n_epoch):
    for j in range(N):
        a = a + epsilon*2*(Y[j] - a*X[j] - b)*X[j]
        b = b + epsilon*2*(Y[j] - a*X[j] - b)
    L = 0
    for j in range(N):
        L = L + (Y[j]-a*X[j]-b)**2
    print("epoch %4d: loss = %f, a = %f, b = %f" % (i, L, a, b))
 x_min = np.min(X)
 x_max = np.max(X)
 y_min = a * x_min + b
 y_max = a * x_max + b
 plt.scatter(X, Y, label='original data')
 plt.plot([x_min, x_max], [y_min, y_max], 'r', label='model')
 plt.legend()
 plt.show()
 # -
 # ## How to show the iterative process
 # +
 # %matplotlib nbagg
 import matplotlib.pyplot as plt
 import matplotlib.animation as animation
 n_epoch = 3000          # epoch size
 a, b = 1, 1             # initial parameters
 epsilon = 0.001         # learning rate
 fig = plt.figure()
 imgs = []
 for i in range(n_epoch):
    for j in range(N):
        a = a + epsilon*2*(Y[j] - a*X[j] - b)*X[j]
        b = b + epsilon*2*(Y[j] - a*X[j] - b)
    L = 0
    for j in range(N):
        L = L + (Y[j]-a*X[j]-b)**2
    #print("epoch %4d: loss = %f, a = %f, b = %f" % (i, L, a, b))
    if i % 50 == 0:
        x_min = np.min(X)
        x_max = np.max(X)
        y_min = a * x_min + b
        y_max = a * x_max + b
        img = plt.scatter(X, Y, label='original data')
        img = plt.plot([x_min, x_max], [y_min, y_max], 'r', label='model')
        imgs.append(img)
 ani = animation.ArtistAnimation(fig, imgs)
 plt.show()
 # -
 # ## How to use batch update method?
 #
 # If some data is outliear, then only use one data can make the learning inaccuracy and slow.
 #
 #
 # * [梯度下降方法的几种形式](https://blog.csdn.net/u010402786/article/details/51188876)
 # ## How to fit polynomial function?
 #
 # If we observe a missle at some time, then how to estimate the trajectory? Acoording the physical theory, the trajectory can be formulated as:
 # $$
 # y = at^2 + bt + c
 # $$
 # The we need at least three data to compute the parameters $a, b, c$.
 #
 # $$
 # L = \sum_{i=1}^N (y_i - at^2 - bt - c)^2
 # $$
 #
 # +
 t = np.array([2, 4, 6, 8])
 #t = np.linspace(0, 10)
 pa = -20
 pb = 90
 pc = 800
 y = pa*t**2 + pb*t + pc
 plt.scatter(t, y)
 plt.show()
 # -
 # ### How to get the update items?
 #
 # $$
 # L = \sum_{i=1}^N (y_i - at^2 - bt - c)^2
 # $$
 #
 # \begin{eqnarray}
 # \frac{\partial L}{\partial a} & = & - 2\sum_{i=1}^N (y_i - at^2 - bt -c) t^2 \\
 # \frac{\partial L}{\partial b} & = & - 2\sum_{i=1}^N (y_i - at^2 - bt -c) t \\
 # \frac{\partial L}{\partial c} & = & - 2\sum_{i=1}^N (y_i - at^2 - bt -c)
 # \end{eqnarray}
 # ## How to use sklearn to solve linear problem?
 #
 #
 # +
 from sklearn import linear_model
 # load data
 d = datasets.load_diabetes()
 X = d.data[:, np.newaxis, 2]
 Y = d.target
 # create regression model
 regr = linear_model.LinearRegression()
 regr.fit(X, Y)
 a, b = regr.coef_, regr.intercept_
 print("a = %f, b = %f" % (a, b))
 x_min = np.min(X)
 x_max = np.max(X)
 y_min = a * x_min + b
 y_max = a * x_max + b
 plt.scatter(X, Y)
 plt.plot([x_min, x_max], [y_min, y_max], 'r')
 plt.show()
 # -
 # ## How to use sklearn to fit polynomial function?
 # +
 # Fitting polynomial functions
 from sklearn.preprocessing import PolynomialFeatures
 from sklearn.linear_model import LinearRegression
 from sklearn.pipeline import Pipeline
 t = np.array([2, 4, 6, 8])
 pa = -20
 pb = 90
 pc = 800
 y = pa*t**2 + pb*t + pc
 model = Pipeline([('poly', PolynomialFeatures(degree=2)),
                  ('linear', LinearRegression(fit_intercept=False))])
 model = model.fit(t[:, np.newaxis], y)
 model.named_steps['linear'].coef_
 # -
 # ## How to estimate some missing value by the model?
 #
 # +
 # load data
 d = datasets.load_diabetes()
 N = d.target.shape[0]
 N_train = int(N*0.9)
 N_test = N - N_train
 X = d.data[:N_train, np.newaxis, 2]
 Y = d.target[:N_train]
 X_test = d.data[N_train:, np.newaxis, 2]
 Y_test = d.target[N_train:]
 # create regression model
 regr = linear_model.LinearRegression()
 regr.fit(X, Y)
 Y_est = regr.predict(X_test)
 print("Y_est  = ", Y_est)
 print("Y_test = ", Y_test)
 err = (Y_est - Y_test)**2
 err2 = sklearn.metrics.mean_squared_error(Y_test, Y_est)
 score = regr.score(X_test, Y_test)
 print("err = %f (%f), score = %f" % (np.sqrt(np.sum(err))/N_test, np.sqrt(err2), score))
 # plot data
 a, b = regr.coef_, regr.intercept_
 print("a = %f, b = %f" % (a, b))
 x_min = np.min(X)
 x_max = np.max(X)
 y_min = a * x_min + b
 y_max = a * x_max + b
 plt.scatter(X, Y, label='train data')
 plt.scatter(X_test, Y_test, label='test data')
 plt.plot([x_min, x_max], [y_min, y_max], 'r', label='model')
 plt.legend()
 plt.show()
 # -
--- a/4_logistic_regression/Logistic_regression.py
+++ b/4_logistic_regression/Logistic_regression.py
@@ -1,388 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # Logistic Regression
 #
 # 逻辑回归(Logistic Regression, LR)模型其实仅在线性回归的基础上，套用了一个逻辑函数，但也就由于这个逻辑函数，使得逻辑回归模型成为了机器学习领域一颗耀眼的明星，更是计算广告学的核心。本节主要详述逻辑回归模型的基础。
 #
 #
 # ## 1 逻辑回归模型
 # 回归是一种比较容易理解的模型，就相当于$y=f(x)$，表明自变量$x$与因变量$y$的关系。最常见问题有如医生治病时的望、闻、问、切，之后判定病人是否生病或生了什么病，其中的望闻问切就是获取自变量$x$，即特征数据，判断是否生病就相当于获取因变量$y$，即预测分类。
 #
 # 最简单的回归是线性回归，在此借用Andrew NG的讲义，有如图所示，$X$为数据点——肿瘤的大小，$Y$为观测值——是否是恶性肿瘤。通过构建线性回归模型，如$h_\theta(x)$所示，构建线性回归模型后，即可以根据肿瘤大小，预测是否为恶性肿瘤$h_\theta(x)) \ge 0.5$为恶性，$h_\theta(x) \lt 0.5$为良性。
 #
 # ![LinearRegression](images/fig1.gif)
 #
 # 然而线性回归的鲁棒性很差，例如在上图的数据集上建立回归，因最右边噪点的存在，使回归模型在训练集上表现都很差。这主要是由于线性回归在整个实数域内敏感度一致，而分类范围，需要在$[0,1]$。
 #
 # 逻辑回归就是一种减小预测范围，将预测值限定为$[0,1]$间的一种回归模型，其回归方程与回归曲线如图2所示。逻辑曲线在$z=0$时，十分敏感，在$z>>0$或$z<<0$处，都不敏感，将预测值限定为$(0,1)$。
 #
 #
 # +
 # %matplotlib inline
 import matplotlib.pyplot as plt
 import numpy as np
 plt.figure()
 plt.axis([-10,10,0,1])
 plt.grid(True)
 X=np.arange(-10,10,0.1)
 y=1/(1+np.e**(-X))
 plt.plot(X,y,'b-')
 plt.title("Logistic function")
 plt.show()
 # -
 # ### 逻辑回归表达式
 #
 # 这个函数称为Logistic函数(logistic function)，也称为Sigmoid函数(sigmoid function)。函数公式如下：
 #
 # $$
 # g(z) = \frac{1}{1+e^{-z}}
 # $$
 #
 # Logistic函数当z趋近于无穷大时，g(z)趋近于1；当z趋近于无穷小时，g(z)趋近于0。Logistic函数的图形如上图所示。Logistic函数求导时有一个特性，这个特性将在下面的推导中用到，这个特性为：
 # $$
 # g'(z) =  \frac{d}{dz} \frac{1}{1+e^{-z}} \\
 #       =  \frac{1}{(1+e^{-z})^2}(e^{-z}) \\
 #       =  \frac{1}{(1+e^{-z})} (1 - \frac{1}{(1+e^{-z})}) \\
 #       =  g(z)(1-g(z))
 # $$
 #
 #
 # +
 # %matplotlib inline
 import matplotlib.pyplot as plt
 import numpy as np
 plt.figure()
 plt.axis([-10,10,0,1])
 plt.grid(True)
 X=np.arange(-10,10,0.1)
 y=1/(1+np.e**(-X))
 plt.plot(X,y,'b-')
 plt.title("Logistic function")
 plt.show()
 # -
 # 逻辑回归本质上是线性回归，只是在特征到结果的映射中加入了一层函数映射，即先把特征线性求和，然后使用函数$g(z)$将最为假设函数来预测。$g(z)$可以将连续值映射到0到1之间。线性回归模型的表达式带入$g(z)$，就得到逻辑回归的表达式:
 #
 # $$
 # h_\theta(x) = g(\theta^T x) = \frac{1}{1+e^{-\theta^T x}}
 # $$
 # ### 逻辑回归的软分类
 #
 # 现在我们将y的取值$h_\theta(x)$通过Logistic函数归一化到(0,1)间，$y$的取值有特殊的含义，它表示结果取1的概率，因此对于输入$x$分类结果为类别1和类别0的概率分别为：
 #
 # $$
 # P(y=1|x,\theta) = h_\theta(x) \\
 # P(y=0|x,\theta) = 1 - h_\theta(x)
 # $$
 #
 # 对上面的表达式合并一下就是：
 #
 # $$
 # p(y|x,\theta) = (h_\theta(x))^y (1 - h_\theta(x))^{1-y}
 # $$
 #
 #
 # ### 梯度上升
 #
 # 得到了逻辑回归的表达式，下一步跟线性回归类似，构建似然函数，然后最大似然估计，最终推导出$\theta$的迭代更新表达式。只不过这里用的不是梯度下降，而是梯度上升，因为这里是最大化似然函数。
 #
 # 我们假设训练样本相互独立，那么似然函数表达式为：
 # ![Loss](images/eq_loss.png)
 #
 # 同样对似然函数取log，转换为：
 # ![LogLoss](images/eq_logloss.png)
 #
 # 转换后的似然函数对$\theta$求偏导，在这里我们以只有一个训练样本的情况为例：
 # ![LogLossDiff](images/eq_logloss_diff.png)
 #
 # 这个求偏导过程中：
 # * 第一步是对$\theta$偏导的转化，依据偏导公式：$y=lnx$, $y'=1/x$。
 # * 第二步是根据g(z)求导的特性g'(z) = g(z)(1 - g(z)) 。
 # * 第三步就是普通的变换。
 #
 # 这样我们就得到了梯度上升每次迭代的更新方向，那么$\theta$的迭代表达式为：
 # $$
 # \theta_j := \theta_j + \alpha(y^i - h_\theta(x^i)) x_j^i
 # $$
 #
 #
 # ## Program
 # +
 # %matplotlib inline
 from __future__ import division
 import numpy as np
 import sklearn.datasets
 import matplotlib.pyplot as plt
 np.random.seed(0)
 # +
 # load sample data
 data, label = sklearn.datasets.make_moons(200, noise=0.30)
 print("data  = ", data[:10, :])
 print("label = ", label[:10])
 plt.scatter(data[:,0], data[:,1], c=label)
 plt.title("Original Data")
 # +
 def plot_decision_boundary(predict_func, data, label):
    """画出结果图
    Args:
        pred_func (callable): 预测函数
        data (numpy.ndarray): 训练数据集合
        label (numpy.ndarray): 训练数据标签
    """
    x_min, x_max = data[:, 0].min() - .5, data[:, 0].max() + .5
    y_min, y_max = data[:, 1].min() - .5, data[:, 1].max() + .5
    h = 0.01
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = predict_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.scatter(data[:, 0], data[:, 1], c=label, cmap=plt.cm.Spectral)
    plt.show()
 # +
 def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))
 class Logistic(object):
    """logistic回归模型"""
    def __init__(self, data, label):
        self.data = data
        self.label = label
        self.data_num, n = np.shape(data)
        self.weights = np.ones(n)
        self.b = 1
    def train(self, num_iteration=150):
        """随机梯度上升算法
        Args:
            data (numpy.ndarray): 训练数据集
            labels (numpy.ndarray): 训练标签
            num_iteration (int): 迭代次数
        """
        for j in range(num_iteration):
            data_index = list(range(self.data_num))
            for i in range(self.data_num):
                # 学习速率
                alpha = 0.01
                rand_index = int(np.random.uniform(0, len(data_index)))
                error = self.label[rand_index] - sigmoid(sum(self.data[rand_index] * self.weights + self.b))
                self.weights += alpha * error * self.data[rand_index]
                self.b += alpha * error
                del(data_index[rand_index])
    def predict(self, predict_data):
        """预测函数"""
        result = list(map(lambda x: 1 if sum(self.weights * x + self.b) > 0 else 0,
                     predict_data))
        return np.array(result)
 # -
 logistic = Logistic(data, label)
 logistic.train(200)
 plot_decision_boundary(lambda x: logistic.predict(x), data, label)
 # ## How to use sklearn to resolve the problem
 #
 # +
 from sklearn.linear_model.logistic import LogisticRegression
 from sklearn.metrics import confusion_matrix
 from sklearn.metrics import accuracy_score
 import matplotlib.pyplot as plt
 # calculate train/test data number
 N = len(data)
 N_train = int(N*0.8)
 N_test = N - N_train
 # split train/test data
 x_train = data[:N_train, :]
 y_train = label[:N_train]
 x_test  = data[N_train:, :]
 y_test  = label[N_train:]
 # do logistic regression
 lr=LogisticRegression()
 lr.fit(x_train,y_train)
 pred_train = lr.predict(x_train)
 pred_test  = lr.predict(x_test)
 # calculate train/test accuracy
 acc_train = accuracy_score(y_train, pred_train)
 acc_test = accuracy_score(y_test, pred_test)
 print("accuracy train = %f" % acc_train)
 print("accuracy test = %f" % acc_test)
 # plot confusion matrix
 cm = confusion_matrix(y_test,pred_test)
 plt.matshow(cm)
 plt.title(u'Confusion Matrix')
 plt.colorbar()
 plt.ylabel(u'Groundtruth')
 plt.xlabel(u'Predict')
 plt.show()
 # -
 # ## Multi-class recognition
 # ### Load & show the data
 # +
 import matplotlib.pyplot as plt 
 from sklearn.datasets import load_digits
 # load data
 digits = load_digits()
 # copied from notebook 02_sklearn_data.ipynb
 fig = plt.figure(figsize=(6, 6))  # figure size in inches
 fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
 # plot the digits: each image is 8x8 pixels
 for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary)
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))
 # -
 # ### Visualizing the Data
 #
 # A good first-step for many problems is to visualize the data using one of the Dimensionality Reduction techniques we saw earlier. We'll start with the most straightforward one, Principal Component Analysis (PCA).
 #
 # PCA seeks orthogonal linear combinations of the features which show the greatest variance, and as such, can help give you a good idea of the structure of the data set. Here we'll use RandomizedPCA, because it's faster for large N.
 # +
 from sklearn.decomposition import PCA
 pca = PCA(n_components=2, svd_solver="randomized")
 proj = pca.fit_transform(digits.data)
 plt.scatter(proj[:, 0], proj[:, 1], c=digits.target)
 plt.colorbar()
 # -
 # A weakness of PCA is that it produces a linear dimensionality reduction:
 # this may miss some interesting relationships in the data.  If we want to
 # see a nonlinear mapping  of the data, we can use one of the several
 # methods in the `manifold` module.  Here we'll use [Isomap](https://blog.csdn.net/VictoriaW/article/details/78497316) (a concatenation
 # of Isometric Mapping) which is a manifold learning method based on
 # graph theory:
 # +
 from sklearn.manifold import Isomap
 iso = Isomap(n_neighbors=5, n_components=2)
 proj = iso.fit_transform(digits.data)
 plt.scatter(proj[:, 0], proj[:, 1], c=digits.target)
 plt.colorbar()
 # -
 # ## Program
 # +
 from sklearn.datasets import load_digits
 from sklearn.linear_model.logistic import LogisticRegression
 from sklearn.metrics import accuracy_score
 import matplotlib.pyplot as plt 
 # load digital data
 digits, dig_label = load_digits(return_X_y=True)
 print(digits.shape)
 # calculate train/test data number
 N = len(digits)
 N_train = int(N*0.8)
 N_test = N - N_train
 # split train/test data
 x_train = digits[:N_train, :]
 y_train = dig_label[:N_train]
 x_test  = digits[N_train:, :]
 y_test  = dig_label[N_train:]
 # do logistic regression
 lr=LogisticRegression()
 lr.fit(x_train,y_train)
 pred_train = lr.predict(x_train)
 pred_test  = lr.predict(x_test)
 # calculate train/test accuracy
 acc_train = accuracy_score(y_train, pred_train)
 acc_test = accuracy_score(y_test, pred_test)
 print("accuracy train = %f, accuracy_test = %f" % (acc_train, acc_test))
 score_train = lr.score(x_train, y_train)
 score_test  = lr.score(x_test, y_test)
 print("score_train = %f, score_test = %f" % (score_train, score_test))
 # +
 from sklearn.metrics import confusion_matrix
 # plot confusion matrix
 cm = confusion_matrix(y_test,pred_test)
 plt.matshow(cm)
 plt.title(u'Confusion Matrix')
 plt.colorbar()
 plt.ylabel(u'Groundtruth')
 plt.xlabel(u'Predict')
 plt.show()
 # -
 # ## Exercise - How to draw mis-classfied data?
 #
 # 1. How to obtain the mis-classified index?
 # 2. How to draw them?
 # ## References
 #
 # * [逻辑回归模型(Logistic Regression, LR)基础](https://www.cnblogs.com/sparkwen/p/3441197.html)
 # * [逻辑回归（Logistic Regression）](http://www.cnblogs.com/BYRans/p/4713624.html)
--- a/4_logistic_regression/PCA_and_Logistic_Regression.py
+++ b/4_logistic_regression/PCA_and_Logistic_Regression.py
@@ -1,167 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # Chaining a PCA and a logistic regression
 # The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction.
 #
 # We use a GridSearchCV to set the dimensionality of the PCA
 # +
 % matplotlib inline
 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn import linear_model, decomposition, datasets
 from sklearn.pipeline import Pipeline
 from sklearn.model_selection import GridSearchCV
 logistic = linear_model.LogisticRegression()
 pca = decomposition.PCA()
 pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target
 # Plot the PCA spectrum
 pca.fit(X_digits)
 plt.figure(1, figsize=(4, 3))
 plt.clf()
 plt.axes([.2, .2, .7, .7])
 plt.plot(pca.explained_variance_, linewidth=2)
 plt.axis('tight')
 plt.xlabel('n_components')
 plt.ylabel('explained_variance_')
 # Prediction
 n_components = [20, 40, 64]
 Cs = np.logspace(-4, 4, 3)
 # Parameters of pipelines can be set using ‘__’ separated parameter names:
 estimator = GridSearchCV(pipe,
                         dict(pca__n_components=n_components,
                              logistic__C=Cs))
 estimator.fit(X_digits, y_digits)
 plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
 plt.legend(prop=dict(size=12))
 plt.show()
 # +
 # Compare the performance
 from sklearn.datasets import load_digits
 from sklearn.linear_model.logistic import LogisticRegression
 from sklearn import decomposition
 from sklearn.metrics import confusion_matrix
 from sklearn.metrics import accuracy_score
 import matplotlib.pyplot as plt
 # load digital data
 digits, dig_label = load_digits(return_X_y=True)
 print(digits.shape)
 # draw one digital
 plt.gray() 
 plt.matshow(digits[0].reshape([8, 8])) 
 plt.show() 
 # +
 # calculate train/test data number
 N = len(digits)
 N_train = int(N*0.8)
 N_test = N - N_train
 # split train/test data
 x_train = digits[:N_train, :]
 y_train = dig_label[:N_train]
 x_test  = digits[N_train:, :]
 y_test  = dig_label[N_train:]
 # do logistic regression
 lr=LogisticRegression()
 lr.fit(x_train,y_train)
 pred_train = lr.predict(x_train)
 pred_test  = lr.predict(x_test)
 # calculate train/test accuracy
 acc_train = accuracy_score(y_train, pred_train)
 acc_test = accuracy_score(y_test, pred_test)
 print("accuracy train = %f, accuracy_test = %f" % (acc_train, acc_test))
 # +
 # do PCA with 'n_components=40'
 pca = decomposition.PCA(n_components=40)
 pca.fit(x_train)
 x_train_pca = pca.transform(x_train)
 x_test_pca = pca.transform(x_test)
 # do logistic regression
 lr=LogisticRegression()
 lr.fit(x_train_pca,y_train)
 pred_train = lr.predict(x_train_pca)
 pred_test  = lr.predict(x_test_pca)
 # calculate train/test accuracy
 acc_train = accuracy_score(y_train, pred_train)
 acc_test = accuracy_score(y_test, pred_test)
 print("accuracy train = %f, accuracy_test = %f" % (acc_train, acc_test))
 # +
 # do kernel PCA
 #   Ref: http://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html
 from sklearn.decomposition import PCA, KernelPCA
 kpca = KernelPCA(n_components=45, kernel="rbf", fit_inverse_transform=True, gamma=10)
 kpca.fit(x_train)
 x_train_pca = kpca.transform(x_train)
 x_test_pca = kpca.transform(x_test)
 # do logistic regression
 lr=LogisticRegression()
 lr.fit(x_train_pca,y_train)
 pred_train = lr.predict(x_train_pca)
 pred_test  = lr.predict(x_test_pca)
 # calculate train/test accuracy
 acc_train = accuracy_score(y_train, pred_train)
 acc_test = accuracy_score(y_test, pred_test)
 print("accuracy train = %f, accuracy_test = %f" % (acc_train, acc_test))
 # -
 # ## References
 # * [Pipelining: chaining a PCA and a logistic regression](http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html)
 # * [PCA进行无监督降维](https://ljalphabeta.gitbooks.io/python-/content/pca.html)
--- a/5_nn/Perceptron.py
+++ b/5_nn/Perceptron.py
@@ -1,203 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # ## 感知机
 #
 # 感知机（perceptron）是二分类的线性分类模型，输入为实例的特征向量，输出为实例的类别（取+1和-1）。感知机对应于输入空间中将实例划分为两类的分离超平面。感知机旨在求出该超平面，为求得超平面导入了基于误分类的损失函数，利用梯度下降法 对损失函数进行最优化（最优化）。感知机的学习算法具有简单而易于实现的优点，分为原始形式和对偶形式。感知机预测是用学习得到的感知机模型对新的实例进行预测的，因此属于判别模型。感知机由Rosenblatt于1957年提出的，是神经网络和支持向量机的基础。
 #
 # 模仿的是生物神经系统内的神经元，它能够接受来自多个源的信号输入，然后将信号转化为便于传播的信号在进行输出(在生物体内表现为电信号)。
 #
 # ![neuron](images/neuron.png)
 #
 # * dendrites - 树突
 # * nucleus - 细胞核
 # * axon - 轴突
 #
 # 心理学家Rosenblatt构想了感知机，它作为简化的数学模型解释大脑神经元如何工作：它取一组二进制输入值（附近的神经元），将每个输入值乘以一个连续值权重（每个附近神经元的突触强度），并设立一个阈值，如果这些加权输入值的和超过这个阈值，就输出1，否则输出0（同理于神经元是否放电）。对于感知机，绝大多数输入值不是一些数据，就是别的感知机的输出值。
 #
 # 麦卡洛克-皮兹模型缺乏一个对AI而言至关重要的学习机制。这就是感知机更出色的地方所在——罗森布拉特受到唐纳德·赫布(Donald Hebb) 基础性工作的启发，想出一个让这种人工神经元学习的办法。赫布提出了一个出人意料并影响深远的想法，称知识和学习发生在大脑主要是通过神经元间突触的形成与变化，简要表述为赫布法则：
 #
 # >当细胞A的轴突足以接近以激发细胞B，并反复持续地对细胞B放电，一些生长过程或代谢变化将发生在某一个或这两个细胞内，以致A作为对B放电的细胞中的一个，效率增加。
 #
 #
 # 感知机并没有完全遵循这个想法，**但通过调输入值的权重，可以有一个非常简单直观的学习方案：给定一个有输入输出实例的训练集，感知机应该「学习」一个函数：对每个例子，若感知机的输出值比实例低太多，则增加它的权重，否则若设比实例高太多，则减少它的权重。**
 #
 # ## 1. 感知机模型
 #
 # 假设输入空间(特征向量)为X⊆Rn，输出空间为Y={-1, +1}。输入x∈X表示实例的特征向量，对应于输入空间的点；输出y∈Y表示示例的类别。由输入空间到输出空间的函数为
 #
 # $$
 # f(x) = sign(w x + b)
 # $$
 #
 # 称为感知机。其中，参数w叫做权值向量，b称为偏置。w·x表示w和x的内积。sign为符号函数，即
 # ![sign_function](images/sign.png)
 #
 # ### 几何解释    
 # 感知机模型是线性分类模型，感知机模型的假设空间是定义在特征空间中的所有线性分类模型，即函数集合{f|f(x)=w·x+b}。线性方程 w·x+b=0对应于特征空间Rn中的一个超平面S，其中w是超平面的法向量，b是超平面的截踞。这个超平面把特征空间划分为两部分。位于两侧的点分别为正负两类。超平面S称为分离超平面，如下图：
 # ![perceptron_geometry_def](images/perceptron_geometry_def.png)
 #
 # ### 生物学类比
 # ![perceptron_2](images/perceptron_2.PNG)
 #
 #
 #
 # ## 2. 感知机学习策略
 #
 # 假设训练数据集是线性可分的，感知机学习的目标是求得一个能够将训练数据的正负实例点完全分开的分离超平面，即最终求得参数w、b。这需要一个学习策略，即定义（经验）损失函数并将损失函数最小化。
 #
 # 损失函数的一个自然的选择是误分类的点的总数。但是这样得到的损失函数不是参数w、b的连续可导函数，不宜优化。损失函数的另一个选择是误分类点到分里面的距离之和。
 #
 # 首先，对于任意一点xo到超平面的距离为
 # $$
 # \frac{1}{||w||} | w \cdot xo + b |
 # $$
 #
 # 其次，对于误分类点（xi,yi）来说 -yi(w·xi+b)>0
 #
 # 这样，假设超平面S的总的误分类点集合为M，那么所有误分类点到S的距离之和为
 # $$
 # -\frac{1}{||w||} \sum_{x_i \in M} y_i (w \cdot x_i + b)
 # $$
 # 不考虑1/||w||，就得到了感知机学习的损失函数。
 #
 # ### 经验风险函数
 #
 # 给定数据集T={(x1,y1),(x2,y2)...(xN,yN)}（其中xi∈X=Rn，yi∈Y={-1, +1}，i=1,2...N），感知机sign(w·x+b)学习的损失函数定义为
 # $$
 # L(w, b) = - \sum_{x_i \in M} y_i (w \cdot x_i + b)
 # $$
 # 其中M为误分类点的集合，这个损失函数就是感知机学习的[经验风险函数](https://blog.csdn.net/zhzhx1204/article/details/70163099)。
 #
 # 显然，损失函数L(w,b)是非负的。如果没有误分类点，那么L(w,b)为0，误分类点数越少，L(w,b)值越小。一个特定的损失函数：在误分类时是参数w,b的线性函数，在正确分类时，是0.因此，给定训练数据集T,损失函数L(w,b)是w,b的连续可导函数。
 #
 # ## 3. 感知机学习算法
 #
 #
 # 最优化问题：给定数据集T={(x1,y1),(x2,y2)...(xN,yN)}（其中xi∈X=Rn，yi∈Y={-1, +1}，i=1,2...N），求参数w,b,使其成为损失函数的解（M为误分类的集合）：
 #
 # $$
 # min_{w,b} L(w, b) =  - \sum_{x_i \in M} y_i (w \cdot x_i + b)
 # $$
 #
 # 感知机学习是误分类驱动的，具体采用[随机梯度下降法](https://blog.csdn.net/zbc1090549839/article/details/38149561)。首先，任意选定$w_0$、$b_0$，然后用梯度下降法不断极小化目标函数，极小化的过程不知一次性的把M中的所有误分类点梯度下降，而是一次随机选取一个误分类点使其梯度下降。
 #
 # 假设误分类集合M是固定的，那么损失函数L(w,b)的梯度为
 # $$
 # \triangledown_w L(w, b) = - \sum_{x_i \in M} y_i x_i \\
 # \triangledown_b L(w, b) = - \sum_{x_i \in M} y_i \\
 # $$
 #
 # 随机选取一个误分类点$(x_i,y_i)$,对$w,b$进行更新：
 # $$
 # w = w + \eta y_i x_i \\
 # b = b + \eta y_i
 # $$
 #
 # 式中$\eta$（0 ≤ $ \eta $ ≤ 1）是步长，在统计学是中成为学习速率。步长越大，梯度下降的速度越快，更能接近极小点。如果步长过大，有可能导致跨过极小点，导致函数发散；如果步长过小，有可能会耗很长时间才能达到极小点。
 #
 # 直观解释：当一个实例点被误分类时，调整w,b，使分离超平面向该误分类点的一侧移动，以减少该误分类点与超平面的距离，直至超越该点被正确分类。
 #
 #
 #
 # 算法
 # ```
 # 输入：T={(x1,y1),(x2,y2)...(xN,yN)}（其中xi∈X=Rn，yi∈Y={-1, +1}，i=1,2...N，学习速率为η）
 # 输出：w, b;感知机模型f(x)=sign(w·x+b)
 # (1) 初始化w0,b0
 # (2) 在训练数据集中选取（xi, yi）
 # (3) 如果yi(w xi+b)≤0
 #            w = w + ηyixi
 #            b = b + ηyi
 # (4) 转至（2）
 # ```
 #
 #
 # ## 4. Program
 #
 # +
 import random
 import numpy as np
 # 符号函数
 def sign(v):
    if v > 0:  return 1
    else:      return -1
 def perceptron_train(train_data, eta=0.5, n_iter=100):
    weight = [0, 0]  # 权重
    bias = 0  # 偏置量
    learning_rate = eta  # 学习速率
    train_num = n_iter  # 迭代次数
    for i in range(train_num):
        #FIXME: the random chose sample is to slow
        train = random.choice(train_data)
        x1, x2, y = train
        predict = sign(weight[0] * x1 + weight[1] * x2 + bias)  # 输出
        #print("train data: x: (%2d, %2d) y: %2d  ==> predict: %2d" % (x1, x2, y, predict))
        if y * predict <= 0:  # 判断误分类点
            weight[0] = weight[0] + learning_rate * y * x1  # 更新权重
            weight[1] = weight[1] + learning_rate * y * x2
            bias      = bias      + learning_rate * y       # 更新偏置量
            print("update weight and bias: ", weight[0], weight[1], bias)
    #print("stop training: ", weight[0], weight[1], bias)
    return weight, bias
 def perceptron_pred(data, w, b):
    y_pred = []
    for d in data:
        x1, x2, y = d
        yi = sign(w[0]*x1 + w[1]*x2 + b)
        y_pred.append(yi)
    return y_pred
 # set training data
 train_data = np.array([[1, 3,  1], [2, 5,  1], [3, 8,  1], [2, 6,  1], 
                       [3, 1, -1], [4, 1, -1], [6, 2, -1], [7, 3, -1]])
 # do training
 w, b = perceptron_train(train_data)
 print("w = ", w)
 print("b = ", b)
 # predict 
 y_pred = perceptron_pred(train_data, w, b)
 print(train_data[:, 2])
 print(y_pred)
 # -
 # ## Reference
 # * [感知机（Python实现）](http://www.cnblogs.com/kaituorensheng/p/3561091.html)
 # * [Programming a Perceptron in Python](https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/)
 # * [损失函数、风险函数、经验风险最小化、结构风险最小化](https://blog.csdn.net/zhzhx1204/article/details/70163099)
--- a/5_nn/README.md
+++ b/5_nn/README.md
@@ -0,0 +1,4 @@
 ## References
 * https://iamtrask.github.io/2015/07/12/basic-python-network/
 * http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/
--- a/5_nn/mlp_bp.py
+++ b/5_nn/mlp_bp.py
@@ -1,557 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 多层神经网络和反向传播
 #
 # ## 神经元
 #
 # 神经元和感知器本质上是一样的，只不过我们说感知器的时候，它的激活函数是阶跃函数；而当我们说神经元时，激活函数往往选择为sigmoid函数或tanh函数。如下图所示：
 #
 # ![neuron](images/neuron.gif)
 #
 # 计算一个神经元的输出的方法和计算一个感知器的输出是一样的。假设神经元的输入是向量$\vec{x}$，权重向量是$\vec{w}$(偏置项是$w_0$)，激活函数是sigmoid函数，则其输出y：
 # $$
 # y = sigmod(\vec{w}^T \cdot \vec{x})
 # $$
 #
 # sigmoid函数的定义如下：
 # $$
 # sigmod(x) = \frac{1}{1+e^{-x}}
 # $$
 # 将其带入前面的式子，得到
 # $$
 # y = \frac{1}{1+e^{-\vec{w}^T \cdot \vec{x}}}
 # $$
 #
 # sigmoid函数是一个非线性函数，值域是(0,1)。函数图像如下图所示
 #
 # ![sigmod_function](images/sigmod.jpg)
 #
 # sigmoid函数的导数是：
 # \begin{eqnarray}
 # y & = & sigmod(x) \tag{1} \\
 # y' & = & y(1-y)
 # \end{eqnarray}
 #
 # 可以看到，sigmoid函数的导数非常有趣，它可以用sigmoid函数自身来表示。这样，一旦计算出sigmoid函数的值，计算它的导数的值就非常方便。
 #
 #
 # ## 神经网络是啥?
 #
 # ![nn1](images/nn1.jpeg)
 #
 # 神经网络其实就是按照一定规则连接起来的多个神经元。上图展示了一个全连接(full connected, FC)神经网络，通过观察上面的图，我们可以发现它的规则包括：
 #
 # * 神经元按照层来布局。最左边的层叫做输入层，负责接收输入数据；最右边的层叫输出层，我们可以从这层获取神经网络输出数据。输入层和输出层之间的层叫做隐藏层，因为它们对于外部来说是不可见的。
 # * 同一层的神经元之间没有连接。
 # * 第N层的每个神经元和第N-1层的所有神经元相连(这就是full connected的含义)，第N-1层神经元的输出就是第N层神经元的输入。
 # * 每个连接都有一个权值。
 #
 # 上面这些规则定义了全连接神经网络的结构。事实上还存在很多其它结构的神经网络，比如卷积神经网络(CNN)、循环神经网络(RNN)，他们都具有不同的连接规则。
 #
 # ## 计算神经网络的输出
 #
 # 神经网络实际上就是一个输入向量$\vec{x}$到输出向量$\vec{y}$的函数，即：
 #
 # $$
 # \vec{y} = f_{network}(\vec{x})
 # $$
 # 根据输入计算神经网络的输出，需要首先将输入向量$\vec{x}$的每个元素的值$x_i$赋给神经网络的输入层的对应神经元，然后根据式1依次向前计算每一层的每个神经元的值，直到最后一层输出层的所有神经元的值计算完毕。最后，将输出层每个神经元的值串在一起就得到了输出向量$\vec{y}$。
 #
 # 接下来举一个例子来说明这个过程，我们先给神经网络的每个单元写上编号。
 #
 # ![nn2](images/nn2.png)
 #
 # 如上图，输入层有三个节点，我们将其依次编号为1、2、3；隐藏层的4个节点，编号依次为4、5、6、7；最后输出层的两个节点编号为8、9。因为我们这个神经网络是全连接网络，所以可以看到每个节点都和上一层的所有节点有连接。比如，我们可以看到隐藏层的节点4，它和输入层的三个节点1、2、3之间都有连接，其连接上的权重分别为$w_{41}$,$w_{42}$,$w_{43}$。那么，我们怎样计算节点4的输出值$a_4$呢？
 #
 #
 # 为了计算节点4的输出值，我们必须先得到其所有上游节点（也就是节点1、2、3）的输出值。节点1、2、3是输入层的节点，所以，他们的输出值就是输入向量$\vec{x}$本身。按照上图画出的对应关系，可以看到节点1、2、3的输出值分别是$x_1$,$x_2$,$x_3$。我们要求输入向量的维度和输入层神经元个数相同，而输入向量的某个元素对应到哪个输入节点是可以自由决定的，你偏非要把$x_1$赋值给节点2也是完全没有问题的，但这样除了把自己弄晕之外，并没有什么价值。
 #
 # 一旦我们有了节点1、2、3的输出值，我们就可以根据式1计算节点4的输出值$a_4$：
 #
 # ![eqn_3_4](images/eqn_3_4.png)
 #
 # 上式的$w_{4b}$是节点4的偏置项，图中没有画出来。而$w_{41}$,$w_{42}$,$w_{43}$分别为节点1、2、3到节点4连接的权重，在给权重$w_{ji}$编号时，我们把目标节点的编号$j$放在前面，把源节点的编号$i$放在后面。
 #
 # 同样，我们可以继续计算出节点5、6、7的输出值$a_5$,$a_6$,$a_7$。这样，隐藏层的4个节点的输出值就计算完成了，我们就可以接着计算输出层的节点8的输出值$y_1$：
 #
 # ![eqn_5_6](images/eqn_5_6.png)
 #
 # 同理，我们还可以计算出$y_2$的值。这样输出层所有节点的输出值计算完毕，我们就得到了在输入向量$\vec{x} = (x_1, x_2, x_3)^T$时，神经网络的输出向量$\vec{y} = (y_1, y_2)^T$。这里我们也看到，输出向量的维度和输出层神经元个数相同。
 #
 #
 # ## 神经网络的矩阵表示
 #
 # 神经网络的计算如果用矩阵来表示会很方便（当然逼格也更高），我们先来看看隐藏层的矩阵表示。
 #
 # 首先我们把隐藏层4个节点的计算依次排列出来：
 #
 # ![eqn_hidden_units](images/eqn_hidden_units.png)
 #
 # 接着，定义网络的输入向量$\vec{x}$和隐藏层每个节点的权重向量$\vec{w}$。令
 #
 # ![eqn_7_12](images/eqn_7_12.png)
 #
 # 代入到前面的一组式子，得到：
 #
 # ![eqn_13_16](images/eqn_13_16.png)
 #
 # 现在，我们把上述计算$a_4$, $a_5$,$a_6$,$a_7$的四个式子写到一个矩阵里面，每个式子作为矩阵的一行，就可以利用矩阵来表示它们的计算了。令
 #
 # ![eqn_matrix1](images/eqn_matrix1.png)
 #
 # 带入前面的一组式子，得到
 #
 # ![formular_2](images/formular_2.png)
 #
 # 在式2中，$f$是激活函数，在本例中是$sigmod$函数；$W$是某一层的权重矩阵；$\vec{x}$是某层的输入向量；$\vec{a}$是某层的输出向量。式2说明神经网络的每一层的作用实际上就是先将输入向量左乘一个数组进行线性变换，得到一个新的向量，然后再对这个向量逐元素应用一个激活函数。
 #
 # 每一层的算法都是一样的。比如，对于包含一个输入层，一个输出层和三个隐藏层的神经网络，我们假设其权重矩阵分别为$W_1$,$W_2$,$W_3$,$W_4$，每个隐藏层的输出分别是$\vec{a}_1$,$\vec{a}_2$,$\vec{a}_3$，神经网络的输入为$\vec{x}$，神经网络的输出为$\vec{y}$，如下图所示：
 #
 # ![nn_parameters_demo](images/nn_parameters_demo.png)
 #
 # 则每一层的输出向量的计算可以表示为：
 #
 # ![eqn_17_20](images/eqn_17_20.png)
 #
 #
 # 这就是神经网络输出值的矩阵计算方法。
 #
 # ## 神经网络的训练 - 反向传播算法
 #
 # 现在，我们需要知道一个神经网络的每个连接上的权值是如何得到的。我们可以说神经网络是一个模型，那么这些权值就是模型的参数，也就是模型要学习的东西。然而，一个神经网络的连接方式、网络的层数、每层的节点数这些参数，则不是学习出来的，而是人为事先设置的。对于这些人为设置的参数，我们称之为超参数(Hyper-Parameters)。
 #
 # 反向传播算法其实就是链式求导法则的应用。然而，这个如此简单且显而易见的方法，却是在Roseblatt提出感知器算法将近30年之后才被发明和普及的。对此，Bengio这样回应道：
 #
 # > 很多看似显而易见的想法只有在事后才变得显而易见。
 #
 # 按照机器学习的通用套路，我们先确定神经网络的目标函数，然后用随机梯度下降优化算法去求目标函数最小值时的参数值。
 #
 # 我们取网络所有输出层节点的误差平方和作为目标函数：
 #
 # ![bp_loss](images/bp_loss.png)
 #
 # 其中，$E_d$表示是样本$d$的误差。
 #
 # 然后，使用随机梯度下降算法对目标函数进行优化：
 #
 # ![bp_weight_update](images/bp_weight_update.png)
 #
 # 随机梯度下降算法也就是需要求出误差$E_d$对于每个权重$w_{ji}$的偏导数（也就是梯度），怎么求呢？
 #
 # ![nn3](images/nn3.png)
 #
 # 观察上图，我们发现权重$w_{ji}$仅能通过影响节点$j$的输入值影响网络的其它部分，设$net_j$是节点$j$的加权输入，即
 #
 # ![eqn_21_22](images/eqn_21_22.png)
 #
 # $E_d$是$net_j$的函数，而$net_j$是$w_{ji}$的函数。根据链式求导法则，可以得到：
 #
 # ![eqn_23_25](images/eqn_23_25.png)
 #
 #
 # 上式中，$x_{ji}$是节点传递给节点$j$的输入值，也就是节点$i$的输出值。
 #
 # 对于的$\frac{\partial E_d}{\partial net_j}$推导，需要区分输出层和隐藏层两种情况。
 #
 #
 # ### 输出层权值训练
 #
 # ![nn3](images/nn3.png)
 #
 # 对于输出层来说，$net_j$仅能通过节点$j$的输出值$y_j$来影响网络其它部分，也就是说$E_d$是$y_j$的函数，而$y_j$是$net_j$的函数，其中$y_j = sigmod(net_j)$。所以我们可以再次使用链式求导法则：
 #
 # ![eqn_26](images/eqn_26.png)
 #
 # 考虑上式第一项:
 #
 # ![eqn_27_29](images/eqn_27_29.png)
 #
 #
 # 考虑上式第二项：
 #
 # ![eqn_30_31](images/eqn_30_31.png)
 #
 # 将第一项和第二项带入，得到：
 #
 # ![eqn_ed_net_j.png](images/eqn_ed_net_j.png)
 #
 # 如果令$\delta_j = - \frac{\partial E_d}{\partial net_j}$，也就是一个节点的误差项$\delta$是网络误差对这个节点输入的偏导数的相反数。带入上式，得到：
 #
 # ![eqn_delta_j.png](images/eqn_delta_j.png)
 #
 # 将上述推导带入随机梯度下降公式，得到：
 #
 # ![eqn_32_34.png](images/eqn_32_34.png)
 #
 # ### 隐藏层权值训练
 #
 # 现在我们要推导出隐藏层的$\frac{\partial E_d}{\partial net_j}$。
 #
 # ![nn3](images/nn3.png)
 #
 # 首先，我们需要定义节点$j$的所有直接下游节点的集合$Downstream(j)$。例如，对于节点4来说，它的直接下游节点是节点8、节点9。可以看到$net_j$只能通过影响$Downstream(j)$再影响$E_d$。设$net_k$是节点$j$的下游节点的输入，则$E_d$是$net_k$的函数，而$net_k$是$net_j$的函数。因为$net_k$有多个，我们应用全导数公式，可以做出如下推导：
 #
 # ![eqn_35_40](images/eqn_35_40.png)
 #
 # 因为$\delta_j = - \frac{\partial E_d}{\partial net_j}$，带入上式得到：
 #
 # ![eqn_delta_hidden.png](images/eqn_delta_hidden.png)
 #
 #
 # 至此，我们已经推导出了反向传播算法。需要注意的是，我们刚刚推导出的训练规则是根据激活函数是sigmoid函数、平方和误差、全连接网络、随机梯度下降优化算法。如果激活函数不同、误差计算方式不同、网络连接结构不同、优化算法不同，则具体的训练规则也会不一样。但是无论怎样，训练规则的推导方式都是一样的，应用链式求导法则进行推导即可。
 #
 # ###  具体解释
 #
 # 我们假设每个训练样本为$(\vec{x}, \vec{t})$，其中向量$\vec{x}$是训练样本的特征，而$\vec{t}$是样本的目标值。
 #
 # ![nn3](images/nn3.png)
 #
 # 首先，我们根据上一节介绍的算法，用样本的特征$\vec{x}$，计算出神经网络中每个隐藏层节点的输出$a_i$，以及输出层每个节点的输出$y_i$。
 #
 # 然后，我们按照下面的方法计算出每个节点的误差项$\delta_i$：
 #
 # * **对于输出层节点$i$**
 #
 # ![formular_3.png](images/formular_3.png)
 #
 # 其中，$\delta_i$是节点$i$的误差项，$y_i$是节点$i$的输出值，$t_i$是样本对应于节点$i$的目标值。举个例子，根据上图，对于输出层节点8来说，它的输出值是$y_1$，而样本的目标值是$t_1$，带入上面的公式得到节点8的误差项应该是：
 #
 # ![forumlar_delta8.png](images/forumlar_delta8.png)
 #
 # * **对于隐藏层节点**
 #
 # ![formular_4.png](images/formular_4.png)
 #
 # 其中，$a_i$是节点$i$的输出值，$w_{ki}$是节点$i$到它的下一层节点$k$的连接的权重，$\delta_k$是节点$i$的下一层节点$k$的误差项。例如，对于隐藏层节点4来说，计算方法如下：
 #
 # ![forumlar_delta4.png](images/forumlar_delta4.png)
 #
 #
 #
 # 最后，更新每个连接上的权值：
 #
 # ![formular_5.png](images/formular_5.png)
 #
 # 其中，$w_{ji}$是节点$i$到节点$j$的权重，$\eta$是一个成为学习速率的常数，$\delta_j$是节点$j$的误差项，$x_{ji}$是节点$i$传递给节点$j$的输入。例如，权重$w_{84}$的更新方法如下：
 #
 # ![eqn_w84_update.png](images/eqn_w84_update.png)
 #
 # 类似的，权重$w_{41}$的更新方法如下：
 #
 # ![eqn_w41_update.png](images/eqn_w41_update.png)
 #
 #
 # 偏置项的输入值永远为1。例如，节点4的偏置项$w_{4b}$应该按照下面的方法计算：
 #
 # ![eqn_w4b_update.png](images/eqn_w4b_update.png)
 #
 # 我们已经介绍了神经网络每个节点误差项的计算和权重更新方法。显然，计算一个节点的误差项，需要先计算每个与其相连的下一层节点的误差项。这就要求误差项的计算顺序必须是从输出层开始，然后反向依次计算每个隐藏层的误差项，直到与输入层相连的那个隐藏层。这就是反向传播算法的名字的含义。当所有节点的误差项计算完毕后，我们就可以根据式5来更新所有的权重。
 #
 #
 # ## Program
 # +
 % matplotlib inline
 import numpy as np
 from sklearn import datasets, linear_model
 import matplotlib.pyplot as plt
 # generate sample data
 np.random.seed(0)
 X, y = datasets.make_moons(200, noise=0.20)
 # generate nn output target
 t = np.zeros((X.shape[0], 2))
 t[np.where(y==0), 0] = 1
 t[np.where(y==1), 1] = 1
 # plot data
 plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
 plt.show()
 # +
 # generate the NN model
 class NN_Model:
    epsilon = 0.01               # learning rate
    n_epoch = 1000               # iterative number
 nn = NN_Model()
 nn.n_input_dim = X.shape[1]      # input size
 nn.n_output_dim = 2              # output node size
 nn.n_hide_dim = 4                # hidden node size
 nn.X = X
 nn.y = y
 # initial weight array
 nn.W1 = np.random.randn(nn.n_input_dim, nn.n_hide_dim) / np.sqrt(nn.n_input_dim)
 nn.b1 = np.zeros((1, nn.n_hide_dim))
 nn.W2 = np.random.randn(nn.n_hide_dim, nn.n_output_dim) / np.sqrt(nn.n_hide_dim)
 nn.b2 = np.zeros((1, nn.n_output_dim))
 # defin sigmod & its derivate function
 def sigmod(X):
    return 1.0/(1+np.exp(-X))
 def sigmod_derivative(X):
    f = sigmod(X)
    return f*(1-f)
 # network forward calculation
 def forward(n, X):
    n.z1 = sigmod(X.dot(n.W1) + n.b1)
    n.z2 = sigmod(n.z1.dot(n.W2) + n.b2)
    return n
 # use random weight to perdict
 forward(nn, X)
 y_pred = np.argmax(nn.z2, axis=1)
 # plot data
 plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=plt.cm.Spectral)
 plt.show()
 # +
 from sklearn.metrics import accuracy_score
 y_true = np.array(nn.y).astype(float)
 # back-propagation
 def backpropagation(n, X, y):
    for i in range(n.n_epoch):
        # forward to calculate each node's output
        forward(n, X)
        # print loss, accuracy
        L = np.sum((n.z2 - y)**2)
        y_pred = np.argmax(nn.z2, axis=1)
        acc = accuracy_score(y_true, y_pred)
        print("epoch [%4d] L = %f, acc = %f" % (i, L, acc))
        # calc weights update
        d2 = n.z2*(1-n.z2)*(y - n.z2)
        d1 = n.z1*(1-n.z1)*(np.dot(d2, n.W2.T))
        # update weights
        n.W2 += n.epsilon * np.dot(n.z1.T, d2)
        n.b2 += n.epsilon * np.sum(d2, axis=0)
        n.W1 += n.epsilon * np.dot(X.T, d1)
        n.b1 += n.epsilon * np.sum(d1, axis=0)
 nn.n_epoch = 2000
 backpropagation(nn, X, t)
 # +
 # plot data
 y_pred = np.argmax(nn.z2, axis=1)
 plt.scatter(X[:, 0], X[:, 1], c=nn.y, cmap=plt.cm.Spectral)
 plt.title("ground truth")
 plt.show()
 plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=plt.cm.Spectral)
 plt.title("predicted")
 plt.show()
 # -
 # ## 如何使用类的方法封装多层神经网络?
 # +
 import numpy as np
 from sklearn import datasets, linear_model
 from sklearn.metrics import accuracy_score
 import matplotlib.pyplot as plt
 # define sigmod
 def sigmod(X):
    return 1.0/(1+np.exp(-X))
 # generate the NN model
 class NN_Model:
    def __init__(self, nodes=None):
        self.epsilon = 0.01                 # learning rate
        self.n_epoch = 1000                 # iterative number
        if not nodes:
            self.nodes = [2, 4, 2]          # default nodes size (from input -> output)
        else:
            self.nodes = nodes
    def init_weight(self):
        W = []
        B = []
        n_layer = len(self.nodes)
        for i in range(n_layer-1):
            w = np.random.randn(self.nodes[i], self.nodes[i+1]) / np.sqrt(self.nodes[i])
            b = np.random.randn(1, self.nodes[i+1])
            W.append(w)
            B.append(b)
        self.W = W
        self.B = B
    def forward(self, X):
        Z = []
        x0 = X
        for i in range(len(self.nodes)-1):
            z = sigmod(np.dot(x0, self.W[i]) + self.B[i])
            x0 = z
            Z.append(z)
        self.Z = Z
        return Z[-1]
    # back-propagation
    def backpropagation(self, X, y, n_epoch=None, epsilon=None):
        if not n_epoch: n_epoch = self.n_epoch
        if not epsilon: epsilon = self.epsilon
        self.X = X
        self.Y = y
        for i in range(n_epoch):
            # forward to calculate each node's output
            self.forward(X)
            self.evaluate()
            # calc weights update
            W = self.W
            B = self.B
            Z = self.Z
            D = []
            d0 = y
            n_layer = len(self.nodes)
            for j in range(n_layer-1, 0, -1):
                jj = j - 1
                z = self.Z[jj]
                if j == n_layer - 1:
                    d = z*(1-z)*(d0 - z)
                else:
                    d = z*(1-z)*np.dot(d0, W[j].T)
                d0 = d
                D.insert(0, d)
            # update weights
            for j in range(n_layer-1, 0, -1):
                jj = j - 1
                if jj != 0:
                    W[jj] += epsilon * np.dot(Z[jj-1].T, D[jj])
                else:
                    W[jj] += epsilon * np.dot(X.T, D[jj])
                B[jj] += epsilon * np.sum(D[jj], axis=0)
    def evaluate(self):
        z = self.Z[-1]
        # print loss, accuracy
        L = np.sum((z - self.Y)**2)
        y_pred = np.argmax(z, axis=1)
        y_true = np.argmax(self.Y, axis=1)
        acc = accuracy_score(y_true, y_pred)
        print("L = %f, acc = %f" % (L, acc))
 # +
 # generate sample data
 np.random.seed(0)
 X, y = datasets.make_moons(200, noise=0.20)
 # generate nn output target
 t = np.zeros((X.shape[0], 2))
 t[np.where(y==0), 0] = 1
 t[np.where(y==1), 1] = 1
 # plot data
 plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
 plt.show()
 # +
 # use the NN model and training
 nn = NN_Model([2, 6, 2])
 nn.init_weight()
 nn.backpropagation(X, t, 2000)
 # +
 # predict results & plot results
 y_res  = nn.forward(X)
 y_pred = np.argmax(y_res, axis=1)
 # plot data
 plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
 plt.title("ground truth")
 plt.show()
 plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=plt.cm.Spectral)
 plt.title("predicted")
 plt.show()
 # -
 # ## 深入分析
 # +
 # print some results
 print(y_res[1:10, :])
 # -
 # **问题**
 # 1. 我们希望得到的每个类别的概率
 # 2. 如何做多分类问题？
 # 3. 如何能让神经网络更快的训练好？
 # 4. 如何更好的构建网络的类定义，从而让神经网络的类支持更多的类型的处理层？
 # ## References
 # * 反向传播算法
 #   * [零基础入门深度学习(3) - 神经网络和反向传播算法](https://www.zybuluo.com/hanbingtao/note/476663)
 #   * [Neural Network Using Python and Numpy](https://www.python-course.eu/neural_networks_with_python_numpy.php)
 #   * http://www.cedar.buffalo.edu/%7Esrihari/CSE574/Chap5/Chap5.3-BackProp.pdf
 #   * https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
 #
--- a/5_nn/note.txt
+++ b/5_nn/note.txt
@@ -1,2 +0,0 @@
 https://iamtrask.github.io/2015/07/12/basic-python-network/
 http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/
--- a/5_nn/softmax_ce.py
+++ b/5_nn/softmax_ce.py
@@ -1,146 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # Softmax & 交叉熵代价函数
 #
 # softmax经常被添加在分类任务的神经网络中的输出层，神经网络的反向传播中关键的步骤就是求导，从这个过程也可以更深刻地理解反向传播的过程，还可以对梯度传播的问题有更多的思考。
 #
 # ## softmax 函数
 #
 # softmax(柔性最大值)函数，一般在神经网络中， softmax可以作为分类任务的输出层。其实可以认为softmax输出的是几个类别选择的概率，比如我有一个分类任务，要分为三个类，softmax函数可以根据它们相对的大小，输出三个类别选取的概率，并且概率和为1。
 #
 # softmax函数的公式是这种形式：
 #
 # $$
 # S_i = \frac{e^{z_i}}{\sum_k e^{z_k}}
 # $$
 #
 # * $S_i$是经过softmax的类别概率输出
 # * $z_k$是神经元的输出
 #
 #
 # 更形象的如下图表示：
 #
 # ![softmax_demo](images/softmax_demo.png)
 #
 # softmax直白来说就是将原来输出是$[3,1,-3]$通过softmax函数一作用，就映射成为(0,1)的值，而这些值的累和为1（满足概率的性质），那么我们就可以将它理解成概率，在最后选取输出结点的时候，我们就可以选取概率最大（也就是值对应最大的）结点，作为我们的预测目标！
 #
 #
 #
 # 首先是神经元的输出，一个神经元如下图：
 #
 # ![softmax_neuron](images/softmax_neuron.png)
 #
 # 神经元的输出设为：
 #
 # $$
 # z_i = \sum_{j} w_{ij} x_{j} + b
 # $$
 #
 # 其中$W_{ij}$是第$i$个神经元的第$j$个权重，$b$是偏置。$z_i$表示该网络的第$i$个输出。
 #
 # 给这个输出加上一个softmax函数，那就变成了这样：
 #
 # $$
 # a_i = \frac{e^{z_i}}{\sum_k e^{z_k}}
 # $$
 #
 # $a_i$代表softmax的第$i$个输出值，右侧套用了softmax函数。
 #
 #
 # ### 损失函数 loss function
 #
 # 在神经网络反向传播中，要求一个损失函数，这个损失函数其实表示的是真实值与网络的估计值的误差，知道误差了，才能知道怎样去修改网络中的权重。
 #
 # 损失函数可以有很多形式，这里用的是交叉熵函数，主要是由于这个求导结果比较简单，易于计算，并且交叉熵解决某些损失函数学习缓慢的问题。**[交叉熵函数](https://blog.csdn.net/u014313009/article/details/51043064)**是这样的：
 #
 # $$
 # C = - \sum_i y_i ln a_i
 # $$
 #
 # 其中$y_i$表示真实的分类结果。
 #
 #
 # ## 推导过程
 #
 # 首先，我们要明确一下我们要求什么，我们要求的是我们的$loss$对于神经元输出($z_i$)的梯度，即：
 #
 # $$
 # \frac{\partial C}{\partial z_i}
 # $$
 #
 # 根据复合函数求导法则：
 #
 # $$
 # \frac{\partial C}{\partial z_i} = \frac{\partial C}{\partial a_j} \frac{\partial a_j}{\partial z_i}
 # $$
 #
 # 有个人可能有疑问了，这里为什么是$a_j$而不是$a_i$，这里要看一下$softmax$的公式了，因为$softmax$公式的特性，它的分母包含了所有神经元的输出，所以，对于不等于i的其他输出里面，也包含着$z_i$，所有的$a$都要纳入到计算范围中，并且后面的计算可以看到需要分为$i = j$和$i \ne j$两种情况求导。
 #
 # ### 针对$a_j$的偏导
 #
 # $$
 # \frac{\partial C}{\partial a_j} = \frac{(\partial -\sum_j y_j ln a_j)}{\partial a_j} = -\sum_j y_j \frac{1}{a_j}
 # $$
 #
 # ### 针对$z_i$的偏导
 #
 # 如果 $i=j$ :
 #
 # \begin{eqnarray}
 # \frac{\partial a_i}{\partial z_i} & = & \frac{\partial (\frac{e^{z_i}}{\sum_k e^{z_k}})}{\partial z_i} \\
 #   & = & \frac{\sum_k e^{z_k} e^{z_i} - (e^{z_i})^2}{\sum_k (e^{z_k})^2} \\
 #   & = & (\frac{e^{z_i}}{\sum_k e^{z_k}} ) (1 - \frac{e^{z_i}}{\sum_k e^{z_k}} ) \\
 #   & = & a_i (1 - a_i)
 # \end{eqnarray}
 #
 # 如果 $i \ne j$:
 # \begin{eqnarray}
 # \frac{\partial a_j}{\partial z_i} & = & \frac{\partial (\frac{e^{z_j}}{\sum_k e^{z_k}})}{\partial z_i} \\
 #   & = &  \frac{0 \cdot \sum_k e^{z_k} - e^{z_j} \cdot e^{z_i} }{(\sum_k e^{z_k})^2} \\
 #   & = & - \frac{e^{z_j}}{\sum_k e^{z_k}} \cdot \frac{e^{z_i}}{\sum_k e^{z_k}} \\
 #   & = & -a_j a_i
 # \end{eqnarray}
 #
 # 当u，v都是变量的函数时的导数推导公式：
 # $$
 # (\frac{u}{v})' = \frac{u'v - uv'}{v^2} 
 # $$
 #
 # ### 整体的推导
 #
 # \begin{eqnarray}
 # \frac{\partial C}{\partial z_i} & = & (-\sum_j y_j \frac{1}{a_j} ) \frac{\partial a_j}{\partial z_i} \\
 #   & = & - \frac{y_i}{a_i} a_i ( 1 - a_i) + \sum_{j \ne i} \frac{y_j}{a_j} a_i a_j \\
 #   & = & -y_i + y_i a_i + \sum_{j \ne i} y_j a_i \\
 #   & = & -y_i + a_i \sum_{j} y_j
 # \end{eqnarray}
 # ## 问题
 # 如何将本节所讲的softmax，交叉熵代价函数应用到上节所讲的BP方法中？
 # ## References
 #
 # * Softmax & 交叉熵
 #   * [交叉熵代价函数（作用及公式推导）](https://blog.csdn.net/u014313009/article/details/51043064)
 #   * [手打例子一步一步带你看懂softmax函数以及相关求导过程](https://www.jianshu.com/p/ffa51250ba2e)
 #   * [简单易懂的softmax交叉熵损失函数求导](https://www.jianshu.com/p/c02a1fbffad6)
--- a/6_pytorch/0_basic/autograd.py
+++ b/6_pytorch/0_basic/autograd.py
@@ -1,220 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 自动求导
 # 这次课程我们会了解 PyTorch 中的自动求导机制，自动求导是 PyTorch 中非常重要的特性，能够让我们避免手动去计算非常复杂的导数，这能够极大地减少了我们构建模型的时间，这也是其前身 Torch 这个框架所不具备的特性，下面我们通过例子看看 PyTorch 自动求导的独特魅力以及探究自动求导的更多用法。
 import torch
 from torch.autograd import Variable
 # ## 简单情况的自动求导
 # 下面我们显示一些简单情况的自动求导，"简单"体现在计算的结果都是标量，也就是一个数，我们对这个标量进行自动求导。
 x = Variable(torch.Tensor([2]), requires_grad=True)
 y = x + 2
 z = y ** 2 + 3
 print(z)
 # 通过上面的一些列操作，我们从 x 得到了最后的结果out，我们可以将其表示为数学公式
 #
 # $$
 # z = (x + 2)^2 + 3
 # $$
 #
 # 那么我们从 z 对 x 求导的结果就是 
 #
 # $$
 # \frac{\partial z}{\partial x} = 2 (x + 2) = 2 (2 + 2) = 8
 # $$
 # 如果你对求导不熟悉，可以查看以下[网址进行复习](https://baike.baidu.com/item/%E5%AF%BC%E6%95%B0#1)
 # 使用自动求导
 z.backward()
 print(x.grad)
 # 对于上面这样一个简单的例子，我们验证了自动求导，同时可以发现发现使用自动求导非常方便。如果是一个更加复杂的例子，那么手动求导就会显得非常的麻烦，所以自动求导的机制能够帮助我们省去麻烦的数学计算，下面我们可以看一个更加复杂的例子。
 # +
 x = Variable(torch.randn(10, 20), requires_grad=True)
 y = Variable(torch.randn(10, 5), requires_grad=True)
 w = Variable(torch.randn(20, 5), requires_grad=True)
 out = torch.mean(y - torch.matmul(x, w)) # torch.matmul 是做矩阵乘法
 out.backward()
 # -
 # 如果你对矩阵乘法不熟悉，可以查看下面的[网址进行复习](https://baike.baidu.com/item/%E7%9F%A9%E9%98%B5%E4%B9%98%E6%B3%95/5446029?fr=aladdin)
 # 得到 x 的梯度
 print(x.grad)
 # 得到 y 的的梯度
 print(y.grad)
 # 得到 w 的梯度
 print(w.grad)
 # 上面数学公式就更加复杂，矩阵乘法之后对两个矩阵对应元素相乘，然后所有元素求平均，有兴趣的同学可以手动去计算一下梯度，使用 PyTorch 的自动求导，我们能够非常容易得到 x, y 和 w 的导数，因为深度学习中充满大量的矩阵运算，所以我们没有办法手动去求这些导数，有了自动求导能够非常方便地解决网络更新的问题。
 #
 #
 # ## 复杂情况的自动求导
 # 上面我们展示了简单情况下的自动求导，都是对标量进行自动求导，可能你会有一个疑问，如何对一个向量或者矩阵自动求导了呢？感兴趣的同学可以自己先去尝试一下，下面我们会介绍对多维数组的自动求导机制。
 m = Variable(torch.FloatTensor([[2, 3]]), requires_grad=True) # 构建一个 1 x 2 的矩阵
 n = Variable(torch.zeros(1, 2)) # 构建一个相同大小的 0 矩阵
 print(m)
 print(n)
 # 通过 m 中的值计算新的 n 中的值
 n[0, 0] = m[0, 0] ** 2
 n[0, 1] = m[0, 1] ** 3
 print(n)
 # 将上面的式子写成数学公式，可以得到 
 # $$
 # n = (n_0,\ n_1) = (m_0^2,\ m_1^3) = (2^2,\ 3^3) 
 # $$
 # 下面我们直接对 n 进行反向传播，也就是求 n 对 m 的导数。
 #
 # 这时我们需要明确这个导数的定义，即如何定义
 #
 # $$
 # \frac{\partial n}{\partial m} = \frac{\partial (n_0,\ n_1)}{\partial (m_0,\ m_1)}
 # $$
 #
 # 在 PyTorch 中，如果要调用自动求导，需要往`backward()`中传入一个参数，这个参数的形状和 n 一样大，比如是 $(w_0,\ w_1)$，那么自动求导的结果就是：
 # $$
 # \frac{\partial n}{\partial m_0} = w_0 \frac{\partial n_0}{\partial m_0} + w_1 \frac{\partial n_1}{\partial m_0}
 # $$
 # $$
 # \frac{\partial n}{\partial m_1} = w_0 \frac{\partial n_0}{\partial m_1} + w_1 \frac{\partial n_1}{\partial m_1}
 # $$
 n.backward(torch.ones_like(n)) # 将 (w0, w1) 取成 (1, 1)
 print(m.grad)
 # 通过自动求导我们得到了梯度是 4 和 27，我们可以验算一下
 # $$
 # \frac{\partial n}{\partial m_0} = w_0 \frac{\partial n_0}{\partial m_0} + w_1 \frac{\partial n_1}{\partial m_0} = 2 m_0 + 0 = 2 \times 2 = 4
 # $$
 # $$
 # \frac{\partial n}{\partial m_1} = w_0 \frac{\partial n_0}{\partial m_1} + w_1 \frac{\partial n_1}{\partial m_1} = 0 + 3 m_1^2 = 3 \times 3^2 = 27
 # $$
 # 通过验算我们可以得到相同的结果
 #
 #
 # ## 多次自动求导
 # 通过调用 backward 我们可以进行一次自动求导，如果我们再调用一次 backward，会发现程序报错，没有办法再做一次。这是因为 PyTorch 默认做完一次自动求导之后，计算图就被丢弃了，所以两次自动求导需要手动设置一个东西，我们通过下面的小例子来说明。
 x = Variable(torch.FloatTensor([3]), requires_grad=True)
 y = x * 2 + x ** 2 + 3
 print(y)
 y.backward(retain_graph=True) # 设置 retain_graph 为 True 来保留计算图
 print(x.grad)
 y.backward() # 再做一次自动求导，这次不保留计算图
 print(x.grad)
 # 可以发现 x 的梯度变成了 16，因为这里做了两次自动求导，所以讲第一次的梯度 8 和第二次的梯度 8 加起来得到了 16 的结果。
 #
 #
 # **小练习**
 #
 # 定义
 #
 # $$
 # x = 
 # \left[
 # \begin{matrix}
 # x_0 \\
 # x_1
 # \end{matrix}
 # \right] = 
 # \left[
 # \begin{matrix}
 # 2 \\
 # 3
 # \end{matrix}
 # \right]
 # $$
 #
 # $$
 # k = (k_0,\ k_1) = (x_0^2 + 3 x_1,\ 2 x_0 + x_1^2)
 # $$
 #
 # 我们希望求得
 #
 # $$
 # j = \left[
 # \begin{matrix}
 # \frac{\partial k_0}{\partial x_0} & \frac{\partial k_0}{\partial x_1} \\
 # \frac{\partial k_1}{\partial x_0} & \frac{\partial k_1}{\partial x_1}
 # \end{matrix}
 # \right]
 # $$
 #
 # 参考答案：
 #
 # $$
 # \left[
 # \begin{matrix}
 # 4 & 3 \\
 # 2 & 6 \\
 # \end{matrix}
 # \right]
 # $$
 # +
 x = Variable(torch.FloatTensor([2, 3]), requires_grad=True)
 k = Variable(torch.zeros(2))
 k[0] = x[0] ** 2 + 3 * x[1]
 k[1] = x[1] ** 2 + 2 * x[0]
 # -
 print(k)
 # +
 j = torch.zeros(2, 2)
 k.backward(torch.FloatTensor([1, 0]), retain_graph=True)
 j[0] = x.grad.data
 x.grad.data.zero_() # 归零之前求得的梯度
 k.backward(torch.FloatTensor([0, 1]))
 j[1] = x.grad.data
 # -
 print(j)
 # 下一次课我们会介绍两种神经网络的编程方式，动态图编程和静态图编程
--- a/6_pytorch/1_NN/deep-nn.py
+++ b/6_pytorch/1_NN/deep-nn.py
@@ -1,233 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 深层神经网络
 # 前面一章我们简要介绍了神经网络的一些基本知识，同时也是示范了如何用神经网络构建一个复杂的非线性二分类器，更多的情况神经网络适合使用在更加复杂的情况，比如图像分类的问题，下面我们用深度学习的入门级数据集 MNIST 手写体分类来说明一下更深层神经网络的优良表现。
 #
 # ## MNIST 数据集
 # mnist 数据集是一个非常出名的数据集，基本上很多网络都将其作为一个测试的标准，其来自美国国家标准与技术研究所, National Institute of Standards and Technology (NIST)。 训练集 (training set) 由来自 250 个不同人手写的数字构成, 其中 50% 是高中学生, 50% 来自人口普查局 (the Census Bureau) 的工作人员，一共有 60000 张图片。 测试集(test set) 也是同样比例的手写数字数据，一共有 10000 张图片。
 #
 # 每张图片大小是 28 x 28 的灰度图，如下
 #
 # ![](https://ws3.sinaimg.cn/large/006tKfTcly1fmlx2wl5tqj30ge0au745.jpg)
 #
 # 所以我们的任务就是给出一张图片，我们希望区别出其到底属于 0 到 9 这 10 个数字中的哪一个。
 #
 # ## 多分类问题
 # 前面我们讲过二分类问题，现在处理的问题更加复杂，是一个 10 分类问题，统称为多分类问题，对于多分类问题而言，我们的 loss 函数使用一个更加复杂的函数，叫交叉熵。
 #
 # ### softmax
 # 提到交叉熵，我们先讲一下 softmax 函数，前面我们见过了 sigmoid 函数，如下
 #
 # $$s(x) = \frac{1}{1 + e^{-x}}$$
 #
 # 可以将任何一个值转换到 0 ~ 1 之间，当然对于一个二分类问题，这样就足够了，因为对于二分类问题，如果不属于第一类，那么必定属于第二类，所以只需要用一个值来表示其属于其中一类概率，但是对于多分类问题，这样并不行，需要知道其属于每一类的概率，这个时候就需要 softmax 函数了。
 #
 # softmax 函数示例如下
 #
 # ![](https://ws4.sinaimg.cn/large/006tKfTcly1fmlxtnfm4fj30ll0bnq3c.jpg)
 #
 # 对于网络的输出 $z_1, z_2, \cdots z_k$，我们首先对他们每个都取指数变成 $e^{z_1}, e^{z_2}, \cdots, e^{z_k}$，那么每一项都除以他们的求和，也就是
 #
 # $$
 # z_i \rightarrow \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}}
 # $$
 #
 # 如果对经过 softmax 函数的所有项求和就等于 1，所以他们每一项都分别表示属于其中某一类的概率。
 #
 # ## 交叉熵
 # 交叉熵衡量两个分布相似性的一种度量方式，前面讲的二分类问题的 loss 函数就是交叉熵的一种特殊情况，交叉熵的一般公式为
 #
 # $$
 # cross\_entropy(p, q) = E_{p}[-\log q] = - \frac{1}{m} \sum_{x} p(x) \log q(x)
 # $$
 #
 # 对于二分类问题我们可以写成
 #
 # $$
 # -\frac{1}{m} \sum_{i=1}^m (y^{i} \log sigmoid(x^{i}) + (1 - y^{i}) \log (1 - sigmoid(x^{i}))
 # $$
 #
 # 这就是我们之前讲的二分类问题的 loss，当时我们并没有解释原因，只是给出了公式，然后解释了其合理性，现在我们给出了公式去证明这样取 loss 函数是合理的
 #
 # 交叉熵是信息理论里面的内容，这里不再具体展开，更多的内容，可以看到下面的[链接](http://blog.csdn.net/rtygbwwwerr/article/details/50778098)
 #
 # 下面我们直接用 mnist 举例，讲一讲深度神经网络
 # +
 import numpy as np
 import torch
 from torchvision.datasets import mnist # 导入 pytorch 内置的 mnist 数据
 from torch import nn
 from torch.autograd import Variable
 # -
 # 使用内置函数下载 mnist 数据集
 train_set = mnist.MNIST('../../data/mnist', train=True, download=True)
 test_set  = mnist.MNIST('../../data/mnist', train=False, download=True)
 # 我们可以看看其中的一个数据是什么样子的
 a_data, a_label = train_set[0]
 a_data
 a_label
 # 这里的读入的数据是 PIL 库中的格式，我们可以非常方便地将其转换为 numpy array
 a_data = np.array(a_data, dtype='float32')
 print(a_data.shape)
 # 这里我们可以看到这种图片的大小是 28 x 28
 print(a_data)
 # 我们可以将数组展示出来，里面的 0 就表示黑色，255 表示白色
 #
 # 对于神经网络，我们第一层的输入就是 28 x 28 = 784，所以必须将得到的数据我们做一个变换，使用 reshape 将他们拉平成一个一维向量
 # +
 def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 标准化，这个技巧之后会讲到
    x = x.reshape((-1,)) # 拉平
    x = torch.from_numpy(x)
    return x
 train_set = mnist.MNIST('./data', train=True, transform=data_tf, download=True) # 重新载入数据集，申明定义的数据变换
 test_set = mnist.MNIST('./data', train=False, transform=data_tf, download=True)
 # -
 a, a_label = train_set[0]
 print(a.shape)
 print(a_label)
 from torch.utils.data import DataLoader
 # 使用 pytorch 自带的 DataLoader 定义一个数据迭代器
 train_data = DataLoader(train_set, batch_size=64, shuffle=True)
 test_data = DataLoader(test_set, batch_size=128, shuffle=False)
 # 使用这样的数据迭代器是非常有必要的，如果数据量太大，就无法一次将他们全部读入内存，所以需要使用 python 迭代器，每次生成一个批次的数据
 a, a_label = next(iter(train_data))
 # 打印出一个批次的数据大小
 print(a.shape)
 print(a_label.shape)
 # 使用 Sequential 定义 4 层神经网络
 net = nn.Sequential(
    nn.Linear(784, 400),
    nn.ReLU(),
    nn.Linear(400, 200),
    nn.ReLU(),
    nn.Linear(200, 100),
    nn.ReLU(),
    nn.Linear(100, 10)
 )
 net
 # 交叉熵在 pytorch 中已经内置了，交叉熵的数值稳定性更差，所以内置的函数已经帮我们解决了这个问题
 # 定义 loss 函数
 criterion = nn.CrossEntropyLoss()
 optimizer = torch.optim.SGD(net.parameters(), 1e-1) # 使用随机梯度下降，学习率 0.1
 # + {"scrolled": true}
 # 开始训练
 losses = []
 acces = []
 eval_losses = []
 eval_acces = []
 for e in range(20):
    train_loss = 0
    train_acc = 0
    net.train()
    for im, label in train_data:
        im = Variable(im)
        label = Variable(label)
        # 前向传播
        out = net(im)
        loss = criterion(out, label)
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # 记录误差
        train_loss += loss.data[0]
        # 计算分类的准确率
        _, pred = out.max(1)
        num_correct = float((pred == label).sum().data[0])
        acc = num_correct / im.shape[0]
        train_acc += acc
    losses.append(train_loss / len(train_data))
    acces.append(train_acc / len(train_data))
    # 在测试集上检验效果
    eval_loss = 0
    eval_acc = 0
    net.eval() # 将模型改为预测模式
    for im, label in test_data:
        im = Variable(im)
        label = Variable(label)
        out = net(im)
        loss = criterion(out, label)
        # 记录误差
        eval_loss += loss.data[0]
        # 记录准确率
        _, pred = out.max(1)
        num_correct = flot((pred == label).sum().data[0])
        acc = num_correct / im.shape[0]
        eval_acc += acc
    eval_losses.append(eval_loss / len(test_data))
    eval_acces.append(eval_acc / len(test_data))
    print('epoch: {}, Train Loss: {:.6f}, Train Acc: {:.6f}, Eval Loss: {:.6f}, Eval Acc: {:.6f}'
          .format(e, train_loss / len(train_data), train_acc / len(train_data), 
                     eval_loss / len(test_data), eval_acc / len(test_data)))
 # -
 # 画出 loss 曲线和 准确率曲线
 import matplotlib.pyplot as plt
 # %matplotlib inline
 plt.title('train loss')
 plt.plot(np.arange(len(losses)), losses)
 plt.plot(np.arange(len(acces)), acces)
 plt.title('train acc')
 plt.plot(np.arange(len(eval_losses)), eval_losses)
 plt.title('test loss')
 plt.plot(np.arange(len(eval_acces)), eval_acces)
 plt.title('test acc')
 # 可以看到我们的三层网络在训练集上能够达到 99.9% 的准确率，测试集上能够达到 98.20% 的准确率
 # **小练习：看一看上面的训练过程，看一下准确率是怎么计算出来的，特别注意 max 这个函数**
 #
 # **自己重新实现一个新的网络，试试改变隐藏层的数目和激活函数，看看有什么新的结果**
--- a/6_pytorch/1_NN/linear-regression-gradient-descend.py
+++ b/6_pytorch/1_NN/linear-regression-gradient-descend.py
@@ -1,355 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 线性模型和梯度下降
 # 这是神经网络的第一课，我们会学习一个非常简单的模型，线性回归，同时也会学习一个优化算法-梯度下降法，对这个模型进行优化。线性回归是监督学习里面一个非常简单的模型，同时梯度下降也是深度学习中应用最广的优化算法，我们将从这里开始我们的深度学习之旅
 #
 #
 # ## 一元线性回归
 # 一元线性模型非常简单，假设我们有变量 $x_i$ 和目标 $y_i$，每个 i 对应于一个数据点，希望建立一个模型
 #
 # $$
 # \hat{y}_i = w x_i + b
 # $$
 #
 # $\hat{y}_i$ 是我们预测的结果，希望通过 $\hat{y}_i$ 来拟合目标 $y_i$，通俗来讲就是找到这个函数拟合 $y_i$ 使得误差最小，即最小化
 #
 # $$
 # \frac{1}{n} \sum_{i=1}^n(\hat{y}_i - y_i)^2
 # $$
 # 那么如何最小化这个误差呢？
 #
 # 这里需要用到**梯度下降**，这是我们接触到的第一个优化算法，非常简单，但是却非常强大，在深度学习中被大量使用，所以让我们从简单的例子出发了解梯度下降法的原理
 # ## 梯度下降法
 # 在梯度下降法中，我们首先要明确梯度的概念，随后我们再了解如何使用梯度进行下降。
 # ### 梯度
 # 梯度在数学上就是导数，如果是一个多元函数，那么梯度就是偏导数。比如一个函数f(x, y)，那么 f 的梯度就是 
 #
 # $$
 # (\frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y})
 # $$
 #
 # 可以称为 grad f(x, y) 或者 $\nabla f(x, y)$。具体某一点 $(x_0,\ y_0)$ 的梯度就是 $\nabla f(x_0,\ y_0)$。
 #
 # 下面这个图片是 $f(x) = x^2$ 这个函数在 x=1 处的梯度
 #
 # ![](https://ws3.sinaimg.cn/large/006tNc79ly1fmarbuh2j3j30ba0b80sy.jpg)
 # 梯度有什么意义呢？从几何意义来讲，一个点的梯度值是这个函数变化最快的地方，具体来说，对于函数 f(x, y)，在点 $(x_0, y_0)$ 处，沿着梯度 $\nabla f(x_0,\ y_0)$ 的方向，函数增加最快，也就是说沿着梯度的方向，我们能够更快地找到函数的极大值点，或者反过来沿着梯度的反方向，我们能够更快地找到函数的最小值点。
 # ### 梯度下降法
 # 有了对梯度的理解，我们就能了解梯度下降发的原理了。上面我们需要最小化这个误差，也就是需要找到这个误差的最小值点，那么沿着梯度的反方向我们就能够找到这个最小值点。
 #
 # 我们可以来看一个直观的解释。比如我们在一座大山上的某处位置，由于我们不知道怎么下山，于是决定走一步算一步，也就是在每走到一个位置的时候，求解当前位置的梯度，沿着梯度的负方向，也就是当前最陡峭的位置向下走一步，然后继续求解当前位置梯度，向这一步所在位置沿着最陡峭最易下山的位置走一步。这样一步步的走下去，一直走到觉得我们已经到了山脚。当然这样走下去，有可能我们不能走到山脚，而是到了某一个局部的山峰低处。
 #
 # 类比我们的问题，就是沿着梯度的反方向，我们不断改变 w 和 b 的值，最终找到一组最好的 w 和 b 使得误差最小。
 #
 # 在更新的时候，我们需要决定每次更新的幅度，比如在下山的例子中，我们需要每次往下走的那一步的长度，这个长度称为学习率，用 $\eta$ 表示，这个学习率非常重要，不同的学习率都会导致不同的结果，学习率太小会导致下降非常缓慢，学习率太大又会导致跳动非常明显，可以看看下面的例子
 #
 # ![](https://ws2.sinaimg.cn/large/006tNc79ly1fmgn23lnzjg30980gogso.gif)
 #
 # 可以看到上面的学习率较为合适，而下面的学习率太大，就会导致不断跳动
 #
 # 最后我们的更新公式就是
 #
 # $$
 # w := w - \eta \frac{\partial f(w,\ b)}{\partial w} \\
 # b := b - \eta \frac{\partial f(w,\ b)}{\partial b}
 # $$
 #
 # 通过不断地迭代更新，最终我们能够找到一组最优的 w 和 b，这就是梯度下降法的原理。
 #
 # 最后可以通过这张图形象地说明一下这个方法
 #
 # ![](https://ws3.sinaimg.cn/large/006tNc79ly1fmarxsltfqj30gx091gn4.jpg)
 #
 #
 # 上面是原理部分，下面通过一个例子来进一步学习线性模型
 # +
 import torch
 import numpy as np
 from torch.autograd import Variable
 torch.manual_seed(2017)
 # +
 # 读入数据 x 和 y
 x_train = np.array([[3.3], [4.4], [5.5], [6.71], [6.93], [4.168],
                    [9.779], [6.182], [7.59], [2.167], [7.042],
                    [10.791], [5.313], [7.997], [3.1]], dtype=np.float32)
 y_train = np.array([[1.7], [2.76], [2.09], [3.19], [1.694], [1.573],
                    [3.366], [2.596], [2.53], [1.221], [2.827],
                    [3.465], [1.65], [2.904], [1.3]], dtype=np.float32)
 # +
 # 画出图像
 import matplotlib.pyplot as plt
 # %matplotlib inline
 plt.plot(x_train, y_train, 'bo')
 # +
 # 转换成 Tensor
 x_train = torch.from_numpy(x_train)
 y_train = torch.from_numpy(y_train)
 # 定义参数 w 和 b
 w = Variable(torch.randn(1), requires_grad=True) # 随机初始化
 b = Variable(torch.zeros(1), requires_grad=True) # 使用 0 进行初始化
 # +
 # 构建线性回归模型
 x_train = Variable(x_train)
 y_train = Variable(y_train)
 def linear_model(x):
    return x * w + b
 # -
 y_ = linear_model(x_train)
 # 经过上面的步骤我们就定义好了模型，在进行参数更新之前，我们可以先看看模型的输出结果长什么样
 plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
 plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
 plt.legend()
 # **思考：红色的点表示预测值，似乎排列成一条直线，请思考一下这些点是否在一条直线上？**
 # 这个时候需要计算我们的误差函数，也就是
 #
 # $$
 # \frac{1}{n} \sum_{i=1}^n(\hat{y}_i - y_i)^2
 # $$
 # +
 # 计算误差
 def get_loss(y_, y):
    return torch.mean((y_ - y) ** 2)
 loss = get_loss(y_, y_train)
 # -
 # 打印一下看看 loss 的大小
 print(loss)
 # 定义好了误差函数，接下来我们需要计算 w 和 b 的梯度了，这时得益于 PyTorch 的自动求导，我们不需要手动去算梯度，有兴趣的同学可以手动计算一下，w 和 b 的梯度分别是
 #
 # $$
 # \frac{\partial}{\partial w} = \frac{2}{n} \sum_{i=1}^n x_i(w x_i + b - y_i) \\
 # \frac{\partial}{\partial b} = \frac{2}{n} \sum_{i=1}^n (w x_i + b - y_i)
 # $$
 # 自动求导
 loss.backward()
 # 查看 w 和 b 的梯度
 print(w.grad)
 print(b.grad)
 # 更新一次参数
 w.data = w.data - 1e-2 * w.grad.data
 b.data = b.data - 1e-2 * b.grad.data
 # 更新完成参数之后，我们再一次看看模型输出的结果
 y_ = linear_model(x_train)
 plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
 plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
 plt.legend()
 # 从上面的例子可以看到，更新之后红色的线跑到了蓝色的线下面，没有特别好的拟合蓝色的真实值，所以我们需要在进行几次更新
 for e in range(10): # 进行 10 次更新
    y_ = linear_model(x_train)
    loss = get_loss(y_, y_train)
    w.grad.zero_() # 记得归零梯度
    b.grad.zero_() # 记得归零梯度
    loss.backward()
    w.data = w.data - 1e-2 * w.grad.data # 更新 w
    b.data = b.data - 1e-2 * b.grad.data # 更新 b 
    print('epoch: {}, loss: {}'.format(e, loss.data[0]))
 y_ = linear_model(x_train)
 plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
 plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
 plt.legend()
 # 经过 10 次更新，我们发现红色的预测结果已经比较好的拟合了蓝色的真实值。
 #
 # 现在你已经学会了你的第一个机器学习模型了，再接再厉，完成下面的小练习。
 # **小练习：**
 #
 # 重启 notebook 运行上面的线性回归模型，但是改变训练次数以及不同的学习率进行尝试得到不同的结果
 # ## 多项式回归模型
 # 下面我们更进一步，讲一讲多项式回归。什么是多项式回归呢？非常简单，根据上面的线性回归模型
 #
 # $$
 # \hat{y} = w x + b
 # $$
 #
 # 这里是关于 x 的一个一次多项式，这个模型比较简单，没有办法拟合比较复杂的模型，所以我们可以使用更高次的模型，比如
 #
 # $$
 # \hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \cdots
 # $$
 #
 # 这样就能够拟合更加复杂的模型，这就是多项式模型，这里使用了 x 的更高次，同理还有多元回归模型，形式也是一样的，只是出了使用 x，还是更多的变量，比如 y、z 等等，同时他们的 loss 函数和简单的线性回归模型是一致的。
 #
 #
 # 首先我们可以先定义一个需要拟合的目标函数，这个函数是个三次的多项式
 # +
 # 定义一个多变量函数
 w_target = np.array([0.5, 3, 2.4]) # 定义参数
 b_target = np.array([0.9]) # 定义参数
 f_des = 'y = {:.2f} + {:.2f} * x + {:.2f} * x^2 + {:.2f} * x^3'.format(
    b_target[0], w_target[0], w_target[1], w_target[2]) # 打印出函数的式子
 print(f_des)
 # -
 # 我们可以先画出这个多项式的图像
 # +
 # 画出这个函数的曲线
 x_sample = np.arange(-3, 3.1, 0.1)
 y_sample = b_target[0] + w_target[0] * x_sample + w_target[1] * x_sample ** 2 + w_target[2] * x_sample ** 3
 plt.plot(x_sample, y_sample, label='real curve')
 plt.legend()
 # -
 # 接着我们可以构建数据集，需要 x 和 y，同时是一个三次多项式，所以我们取了 $x,\ x^2, x^3$
 # +
 # 构建数据 x 和 y
 # x 是一个如下矩阵 [x, x^2, x^3]
 # y 是函数的结果 [y]
 x_train = np.stack([x_sample ** i for i in range(1, 4)], axis=1)
 x_train = torch.from_numpy(x_train).float() # 转换成 float tensor
 y_train = torch.from_numpy(y_sample).float().unsqueeze(1) # 转化成 float tensor 
 # -
 # 接着我们可以定义需要优化的参数，就是前面这个函数里面的 $w_i$
 # +
 # 定义参数和模型
 w = Variable(torch.randn(3, 1), requires_grad=True)
 b = Variable(torch.zeros(1), requires_grad=True)
 # 将 x 和 y 转换成 Variable
 x_train = Variable(x_train)
 y_train = Variable(y_train)
 def multi_linear(x):
    return torch.mm(x, w) + b
 # -
 # 我们可以画出没有更新之前的模型和真实的模型之间的对比
 # +
 # 画出更新之前的模型
 y_pred = multi_linear(x_train)
 plt.plot(x_train.data.numpy()[:, 0], y_pred.data.numpy(), label='fitting curve', color='r')
 plt.plot(x_train.data.numpy()[:, 0], y_sample, label='real curve', color='b')
 plt.legend()
 # -
 # 可以发现，这两条曲线之间存在差异，我们计算一下他们之间的误差
 # 计算误差，这里的误差和一元的线性模型的误差是相同的，前面已经定义过了 get_loss
 loss = get_loss(y_pred, y_train)
 print(loss)
 # 自动求导
 loss.backward()
 # 查看一下 w 和 b 的梯度
 print(w.grad)
 print(b.grad)
 # 更新一下参数
 w.data = w.data - 0.001 * w.grad.data
 b.data = b.data - 0.001 * b.grad.data
 # +
 # 画出更新一次之后的模型
 y_pred = multi_linear(x_train)
 plt.plot(x_train.data.numpy()[:, 0], y_pred.data.numpy(), label='fitting curve', color='r')
 plt.plot(x_train.data.numpy()[:, 0], y_sample, label='real curve', color='b')
 plt.legend()
 # -
 # 因为只更新了一次，所以两条曲线之间的差异仍然存在，我们进行 100 次迭代
 # 进行 100 次参数更新
 for e in range(100):
    y_pred = multi_linear(x_train)
    loss = get_loss(y_pred, y_train)
    w.grad.data.zero_()
    b.grad.data.zero_()
    loss.backward()
    # 更新参数
    w.data = w.data - 0.001 * w.grad.data
    b.data = b.data - 0.001 * b.grad.data
    if (e + 1) % 20 == 0:
        print('epoch {}, Loss: {:.5f}'.format(e+1, loss.data[0]))
 # 可以看到更新完成之后 loss 已经非常小了，我们画出更新之后的曲线对比
 # +
 # 画出更新之后的结果
 y_pred = multi_linear(x_train)
 plt.plot(x_train.data.numpy()[:, 0], y_pred.data.numpy(), label='fitting curve', color='r')
 plt.plot(x_train.data.numpy()[:, 0], y_sample, label='real curve', color='b')
 plt.legend()
 # -
 # 可以看到，经过 100 次更新之后，可以看到拟合的线和真实的线已经完全重合了
 # **小练习：上面的例子是一个三次的多项式，尝试使用二次的多项式去拟合它，看看最后能做到多好**
 #
 # **提示：参数 `w = torch.randn(2, 1)`，同时重新构建 x 数据集**
--- a/6_pytorch/1_NN/logistic-regression.py
+++ b/6_pytorch/1_NN/logistic-regression.py
@@ -1,332 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # Logistic 回归模型
 # 上一节课我们学习了简单的线性回归模型，这一次课中，我们会学习第二个模型，Logistic 回归模型。
 #
 # Logistic 回归是一种广义的回归模型，其与多元线性回归有着很多相似之处，模型的形式基本相同，虽然也被称为回归，但是其更多的情况使用在分类问题上，同时又以二分类更为常用。
 # ## 模型形式
 # Logistic 回归的模型形式和线性回归一样，都是 y = wx + b，其中 x 可以是一个多维的特征，唯一不同的地方在于 Logistic 回归会对 y 作用一个 logistic 函数，将其变为一种概率的结果。 Logistic 函数作为 Logistic 回归的核心，我们下面讲一讲 Logistic 函数，也被称为 Sigmoid 函数。
 # ### Sigmoid 函数
 # Sigmoid 函数非常简单，其公式如下
 #
 # $$
 # f(x) = \frac{1}{1 + e^{-x}}
 # $$
 #
 # Sigmoid 函数的图像如下
 #
 # ![](https://ws2.sinaimg.cn/large/006tKfTcly1fmd3dde091g30du060mx0.gif)
 #
 # 可以看到 Sigmoid 函数的范围是在 0 ~ 1 之间，所以任何一个值经过了 Sigmoid 函数的作用，都会变成 0 ~ 1 之间的一个值，这个值可以形象地理解为一个概率，比如对于二分类问题，这个值越小就表示属于第一类，这个值越大就表示属于第二类。
 # 另外一个 Logistic 回归的前提是确保你的数据具有非常良好的线性可分性，也就是说，你的数据集能够在一定的维度上被分为两个部分，比如
 #
 # ![](https://ws1.sinaimg.cn/large/006tKfTcly1fmd3gwdueoj30aw0aewex.jpg)
 # 可以看到，上面红色的点和蓝色的点能够几乎被一个绿色的平面分割开来
 # ## 回归问题 vs 分类问题
 # Logistic 回归处理的是一个分类问题，而上一个模型是回归模型，那么回归问题和分类问题的区别在哪里呢？
 #
 # 从上面的图可以看出，分类问题希望把数据集分到某一类，比如一个 3 分类问题，那么对于任何一个数据点，我们都希望找到其到底属于哪一类，最终的结果只有三种情况，{0, 1, 2}，所以这是一个离散的问题。
 #
 # 而回归问题是一个连续的问题，比如曲线的拟合，我们可以拟合任意的函数结果，这个结果是一个连续的值。
 #
 # 分类问题和回归问题是机器学习和深度学习的第一步，拿到任何一个问题，我们都需要先确定其到底是分类还是回归，然后再进行算法设计
 # ## 损失函数
 # 前一节对于回归问题，我们有一个 loss 去衡量误差，那么对于分类问题，我们如何去衡量这个误差，并设计 loss 函数呢？
 #
 # Logistic 回归使用了 Sigmoid 函数将结果变到 0 ~ 1 之间，对于任意输入一个数据，经过 Sigmoid 之后的结果我们记为 $\hat{y}$，表示这个数据点属于第二类的概率，那么其属于第一类的概率就是 $1-\hat{y}$。如果这个数据点属于第二类，我们希望 $\hat{y}$ 越大越好，也就是越靠近 1 越好，如果这个数据属于第一类，那么我们希望 $1-\hat{y}$ 越大越好，也就是 $\hat{y}$ 越小越好，越靠近 0 越好，所以我们可以这样设计我们的 loss 函数
 #
 # $$
 # loss = -(y * log(\hat{y}) + (1 - y) * log(1 - \hat{y}))
 # $$
 #
 # 其中 y 表示真实的 label，只能取 {0, 1} 这两个值，因为 $\hat{y}$ 表示经过 Logistic 回归预测之后的结果，是一个 0 ~ 1 之间的小数。如果 y 是 0，表示该数据属于第一类，我们希望 $\hat{y}$ 越小越好，上面的 loss 函数变为
 #
 # $$
 # loss = - (log(1 - \hat{y}))
 # $$
 #
 # 在训练模型的时候我们希望最小化 loss 函数，根据 log 函数的单调性，也就是最小化 $\hat{y}$，与我们的要求是一致的。
 #
 # 而如果 y 是 1，表示该数据属于第二类，我们希望 $\hat{y}$ 越大越好，同时上面的 loss 函数变为
 #
 # $$
 # loss = -(log(\hat{y}))
 # $$
 #
 # 我们希望最小化 loss 函数也就是最大化 $\hat{y}$，这也与我们的要求一致。
 #
 # 所以通过上面的论述，说明了这么构建 loss 函数是合理的。
 # 下面我们通过例子来具体学习 Logistic 回归
 import torch
 from torch.autograd import Variable
 import numpy as np
 import matplotlib.pyplot as plt
 # %matplotlib inline
 # 设定随机种子
 torch.manual_seed(2017)
 # 我们从 data.txt 读入数据，感兴趣的同学可以打开 data.txt 文件进行查看
 #
 # 读入数据点之后我们根据不同的 label 将数据点分为了红色和蓝色，并且画图展示出来了
 # +
 # 从 data.txt 中读入点
 with open('./data.txt', 'r') as f:
    data_list = [i.split('\n')[0].split(',') for i in f.readlines()]
    data = [(float(i[0]), float(i[1]), float(i[2])) for i in data_list]
 # 标准化
 x0_max = max([i[0] for i in data])
 x1_max = max([i[1] for i in data])
 data = [(i[0]/x0_max, i[1]/x1_max, i[2]) for i in data]
 x0 = list(filter(lambda x: x[-1] == 0.0, data)) # 选择第一类的点
 x1 = list(filter(lambda x: x[-1] == 1.0, data)) # 选择第二类的点
 plot_x0 = [i[0] for i in x0]
 plot_y0 = [i[1] for i in x0]
 plot_x1 = [i[0] for i in x1]
 plot_y1 = [i[1] for i in x1]
 plt.plot(plot_x0, plot_y0, 'ro', label='x_0')
 plt.plot(plot_x1, plot_y1, 'bo', label='x_1')
 plt.legend(loc='best')
 # -
 # 接下来我们将数据转换成 NumPy 的类型，接着转换到 Tensor 为之后的训练做准备
 np_data = np.array(data, dtype='float32') # 转换成 numpy array
 x_data = torch.from_numpy(np_data[:, 0:2]) # 转换成 Tensor, 大小是 [100, 2]
 y_data = torch.from_numpy(np_data[:, -1]).unsqueeze(1) # 转换成 Tensor，大小是 [100, 1]
 # 下面我们来实现以下 Sigmoid 的函数，Sigmoid 函数的公式为
 #
 # $$
 # f(x) = \frac{1}{1 + e^{-x}}
 # $$
 # 定义 sigmoid 函数
 def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 # 画出 Sigmoid 函数，可以看到值越大，经过 Sigmoid 函数之后越靠近 1，值越小，越靠近 0
 # +
 # 画出 sigmoid 的图像
 plot_x = np.arange(-10, 10.01, 0.01)
 plot_y = sigmoid(plot_x)
 plt.plot(plot_x, plot_y, 'r')
 # -
 x_data = Variable(x_data)
 y_data = Variable(y_data)
 # 在 PyTorch 当中，不需要我们自己写 Sigmoid 的函数，PyTorch 已经用底层的 C++ 语言为我们写好了一些常用的函数，不仅方便我们使用，同时速度上比我们自己实现的更快，稳定性更好
 #
 # 通过导入 `torch.nn.functional` 来使用，下面就是使用方法
 import torch.nn.functional as F
 # +
 # 定义 logistic 回归模型
 w = Variable(torch.randn(2, 1), requires_grad=True) 
 b = Variable(torch.zeros(1), requires_grad=True)
 def logistic_regression(x):
    return F.sigmoid(torch.mm(x, w) + b)
 # -
 # 在更新之前，我们可以画出分类的效果
 # +
 # 画出参数更新之前的结果
 w0 = w[0].data[0]
 w1 = w[1].data[0]
 b0 = b.data[0]
 plot_x = np.arange(0.2, 1, 0.01)
 plot_y = (-w0 * plot_x - b0) / w1
 plt.plot(plot_x, plot_y, 'g', label='cutting line')
 plt.plot(plot_x0, plot_y0, 'ro', label='x_0')
 plt.plot(plot_x1, plot_y1, 'bo', label='x_1')
 plt.legend(loc='best')
 # -
 # 可以看到分类效果基本是混乱的，我们来计算一下 loss，公式如下
 #
 # $$
 # loss = -(y * log(\hat{y}) + (1 - y) * log(1 - \hat{y}))
 # $$
 # 计算loss
 def binary_loss(y_pred, y):
    logits = (y * y_pred.clamp(1e-12).log() + (1 - y) * (1 - y_pred).clamp(1e-12).log()).mean()
    return -logits
 # 注意到其中使用 `.clamp`，这是[文档](http://pytorch.org/docs/0.3.0/torch.html?highlight=clamp#torch.clamp)的内容，查看一下，并且思考一下这里是否一定要使用这个函数，如果不使用会出现什么样的结果
 #
 # **提示：查看一个 log 函数的图像**
 y_pred = logistic_regression(x_data)
 loss = binary_loss(y_pred, y_data)
 print(loss)
 # 得到 loss 之后，我们还是使用梯度下降法更新参数，这里可以使用自动求导来直接得到参数的导数，感兴趣的同学可以去手动推导一下导数的公式
 # +
 # 自动求导并更新参数
 loss.backward()
 w.data = w.data - 0.1 * w.grad.data
 b.data = b.data - 0.1 * b.grad.data
 # 算出一次更新之后的loss
 y_pred = logistic_regression(x_data)
 loss = binary_loss(y_pred, y_data)
 print(loss)
 # -
 # 上面的参数更新方式其实是繁琐的重复操作，如果我们的参数很多，比如有 100 个，那么我们需要写 100 行来更新参数，为了方便，我们可以写成一个函数来更新，其实 PyTorch 已经为我们封装了一个函数来做这件事，这就是 PyTorch 中的优化器 `torch.optim`
 #
 # 使用 `torch.optim` 需要另外一个数据类型，就是 `nn.Parameter`，这个本质上和 Variable 是一样的，只不过 `nn.Parameter` 默认是要求梯度的，而 Variable 默认是不求梯度的
 #
 # 使用 `torch.optim.SGD` 可以使用梯度下降法来更新参数，PyTorch 中的优化器有更多的优化算法，在本章后面的课程我们会更加详细的介绍
 #
 # 将参数 w 和 b 放到 `torch.optim.SGD` 中之后，说明一下学习率的大小，就可以使用 `optimizer.step()` 来更新参数了，比如下面我们将参数传入优化器，学习率设置为 1.0
 # +
 # 使用 torch.optim 更新参数
 from torch import nn
 w = nn.Parameter(torch.randn(2, 1))
 b = nn.Parameter(torch.zeros(1))
 def logistic_regression(x):
    return F.sigmoid(torch.mm(x, w) + b)
 optimizer = torch.optim.SGD([w, b], lr=1.)
 # +
 # 进行 1000 次更新
 import time
 start = time.time()
 for e in range(1000):
    # 前向传播
    y_pred = logistic_regression(x_data)
    loss = binary_loss(y_pred, y_data) # 计算 loss
    # 反向传播
    optimizer.zero_grad() # 使用优化器将梯度归 0
    loss.backward()
    optimizer.step() # 使用优化器来更新参数
    # 计算正确率
    mask = y_pred.ge(0.5).float()
    acc = (mask == y_data).sum().data[0] / y_data.shape[0]
    if (e + 1) % 200 == 0:
        print('epoch: {}, Loss: {:.5f}, Acc: {:.5f}'.format(e+1, loss.data[0], acc))
 during = time.time() - start
 print()
 print('During Time: {:.3f} s'.format(during))
 # -
 # 可以看到使用优化器之后更新参数非常简单，只需要在自动求导之前使用**`optimizer.zero_grad()`** 来归 0 梯度，然后使用 **`optimizer.step()`**来更新参数就可以了，非常简便
 #
 # 同时经过了 1000 次更新，loss 也降得比较低了
 # 下面我们画出更新之后的结果
 # +
 # 画出更新之后的结果
 w0 = w[0].data[0]
 w1 = w[1].data[0]
 b0 = b.data[0]
 plot_x = np.arange(0.2, 1, 0.01)
 plot_y = (-w0 * plot_x - b0) / w1
 plt.plot(plot_x, plot_y, 'g', label='cutting line')
 plt.plot(plot_x0, plot_y0, 'ro', label='x_0')
 plt.plot(plot_x1, plot_y1, 'bo', label='x_1')
 plt.legend(loc='best')
 # -
 # 可以看到更新之后模型已经能够基本将这两类点分开了
 # 前面我们使用了自己写的 loss，其实 PyTorch 已经为我们写好了一些常见的 loss，比如线性回归里面的 loss 是 `nn.MSE()`，而 Logistic 回归的二分类 loss 在 PyTorch 中是 `nn.BCEWithLogitsLoss()`，关于更多的 loss，可以查看[文档](http://pytorch.org/docs/0.3.0/nn.html#loss-functions)
 #
 # PyTorch 为我们实现的 loss 函数有两个好处，第一是方便我们使用，不需要重复造轮子，第二就是其实现是在底层 C++ 语言上的，所以速度上和稳定性上都要比我们自己实现的要好
 #
 # 另外，PyTorch 出于稳定性考虑，将模型的 Sigmoid 操作和最后的 loss 都合在了 `nn.BCEWithLogitsLoss()`，所以我们使用 PyTorch 自带的 loss 就不需要再加上 Sigmoid 操作了
 # +
 # 使用自带的loss
 criterion = nn.BCEWithLogitsLoss() # 将 sigmoid 和 loss 写在一层，有更快的速度、更好的稳定性
 w = nn.Parameter(torch.randn(2, 1))
 b = nn.Parameter(torch.zeros(1))
 def logistic_reg(x):
    return torch.mm(x, w) + b
 optimizer = torch.optim.SGD([w, b], 1.)
 # -
 y_pred = logistic_reg(x_data)
 loss = criterion(y_pred, y_data)
 print(loss.data)
 # +
 # 同样进行 1000 次更新
 start = time.time()
 for e in range(1000):
    # 前向传播
    y_pred = logistic_reg(x_data)
    loss = criterion(y_pred, y_data)
    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # 计算正确率
    mask = y_pred.ge(0.5).float()
    acc = (mask == y_data).sum().data[0] / y_data.shape[0]
    if (e + 1) % 200 == 0:
        print('epoch: {}, Loss: {:.5f}, Acc: {:.5f}'.format(e+1, loss.data[0], acc))
 during = time.time() - start
 print()
 print('During Time: {:.3f} s'.format(during))
 # -
 # 可以看到，使用了 PyTorch 自带的 loss 之后，速度有了一定的上升，虽然看上去速度的提升并不多，但是这只是一个小网络，对于大网络，使用自带的 loss 不管对于稳定性还是速度而言，都有质的飞跃，同时也避免了重复造轮子的困扰
 # 下一节课我们会介绍 PyTorch 中构建模型的模块 `Sequential` 和 `Module`，使用这个可以帮助我们更方便地构建模型
--- a/6_pytorch/2_CNN/basic_conv.py
+++ b/6_pytorch/2_CNN/basic_conv.py
@@ -1,109 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 卷积模块介绍
 # 前面我们介绍了卷积网络的基本知识，其在计算机视觉领域被应用得非常广泛，那么常见的卷机网络中用到的模块能够使用 pytorch 非常轻松地实现，下面我们来讲一下 pytorch 中的卷积模块
 # ## 卷积
 # 卷积在 pytorch 中有两种方式，一种是 `torch.nn.Conv2d()`，一种是 `torch.nn.functional.conv2d()`，这两种形式本质都是使用一个卷积操作
 #
 # 这两种形式的卷积对于输入的要求都是一样的，首先需要输入是一个 `torch.autograd.Variable()` 的类型，大小是 (batch, channel, H, W)，其中 batch 表示输入的一批数据的数目，第二个是输入的通道数，一般一张彩色的图片是 3，灰度图是 1，而卷积网络过程中的通道数比较大，会出现几十到几百的通道数，H 和 W 表示输入图片的高度和宽度，比如一个 batch 是 32 张图片，每张图片是 3 通道，高和宽分别是 50 和 100，那么输入的大小就是 (32, 3, 50, 100)
 #
 # 下面举例来说明一下这两种卷积方式
 import numpy as np
 import torch
 from torch import nn
 from torch.autograd import Variable
 import torch.nn.functional as F
 from PIL import Image
 import matplotlib.pyplot as plt
 # %matplotlib inline
 im = Image.open('./cat.png').convert('L') # 读入一张灰度图的图片
 im = np.array(im, dtype='float32') # 将其转换为一个矩阵
 # 可视化图片
 plt.imshow(im.astype('uint8'), cmap='gray')
 # 将图片矩阵转化为 pytorch tensor，并适配卷积输入的要求
 im = torch.from_numpy(im.reshape((1, 1, im.shape[0], im.shape[1]))) 
 # 下面我们定义一个算子对其进行轮廓检测
 # +
 # 使用 nn.Conv2d
 conv1 = nn.Conv2d(1, 1, 3, bias=False) # 定义卷积
 sobel_kernel = np.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]], dtype='float32') # 定义轮廓检测算子
 sobel_kernel = sobel_kernel.reshape((1, 1, 3, 3)) # 适配卷积的输入输出
 conv1.weight.data = torch.from_numpy(sobel_kernel) # 给卷积的 kernel 赋值
 edge1 = conv1(Variable(im)) # 作用在图片上
 edge1 = edge1.data.squeeze().numpy() # 将输出转换为图片的格式
 # -
 # 下面我们可视化边缘检测之后的结果
 plt.imshow(edge1, cmap='gray')
 # +
 # 使用 F.conv2d
 sobel_kernel = np.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]], dtype='float32') # 定义轮廓检测算子
 sobel_kernel = sobel_kernel.reshape((1, 1, 3, 3)) # 适配卷积的输入输出
 weight = Variable(torch.from_numpy(sobel_kernel))
 edge2 = F.conv2d(Variable(im), weight) # 作用在图片上
 edge2 = edge2.data.squeeze().numpy() # 将输出转换为图片的格式
 plt.imshow(edge2, cmap='gray')
 # -
 # 可以看到两种形式能够得到相同的效果，不同的地方相信你也看到了，使用 `nn.Conv2d()` 相当于直接定义了一层卷积网络结构，而使用 `torch.nn.functional.conv2d()` 相当于定义了一个卷积的操作，所以使用后者需要再额外去定义一个 weight，而且这个 weight 也必须是一个 Variable，而使用 `nn.Conv2d()` 则会帮我们默认定义一个随机初始化的 weight，如果我们需要修改，那么取出其中的值对其修改，如果不想修改，那么可以直接使用这个默认初始化的值，非常方便
 #
 # **实际使用中我们基本都使用 `nn.Conv2d()` 这种形式**
 # ## 池化层
 # 卷积网络中另外一个非常重要的结构就是池化，这是利用了图片的下采样不变性，即一张图片变小了还是能够看出了这张图片的内容，而使用池化层能够将图片大小降低，非常好地提高了计算效率，同时池化层也没有参数。池化的方式有很多种，比如最大值池化，均值池化等等，在卷积网络中一般使用最大值池化。
 #
 # 在 pytorch 中最大值池化的方式也有两种，一种是 `nn.MaxPool2d()`，一种是 `torch.nn.functional.max_pool2d()`，他们对于图片的输入要求跟卷积对于图片的输入要求是一样了，就不再赘述，下面我们也举例说明
 # 使用 nn.MaxPool2d
 pool1 = nn.MaxPool2d(2, 2)
 print('before max pool, image shape: {} x {}'.format(im.shape[2], im.shape[3]))
 small_im1 = pool1(Variable(im))
 small_im1 = small_im1.data.squeeze().numpy()
 print('after max pool, image shape: {} x {} '.format(small_im1.shape[0], small_im1.shape[1]))
 # 可以看到图片的大小减小了一半，那么图片是不是变了呢？我们可以可视化一下
 plt.imshow(small_im1, cmap='gray')
 # 可以看到图片几乎没有变化，说明池化层只是减小了图片的尺寸，并不会影响图片的内容
 # F.max_pool2d
 print('before max pool, image shape: {} x {}'.format(im.shape[2], im.shape[3]))
 small_im2 = F.max_pool2d(Variable(im), 2, 2)
 small_im2 = small_im2.data.squeeze().numpy()
 print('after max pool, image shape: {} x {} '.format(small_im1.shape[0], small_im1.shape[1]))
 plt.imshow(small_im2, cmap='gray')
 # **跟卷积层一样，实际使用中，我们一般使用 `nn.MaxPool2d()`**
 # 以上我们介绍了如何在 pytorch 中使用卷积网络中的卷积模块和池化模块，接下来我们会开始讲卷积网络中几个非常著名的网络结构
--- a/6_pytorch/2_CNN/batch-normalization.py
+++ b/6_pytorch/2_CNN/batch-normalization.py
@@ -1,257 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 批标准化
 # 在我们正式进入模型的构建和训练之前，我们会先讲一讲数据预处理和批标准化，因为模型训练并不容易，特别是一些非常复杂的模型，并不能非常好的训练得到收敛的结果，所以对数据增加一些预处理，同时使用批标准化能够得到非常好的收敛结果，这也是卷积网络能够训练到非常深的层的一个重要原因。
 # ## 数据预处理
 # 目前数据预处理最常见的方法就是中心化和标准化，中心化相当于修正数据的中心位置，实现方法非常简单，就是在每个特征维度上减去对应的均值，最后得到 0 均值的特征。标准化也非常简单，在数据变成 0 均值之后，为了使得不同的特征维度有着相同的规模，可以除以标准差近似为一个标准正态分布，也可以依据最大值和最小值将其转化为 -1 ~ 1 之间，下面是一个简单的图示
 #
 # ![](https://ws1.sinaimg.cn/large/006tKfTcly1fmqouzer3xj30ij06n0t8.jpg)
 #
 # 这两种方法非常的常见，如果你还记得，前面我们在神经网络的部分就已经使用了这个方法实现了数据标准化，至于另外一些方法，比如 PCA 或者 白噪声已经用得非常少了。
 # ## Batch Normalization
 # 前面在数据预处理的时候，我们尽量输入特征不相关且满足一个标准的正态分布，这样模型的表现一般也较好。但是对于很深的网路结构，网路的非线性层会使得输出的结果变得相关，且不再满足一个标准的 N(0, 1) 的分布，甚至输出的中心已经发生了偏移，这对于模型的训练，特别是深层的模型训练非常的困难。
 #
 # 所以在 2015 年一篇论文提出了这个方法，批标准化，简而言之，就是对于每一层网络的输出，对其做一个归一化，使其服从标准的正态分布，这样后一层网络的输入也是一个标准的正态分布，所以能够比较好的进行训练，加快收敛速度。
 # batch normalization 的实现非常简单，对于给定的一个 batch 的数据 $B = \{x_1, x_2, \cdots, x_m\}$算法的公式如下
 #
 # $$
 # \mu_B = \frac{1}{m} \sum_{i=1}^m x_i
 # $$
 # $$
 # \sigma^2_B = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2
 # $$
 # $$
 # \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}
 # $$
 # $$
 # y_i = \gamma \hat{x}_i + \beta
 # $$
 # 第一行和第二行是计算出一个 batch 中数据的均值和方差，接着使用第三个公式对 batch 中的每个数据点做标准化，$\epsilon$ 是为了计算稳定引入的一个小的常数，通常取 $10^{-5}$，最后利用权重修正得到最后的输出结果，非常的简单，下面我们可以实现一下简单的一维的情况，也就是神经网络中的情况
 # + {"ExecuteTime": {"start_time": "2017-12-23T06:50:51.575693Z", "end_time": "2017-12-23T06:50:51.579067Z"}}
 import sys
 sys.path.append('..')
 import torch
 # + {"ExecuteTime": {"start_time": "2017-12-23T07:14:11.060849Z", "end_time": "2017-12-23T07:14:11.077807Z"}}
 def simple_batch_norm_1d(x, gamma, beta):
    eps = 1e-5
    x_mean = torch.mean(x, dim=0, keepdim=True) # 保留维度进行 broadcast
    x_var = torch.mean((x - x_mean) ** 2, dim=0, keepdim=True)
    x_hat = (x - x_mean) / torch.sqrt(x_var + eps)
    return gamma.view_as(x_mean) * x_hat + beta.view_as(x_mean)
 # -
 # 我们来验证一下是否对于任意的输入，输出会被标准化
 # + {"ExecuteTime": {"start_time": "2017-12-23T07:14:20.597682Z", "end_time": "2017-12-23T07:14:20.610603Z"}}
 x = torch.arange(15).view(5, 3)
 gamma = torch.ones(x.shape[1])
 beta = torch.zeros(x.shape[1])
 print('before bn: ')
 print(x)
 y = simple_batch_norm_1d(x, gamma, beta)
 print('after bn: ')
 print(y)
 # -
 # 可以看到这里一共是 5 个数据点，三个特征，每一列表示一个特征的不同数据点，使用批标准化之后，每一列都变成了标准的正态分布
 #
 # 这个时候会出现一个问题，就是测试的时候该使用批标准化吗？
 #
 # 答案是肯定的，因为训练的时候使用了，而测试的时候不使用肯定会导致结果出现偏差，但是测试的时候如果只有一个数据集，那么均值不就是这个值，方差为 0 吗？这显然是随机的，所以测试的时候不能用测试的数据集去算均值和方差，而是用训练的时候算出的移动平均均值和方差去代替
 #
 # 下面我们实现以下能够区分训练状态和测试状态的批标准化方法
 # + {"ExecuteTime": {"start_time": "2017-12-23T07:32:48.005892Z", "end_time": "2017-12-23T07:32:48.025709Z"}}
 def batch_norm_1d(x, gamma, beta, is_training, moving_mean, moving_var, moving_momentum=0.1):
    eps = 1e-5
    x_mean = torch.mean(x, dim=0, keepdim=True) # 保留维度进行 broadcast
    x_var = torch.mean((x - x_mean) ** 2, dim=0, keepdim=True)
    if is_training:
        x_hat = (x - x_mean) / torch.sqrt(x_var + eps)
        moving_mean[:] = moving_momentum * moving_mean + (1. - moving_momentum) * x_mean
        moving_var[:] = moving_momentum * moving_var + (1. - moving_momentum) * x_var
    else:
        x_hat = (x - moving_mean) / torch.sqrt(moving_var + eps)
    return gamma.view_as(x_mean) * x_hat + beta.view_as(x_mean)
 # -
 # 下面我们使用上一节课将的深度神经网络分类 mnist 数据集的例子来试验一下批标准化是否有用
 import numpy as np
 from torchvision.datasets import mnist # 导入 pytorch 内置的 mnist 数据
 from torch.utils.data import DataLoader
 from torch import nn
 from torch.autograd import Variable
 # +
 # 使用内置函数下载 mnist 数据集
 train_set = mnist.MNIST('../../data/mnist', train=True)
 test_set = mnist.MNIST('../../data/mnist', train=False)
 def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 数据预处理，标准化
    x = x.reshape((-1,)) # 拉平
    x = torch.from_numpy(x)
    return x
 train_set = mnist.MNIST('../../data/mnist', train=True, transform=data_tf, download=True) # 重新载入数据集，申明定义的数据变换
 test_set = mnist.MNIST('../../data/mnist', train=False, transform=data_tf, download=True)
 train_data = DataLoader(train_set, batch_size=64, shuffle=True)
 test_data = DataLoader(test_set, batch_size=128, shuffle=False)
 # -
 class multi_network(nn.Module):
    def __init__(self):
        super(multi_network, self).__init__()
        self.layer1 = nn.Linear(784, 100)
        self.relu = nn.ReLU(True)
        self.layer2 = nn.Linear(100, 10)
        self.gamma = nn.Parameter(torch.randn(100))
        self.beta = nn.Parameter(torch.randn(100))
        self.moving_mean = Variable(torch.zeros(100))
        self.moving_var = Variable(torch.zeros(100))
    def forward(self, x, is_train=True):
        x = self.layer1(x)
        x = batch_norm_1d(x, self.gamma, self.beta, is_train, self.moving_mean, self.moving_var)
        x = self.relu(x)
        x = self.layer2(x)
        return x
 net = multi_network()
 # 定义 loss 函数
 criterion = nn.CrossEntropyLoss()
 optimizer = torch.optim.SGD(net.parameters(), 1e-1) # 使用随机梯度下降，学习率 0.1
 # 为了方便，训练函数已经定义在外面的 utils.py 中，跟前面训练网络的操作是一样的，感兴趣的同学可以去看看
 from utils import train
 train(net, train_data, test_data, 10, optimizer, criterion)
 # 这里的 $\gamma$ 和 $\beta$ 都作为参数进行训练，初始化为随机的高斯分布，`moving_mean` 和 `moving_var` 都初始化为 0，并不是更新的参数，训练完 10 次之后，我们可以看看移动平均和移动方差被修改为了多少
 # + {"scrolled": true}
 # 打出 moving_mean 的前 10 项
 print(net.moving_mean[:10])
 # -
 # 可以看到，这些值已经在训练的过程中进行了修改，在测试过程中，我们不需要再计算均值和方差，直接使用移动平均和移动方差即可
 # 作为对比，我们看看不使用批标准化的结果
 # +
 no_bn_net = nn.Sequential(
    nn.Linear(784, 100),
    nn.ReLU(True),
    nn.Linear(100, 10)
 )
 optimizer = torch.optim.SGD(no_bn_net.parameters(), 1e-1) # 使用随机梯度下降，学习率 0.1
 train(no_bn_net, train_data, test_data, 10, optimizer, criterion)
 # -
 # 可以看到虽然最后的结果两种情况一样，但是如果我们看前几次的情况，可以看到使用批标准化的情况能够更快的收敛，因为这只是一个小网络，所以用不用批标准化都能够收敛，但是对于更加深的网络，使用批标准化在训练的时候能够很快地收敛
 # 从上面可以看到，我们自己实现了 2 维情况的批标准化，对应于卷积的 4 维情况的标准化是类似的，只需要沿着通道的维度进行均值和方差的计算，但是我们自己实现批标准化是很累的，pytorch 当然也为我们内置了批标准化的函数，一维和二维分别是 `torch.nn.BatchNorm1d()` 和 `torch.nn.BatchNorm2d()`，不同于我们的实现，pytorch 不仅将 $\gamma$ 和 $\beta$ 作为训练的参数，也将 `moving_mean` 和 `moving_var` 也作为参数进行训练
 # 下面我们在卷积网络下试用一下批标准化看看效果
 # +
 def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 数据预处理，标准化
    x = torch.from_numpy(x)
    x = x.unsqueeze(0)
    return x
 train_set = mnist.MNIST('../../data/mnist', train=True, transform=data_tf, download=True) # 重新载入数据集，申明定义的数据变换
 test_set = mnist.MNIST('../../data/mnist', train=False, transform=data_tf, download=True)
 train_data = DataLoader(train_set, batch_size=64, shuffle=True)
 test_data = DataLoader(test_set, batch_size=128, shuffle=False)
 # +
 # 使用批标准化
 class conv_bn_net(nn.Module):
    def __init__(self):
        super(conv_bn_net, self).__init__()
        self.stage1 = nn.Sequential(
            nn.Conv2d(1, 6, 3, padding=1),
            nn.BatchNorm2d(6),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.BatchNorm2d(16),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2)
        )
        self.classfy = nn.Linear(400, 10)
    def forward(self, x):
        x = self.stage1(x)
        x = x.view(x.shape[0], -1)
        x = self.classfy(x)
        return x
 net = conv_bn_net()
 optimizer = torch.optim.SGD(net.parameters(), 1e-1) # 使用随机梯度下降，学习率 0.1
 # -
 train(net, train_data, test_data, 5, optimizer, criterion)
 # +
 # 不使用批标准化
 class conv_no_bn_net(nn.Module):
    def __init__(self):
        super(conv_no_bn_net, self).__init__()
        self.stage1 = nn.Sequential(
            nn.Conv2d(1, 6, 3, padding=1),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.ReLU(True),
            nn.MaxPool2d(2, 2)
        )
        self.classfy = nn.Linear(400, 10)
    def forward(self, x):
        x = self.stage1(x)
        x = x.view(x.shape[0], -1)
        x = self.classfy(x)
        return x
 net = conv_no_bn_net()
 optimizer = torch.optim.SGD(net.parameters(), 1e-1) # 使用随机梯度下降，学习率 0.1    
 # -
 train(net, train_data, test_data, 5, optimizer, criterion)
 # 之后介绍一些著名的网络结构的时候，我们会慢慢认识到批标准化的重要性，使用 pytorch 能够非常方便地添加批标准化层
--- a/6_pytorch/2_CNN/data-augumentation.py
+++ b/6_pytorch/2_CNN/data-augumentation.py
@@ -1,204 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 数据增强
 # 前面我们已经讲了几个非常著名的卷积网络的结构，但是单单只靠这些网络并不能取得 state-of-the-art 的结果，现实问题往往更加复杂，非常容易出现过拟合的问题，而数据增强的方法是对抗过拟合问题的一个重要方法。
 #
 # 2012 年 AlexNet 在 ImageNet 上大获全胜，图片增强方法功不可没，因为有了图片增强，使得训练的数据集比实际数据集多了很多'新'样本，减少了过拟合的问题，下面我们来具体解释一下。
 # ## 常用的数据增强方法
 # 常用的数据增强方法如下：  
 # 1.对图片进行一定比例缩放  
 # 2.对图片进行随机位置的截取   
 # 3.对图片进行随机的水平和竖直翻转  
 # 4.对图片进行随机角度的旋转  
 # 5.对图片进行亮度、对比度和颜色的随机变化
 #
 # 这些方法 pytorch 都已经为我们内置在了 torchvision 里面，我们在安装 pytorch 的时候也安装了 torchvision，下面我们来依次展示一下这些数据增强方法
 # +
 import sys
 sys.path.append('..')
 from PIL import Image
 from torchvision import transforms as tfs
 # -
 # 读入一张图片
 im = Image.open('./cat.png')
 im
 # ### 随机比例放缩
 # 随机比例缩放主要使用的是 `torchvision.transforms.Resize()` 这个函数，第一个参数可以是一个整数，那么图片会保存现在的宽和高的比例，并将更短的边缩放到这个整数的大小，第一个参数也可以是一个 tuple，那么图片会直接把宽和高缩放到这个大小；第二个参数表示放缩图片使用的方法，比如最邻近法，或者双线性差值等，一般双线性差值能够保留图片更多的信息，所以 pytorch 默认使用的是双线性差值，你可以手动去改这个参数，更多的信息可以看看[文档](http://pytorch.org/docs/0.3.0/torchvision/transforms.html)
 # 比例缩放
 print('before scale, shape: {}'.format(im.size))
 new_im = tfs.Resize((100, 200))(im)
 print('after scale, shape: {}'.format(new_im.size))
 new_im
 # ### 随机位置截取
 # 随机位置截取能够提取出图片中局部的信息，使得网络接受的输入具有多尺度的特征，所以能够有较好的效果。在 torchvision 中主要有下面两种方式，一个是 `torchvision.transforms.RandomCrop()`，传入的参数就是截取出的图片的长和宽，对图片在随机位置进行截取；第二个是 `torchvision.transforms.CenterCrop()`，同样传入介曲初的图片的大小作为参数，会在图片的中心进行截取
 # 随机裁剪出 100 x 100 的区域
 random_im1 = tfs.RandomCrop(100)(im)
 random_im1
 # 随机裁剪出 150 x 100 的区域
 random_im2 = tfs.RandomCrop((150, 100))(im)
 random_im2
 # 中心裁剪出 100 x 100 的区域
 center_im = tfs.CenterCrop(100)(im)
 center_im
 # ### 随机的水平和竖直方向翻转
 # 对于上面这一张猫的图片，如果我们将它翻转一下，它仍然是一张猫，但是图片就有了更多的多样性，所以随机翻转也是一种非常有效的手段。在 torchvision 中，随机翻转使用的是 `torchvision.transforms.RandomHorizontalFlip()` 和 `torchvision.transforms.RandomVerticalFlip()`
 # 随机水平翻转
 h_filp = tfs.RandomHorizontalFlip()(im)
 h_filp
 # 随机竖直翻转
 v_flip = tfs.RandomVerticalFlip()(im)
 v_flip
 # ### 随机角度旋转
 # 一些角度的旋转仍然是非常有用的数据增强方式，在 torchvision 中，使用 `torchvision.transforms.RandomRotation()` 来实现，其中第一个参数就是随机旋转的角度，比如填入 10，那么每次图片就会在 -10 ~ 10 度之间随机旋转
 rot_im = tfs.RandomRotation(45)(im)
 rot_im
 # ### 亮度、对比度和颜色的变化
 # 除了形状变化外，颜色变化又是另外一种增强方式，其中可以设置亮度变化，对比度变化和颜色变化等，在 torchvision 中主要使用 `torchvision.transforms.ColorJitter()` 来实现的，第一个参数就是亮度的比例，第二个是对比度，第三个是饱和度，第四个是颜色
 # 亮度
 bright_im = tfs.ColorJitter(brightness=1)(im) # 随机从 0 ~ 2 之间亮度变化，1 表示原图
 bright_im
 # 对比度
 contrast_im = tfs.ColorJitter(contrast=1)(im) # 随机从 0 ~ 2 之间对比度变化，1 表示原图
 contrast_im
 # 颜色
 color_im = tfs.ColorJitter(hue=0.5)(im) # 随机从 -0.5 ~ 0.5 之间对颜色变化
 color_im
 #
 #
 # 上面我们讲了这么图片增强的方法，其实这些方法都不是孤立起来用的，可以联合起来用，比如先做随机翻转，然后随机截取，再做对比度增强等等，torchvision 里面有个非常方便的函数能够将这些变化合起来，就是 `torchvision.transforms.Compose()`，下面我们举个例子
 im_aug = tfs.Compose([
    tfs.Resize(120),
    tfs.RandomHorizontalFlip(),
    tfs.RandomCrop(96),
    tfs.ColorJitter(brightness=0.5, contrast=0.5, hue=0.5)
 ])
 import matplotlib.pyplot as plt
 # %matplotlib inline
 nrows = 3
 ncols = 3
 figsize = (8, 8)
 _, figs = plt.subplots(nrows, ncols, figsize=figsize)
 for i in range(nrows):
    for j in range(ncols):
        figs[i][j].imshow(im_aug(im))
        figs[i][j].axes.get_xaxis().set_visible(False)
        figs[i][j].axes.get_yaxis().set_visible(False)
 plt.show()
 # 可以看到每次做完增强之后的图片都有一些变化，所以这就是我们前面讲的，增加了一些'新'数据
 #
 # 下面我们使用图像增强进行训练网络，看看具体的提升究竟在什么地方，使用前面讲的 ResNet 进行训练 
 # + {"ExecuteTime": {"start_time": "2017-12-23T05:04:02.920639Z", "end_time": "2017-12-23T05:04:03.407434Z"}}
 import numpy as np
 import torch
 from torch import nn
 import torch.nn.functional as F
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 from utils import train, resnet
 from torchvision import transforms as tfs
 # + {"ExecuteTime": {"start_time": "2017-12-23T05:04:03.459562Z", "end_time": "2017-12-23T05:04:04.743167Z"}}
 # 使用数据增强
 def train_tf(x):
    im_aug = tfs.Compose([
        tfs.Resize(120),
        tfs.RandomHorizontalFlip(),
        tfs.RandomCrop(96),
        tfs.ColorJitter(brightness=0.5, contrast=0.5, hue=0.5),
        tfs.ToTensor(),
        tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
    ])
    x = im_aug(x)
    return x
 def test_tf(x):
    im_aug = tfs.Compose([
        tfs.Resize(96),
        tfs.ToTensor(),
        tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
    ])
    x = im_aug(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=train_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
 test_set = CIFAR10('./data', train=False, transform=test_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False)
 net = resnet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"start_time": "2017-12-23T05:04:04.745540Z", "end_time": "2017-12-23T05:08:51.433955Z"}}
 train(net, train_data, test_data, 10, optimizer, criterion)
 # + {"ExecuteTime": {"start_time": "2017-12-23T05:09:21.756986Z", "end_time": "2017-12-23T05:09:22.997927Z"}}
 # 不使用数据增强
 def data_tf(x):
    im_aug = tfs.Compose([
        tfs.Resize(96),
        tfs.ToTensor(),
        tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
    ])
    x = im_aug(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=data_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
 test_set = CIFAR10('./data', train=False, transform=data_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False)
 net = resnet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"start_time": "2017-12-23T05:09:23.000573Z", "end_time": "2017-12-23T05:13:57.898751Z"}}
 train(net, train_data, test_data, 10, optimizer, criterion)
 # -
 # 从上面可以看出，对于训练集，不做数据增强跑 10 次，准确率已经到了 95%，而使用了数据增强，跑 10 次准确率只有 75%，说明数据增强之后变得更难了。
 #
 # 而对于测试集，使用数据增强进行训练的时候，准确率会比不使用更高，因为数据增强提高了模型应对于更多的不同数据集的泛化能力，所以有更好的效果。
--- a/6_pytorch/2_CNN/densenet.py
+++ b/6_pytorch/2_CNN/densenet.py
@@ -1,178 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # DenseNet
 # 因为 ResNet 提出了跨层链接的思想，这直接影响了随后出现的卷积网络架构，其中最有名的就是 cvpr 2017 的 best paper，DenseNet。
 #
 # DenseNet 和 ResNet 不同在于 ResNet 是跨层求和，而 DenseNet 是跨层将特征在通道维度进行拼接，下面可以看看他们两者的图示
 #
 # ![](https://ws4.sinaimg.cn/large/006tNc79ly1fmpvj5vkfhj30uw0anq73.jpg)
 #
 # ![](https://ws1.sinaimg.cn/large/006tNc79ly1fmpvj7fxd1j30vb0eyzqf.jpg)
 # 第一张图是 ResNet，第二张图是 DenseNet，因为是在通道维度进行特征的拼接，所以底层的输出会保留进入所有后面的层，这能够更好的保证梯度的传播，同时能够使用低维的特征和高维的特征进行联合训练，能够得到更好的结果。
 # DenseNet 主要由 dense block 构成，下面我们来实现一个 densen block
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:30.612922Z", "end_time": "2017-12-22T15:38:31.113030Z"}}
 import sys
 sys.path.append('..')
 import numpy as np
 import torch
 from torch import nn
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 # -
 # 首先定义一个卷积块，这个卷积块的顺序是 bn -> relu -> conv
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.115369Z", "end_time": "2017-12-22T15:38:31.121249Z"}}
 def conv_block(in_channel, out_channel):
    layer = nn.Sequential(
        nn.BatchNorm2d(in_channel),
        nn.ReLU(True),
        nn.Conv2d(in_channel, out_channel, 3, padding=1, bias=False)
    )
    return layer
 # -
 # dense block 将每次的卷积的输出称为 `growth_rate`，因为如果输入是 `in_channel`，有 n 层，那么输出就是 `in_channel + n * growh_rate`
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.123363Z", "end_time": "2017-12-22T15:38:31.145274Z"}}
 class dense_block(nn.Module):
    def __init__(self, in_channel, growth_rate, num_layers):
        super(dense_block, self).__init__()
        block = []
        channel = in_channel
        for i in range(num_layers):
            block.append(conv_block(channel, growth_rate))
            channel += growth_rate
        self.net = nn.Sequential(*block)
    def forward(self, x):
        for layer in self.net:
            out = layer(x)
            x = torch.cat((out, x), dim=1)
        return x
 # -
 # 我们验证一下输出的 channel 是否正确
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.147196Z", "end_time": "2017-12-22T15:38:31.213632Z"}}
 test_net = dense_block(3, 12, 3)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 print('input shape: {} x {} x {}'.format(test_x.shape[1], test_x.shape[2], test_x.shape[3]))
 test_y = test_net(test_x)
 print('output shape: {} x {} x {}'.format(test_y.shape[1], test_y.shape[2], test_y.shape[3]))
 # -
 # 除了 dense block，DenseNet 中还有一个模块叫过渡层（transition block），因为 DenseNet 会不断地对维度进行拼接， 所以当层数很高的时候，输出的通道数就会越来越大，参数和计算量也会越来越大，为了避免这个问题，需要引入过渡层将输出通道降低下来，同时也将输入的长宽减半，这个过渡层可以使用 1 x 1 的卷积
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.215770Z", "end_time": "2017-12-22T15:38:31.222120Z"}}
 def transition(in_channel, out_channel):
    trans_layer = nn.Sequential(
        nn.BatchNorm2d(in_channel),
        nn.ReLU(True),
        nn.Conv2d(in_channel, out_channel, 1),
        nn.AvgPool2d(2, 2)
    )
    return trans_layer
 # -
 # 验证一下过渡层是否正确
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.224078Z", "end_time": "2017-12-22T15:38:31.234846Z"}}
 test_net = transition(3, 12)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 print('input shape: {} x {} x {}'.format(test_x.shape[1], test_x.shape[2], test_x.shape[3]))
 test_y = test_net(test_x)
 print('output shape: {} x {} x {}'.format(test_y.shape[1], test_y.shape[2], test_y.shape[3]))
 # -
 # 最后我们定义 DenseNet
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.236857Z", "end_time": "2017-12-22T15:38:31.318822Z"}}
 class densenet(nn.Module):
    def __init__(self, in_channel, num_classes, growth_rate=32, block_layers=[6, 12, 24, 16]):
        super(densenet, self).__init__()
        self.block1 = nn.Sequential(
            nn.Conv2d(in_channel, 64, 7, 2, 3),
            nn.BatchNorm2d(64),
            nn.ReLU(True),
            nn.MaxPool2d(3, 2, padding=1)
        )
        channels = 64
        block = []
        for i, layers in enumerate(block_layers):
            block.append(dense_block(channels, growth_rate, layers))
            channels += layers * growth_rate
            if i != len(block_layers) - 1:
                block.append(transition(channels, channels // 2)) # 通过 transition 层将大小减半，通道数减半
                channels = channels // 2
        self.block2 = nn.Sequential(*block)
        self.block2.add_module('bn', nn.BatchNorm2d(channels))
        self.block2.add_module('relu', nn.ReLU(True))
        self.block2.add_module('avg_pool', nn.AvgPool2d(3))
        self.classifier = nn.Linear(channels, num_classes)
    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = x.view(x.shape[0], -1)
        x = self.classifier(x)
        return x
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.320788Z", "end_time": "2017-12-22T15:38:31.654182Z"}}
 test_net = densenet(3, 10)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 test_y = test_net(test_x)
 print('output: {}'.format(test_y.shape))
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:31.656356Z", "end_time": "2017-12-22T15:38:32.894729Z"}}
 from utils import train
 def data_tf(x):
    x = x.resize((96, 96), 2) # 将图片放大到 96 x 96
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 标准化，这个技巧之后会讲到
    x = x.transpose((2, 0, 1)) # 将 channel 放到第一维，只是 pytorch 要求的输入方式
    x = torch.from_numpy(x)
    return x
 train_set = CIFAR10('../../data', train=True, transform=data_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
 test_set = CIFAR10('../../data', train=False, transform=data_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False)
 net = densenet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"start_time": "2017-12-22T15:38:32.896735Z", "end_time": "2017-12-22T16:15:38.168095Z"}}
 train(net, train_data, test_data, 20, optimizer, criterion)
 # -
 # DenseNet 将残差连接改为了特征拼接，使得网络有了更稠密的连接
--- a/6_pytorch/2_CNN/googlenet.py
+++ b/6_pytorch/2_CNN/googlenet.py
@@ -1,206 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # GoogLeNet
 # 前面我们讲的 VGG 是 2014 年 ImageNet 比赛的亚军，那么冠军是谁呢？就是我们马上要讲的 GoogLeNet，这是 Google 的研究人员提出的网络结构，在当时取得了非常大的影响，因为网络的结构变得前所未有，它颠覆了大家对卷积网络的串联的印象和固定做法，采用了一种非常有效的 inception 模块，得到了比 VGG 更深的网络结构，但是却比 VGG 的参数更少，因为其去掉了后面的全连接层，所以参数大大减少，同时有了很高的计算效率。
 #
 # ![](https://ws2.sinaimg.cn/large/006tNc79ly1fmprhdocouj30qb08vac3.jpg)
 #
 # 这是 googlenet 的网络示意图，下面我们介绍一下其作为创新的 inception 模块。
 # ## Inception 模块
 # 在上面的网络中，我们看到了多个四个并行卷积的层，这些四个卷积并行的层就是 inception 模块，可视化如下
 #
 # ![](https://ws4.sinaimg.cn/large/006tNc79gy1fmprivb2hxj30dn09dwef.jpg)
 #
 # 一个 inception 模块的四个并行线路如下：
 # 1.一个 1 x 1 的卷积，一个小的感受野进行卷积提取特征
 # 2.一个 1 x 1 的卷积加上一个 3 x 3 的卷积，1 x 1 的卷积降低输入的特征通道，减少参数计算量，然后接一个 3 x 3 的卷积做一个较大感受野的卷积
 # 3.一个 1 x 1 的卷积加上一个 5 x 5 的卷积，作用和第二个一样
 # 4.一个 3 x 3 的最大池化加上 1 x 1 的卷积，最大池化改变输入的特征排列，1 x 1 的卷积进行特征提取
 #
 # 最后将四个并行线路得到的特征在通道这个维度上拼接在一起，下面我们可以实现一下
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:05.427292Z", "start_time": "2017-12-22T12:51:04.924747Z"}}
 import sys
 sys.path.append('..')
 import numpy as np
 import torch
 from torch import nn
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:08.890890Z", "start_time": "2017-12-22T12:51:08.876313Z"}}
 # 定义一个卷积加一个 relu 激活函数和一个 batchnorm 作为一个基本的层结构
 def conv_relu(in_channel, out_channel, kernel, stride=1, padding=0):
    layer = nn.Sequential(
        nn.Conv2d(in_channel, out_channel, kernel, stride, padding),
        nn.BatchNorm2d(out_channel, eps=1e-3),
        nn.ReLU(True)
    )
    return layer
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:09.671474Z", "start_time": "2017-12-22T12:51:09.587337Z"}}
 class inception(nn.Module):
    def __init__(self, in_channel, out1_1, out2_1, out2_3, out3_1, out3_5, out4_1):
        super(inception, self).__init__()
        # 第一条线路
        self.branch1x1 = conv_relu(in_channel, out1_1, 1)
        # 第二条线路
        self.branch3x3 = nn.Sequential( 
            conv_relu(in_channel, out2_1, 1),
            conv_relu(out2_1, out2_3, 3, padding=1)
        )
        # 第三条线路
        self.branch5x5 = nn.Sequential(
            conv_relu(in_channel, out3_1, 1),
            conv_relu(out3_1, out3_5, 5, padding=2)
        )
        # 第四条线路
        self.branch_pool = nn.Sequential(
            nn.MaxPool2d(3, stride=1, padding=1),
            conv_relu(in_channel, out4_1, 1)
        )
    def forward(self, x):
        f1 = self.branch1x1(x)
        f2 = self.branch3x3(x)
        f3 = self.branch5x5(x)
        f4 = self.branch_pool(x)
        output = torch.cat((f1, f2, f3, f4), dim=1)
        return output
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:10.948630Z", "start_time": "2017-12-22T12:51:10.757903Z"}}
 test_net = inception(3, 64, 48, 64, 64, 96, 32)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 print('input shape: {} x {} x {}'.format(test_x.shape[1], test_x.shape[2], test_x.shape[3]))
 test_y = test_net(test_x)
 print('output shape: {} x {} x {}'.format(test_y.shape[1], test_y.shape[2], test_y.shape[3]))
 # -
 # 可以看到输入经过了 inception 模块之后，大小没有变化，通道的维度变多了
 # 下面我们定义 GoogLeNet，GoogLeNet 可以看作是很多个 inception 模块的串联，注意，原论文中使用了多个输出来解决梯度消失的问题，这里我们只定义一个简单版本的 GoogLeNet，简化为一个输出
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:13.149380Z", "start_time": "2017-12-22T12:51:12.934110Z"}}
 class googlenet(nn.Module):
    def __init__(self, in_channel, num_classes, verbose=False):
        super(googlenet, self).__init__()
        self.verbose = verbose
        self.block1 = nn.Sequential(
            conv_relu(in_channel, out_channel=64, kernel=7, stride=2, padding=3),
            nn.MaxPool2d(3, 2)
        )
        self.block2 = nn.Sequential(
            conv_relu(64, 64, kernel=1),
            conv_relu(64, 192, kernel=3, padding=1),
            nn.MaxPool2d(3, 2)
        )
        self.block3 = nn.Sequential(
            inception(192, 64, 96, 128, 16, 32, 32),
            inception(256, 128, 128, 192, 32, 96, 64),
            nn.MaxPool2d(3, 2)
        )
        self.block4 = nn.Sequential(
            inception(480, 192, 96, 208, 16, 48, 64),
            inception(512, 160, 112, 224, 24, 64, 64),
            inception(512, 128, 128, 256, 24, 64, 64),
            inception(512, 112, 144, 288, 32, 64, 64),
            inception(528, 256, 160, 320, 32, 128, 128),
            nn.MaxPool2d(3, 2)
        )
        self.block5 = nn.Sequential(
            inception(832, 256, 160, 320, 32, 128, 128),
            inception(832, 384, 182, 384, 48, 128, 128),
            nn.AvgPool2d(2)
        )
        self.classifier = nn.Linear(1024, num_classes)
    def forward(self, x):
        x = self.block1(x)
        if self.verbose:
            print('block 1 output: {}'.format(x.shape))
        x = self.block2(x)
        if self.verbose:
            print('block 2 output: {}'.format(x.shape))
        x = self.block3(x)
        if self.verbose:
            print('block 3 output: {}'.format(x.shape))
        x = self.block4(x)
        if self.verbose:
            print('block 4 output: {}'.format(x.shape))
        x = self.block5(x)
        if self.verbose:
            print('block 5 output: {}'.format(x.shape))
        x = x.view(x.shape[0], -1)
        x = self.classifier(x)
        return x
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:13.614936Z", "start_time": "2017-12-22T12:51:13.428383Z"}}
 test_net = googlenet(3, 10, True)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 test_y = test_net(test_x)
 print('output: {}'.format(test_y.shape))
 # -
 # 可以看到输入的尺寸不断减小，通道的维度不断增加
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:51:16.387778Z", "start_time": "2017-12-22T12:51:15.121350Z"}}
 from utils import train
 def data_tf(x):
    x = x.resize((96, 96), 2) # 将图片放大到 96 x 96
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 标准化，这个技巧之后会讲到
    x = x.transpose((2, 0, 1)) # 将 channel 放到第一维，只是 pytorch 要求的输入方式
    x = torch.from_numpy(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=data_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
 test_set = CIFAR10('./data', train=False, transform=data_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False)
 net = googlenet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:17:25.310685Z", "start_time": "2017-12-22T12:51:16.389607Z"}}
 train(net, train_data, test_data, 20, optimizer, criterion)
 # -
 # GoogLeNet 加入了更加结构化的 Inception 块使得我们能够使用更大的通道，更多的层，同时也控制了计算量。
 #
 # **小练习：GoogLeNet 有很多后续的版本，尝试看看论文，看看有什么不同，实现一下：  
 # v1：最早的版本  
 # v2：加入 batch normalization 加快训练  
 # v3：对 inception 模块做了调整  
 # v4：基于 ResNet 加入了 残差连接  **
--- a/6_pytorch/2_CNN/lr-decay.py
+++ b/6_pytorch/2_CNN/lr-decay.py
@@ -1,184 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 学习率衰减
 # 对于基于一阶梯度进行优化的方法而言，开始的时候更新的幅度是比较大的，也就是说开始的学习率可以设置大一点，但是当训练集的 loss 下降到一定程度之后，，使用这个太大的学习率就会导致 loss 一直来回震荡，比如
 #
 # ![](https://ws4.sinaimg.cn/large/006tNc79ly1fmrvdlncomj30bf0aywet.jpg)
 # 这个时候就需要对学习率进行衰减已达到 loss 的充分下降，而是用学习率衰减的办法能够解决这个矛盾，学习率衰减就是随着训练的进行不断的减小学习率。
 #
 # 在 pytorch 中学习率衰减非常方便，使用 `torch.optim.lr_scheduler`，更多的信息可以直接查看[文档](http://pytorch.org/docs/0.3.0/optim.html#how-to-adjust-learning-rate)
 #
 # 但是我推荐大家使用下面这种方式来做学习率衰减，更加直观，下面我们直接举例子来说明
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:45:33.834665Z", "end_time": "2017-12-24T08:45:34.293625Z"}}
 import sys
 sys.path.append('..')
 import numpy as np
 import torch
 from torch import nn
 import torch.nn.functional as F
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 from utils import resnet
 from torchvision import transforms as tfs
 from datetime import datetime
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:45:35.063610Z", "end_time": "2017-12-24T08:45:35.195093Z"}}
 net = resnet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01, weight_decay=1e-4)
 # -
 # 这里我们定义好了模型和优化器，可以通过 `optimizer.param_groups` 来得到所有的参数组和其对应的属性，参数组是什么意思呢？就是我们可以将模型的参数分成几个组，每个组定义一个学习率，这里比较复杂，一般来讲如果不做特别修改，就只有一个参数组
 #
 # 这个参数组是一个字典，里面有很多属性，比如学习率，权重衰减等等，我们可以访问以下
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:22:59.187178Z", "end_time": "2017-12-24T08:22:59.192905Z"}}
 print('learning rate: {}'.format(optimizer.param_groups[0]['lr']))
 print('weight decay: {}'.format(optimizer.param_groups[0]['weight_decay']))
 # -
 # 所以我们可以通过修改这个属性来改变我们训练过程中的学习率，非常简单
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:25:04.762612Z", "end_time": "2017-12-24T08:25:04.767090Z"}}
 optimizer.param_groups[0]['lr'] = 1e-5
 # -
 # 为了防止有多个参数组，我们可以使用一个循环
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:26:05.136955Z", "end_time": "2017-12-24T08:26:05.142183Z"}}
 for param_group in optimizer.param_groups:
    param_group['lr'] = 1e-1
 # -
 # 方法就是这样，非常简单，我们可以在任意的位置改变我们的学习率
 #
 # 下面我们具体来看看学习率衰减的好处
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:45:40.803993Z", "end_time": "2017-12-24T08:45:40.809459Z"}}
 def set_learning_rate(optimizer, lr):
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:45:46.738002Z", "end_time": "2017-12-24T08:45:48.006789Z"}}
 # 使用数据增强
 def train_tf(x):
    im_aug = tfs.Compose([
        tfs.Resize(120),
        tfs.RandomHorizontalFlip(),
        tfs.RandomCrop(96),
        tfs.ColorJitter(brightness=0.5, contrast=0.5, hue=0.5),
        tfs.ToTensor(),
        tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
    ])
    x = im_aug(x)
    return x
 def test_tf(x):
    im_aug = tfs.Compose([
        tfs.Resize(96),
        tfs.ToTensor(),
        tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
    ])
    x = im_aug(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=train_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=256, shuffle=True, num_workers=4)
 valid_set = CIFAR10('./data', train=False, transform=test_tf)
 valid_data = torch.utils.data.DataLoader(valid_set, batch_size=256, shuffle=False, num_workers=4)
 net = resnet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.1, weight_decay=1e-4)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:45:48.556187Z", "end_time": "2017-12-24T08:59:49.656832Z"}}
 train_losses = []
 valid_losses = []
 if torch.cuda.is_available():
    net = net.cuda()
 prev_time = datetime.now()
 for epoch in range(30):
    if epoch == 20:
        set_learning_rate(optimizer, 0.01) # 80 次修改学习率为 0.01
    train_loss = 0
    net = net.train()
    for im, label in train_data:
        if torch.cuda.is_available():
            im = Variable(im.cuda())  # (bs, 3, h, w)
            label = Variable(label.cuda())  # (bs, h, w)
        else:
            im = Variable(im)
            label = Variable(label)
        # forward
        output = net(im)
        loss = criterion(output, label)
        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.data[0]
    cur_time = datetime.now()
    h, remainder = divmod((cur_time - prev_time).seconds, 3600)
    m, s = divmod(remainder, 60)
    time_str = "Time %02d:%02d:%02d" % (h, m, s)
    valid_loss = 0
    valid_acc = 0
    net = net.eval()
    for im, label in valid_data:
        if torch.cuda.is_available():
            im = Variable(im.cuda(), volatile=True)
            label = Variable(label.cuda(), volatile=True)
        else:
            im = Variable(im, volatile=True)
            label = Variable(label, volatile=True)
        output = net(im)
        loss = criterion(output, label)
        valid_loss += loss.data[0]
    epoch_str = (
        "Epoch %d. Train Loss: %f, Valid Loss: %f, "
        % (epoch, train_loss / len(train_data), valid_loss / len(valid_data)))
    prev_time = cur_time
    train_losses.append(train_loss / len(train_data))
    valid_losses.append(valid_loss / len(valid_data))
    print(epoch_str + time_str)
 # -
 # 下面我们画出 loss 曲线
 # + {"ExecuteTime": {"start_time": "2017-12-24T09:01:37.439613Z", "end_time": "2017-12-24T09:01:37.676274Z"}}
 import matplotlib.pyplot as plt
 # %matplotlib inline
 # + {"ExecuteTime": {"start_time": "2017-12-24T09:02:37.244995Z", "end_time": "2017-12-24T09:02:37.432883Z"}}
 plt.plot(train_losses, label='train')
 plt.plot(valid_losses, label='valid')
 plt.xlabel('epoch')
 plt.legend(loc='best')
 # -
 # 这里我们只训练了 30 次，在 20 次的时候进行了学习率衰减，可以看 loss 曲线在 20 次的时候不管是 train loss 还是 valid loss，都有了一个陡降。
 #
 # 当然这里我们只是作为举例，在实际应用中，做学习率衰减之前应该经过充分的训练，比如训练 80 次或者 100 次，然后再做学习率衰减得到更好的结果，有的时候甚至需要做多次学习率衰减
--- a/6_pytorch/2_CNN/regularization.py
+++ b/6_pytorch/2_CNN/regularization.py
@@ -1,85 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 正则化
 # 前面我们讲了数据增强和 dropout，而在实际使用中，现在的网络往往不使用 dropout，而是用另外一个技术，叫正则化。
 #
 # 正则化是机器学习中提出来的一种方法，有 L1 和 L2 正则化，目前使用较多的是 L2 正则化，引入正则化相当于在 loss 函数上面加上一项，比如
 #
 # $$
 # f = loss + \lambda \sum_{p \in params} ||p||_2^2
 # $$
 #
 # 就是在 loss 的基础上加上了参数的二范数作为一个正则化，我们在训练网络的时候，不仅要最小化 loss 函数，同时还要最小化参数的二范数，也就是说我们会对参数做一些限制，不让它变得太大。
 # 如果我们对新的损失函数 f 求导进行梯度下降，就有
 #
 # $$
 # \frac{\partial f}{\partial p_j} = \frac{\partial loss}{\partial p_j} + 2 \lambda p_j
 # $$
 #
 # 那么在更新参数的时候就有
 #
 # $$
 # p_j \rightarrow p_j - \eta (\frac{\partial loss}{\partial p_j} + 2 \lambda p_j) = p_j - \eta \frac{\partial loss}{\partial p_j} - 2 \eta \lambda p_j 
 # $$
 #
 # 可以看到 $p_j - \eta \frac{\partial loss}{\partial p_j}$ 和没加正则项要更新的部分一样，而后面的 $2\eta \lambda p_j$ 就是正则项的影响，可以看到加完正则项之后会对参数做更大程度的更新，这也被称为权重衰减(weight decay)，在 pytorch 中正则项就是通过这种方式来加入的，比如想在随机梯度下降法中使用正则项，或者说权重衰减，`torch.optim.SGD(net.parameters(), lr=0.1, weight_decay=1e-4)` 就可以了，这个 `weight_decay` 系数就是上面公式中的 $\lambda$，非常方便
 #
 # 注意正则项的系数的大小非常重要，如果太大，会极大的抑制参数的更新，导致欠拟合，如果太小，那么正则项这个部分基本没有贡献，所以选择一个合适的权重衰减系数非常重要，这个需要根据具体的情况去尝试，初步尝试可以使用 `1e-4` 或者 `1e-3` 
 #
 # 下面我们在训练 cifar 10 中添加正则项
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:02:11.383170Z", "end_time": "2017-12-24T08:02:11.903459Z"}}
 import sys
 sys.path.append('..')
 import numpy as np
 import torch
 from torch import nn
 import torch.nn.functional as F
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 from utils import train, resnet
 from torchvision import transforms as tfs
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:02:11.905617Z", "end_time": "2017-12-24T08:02:13.120502Z"}}
 def data_tf(x):
    im_aug = tfs.Compose([
        tfs.Resize(96),
        tfs.ToTensor(),
        tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
    ])
    x = im_aug(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=data_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True, num_workers=4)
 test_set = CIFAR10('./data', train=False, transform=data_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False, num_workers=4)
 net = resnet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01, weight_decay=1e-4) # 增加正则项
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"start_time": "2017-12-24T08:02:13.122785Z", "end_time": "2017-12-24T08:11:36.106177Z"}}
 from utils import train
 train(net, train_data, test_data, 20, optimizer, criterion)
--- a/6_pytorch/2_CNN/resnet.py
+++ b/6_pytorch/2_CNN/resnet.py
@@ -1,191 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # ResNet
 # 当大家还在惊叹 GoogLeNet 的 inception 结构的时候，微软亚洲研究院的研究员已经在设计更深但结构更加简单的网络 ResNet，并且凭借这个网络子在 2015 年 ImageNet 比赛上大获全胜。
 #
 # ResNet 有效地解决了深度神经网络难以训练的问题，可以训练高达 1000 层的卷积网络。网络之所以难以训练，是因为存在着梯度消失的问题，离 loss 函数越远的层，在反向传播的时候，梯度越小，就越难以更新，随着层数的增加，这个现象越严重。之前有两种常见的方案来解决这个问题：
 #
 # 1.按层训练，先训练比较浅的层，然后在不断增加层数，但是这种方法效果不是特别好，而且比较麻烦
 #
 # 2.使用更宽的层，或者增加输出通道，而不加深网络的层数，这种结构往往得到的效果又不好
 #
 # ResNet 通过引入了跨层链接解决了梯度回传消失的问题。
 #
 # ![](https://ws1.sinaimg.cn/large/006tNc79ly1fmptq2snv9j30j808t74a.jpg)
 # 这就普通的网络连接跟跨层残差连接的对比图，使用普通的连接，上层的梯度必须要一层一层传回来，而是用残差连接，相当于中间有了一条更短的路，梯度能够从这条更短的路传回来，避免了梯度过小的情况。
 #
 # 假设某层的输入是 x，期望输出是 H(x)， 如果我们直接把输入 x 传到输出作为初始结果，这就是一个更浅层的网络，更容易训练，而这个网络没有学会的部分，我们可以使用更深的网络 F(x) 去训练它，使得训练更加容易，最后希望拟合的结果就是 F(x) = H(x) - x，这就是一个残差的结构
 #
 # 残差网络的结构就是上面这种残差块的堆叠，下面让我们来实现一个 residual block
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:56:06.772059Z", "start_time": "2017-12-22T12:56:06.766027Z"}}
 import sys
 sys.path.append('..')
 import numpy as np
 import torch
 from torch import nn
 import torch.nn.functional as F
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 # + {"ExecuteTime": {"end_time": "2017-12-22T12:47:49.222432Z", "start_time": "2017-12-22T12:47:49.217940Z"}}
 def conv3x3(in_channel, out_channel, stride=1):
    return nn.Conv2d(in_channel, out_channel, 3, stride=stride, padding=1, bias=False)
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:14:02.429145Z", "start_time": "2017-12-22T13:14:02.383322Z"}}
 class residual_block(nn.Module):
    def __init__(self, in_channel, out_channel, same_shape=True):
        super(residual_block, self).__init__()
        self.same_shape = same_shape
        stride=1 if self.same_shape else 2
        self.conv1 = conv3x3(in_channel, out_channel, stride=stride)
        self.bn1 = nn.BatchNorm2d(out_channel)
        self.conv2 = conv3x3(out_channel, out_channel)
        self.bn2 = nn.BatchNorm2d(out_channel)
        if not self.same_shape:
            self.conv3 = nn.Conv2d(in_channel, out_channel, 1, stride=stride)
    def forward(self, x):
        out = self.conv1(x)
        out = F.relu(self.bn1(out), True)
        out = self.conv2(out)
        out = F.relu(self.bn2(out), True)
        if not self.same_shape:
            x = self.conv3(x)
        return F.relu(x+out, True)
 # -
 # 我们测试一下一个 residual block 的输入和输出
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:14:05.793185Z", "start_time": "2017-12-22T13:14:05.763382Z"}}
 # 输入输出形状相同
 test_net = residual_block(32, 32)
 test_x = Variable(torch.zeros(1, 32, 96, 96))
 print('input: {}'.format(test_x.shape))
 test_y = test_net(test_x)
 print('output: {}'.format(test_y.shape))
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:14:11.929120Z", "start_time": "2017-12-22T13:14:11.914604Z"}}
 # 输入输出形状不同
 test_net = residual_block(3, 32, False)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 print('input: {}'.format(test_x.shape))
 test_y = test_net(test_x)
 print('output: {}'.format(test_y.shape))
 # -
 # 下面我们尝试实现一个 ResNet，它就是 residual block 模块的堆叠
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:27:46.099404Z", "start_time": "2017-12-22T13:27:45.986235Z"}}
 class resnet(nn.Module):
    def __init__(self, in_channel, num_classes, verbose=False):
        super(resnet, self).__init__()
        self.verbose = verbose
        self.block1 = nn.Conv2d(in_channel, 64, 7, 2)
        self.block2 = nn.Sequential(
            nn.MaxPool2d(3, 2),
            residual_block(64, 64),
            residual_block(64, 64)
        )
        self.block3 = nn.Sequential(
            residual_block(64, 128, False),
            residual_block(128, 128)
        )
        self.block4 = nn.Sequential(
            residual_block(128, 256, False),
            residual_block(256, 256)
        )
        self.block5 = nn.Sequential(
            residual_block(256, 512, False),
            residual_block(512, 512),
            nn.AvgPool2d(3)
        )
        self.classifier = nn.Linear(512, num_classes)
    def forward(self, x):
        x = self.block1(x)
        if self.verbose:
            print('block 1 output: {}'.format(x.shape))
        x = self.block2(x)
        if self.verbose:
            print('block 2 output: {}'.format(x.shape))
        x = self.block3(x)
        if self.verbose:
            print('block 3 output: {}'.format(x.shape))
        x = self.block4(x)
        if self.verbose:
            print('block 4 output: {}'.format(x.shape))
        x = self.block5(x)
        if self.verbose:
            print('block 5 output: {}'.format(x.shape))
        x = x.view(x.shape[0], -1)
        x = self.classifier(x)
        return x
 # -
 # 输出一下每个 block 之后的大小
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:28:00.597030Z", "start_time": "2017-12-22T13:28:00.417746Z"}}
 test_net = resnet(3, 10, True)
 test_x = Variable(torch.zeros(1, 3, 96, 96))
 test_y = test_net(test_x)
 print('output: {}'.format(test_y.shape))
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:29:01.484172Z", "start_time": "2017-12-22T13:29:00.095952Z"}}
 from utils import train
 def data_tf(x):
    x = x.resize((96, 96), 2) # 将图片放大到 96 x 96
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 标准化，这个技巧之后会讲到
    x = x.transpose((2, 0, 1)) # 将 channel 放到第一维，只是 pytorch 要求的输入方式
    x = torch.from_numpy(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=data_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
 test_set = CIFAR10('./data', train=False, transform=data_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False)
 net = resnet(3, 10)
 optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"end_time": "2017-12-22T13:45:00.783186Z", "start_time": "2017-12-22T13:29:09.214453Z"}}
 train(net, train_data, test_data, 20, optimizer, criterion)
 # -
 # ResNet 使用跨层通道使得训练非常深的卷积神经网络成为可能。同样它使用很简单的卷积层配置，使得其拓展更加简单。
 #
 # **小练习：  
 # 1.尝试一下论文中提出的 bottleneck 的结构   
 # 2.尝试改变 conv -> bn -> relu 的顺序为 bn -> relu -> conv，看看精度会不会提高**
--- a/6_pytorch/2_CNN/vgg.py
+++ b/6_pytorch/2_CNN/vgg.py
@@ -1,155 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # VGG
 # 计算机视觉是一直深度学习的主战场，从这里我们将接触到近几年非常流行的卷积网络结构，网络结构由浅变深，参数越来越多，网络有着更多的跨层链接，首先我们先介绍一个数据集 cifar10，我们将以此数据集为例介绍各种卷积网络的结构。
 #
 # ## CIFAR 10
 # cifar 10 这个数据集一共有 50000 张训练集，10000 张测试集，两个数据集里面的图片都是 png 彩色图片，图片大小是 32 x 32 x 3，一共是 10 分类问题，分别为飞机、汽车、鸟、猫、鹿、狗、青蛙、马、船和卡车。这个数据集是对网络性能测试一个非常重要的指标，可以说如果一个网络在这个数据集上超过另外一个网络，那么这个网络性能上一定要比另外一个网络好，目前这个数据集最好的结果是 95% 左右的测试集准确率。
 #
 # ![](https://ws1.sinaimg.cn/large/006tNc79ly1fmpjxxq7wcj30db0ae7ag.jpg)
 #
 # 你能用肉眼对这些图片进行分类吗？
 #
 # cifar 10 已经被 pytorch 内置了，使用非常方便，只需要调用 `torchvision.datasets.CIFAR10` 就可以了
 # ## VGGNet
 # vggNet 是第一个真正意义上的深层网络结构，其是 ImageNet2014年的冠军，得益于 python 的函数和循环，我们能够非常方便地构建重复结构的深层网络。
 #
 # vgg 的网络结构非常简单，就是不断地堆叠卷积层和池化层，下面是一个简单的图示
 #
 # ![](https://ws4.sinaimg.cn/large/006tNc79ly1fmpk5smtidj307n0dx3yv.jpg)
 #
 # vgg 几乎全部使用 3 x 3 的卷积核以及 2 x 2 的池化层，使用小的卷积核进行多层的堆叠和一个大的卷积核的感受野是相同的，同时小的卷积核还能减少参数，同时可以有更深的结构。
 #
 # vgg 的一个关键就是使用很多层 3 x 3 的卷积然后再使用一个最大池化层，这个模块被使用了很多次，下面我们照着这个结构来写一写
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:50.883050Z", "end_time": "2017-12-22T09:01:51.296457Z"}}
 import sys
 sys.path.append('..')
 import numpy as np
 import torch
 from torch import nn
 from torch.autograd import Variable
 from torchvision.datasets import CIFAR10
 # -
 # 我们可以定义一个 vgg 的 block，传入三个参数，第一个是模型层数，第二个是输入的通道数，第三个是输出的通道数，第一层卷积接受的输入通道就是图片输入的通道数，然后输出最后的输出通道数，后面的卷积接受的通道数就是最后的输出通道数
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:51.298777Z", "end_time": "2017-12-22T09:01:51.312500Z"}}
 def vgg_block(num_convs, in_channels, out_channels):
    net = [nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1), nn.ReLU(True)] # 定义第一层
    for i in range(num_convs-1): # 定义后面的很多层
        net.append(nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1))
        net.append(nn.ReLU(True))
    net.append(nn.MaxPool2d(2, 2)) # 定义池化层
    return nn.Sequential(*net)
 # -
 # 我们可以将模型打印出来看看结构
 # + {"ExecuteTime": {"start_time": "2017-12-22T08:20:40.808853Z", "end_time": "2017-12-22T08:20:40.819497Z"}}
 block_demo = vgg_block(3, 64, 128)
 print(block_demo)
 # + {"ExecuteTime": {"start_time": "2017-12-22T07:52:02.381987Z", "end_time": "2017-12-22T07:52:04.632406Z"}}
 # 首先定义输入为 (1, 64, 300, 300)
 input_demo = Variable(torch.zeros(1, 64, 300, 300))
 output_demo = block_demo(input_demo)
 print(output_demo.shape)
 # -
 # 可以看到输出就变为了 (1, 128, 150, 150)，可以看到经过了这一个 vgg block，输入大小被减半，通道数变成了 128
 #
 # 下面我们定义一个函数对这个 vgg block 进行堆叠
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:54.489255Z", "end_time": "2017-12-22T09:01:54.497712Z"}}
 def vgg_stack(num_convs, channels):
    net = []
    for n, c in zip(num_convs, channels):
        in_c = c[0]
        out_c = c[1]
        net.append(vgg_block(n, in_c, out_c))
    return nn.Sequential(*net)
 # -
 # 作为实例，我们定义一个稍微简单一点的 vgg 结构，其中有 8 个卷积层
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:55.041923Z", "end_time": "2017-12-22T09:01:55.149378Z"}}
 vgg_net = vgg_stack((1, 1, 2, 2, 2), ((3, 64), (64, 128), (128, 256), (256, 512), (512, 512)))
 print(vgg_net)
 # -
 # 我们可以看到网络结构中有个 5 个 最大池化，说明图片的大小会减少 5 倍，我们可以验证一下，输入一张 256 x 256 的图片看看结果是什么
 # + {"ExecuteTime": {"start_time": "2017-12-22T08:52:43.431478Z", "end_time": "2017-12-22T08:52:44.049650Z"}}
 test_x = Variable(torch.zeros(1, 3, 256, 256))
 test_y = vgg_net(test_x)
 print(test_y.shape)
 # -
 # 可以看到图片减小了 $2^5$ 倍，最后再加上几层全连接，就能够得到我们想要的分类输出
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:57.306864Z", "end_time": "2017-12-22T09:01:57.323034Z"}}
 class vgg(nn.Module):
    def __init__(self):
        super(vgg, self).__init__()
        self.feature = vgg_net
        self.fc = nn.Sequential(
            nn.Linear(512, 100),
            nn.ReLU(True),
            nn.Linear(100, 10)
        )
    def forward(self, x):
        x = self.feature(x)
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x
 # -
 # 然后我们可以训练我们的模型看看在 cifar10 上的效果
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:58.709531Z", "end_time": "2017-12-22T09:01:59.921373Z"}}
 from utils import train
 def data_tf(x):
    x = np.array(x, dtype='float32') / 255
    x = (x - 0.5) / 0.5 # 标准化，这个技巧之后会讲到
    x = x.transpose((2, 0, 1)) # 将 channel 放到第一维，只是 pytorch 要求的输入方式
    x = torch.from_numpy(x)
    return x
 train_set = CIFAR10('./data', train=True, transform=data_tf)
 train_data = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
 test_set = CIFAR10('./data', train=False, transform=data_tf)
 test_data = torch.utils.data.DataLoader(test_set, batch_size=128, shuffle=False)
 net = vgg()
 optimizer = torch.optim.SGD(net.parameters(), lr=1e-1)
 criterion = nn.CrossEntropyLoss()
 # + {"ExecuteTime": {"start_time": "2017-12-22T09:01:59.924086Z", "end_time": "2017-12-22T09:12:46.868967Z"}}
 train(net, train_data, test_data, 20, optimizer, criterion)
 # -
 # 可以看到，跑完 20 次，vgg 能在 cifar 10 上取得 76% 左右的测试准确率
--- a/6_pytorch/3_RNN/time-series/lstm-time-series.py
+++ b/6_pytorch/3_RNN/time-series/lstm-time-series.py
@@ -1,146 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # RNN 用于时间序列的分析
 # 前面我们讲到使用 RNN 做简单的图像分类的问题，但是 RNN 并不擅长此类问题，下面我们讲一讲如何将 RNN 用到时间序列的问题上，因为对于时序数据，后面的数据会用到前面的数据，LSTM 的记忆特性非常适合这种场景。
 # 首先我们可以读入数据，这个数据是 10 年飞机月流量，可视化得到下面的效果。
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 # %matplotlib inline
 data_csv = pd.read_csv('./data.csv', usecols=[1])
 plt.plot(data_csv)
 # 首先我们进行预处理，将数据中 `na` 的数据去掉，然后将数据标准化到 0 ~ 1 之间。
 # 数据预处理
 data_csv = data_csv.dropna()
 dataset = data_csv.values
 dataset = dataset.astype('float32')
 max_value = np.max(dataset)
 min_value = np.min(dataset)
 scalar = max_value - min_value
 dataset = list(map(lambda x: x / scalar, dataset))
 # 接着我们进行数据集的创建，我们想通过前面几个月的流量来预测当月的流量，比如我们希望通过前两个月的流量来预测当月的流量，我们可以将前两个月的流量当做输入，当月的流量当做输出。同时我们需要将我们的数据集分为训练集和测试集，通过测试集的效果来测试模型的性能，这里我们简单的将前面几年的数据作为训练集，后面两年的数据作为测试集。
 def create_dataset(dataset, look_back=2):
    dataX, dataY = [], []
    for i in range(len(dataset) - look_back):
        a = dataset[i:(i + look_back)]
        dataX.append(a)
        dataY.append(dataset[i + look_back])
    return np.array(dataX), np.array(dataY)
 # 创建好输入输出
 data_X, data_Y = create_dataset(dataset)
 # 划分训练集和测试集，70% 作为训练集
 train_size = int(len(data_X) * 0.7)
 test_size = len(data_X) - train_size
 train_X = data_X[:train_size]
 train_Y = data_Y[:train_size]
 test_X = data_X[train_size:]
 test_Y = data_Y[train_size:]
 train_Y.shape
 # 最后，我们需要将数据改变一下形状，因为 RNN 读入的数据维度是 (seq, batch, feature)，所以要重新改变一下数据的维度，这里只有一个序列，所以 batch 是 1，而输入的 feature 就是我们希望依据的几个月份，这里我们定的是两个月份，所以 feature 就是 2.
 # +
 import torch
 train_X = train_X.reshape(-1, 1, 2)
 train_Y = train_Y.reshape(-1, 1, 1)
 test_X = test_X.reshape(-1, 1, 2)
 train_x = torch.from_numpy(train_X)
 train_y = torch.from_numpy(train_Y)
 test_x = torch.from_numpy(test_X)
 # -
 from torch import nn
 from torch.autograd import Variable
 # 这里定义好模型，模型的第一部分是一个两层的 RNN，每一步模型接受两个月的输入作为特征，得到一个输出特征。接着通过一个线性层将 RNN 的输出回归到流量的具体数值，这里我们需要用 `view` 来重新排列，因为 `nn.Linear` 不接受三维的输入，所以我们先将前两维合并在一起，然后经过线性层之后再将其分开，最后输出结果。
 # 定义模型
 class lstm_reg(nn.Module):
    def __init__(self, input_size, hidden_size, output_size=1, num_layers=2):
        super(lstm_reg, self).__init__()
        self.rnn = nn.LSTM(input_size, hidden_size, num_layers) # rnn
        self.reg = nn.Linear(hidden_size, output_size) # 回归
    def forward(self, x):
        x, _ = self.rnn(x) # (seq, batch, hidden)
        s, b, h = x.shape
        x = x.view(s*b, h) # 转换成线性层的输入格式
        x = self.reg(x)
        x = x.view(s, b, -1)
        return x
 # +
 net = lstm_reg(2, 4)
 criterion = nn.MSELoss()
 optimizer = torch.optim.Adam(net.parameters(), lr=1e-2)
 # -
 # 定义好网络结构，输入的维度是 2，因为我们使用两个月的流量作为输入，隐藏层的维度可以任意指定，这里我们选的 4
 # 开始训练
 for e in range(1000):
    var_x = Variable(train_x)
    var_y = Variable(train_y)
    # 前向传播
    out = net(var_x)
    loss = criterion(out, var_y)
    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (e + 1) % 100 == 0: # 每 100 次输出结果
        print('Epoch: {}, Loss: {:.5f}'.format(e + 1, loss.data[0]))
 # 训练完成之后，我们可以用训练好的模型去预测后面的结果
 net = net.eval() # 转换成测试模式
 data_X = data_X.reshape(-1, 1, 2)
 data_X = torch.from_numpy(data_X)
 var_data = Variable(data_X)
 pred_test = net(var_data) # 测试集的预测结果
 # 改变输出的格式
 pred_test = pred_test.view(-1).data.numpy()
 # 画出实际结果和预测的结果
 plt.plot(pred_test, 'r', label='prediction')
 plt.plot(dataset, 'b', label='real')
 plt.legend(loc='best')
 # 这里蓝色的是真实的数据集，红色的是预测的结果，我们能够看到，使用 lstm 能够得到比较相近的结果，预测的趋势也与真实的数据集是相同的，因为其能够记忆之前的信息，而单纯的使用线性回归并不能得到较好的结果，从这个例子也说明了 RNN 对于序列有着非常好的性能。
 # **小练习：试试改变隐藏状态输出的特征数，看看有没有什么改变，同时试试使用简单的线性回归模型，看看会得到什么样的结果**
--- a/6_pytorch/4_GAN/autoencoder.py
+++ b/6_pytorch/4_GAN/autoencoder.py
@@ -1,250 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 自动编码器
 # 自动编码器最开始是作为一种数据压缩方法，同时还可以在卷积网络中进行逐层预训练，但是随后更多结构复杂的网络，比如 resnet 的出现使得我们能够训练任意深度的网络，自动编码器就不再使用在这个方面，下面我们讲一讲自动编码器的一个新的应用，这是随着生成对抗模型而出现的，就是使用自动编码器生成数据。
 #
 # 自动编码器的一般结构如下
 #
 # ![](https://ws1.sinaimg.cn/large/006tNc79ly1fmzr05igw3j30ni06j3z4.jpg)
 #
 # 由上面的图片，我们能够看到，第一部分是编码器(encoder)，第二部分是解码器(decoder)，编码器和解码器都可以是任意的模型，通常我们可以使用神经网络作为我们的编码器和解码器，输入的数据经过神经网络降维到一个编码，然后又通过另外一个神经网络解码得到一个与原始数据一模一样的生成数据，通过比较原始数据和生成数据，希望他们尽可能接近，所以最小化他们之间的差异来训练网络中编码器和解码器的参数。
 #
 # 当训练完成之后，我们如何生成数据呢？非常简单，我们只需要拿出解码器的部分，然后随机传入 code，就可以通过解码器生成各种各样的数据
 #
 # ![](https://ws3.sinaimg.cn/large/006tNc79ly1fmzrx3d3ygj30nu06ijs2.jpg)
 #
 # 下面我们使用 mnist 数据集来说明一个如何构建一个简单的自动编码器
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:09:20.758909Z", "end_time": "2018-01-01T10:09:21.223959Z"}}
 import os
 import torch
 from torch.autograd import Variable
 from torch import nn
 from torch.utils.data import DataLoader
 from torchvision.datasets import MNIST
 from torchvision import transforms as tfs
 from torchvision.utils import save_image
 # -
 # 进行数据预处理和迭代器的构建
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:09:21.341312Z", "end_time": "2018-01-01T10:09:21.368959Z"}}
 im_tfs = tfs.Compose([
    tfs.ToTensor(),
    tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]) # 标准化
 ])
 train_set = MNIST('./mnist', transform=im_tfs)
 train_data = DataLoader(train_set, batch_size=128, shuffle=True)
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:09:23.489417Z", "end_time": "2018-01-01T10:09:23.526707Z"}}
 # 定义网络
 class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 128),
            nn.ReLU(True),
            nn.Linear(128, 64),
            nn.ReLU(True),
            nn.Linear(64, 12),
            nn.ReLU(True),
            nn.Linear(12, 3) # 输出的 code 是 3 维，便于可视化
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 12),
            nn.ReLU(True),
            nn.Linear(12, 64),
            nn.ReLU(True),
            nn.Linear(64, 128),
            nn.ReLU(True),
            nn.Linear(128, 28*28),
            nn.Tanh()
        )
    def forward(self, x):
        encode = self.encoder(x)
        decode = self.decoder(encode)
        return encode, decode
 # -
 # 这里定义的编码器和解码器都是 4 层神经网络作为模型，中间使用 relu 激活函数，最后输出的 code 是三维，注意解码器最后我们使用 tanh 作为激活函数，因为输入图片标准化在 -1 ~ 1 之间，所以输出也要在 -1 ~ 1 这个范围内，最后我们可以验证一下
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:09:26.657447Z", "end_time": "2018-01-01T10:09:26.677033Z"}}
 net = autoencoder()
 x = Variable(torch.randn(1, 28*28)) # batch size 是 1
 code, _ = net(x)
 print(code.shape)
 # -
 # 可以看到最后得到的 code 就是三维的
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:09:27.726089Z", "end_time": "2018-01-01T10:09:27.739067Z"}}
 criterion = nn.MSELoss(size_average=False)
 optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
 def to_img(x):
    '''
    定义一个函数将最后的结果转换回图片
    '''
    x = 0.5 * (x + 1.)
    x = x.clamp(0, 1)
    x = x.view(x.shape[0], 1, 28, 28)
    return x
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:09:28.323220Z", "end_time": "2018-01-01T11:03:15.048160Z"}}
 # 开始训练自动编码器
 for e in range(100):
    for im, _ in train_data:
        im = im.view(im.shape[0], -1)
        im = Variable(im)
        # 前向传播
        _, output = net(im)
        loss = criterion(output, im) / im.shape[0] # 平均
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e+1) % 20 == 0: # 每 20 次，将生成的图片保存一下
        print('epoch: {}, Loss: {:.4f}'.format(e + 1, loss.data[0]))
        pic = to_img(output.cpu().data)
        if not os.path.exists('./simple_autoencoder'):
            os.mkdir('./simple_autoencoder')
        save_image(pic, './simple_autoencoder/image_{}.png'.format(e + 1))
 # -
 # 训练完成之后我们可以看看生成的图片效果
 #
 # ![](https://ws2.sinaimg.cn/large/006tNc79ly1fmzw2c26qtj306q0a2abh.jpg)
 #
 # 可以看出，图片还是具有较好的清晰度
 # + {"ExecuteTime": {"start_time": "2018-01-01T11:03:19.489154Z", "end_time": "2018-01-01T11:03:21.396147Z"}}
 import matplotlib.pyplot as plt
 from matplotlib import cm
 from mpl_toolkits.mplot3d import Axes3D
 # %matplotlib inline
 # 可视化结果
 view_data = Variable((train_set.train_data[:200].type(torch.FloatTensor).view(-1, 28*28) / 255. - 0.5) / 0.5)
 encode, _ = net(view_data)    # 提取压缩的特征值
 fig = plt.figure(2)
 ax = Axes3D(fig)    # 3D 图
 # x, y, z 的数据值
 X = encode.data[:, 0].numpy()
 Y = encode.data[:, 1].numpy()
 Z = encode.data[:, 2].numpy()
 values = train_set.train_labels[:200].numpy()  # 标签值
 for x, y, z, s in zip(X, Y, Z, values):
    c = cm.rainbow(int(255*s/9))    # 上色
    ax.text(x, y, z, s, backgroundcolor=c)  # 标位子
 ax.set_xlim(X.min(), X.max())
 ax.set_ylim(Y.min(), Y.max())
 ax.set_zlim(Z.min(), Z.max())
 plt.show()
 # -
 # 可以看到，不同种类的图片进入自动编码器之后会被编码得不同，而相同类型的图片经过自动编码之后的编码在几何示意图上距离较近，在训练好自动编码器之后，我们可以给一个随机的 code，通过 decoder 生成图片
 # + {"ExecuteTime": {"start_time": "2018-01-01T11:06:01.958234Z", "end_time": "2018-01-01T11:06:02.107432Z"}}
 code = Variable(torch.FloatTensor([[1.19, -3.36, 2.06]])) # 给一个 code 是 (1.19, -3.36, 2.06)
 decode = net.decoder(code)
 decode_img = to_img(decode).squeeze()
 decode_img = decode_img.data.numpy() * 255
 plt.imshow(decode_img.astype('uint8'), cmap='gray') # 生成图片 3
 # -
 # 这里我们仅仅使用多层神经网络定义了一个自动编码器，当然你会想到，为什么不使用效果更好的卷积神经网络呢？我们当然可以使用卷积神经网络来定义，下面我们就重新定义一个卷积神经网络来进行 autoencoder
 # + {"ExecuteTime": {"start_time": "2018-01-01T11:06:06.284342Z", "end_time": "2018-01-01T11:06:06.346907Z"}}
 class conv_autoencoder(nn.Module):
    def __init__(self):
        super(conv_autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 16, 3, stride=3, padding=1),  # (b, 16, 10, 10)
            nn.ReLU(True),
            nn.MaxPool2d(2, stride=2),  # (b, 16, 5, 5)
            nn.Conv2d(16, 8, 3, stride=2, padding=1),  # (b, 8, 3, 3)
            nn.ReLU(True),
            nn.MaxPool2d(2, stride=1)  # (b, 8, 2, 2)
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(8, 16, 3, stride=2),  # (b, 16, 5, 5)
            nn.ReLU(True),
            nn.ConvTranspose2d(16, 8, 5, stride=3, padding=1),  # (b, 8, 15, 15)
            nn.ReLU(True),
            nn.ConvTranspose2d(8, 1, 2, stride=2, padding=1),  # (b, 1, 28, 28)
            nn.Tanh()
        )
    def forward(self, x):
        encode = self.encoder(x)
        decode = self.decoder(encode)
        return encode, decode
 # + {"ExecuteTime": {"start_time": "2018-01-01T11:06:06.944171Z", "end_time": "2018-01-01T11:06:10.043014Z"}}
 conv_net = conv_autoencoder()
 if torch.cuda.is_available():
    conv_net = conv_net.cuda()
 optimizer = torch.optim.Adam(conv_net.parameters(), lr=1e-3, weight_decay=1e-5)
 # -
 # 对于卷积网络中，我们可以对输入进行上采样，那么对于卷积神经网络，我们可以使用转置卷积进行这个操作，这里我们先不展开讨论转置卷积，如果想先了解转置卷积，可以看看[语义分割](https://github.com/SherlockLiao/code-of-learn-deep-learning-with-pytorch/blob/master/chapter9_Computer-Vision/segmentation/fcn.ipynb)的部分，里面有转置卷积的介绍
 #
 # 在 pytorch 中使用转置卷积就是上面的操作，`torch.nn.ConvTranspose2d()` 就可以了
 # + {"ExecuteTime": {"start_time": "2018-01-01T11:06:24.760698Z", "end_time": "2018-01-01T11:15:44.595927Z"}}
 # 开始训练自动编码器
 for e in range(40):
    for im, _ in train_data:
        if torch.cuda.is_available():
            im = im.cuda()
        im = Variable(im)
        # 前向传播
        _, output = conv_net(im)
        loss = criterion(output, im) / im.shape[0] # 平均
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e+1) % 20 == 0: # 每 20 次，将生成的图片保存一下
        print('epoch: {}, Loss: {:.4f}'.format(e+1, loss.data[0]))
        pic = to_img(output.cpu().data)
        if not os.path.exists('./conv_autoencoder'):
            os.mkdir('./conv_autoencoder')
        save_image(pic, './conv_autoencoder/image_{}.png'.format(e+1))
 # -
 # 为了时间更短，只跑 40 次，如果有条件可以再 gpu 上跑跑
 #
 # 最后我们看看结果
 #
 # ![](https://ws1.sinaimg.cn/large/006tNc79ly1fmzww48to3j306q0a20ud.jpg)
 # 这里我们展示了简单的自动编码器，也用了多层神经网络和卷积神经网络作为例子，但是自动编码器存在一个问题，我们并不能任意生成我们想要的数据，因为我们并不知道 encode 之后的编码到底是什么样的概率分布，所以有一个改进的版本变分自动编码器，其能够解决这个问题
--- a/6_pytorch/4_GAN/gan.py
+++ b/6_pytorch/4_GAN/gan.py
@@ -1,429 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 生成对抗网络
 # 前面我们讲了自动编码器和变分自动编码器，不管是哪一个，都是通过计算生成图像和输入图像在每个像素点的误差来生成 loss，这一点是特别不好的，因为不同的像素点可能造成不同的视觉结果，但是可能他们的 loss 是相同的，所以通过单个像素点来得到 loss 是不准确的，这个时候我们需要一种全新的 loss 定义方式，就是通过对抗进行学习。
 #
 # ## GANs
 # 这种训练方式定义了一种全新的网络结构，就是生成对抗网络，也就是 GANs。这一部分，我们会形象地介绍生成对抗网络，以及用代码进行实现，而在书中会更加详细地介绍 GANs 的数学推导。
 #
 # 根据这个名字就可以知道这个网络是由两部分组成的，第一部分是生成，第二部分是对抗。简单来说，就是有一个生成网络和一个判别网络，通过训练让两个网络相互竞争，生成网络来生成假的数据，对抗网络通过判别器去判别真伪，最后希望生成器生成的数据能够以假乱真。
 #
 # 可以用这个图来简单的看一看这两个过程
 #
 # ![](https://ws3.sinaimg.cn/large/006tNc79gy1fn22oma081j30k007cgll.jpg)
 #
 # ### Discriminator Network
 # 首先我们来讲一下对抗过程，因为这个过程更加简单。
 #
 # 对抗过程简单来说就是一个判断真假的判别器，相当于一个二分类问题，我们输入一张真的图片希望判别器输出的结果是1，输入一张假的图片希望判别器输出的结果是0。这其实已经和原图片的 label 没有关系了，不管原图片到底是一个多少类别的图片，他们都统一称为真的图片，label 是 1 表示真实的；而生成的假的图片的 label 是 0 表示假的。
 #
 # 我们训练的过程就是希望这个判别器能够正确的判出真的图片和假的图片，这其实就是一个简单的二分类问题，对于这个问题可以用我们前面讲过的很多方法去处理，比如 logistic 回归，深层网络，卷积神经网络，循环神经网络都可以。
 #
 # ### Generator Network
 # 接着我们看看生成网络如何生成一张假的图片。首先给出一个简单的高维的正态分布的噪声向量，如上图所示的 D-dimensional noise vector，这个时候我们可以通过仿射变换，也就是 xw+b 将其映射到一个更高的维度，然后将他重新排列成一个矩形，这样看着更像一张图片，接着进行一些卷积、转置卷积、池化、激活函数等进行处理，最后得到了一个与我们输入图片大小一模一样的噪音矩阵，这就是我们所说的假的图片。
 #
 # 这个时候我们如何去训练这个生成器呢？这就需要通过对抗学习，增大判别器判别这个结果为真的概率，通过这个步骤不断调整生成器的参数，希望生成的图片越来越像真的，而在这一步中我们不会更新判别器的参数，因为如果判别器不断被优化，可能生成器无论生成什么样的图片都无法骗过判别器。
 #
 # 生成器的效果可以看看下面的图示
 #
 # ![](https://ws3.sinaimg.cn/large/006tNc79gy1fn22s47jnfj30k005c74b.jpg)
 #
 # 关于生成对抗网络，出现了很多变形，比如 WGAN，LS-GAN 等等，这一节我们只使用 mnist 举一些简单的例子来说明，更复杂的网络结构可以再 github 上找到相应的实现
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:35:18.858664Z", "end_time": "2018-01-04T09:35:19.703119Z"}}
 import torch
 from torch import nn
 from torch.autograd import Variable
 import torchvision.transforms as tfs
 from torch.utils.data import DataLoader, sampler
 from torchvision.datasets import MNIST
 import numpy as np
 import matplotlib.pyplot as plt
 import matplotlib.gridspec as gridspec
 # %matplotlib inline
 plt.rcParams['figure.figsize'] = (10.0, 8.0) # 设置画图的尺寸
 plt.rcParams['image.interpolation'] = 'nearest'
 plt.rcParams['image.cmap'] = 'gray'
 def show_images(images): # 定义画图工具
    images = np.reshape(images, [images.shape[0], -1])
    sqrtn = int(np.ceil(np.sqrt(images.shape[0])))
    sqrtimg = int(np.ceil(np.sqrt(images.shape[1])))
    fig = plt.figure(figsize=(sqrtn, sqrtn))
    gs = gridspec.GridSpec(sqrtn, sqrtn)
    gs.update(wspace=0.05, hspace=0.05)
    for i, img in enumerate(images):
        ax = plt.subplot(gs[i])
        plt.axis('off')
        ax.set_xticklabels([])
        ax.set_yticklabels([])
        ax.set_aspect('equal')
        plt.imshow(img.reshape([sqrtimg,sqrtimg]))
    return 
 def preprocess_img(x):
    x = tfs.ToTensor()(x)
    return (x - 0.5) / 0.5
 def deprocess_img(x):
    return (x + 1.0) / 2.0
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:35:20.674313Z", "end_time": "2018-01-04T09:35:28.869280Z"}}
 class ChunkSampler(sampler.Sampler): # 定义一个取样的函数
    """Samples elements sequentially from some offset. 
    Arguments:
        num_samples: # of desired datapoints
        start: offset where we should start selecting from
    """
    def __init__(self, num_samples, start=0):
        self.num_samples = num_samples
        self.start = start
    def __iter__(self):
        return iter(range(self.start, self.start + self.num_samples))
    def __len__(self):
        return self.num_samples
 NUM_TRAIN = 50000
 NUM_VAL = 5000
 NOISE_DIM = 96
 batch_size = 128
 train_set = MNIST('./mnist', train=True, download=True, transform=preprocess_img)
 train_data = DataLoader(train_set, batch_size=batch_size, sampler=ChunkSampler(NUM_TRAIN, 0))
 val_set = MNIST('./mnist', train=True, download=True, transform=preprocess_img)
 val_data = DataLoader(val_set, batch_size=batch_size, sampler=ChunkSampler(NUM_VAL, NUM_TRAIN))
 imgs = deprocess_img(train_data.__iter__().next()[0].view(batch_size, 784)).numpy().squeeze() # 可视化图片效果
 show_images(imgs)
 # -
 # ## 简单版本的生成对抗网络
 # 通过前面我们知道生成对抗网络有两个部分构成，一个是生成网络，一个是对抗网络，我们首先写一个简单版本的网络结构，生成网络和对抗网络都是简单的多层神经网络
 #
 # ### 判别网络
 # 判别网络的结构非常简单，就是一个二分类器，结构如下:
 # * 全连接(784 -> 256)
 # * leakyrelu,  $\alpha$ 是 0.2
 # * 全连接(256 -> 256)
 # * leakyrelu, $\alpha$ 是 0.2
 # * 全连接(256 -> 1)
 #
 # 其中 leakyrelu 是指 f(x) = max($\alpha$ x, x)
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:35:28.871207Z", "end_time": "2018-01-04T09:35:28.877089Z"}}
 def discriminator():
    net = nn.Sequential(        
            nn.Linear(784, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1)
        )
    return net
 # -
 # ### 生成网络
 # 接下来我们看看生成网络，生成网络的结构也很简单，就是根据一个随机噪声生成一个和数据维度一样的张量，结构如下：
 # * 全连接(噪音维度 -> 1024)
 # * relu
 # * 全连接(1024 -> 1024)
 # * relu
 # * 全连接(1024 -> 784)
 # * tanh 将数据裁剪到 -1 ~ 1 之间
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:35:28.878933Z", "end_time": "2018-01-04T09:35:28.893308Z"}}
 def generator(noise_dim=NOISE_DIM):   
    net = nn.Sequential(
        nn.Linear(noise_dim, 1024),
        nn.ReLU(True),
        nn.Linear(1024, 1024),
        nn.ReLU(True),
        nn.Linear(1024, 784),
        nn.Tanh()
    )
    return net
 # -
 # 接下来我们需要定义生成对抗网络的 loss，通过前面的讲解我们知道，对于对抗网络，相当于二分类问题，将真的判别为真的，假的判别为假的，作为辅助，可以参考一下论文中公式
 #
 # $$ \ell_D = \mathbb{E}_{x \sim p_\text{data}}\left[\log D(x)\right] + \mathbb{E}_{z \sim p(z)}\left[\log \left(1-D(G(z))\right)\right]$$
 #
 # 而对于生成网络，需要去骗过对抗网络，也就是将假的也判断为真的，作为辅助，可以参考一下论文中公式
 #
 # $$\ell_G  =  \mathbb{E}_{z \sim p(z)}\left[\log D(G(z))\right]$$
 #
 # 如果你还记得前面的二分类 loss，那么你就会发现上面这两个公式就是二分类 loss
 #
 # $$ bce(s, y) = y * \log(s) + (1 - y) * \log(1 - s) $$
 # 如果我们把 D(x) 看成真实数据的分类得分，那么 D(G(z)) 就是假数据的分类得分，所以上面判别器的 loss 就是将真实数据的得分判断为 1，假的数据的得分判断为 0，而生成器的 loss 就是将假的数据判断为 1
 #
 # 下面我们来实现一下
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:37:01.458787Z", "end_time": "2018-01-04T09:37:01.475822Z"}}
 bce_loss = nn.BCEWithLogitsLoss()
 def discriminator_loss(logits_real, logits_fake): # 判别器的 loss
    size = logits_real.shape[0]
    true_labels = Variable(torch.ones(size, 1)).float().cuda()
    false_labels = Variable(torch.zeros(size, 1)).float().cuda()
    loss = bce_loss(logits_real, true_labels) + bce_loss(logits_fake, false_labels)
    return loss
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:37:01.750127Z", "end_time": "2018-01-04T09:37:01.756901Z"}}
 def generator_loss(logits_fake): # 生成器的 loss  
    size = logits_fake.shape[0]
    true_labels = Variable(torch.ones(size, 1)).float().cuda()
    loss = bce_loss(logits_fake, true_labels)
    return loss
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:37:02.179658Z", "end_time": "2018-01-04T09:37:02.188467Z"}}
 # 使用 adam 来进行训练，学习率是 3e-4, beta1 是 0.5, beta2 是 0.999
 def get_optimizer(net):
    optimizer = torch.optim.Adam(net.parameters(), lr=3e-4, betas=(0.5, 0.999))
    return optimizer
 # -
 # 下面我们开始训练一个这个简单的生成对抗网络
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:37:03.287554Z", "end_time": "2018-01-04T09:37:03.426140Z"}}
 def train_a_gan(D_net, G_net, D_optimizer, G_optimizer, discriminator_loss, generator_loss, show_every=250, 
                noise_size=96, num_epochs=10):
    iter_count = 0
    for epoch in range(num_epochs):
        for x, _ in train_data:
            bs = x.shape[0]
            # 判别网络
            real_data = Variable(x).view(bs, -1).cuda() # 真实数据
            logits_real = D_net(real_data) # 判别网络得分
            sample_noise = (torch.rand(bs, noise_size) - 0.5) / 0.5 # -1 ~ 1 的均匀分布
            g_fake_seed = Variable(sample_noise).cuda()
            fake_images = G_net(g_fake_seed) # 生成的假的数据
            logits_fake = D_net(fake_images) # 判别网络得分
            d_total_error = discriminator_loss(logits_real, logits_fake) # 判别器的 loss
            D_optimizer.zero_grad()
            d_total_error.backward()
            D_optimizer.step() # 优化判别网络
            # 生成网络
            g_fake_seed = Variable(sample_noise).cuda()
            fake_images = G_net(g_fake_seed) # 生成的假的数据
            gen_logits_fake = D_net(fake_images)
            g_error = generator_loss(gen_logits_fake) # 生成网络的 loss
            G_optimizer.zero_grad()
            g_error.backward()
            G_optimizer.step() # 优化生成网络
            if (iter_count % show_every == 0):
                print('Iter: {}, D: {:.4}, G:{:.4}'.format(iter_count, d_total_error.data[0], g_error.data[0]))
                imgs_numpy = deprocess_img(fake_images.data.cpu().numpy())
                show_images(imgs_numpy[0:16])
                plt.show()
                print()
            iter_count += 1
 # + {"scrolled": true, "ExecuteTime": {"start_time": "2018-01-04T09:37:03.776837Z", "end_time": "2018-01-04T09:38:56.363519Z"}}
 D = discriminator().cuda()
 G = generator().cuda()
 D_optim = get_optimizer(D)
 G_optim = get_optimizer(G)
 train_a_gan(D, G, D_optim, G_optim, discriminator_loss, generator_loss)
 # -
 # 我们已经完成了一个简单的生成对抗网络，是不是非常容易呢。但是可以看到效果并不是特别好，生成的数字也不是特别完整，因为我们仅仅使用了简单的多层全连接网络。
 #
 # 除了这种最基本的生成对抗网络之外，还有很多生成对抗网络的变式，有结构上的变式，也有 loss 上的变式，我们先讲一讲其中一种在 loss 上的变式，Least Squares GAN
 # ## Least Squares GAN
 # [Least Squares GAN](https://arxiv.org/abs/1611.04076) 比最原始的 GANs 的 loss 更加稳定，通过名字我们也能够看出这种 GAN 是通过最小平方误差来进行估计，而不是通过二分类的损失函数，下面我们看看 loss 的计算公式
 #
 # $$\ell_G  =  \frac{1}{2}\mathbb{E}_{z \sim p(z)}\left[\left(D(G(z))-1\right)^2\right]$$
 #
 # $$ \ell_D = \frac{1}{2}\mathbb{E}_{x \sim p_\text{data}}\left[\left(D(x)-1\right)^2\right] + \frac{1}{2}\mathbb{E}_{z \sim p(z)}\left[ \left(D(G(z))\right)^2\right]$$
 # 可以看到 Least Squares GAN 通过最小二乘代替了二分类的 loss，下面我们定义一下 loss 函数
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:38:56.366230Z", "end_time": "2018-01-04T09:38:56.375632Z"}}
 def ls_discriminator_loss(scores_real, scores_fake):
    loss = 0.5 * ((scores_real - 1) ** 2).mean() + 0.5 * (scores_fake ** 2).mean()
    return loss
 def ls_generator_loss(scores_fake):
    loss = 0.5 * ((scores_fake - 1) ** 2).mean()
    return loss
 # + {"scrolled": true, "ExecuteTime": {"start_time": "2018-01-04T09:38:56.377796Z", "end_time": "2018-01-04T09:40:32.256222Z"}}
 D = discriminator().cuda()
 G = generator().cuda()
 D_optim = get_optimizer(D)
 G_optim = get_optimizer(G)
 train_a_gan(D, G, D_optim, G_optim, ls_discriminator_loss, ls_generator_loss)
 # -
 # 上面我们讲了 最基本的 GAN 和 least squares GAN，最后我们讲一讲使用卷积网络的 GAN，叫做深度卷积生成对抗网络
 # ## Deep Convolutional GANs
 # 深度卷积生成对抗网络特别简单，就是将生成网络和对抗网络都改成了卷积网络的形式，下面我们来实现一下
 # ### 卷积判别网络
 # 卷积判别网络就是一个一般的卷积网络，结构如下
 #
 # * 32 Filters, 5x5, Stride 1, Leaky ReLU(alpha=0.01)
 # * Max Pool 2x2, Stride 2
 # * 64 Filters, 5x5, Stride 1, Leaky ReLU(alpha=0.01)
 # * Max Pool 2x2, Stride 2
 # * Fully Connected size 4 x 4 x 64, Leaky ReLU(alpha=0.01)
 # * Fully Connected size 1
 # + {"ExecuteTime": {"start_time": "2018-01-04T09:47:10.521930Z", "end_time": "2018-01-04T09:47:10.573931Z"}}
 class build_dc_classifier(nn.Module):
    def __init__(self):
        super(build_dc_classifier, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 5, 1),
            nn.LeakyReLU(0.01),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(32, 64, 5, 1),
            nn.LeakyReLU(0.01),
            nn.MaxPool2d(2, 2)
        )
        self.fc = nn.Sequential(
            nn.Linear(1024, 1024),
            nn.LeakyReLU(0.01),
            nn.Linear(1024, 1)
        )
    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x
 # -
 # ### 卷积生成网络
 # 卷积生成网络需要将一个低维的噪声向量变成一个图片数据，结构如下
 #
 # * Fully connected of size 1024, ReLU
 # * BatchNorm
 # * Fully connected of size 7 x 7 x 128, ReLU
 # * BatchNorm
 # * Reshape into Image Tensor
 # * 64 conv2d^T filters of 4x4, stride 2, padding 1, ReLU
 # * BatchNorm
 # * 1 conv2d^T filter of 4x4, stride 2, padding 1, TanH
 # + {"ExecuteTime": {"start_time": "2018-01-04T10:05:32.785512Z", "end_time": "2018-01-04T10:05:32.848318Z"}}
 class build_dc_generator(nn.Module): 
    def __init__(self, noise_dim=NOISE_DIM):
        super(build_dc_generator, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(noise_dim, 1024),
            nn.ReLU(True),
            nn.BatchNorm1d(1024),
            nn.Linear(1024, 7 * 7 * 128),
            nn.ReLU(True),
            nn.BatchNorm1d(7 * 7 * 128)
        )
        self.conv = nn.Sequential(
            nn.ConvTranspose2d(128, 64, 4, 2, padding=1),
            nn.ReLU(True),
            nn.BatchNorm2d(64),
            nn.ConvTranspose2d(64, 1, 4, 2, padding=1),
            nn.Tanh()
        )
    def forward(self, x):
        x = self.fc(x)
        x = x.view(x.shape[0], 128, 7, 7) # reshape 通道是 128，大小是 7x7
        x = self.conv(x)
        return x
 # + {"ExecuteTime": {"start_time": "2018-01-04T10:12:43.110774Z", "end_time": "2018-01-04T10:12:43.237237Z"}}
 def train_dc_gan(D_net, G_net, D_optimizer, G_optimizer, discriminator_loss, generator_loss, show_every=250, 
                noise_size=96, num_epochs=10):
    iter_count = 0
    for epoch in range(num_epochs):
        for x, _ in train_data:
            bs = x.shape[0]
            # 判别网络
            real_data = Variable(x).cuda() # 真实数据
            logits_real = D_net(real_data) # 判别网络得分
            sample_noise = (torch.rand(bs, noise_size) - 0.5) / 0.5 # -1 ~ 1 的均匀分布
            g_fake_seed = Variable(sample_noise).cuda()
            fake_images = G_net(g_fake_seed) # 生成的假的数据
            logits_fake = D_net(fake_images) # 判别网络得分
            d_total_error = discriminator_loss(logits_real, logits_fake) # 判别器的 loss
            D_optimizer.zero_grad()
            d_total_error.backward()
            D_optimizer.step() # 优化判别网络
            # 生成网络
            g_fake_seed = Variable(sample_noise).cuda()
            fake_images = G_net(g_fake_seed) # 生成的假的数据
            gen_logits_fake = D_net(fake_images)
            g_error = generator_loss(gen_logits_fake) # 生成网络的 loss
            G_optimizer.zero_grad()
            g_error.backward()
            G_optimizer.step() # 优化生成网络
            if (iter_count % show_every == 0):
                print('Iter: {}, D: {:.4}, G:{:.4}'.format(iter_count, d_total_error.data[0], g_error.data[0]))
                imgs_numpy = deprocess_img(fake_images.data.cpu().numpy())
                show_images(imgs_numpy[0:16])
                plt.show()
                print()
            iter_count += 1
 # + {"ExecuteTime": {"start_time": "2018-01-04T10:12:43.472792Z", "end_time": "2018-01-04T10:13:58.243586Z"}}
 D_DC = build_dc_classifier().cuda()
 G_DC = build_dc_generator().cuda()
 D_DC_optim = get_optimizer(D_DC)
 G_DC_optim = get_optimizer(G_DC)
 train_dc_gan(D_DC, G_DC, D_DC_optim, G_DC_optim, discriminator_loss, generator_loss, num_epochs=5)
 # -
 # 可以看到，通过 DCGANs 能够得到更加清楚的结果
--- a/6_pytorch/4_GAN/vae.py
+++ b/6_pytorch/4_GAN/vae.py
@@ -1,208 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # # 变分自动编码器
 # 变分编码器是自动编码器的升级版本，其结构跟自动编码器是类似的，也由编码器和解码器构成。
 #
 # 回忆一下，自动编码器有个问题，就是并不能任意生成图片，因为我们没有办法自己去构造隐藏向量，需要通过一张图片输入编码我们才知道得到的隐含向量是什么，这时我们就可以通过变分自动编码器来解决这个问题。
 #
 # 其实原理特别简单，只需要在编码过程给它增加一些限制，迫使其生成的隐含向量能够粗略的遵循一个标准正态分布，这就是其与一般的自动编码器最大的不同。
 #
 # 这样我们生成一张新图片就很简单了，我们只需要给它一个标准正态分布的随机隐含向量，这样通过解码器就能够生成我们想要的图片，而不需要给它一张原始图片先编码。
 #
 # 一般来讲，我们通过 encoder 得到的隐含向量并不是一个标准的正态分布，为了衡量两种分布的相似程度，我们使用 KL divergence，利用其来表示隐含向量与标准正态分布之间差异的 loss，另外一个 loss 仍然使用生成图片与原图片的均方误差来表示。
 #
 # KL divergence 的公式如下
 #
 # $$
 # D{KL} (P || Q) =  \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx
 # $$
 # ## 重参数
 # 为了避免计算 KL divergence 中的积分，我们使用重参数的技巧，不是每次产生一个隐含向量，而是生成两个向量，一个表示均值，一个表示标准差，这里我们默认编码之后的隐含向量服从一个正态分布的之后，就可以用一个标准正态分布先乘上标准差再加上均值来合成这个正态分布，最后 loss 就是希望这个生成的正态分布能够符合一个标准正态分布，也就是希望均值为 0，方差为 1
 #
 # 所以标准的变分自动编码器如下
 #
 # ![](https://ws4.sinaimg.cn/large/006tKfTcgy1fn15cq6n7pj30k007t0sv.jpg)
 # 所以最后我们可以将我们的 loss 定义为下面的函数，由均方误差和 KL divergence 求和得到一个总的 loss
 #
 # ```
 # def loss_function(recon_x, x, mu, logvar):
 #     """
 #     recon_x: generating images
 #     x: origin images
 #     mu: latent mean
 #     logvar: latent log variance
 #     """
 #     MSE = reconstruction_function(recon_x, x)
 #     # loss = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
 #     KLD_element = mu.pow(2).add_(logvar.exp()).mul_(-1).add_(1).add_(logvar)
 #     KLD = torch.sum(KLD_element).mul_(-0.5)
 #     # KL divergence
 #     return MSE + KLD
 # ```
 # 下面我们用 mnist 数据集来简单说明一下变分自动编码器
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:41:05.215490Z", "end_time": "2018-01-01T10:41:05.738797Z"}}
 import os
 import torch
 from torch.autograd import Variable
 import torch.nn.functional as F
 from torch import nn
 from torch.utils.data import DataLoader
 from torchvision.datasets import MNIST
 from torchvision import transforms as tfs
 from torchvision.utils import save_image
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:41:05.741302Z", "end_time": "2018-01-01T10:41:05.769643Z"}}
 im_tfs = tfs.Compose([
    tfs.ToTensor(),
    tfs.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]) # 标准化
 ])
 train_set = MNIST('./mnist', transform=im_tfs)
 train_data = DataLoader(train_set, batch_size=128, shuffle=True)
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:41:06.306479Z", "end_time": "2018-01-01T10:41:06.397118Z"}}
 class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.fc1 = nn.Linear(784, 400)
        self.fc21 = nn.Linear(400, 20) # mean
        self.fc22 = nn.Linear(400, 20) # var
        self.fc3 = nn.Linear(20, 400)
        self.fc4 = nn.Linear(400, 784)
    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)
    def reparametrize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        eps = torch.FloatTensor(std.size()).normal_()
        if torch.cuda.is_available():
            eps = Variable(eps.cuda())
        else:
            eps = Variable(eps)
        return eps.mul(std).add_(mu)
    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        return F.tanh(self.fc4(h3))
    def forward(self, x):
        mu, logvar = self.encode(x) # 编码
        z = self.reparametrize(mu, logvar) # 重新参数化成正态分布
        return self.decode(z), mu, logvar # 解码，同时输出均值方差
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:41:06.430817Z", "end_time": "2018-01-01T10:41:10.056600Z"}}
 net = VAE() # 实例化网络
 if torch.cuda.is_available():
    net = net.cuda()
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:41:10.059597Z", "end_time": "2018-01-01T10:41:10.409900Z"}}
 x, _ = train_set[0]
 x = x.view(x.shape[0], -1)
 if torch.cuda.is_available():
    x = x.cuda()
 x = Variable(x)
 _, mu, var = net(x)
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:41:29.749178Z", "end_time": "2018-01-01T10:41:29.753678Z"}}
 print(mu)
 # -
 # 可以看到，对于输入，网络可以输出隐含变量的均值和方差，这里的均值方差还没有训练
 #
 # 下面开始训练
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:13:54.530108Z", "end_time": "2018-01-01T10:13:54.560436Z"}}
 reconstruction_function = nn.MSELoss(size_average=False)
 def loss_function(recon_x, x, mu, logvar):
    """
    recon_x: generating images
    x: origin images
    mu: latent mean
    logvar: latent log variance
    """
    MSE = reconstruction_function(recon_x, x)
    # loss = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD_element = mu.pow(2).add_(logvar.exp()).mul_(-1).add_(1).add_(logvar)
    KLD = torch.sum(KLD_element).mul_(-0.5)
    # KL divergence
    return MSE + KLD
 optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
 def to_img(x):
    '''
    定义一个函数将最后的结果转换回图片
    '''
    x = 0.5 * (x + 1.)
    x = x.clamp(0, 1)
    x = x.view(x.shape[0], 1, 28, 28)
    return x
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:13:54.562533Z", "end_time": "2018-01-01T10:35:01.115877Z"}}
 for e in range(100):
    for im, _ in train_data:
        im = im.view(im.shape[0], -1)
        im = Variable(im)
        if torch.cuda.is_available():
            im = im.cuda()
        recon_im, mu, logvar = net(im)
        loss = loss_function(recon_im, im, mu, logvar) / im.shape[0] # 将 loss 平均
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e + 1) % 20 == 0:
        print('epoch: {}, Loss: {:.4f}'.format(e + 1, loss.data[0]))
        save = to_img(recon_im.cpu().data)
        if not os.path.exists('./vae_img'):
            os.mkdir('./vae_img')
        save_image(save, './vae_img/image_{}.png'.format(e + 1))
 # -
 # 可以看看使用变分自动编码器得到的结果，可以发现效果比一般的编码器要好很多
 #
 # ![](https://ws1.sinaimg.cn/large/006tKfTcgy1fn1ag8832zj306q0a2gmz.jpg)
 #
 # 我们可以输出其中的均值看看
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:40:36.463332Z", "end_time": "2018-01-01T10:40:36.481622Z"}}
 x, _ = train_set[0]
 x = x.view(x.shape[0], -1)
 if torch.cuda.is_available():
    x = x.cuda()
 x = Variable(x)
 _, mu, _ = net(x)
 # + {"ExecuteTime": {"start_time": "2018-01-01T10:40:37.485127Z", "end_time": "2018-01-01T10:40:37.490484Z"}}
 print(mu)
 # -
 # 变分自动编码器虽然比一般的自动编码器效果要好，而且也限制了其输出的编码 (code) 的概率分布，但是它仍然是通过直接计算生成图片和原始图片的均方误差来生成 loss，这个方式并不好，在下一章生成对抗网络中，我们会讲一讲这种方式计算 loss 的局限性，然后会介绍一种新的训练办法，就是通过生成对抗的训练方式来训练网络而不是直接比较两张图片的每个像素点的均方误差
--- a/6_pytorch/imgs/del/img1.png
+++ b/6_pytorch/imgs/del/img1.png
--- a/6_pytorch/imgs/del/img2.png
+++ b/6_pytorch/imgs/del/img2.png
--- a/README.md
+++ b/README.md
@@ -1,13 +1,13 @@
 # Python与机器学习
 # 机器学习
 本教程包含了一些使用Python来学习机器学习的notebook，通过本教程的引导来快速学习Python、Python的常用库、机器学习的理论知识与实际编程，并学习如何解决实际问题。
 本教程主要讲解机器学习的基本原理与实现，通过本教程的引导来快速学习Python、Python的常用库、机器学习的理论知识与实际编程，并学习如何解决实际问题。
 由于**本课程需要大量的编程练习才能取得比较好的学习效果**，因此需要认真把作业和报告完成，写作业的过程可以查阅网上的资料，但是不能直接照抄，需要自己独立思考并独立写出代码。
 由于**本课程需要大量的编程练习才能取得比较好的学习效果**，因此需要认真去做[作业和报告](https://gitee.com/pi-lab/machinelearning_homework)，写作业的过程可以查阅网上的资料，但是不能直接照抄，需要自己独立思考并独立写出代码。
 ## 内容
 ## 1. 内容
 1. [Python](0_python/)
   - [Install Python](tips/InstallPython.md)
   - [Introduction](0_python/0_Introduction.ipynb)
@@ -47,7 +47,7 @@
      - [optim/sgd](6_pytorch/1_NN/optimizer/sgd.ipynb)
      - [optim/adam](6_pytorch/1_NN/optimizer/adam.ipynb)
   - CNN
      - 加一个基本的用法介绍
      - [CNN simple demo](demo_code/3_CNN_MNIST.py)
      - [cnn/basic_conv](6_pytorch/2_CNN/basic_conv.ipynb)
      - [cnn/batch-normalization](6_pytorch/2_CNN/batch-normalization.ipynb)
      - [cnn/regularization](6_pytorch/2_CNN/regularization.ipynb)
@@ -67,7 +67,7 @@
 ## 其他参考
 ## 2. 其他参考
 * 资料速查
  * [相关学习参考资料等](References.md)
  * [一些速查手册](tips/cheatsheet)
@@ -80,5 +80,5 @@
  * [Confusion Matrix](tips/confusion_matrix.ipynb)
  * [Datasets](tips/datasets.ipynb)
  * [构建深度神经网络的一些实战建议](tips/构建深度神经网络的一些实战建议.md)
  * [Intro to Deep Learning](./tips/Intro_to_Deep_Learning.pdf)
  * [Intro to Deep Learning](tips/Intro_to_Deep_Learning.pdf)
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,43 @@
 #
 # pre-requirements (use python3.5 for better compability)
 #   sudo apt-get install python3.5 python3.5-dev 
 #   sudo apt-get install python3-tk
 #
 #
 # pip
 #   sudo apt-get install python-pip python3-pip
 #   pip install pip -U
 #   pip config set global.index-url 'https://mirrors.ustc.edu.cn/pypi/web/simple'
 #
 #
 # Install virtualenv
 #   pip install setuptools
 #   pip install virtualenv
 #   pip install virtualenvwrapper
 #   pip install virtualenvwrapper-win　　#Windows使用该命令
 #
 # Add following lines to `~/.bashrc`
 #   # virtualenv
 #   export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python
 #   export WORKON_HOME=/home/bushuhui/virtualenv
 #   source /usr/local/bin/virtualenvwrapper.sh　
 #
 # Usage:
 #   # create virtual env
 #   mkvirtualenv --python=/usr/local/python3.5.3/bin/python venv
 #
 #   # active virtual env
 #   workon venv
 #
 #
 # Install this list packages:
 #   pip install -r requirements.txt
 #
 numpy
 matplotlib
 sklearn
--- a/tips/InstallPython.md
+++ b/tips/InstallPython.md
@@ -77,7 +77,7 @@ pip3 install torch torchvision
 ## 3. [Python技巧](python/)
 - [pip的安装、使用等](pip.md)
 - [virtualenv的安装、使用](virtualenv.md)
 - [virtualenv便捷管理工具：virtualenv_wrapper](virtualenv_wrapper.md)
 - [pip的安装、使用等](python/pip.md)
 - [virtualenv的安装、使用](python/virtualenv.md)
 - [virtualenv便捷管理工具：virtualenv_wrapper](python/virtualenv_wrapper.md)
--- a/tips/datasets.py
+++ b/tips/datasets.py
@@ -1,228 +0,0 @@
 # -*- coding: utf-8 -*-
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # ## Datasets
 # ## Moons
 #
 # +
 % matplotlib inline
 import numpy as np
 from sklearn import datasets
 import matplotlib.pyplot as plt
 # generate sample data
 np.random.seed(0)
 X, y = datasets.make_moons(200, noise=0.20)
 # plot data
 plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral)
 plt.show()
 # -
 # ## XOR
 # +
 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn.gaussian_process import GaussianProcessClassifier
 rng = np.random.RandomState(0)
 X = rng.randn(200, 2)
 Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)
 # plot data
 plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Spectral)
 plt.show()
 # -
 # ## Digital 
 # +
 import matplotlib.pyplot as plt 
 from sklearn.datasets import load_digits
 # load data
 digits = load_digits()
 # copied from notebook 02_sklearn_data.ipynb
 fig = plt.figure(figsize=(6, 6))  # figure size in inches
 fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
 # plot the digits: each image is 8x8 pixels
 for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary)
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))
 # -
 # ## Iris
 #
 # This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray
 #
 # The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
 #
 # +
 import matplotlib.pyplot as plt
 from mpl_toolkits.mplot3d import Axes3D
 from sklearn import datasets
 from sklearn.decomposition import PCA
 # import some data to play with
 iris = datasets.load_iris()
 X = iris.data[:, :]  
 y = iris.target
 # Plot the samples
 plt.figure(figsize=(15, 5))
 plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)
 plt.subplot(121)
 plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
 plt.xlabel('Sepal length')
 plt.ylabel('Sepal width')
 plt.subplot(122)
 plt.scatter(X[:, 2], X[:, 3], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
 plt.xlabel('Petal Length')
 plt.ylabel('Petal Width')
 plt.show()
 # +
 from sklearn.manifold import Isomap
 iso = Isomap(n_neighbors=5, n_components=2)
 proj = iso.fit_transform(X)
 plt.figure(figsize=(15, 9))
 plt.scatter(proj[:, 0], proj[:, 1], c=y)
 plt.colorbar()
 plt.show()
 # -
 # ## blobs
 #
 # +
 import matplotlib.pyplot as plt
 from sklearn.datasets import make_blobsb
 # Generate 3 blobs with 2 classes where the second blob contains
 # half positive samples and half negative samples. Probability in this
 # blob is therefore 0.5.
 centers = [(-5, -5), (0, 0), (5, 5)]
 n_samples = 500
 X, y = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                  centers=centers, shuffle=False, random_state=42)
 plt.figure(figsize=(15, 9))
 plt.scatter(X[:, 0], X[:, 1], c=y)
 plt.colorbar()
 plt.show()
 # -
 # ## Circles
 # +
 # %matplotlib inline
 import numpy as np
 import matplotlib.pyplot as plt
 n = 200
 t1 = (np.random.rand(n, 1)*2-1)*np.pi
 r1 = 10 + (np.random.rand(n, 1)*2-1)*4
 x_1 = np.concatenate((r1 * np.cos(t1), r1 * np.sin(t1)), axis=1)
 y_1 = [0 for _ in range(n)]
 t2 = (np.random.rand(n, 1)*2-1)*np.pi
 r2 = 20 + (np.random.rand(n, 1)*2-1)*4
 x_2 = np.concatenate((r2 * np.cos(t2), r2 * np.sin(t2)), axis=1)
 y_2 = [1 for _ in range(n)]
 x = np.concatenate((x_1, x_2), axis=0)
 y = np.concatenate((y_1, y_2), axis=0)
 plt.scatter(x[:, 0], x[:,1], c=y)
 plt.show()
 yy = y.reshape(-1, 1)
 data = np.concatenate((x, yy), axis=1)
 np.savetxt("dataset_circles.csv", data, delimiter=",")
 # -
 # ## CIFAR-10数据
 #
 # CIFAR-10[^3]是一个常用的彩色图片数据集，它有10个类别: 'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'。每张图片都是$3\times32\times32$，也即3-通道彩色图片，分辨率为$32\times32$。
 #
 # [^3]: http://www.cs.toronto.edu/~kriz/cifar.html
 import torchvision as tv
 import torchvision.transforms as transforms
 from torchvision.transforms import ToPILImage
 show = ToPILImage() # 可以把Tensor转成Image，方便可视化
 # +
 # 第一次运行程序torchvision会自动下载CIFAR-10数据集，
 # 大约100M，需花费一定的时间，
 # 如果已经下载有CIFAR-10，可通过root参数指定
 # 定义对数据的预处理
 transform = transforms.Compose([
        transforms.ToTensor(), # 转为Tensor
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), # 归一化
                             ])
 # 训练集
 trainset = tv.datasets.CIFAR10(
                    root='../data/', 
                    train=True, 
                    download=True,
                    transform=transform)
 trainloader = t.utils.data.DataLoader(
                    trainset, 
                    batch_size=4,
                    shuffle=True, 
                    num_workers=2)
 # 测试集
 testset = tv.datasets.CIFAR10(
                    '../data/',
                    train=False, 
                    download=True, 
                    transform=transform)
 testloader = t.utils.data.DataLoader(
                    testset,
                    batch_size=4, 
                    shuffle=False,
                    num_workers=2)
 classes = ('plane', 'car', 'bird', 'cat', 'deer', 
           'dog', 'frog', 'horse', 'ship', 'truck')
--- a/tips/notebook_tips.py
+++ b/tips/notebook_tips.py
@@ -1,149 +0,0 @@
 # ---
 # jupyter:
 #   jupytext_format_version: '1.2'
 #   kernelspec:
 #     display_name: Python 3
 #     language: python
 #     name: python3
 #   language_info:
 #     codemirror_mode:
 #       name: ipython
 #       version: 3
 #     file_extension: .py
 #     mimetype: text/x-python
 #     name: python
 #     nbconvert_exporter: python
 #     pygments_lexer: ipython3
 #     version: 3.5.2
 # ---
 # ## Show LaTeX equation
 #
 #
 from IPython.core.display import HTML
 HTML("""
 <style>
 div.cell { /* Tunes the space between cells */
 margin-top:1em;
 margin-bottom:1em;
 }
 div.text_cell_render h1 { /* Main titles bigger, centered */
 font-size: 2.2em;
 line-height:1.4em;
 text-align:center;
 }
 div.text_cell_render h2 { /*  Parts names nearer from text */
 margin-bottom: -0.4em;
 }
 div.text_cell_render { /* Customize text cells */
 font-family: 'Times New Roman';
 font-size:1.5em;
 line-height:1.4em;
 padding-left:3em;
 padding-right:3em;
 }
 </style>
 """)
 from IPython.display import Latex
 Latex(r"""\begin{eqnarray}
 \nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
 \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
 \nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
 \nabla \cdot \vec{\mathbf{B}} & = 0 
 \end{eqnarray}""")
 # %%latex
 \begin{align}
 \nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
 \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
 \nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
 \nabla \cdot \vec{\mathbf{B}} & = 0
 \end{align}
 # \begin{align}
 # \nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
 # \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
 # \nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
 # \nabla \cdot \vec{\mathbf{B}} & = 0
 # \end{align}
 #
 # \begin{equation}
 # E = F \cdot s 
 # \end{equation}
 #
 # \begin{eqnarray}
 # F & = & sin(x) \\
 # G & = & cos(x)
 # \end{eqnarray}
 #
 # \begin{align}
 #     g &= \int_a^b f(x)dx \label{eq1} \\
 #     a &= b + c \label{eq2}
 # \end{align}
 #
 # See (\ref{eq1})
 # ## Audio
 #
 from IPython.display import Audio
 Audio(url="http://www.nch.com.au/acm/8k16bitpcm.wav")
 # +
 import numpy as np
 max_time = 3
 f1 = 220.0
 f2 = 224.0
 rate = 8000.0
 L = 3
 times = np.linspace(0,L,rate*L)
 signal = np.sin(2*np.pi*f1*times) + np.sin(2*np.pi*f2*times)
 Audio(data=signal, rate=rate)
 # -
 # ## External sites
 # + {"scrolled": true}
 from IPython.display import IFrame
 IFrame('https://jupyter.org', width='100%', height=350)
 # -
 # ## JupyterLab
 # +
 import numpy as np
 from pprint import pprint
 pp = pprint
 a = np.array([1, 2, 3])
 pp(a)
 # -
 # ### [jupyter-matplotlib](https://github.com/matplotlib/jupyter-matplotlib)
 #
 #
 # ```
 # # Installing Node.js 5.x on Ubuntu / Debian
 # curl -sL https://deb.nodesource.com/setup_5.x | sudo -E bash -
 # sudo apt-get install -y nodejs
 #
 # pip install ipympl
 #
 # # If using JupyterLab
 # # Install nodejs: https://nodejs.org/en/download/
 # jupyter labextension install @jupyter-widgets/jupyterlab-manager
 # jupyter labextension install jupyter-matplotlib
 # ```
 # ## References
 #
 # * https://nbviewer.jupyter.org/github/ipython/ipython/blob/master/examples/IPython%20Kernel/Index.ipynb