change some contents and add english version to knn

5 years ago · 79eeefd9ae
--- a/2_knn/knn_classification.ipynb
+++ b/2_knn/knn_classification.ipynb
@@ -11,7 +11,7 @@
    "\n",
    "kNN算法不仅可以用于分类，还可以用于回归。通过找出一个样本的k个最近邻居，将这些邻居的属性的平均值赋给该样本，就可以得到该样本的属性。更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值(weight)，如权值与距离成正比（组合函数）。\n",
    "\n",
    "该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。 该算法只计算“最近的”邻居样本，某一类的样本数量很大，那么或者这类样本并不接近目标样本，或者这类样本很靠近目标样本。无论怎样，数量并不能影响运行结果。可以采用权值的方法（和该样本距离小的邻居权值大）来改进。该方法的另一个不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K个最近邻点。目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。\n",
    "该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。在这种情况下可能会产生误判的结果。因此我们需要减少数量对运行结果的影响。可以采用权值的方法（和该样本距离小的邻居权值大）来改进。该方法的另一个不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K个最近邻点。目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。\n",
    "\n",
    "k-NN可以说是一种最直接的用来分类未知数据的方法。基本通过下面这张图跟文字说明就可以明白K-NN是干什么的\n",
    "![knn](images/knn.png)\n",
@@ -25,13 +25,14 @@
   "source": [
    "## 算法步骤：(FIXME: 把流程再细化一下，循环需要体现的更好)\n",
    "\n",
    "* step.1---初始化距离为最大值\n",
    "* step.2---计算未知样本和每个训练样本的距离dist\n",
    "* step.3---得到目前K个最临近样本中的最大距离maxdist\n",
    "* step.4---如果dist小于maxdist，则将该训练样本作为K-最近邻样本\n",
    "* step.5---重复步骤2、3、4，直到未知样本和所有训练样本的距离都算完\n",
    "* step.6---统计K-最近邻样本中每个类标号出现的次数\n",
    "* step.7---选择出现频率最大的类标号作为未知样本的类标号"
    "* step.1---导入训练样本\n",
    "* step.2---将样本的特征转化为数据\n",
    "* step.3---计算未知样本和训练样本的距离dist\n",
    "* step.4---记录位置样本和训练样本得距离及其所属于得分类\n",
    "* step.5---重复步骤2、3，直到未知样本和所有训练样本的距离都算完\n",
    "* step.6---将训练样本按照与未知样本的距离进行排序，找出其中K个最近的样本\n",
    "* step.7---统计K-最近邻样本中每个类标号出现的次数\n",
    "* step.8---选择出现频率最大的类标号作为未知样本的类标号"
   ]
  },
  {
@@ -333,7 +334,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
--- a/2_knn/knn_classification_EN.ipynb
+++ b/2_knn/knn_classification_EN.ipynb
@@ -7,31 +7,34 @@
    "# kNN Classification\n",
    "\n",
    "\n",
    "K最近邻(k-Nearest Neighbor，kNN)分类算法，是一个理论上比较成熟的方法，也是最简单的机器学习算法之一。该方法的思路是：***如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别，则该样本也属于这个类别***。KNN方法虽然从原理上也依赖于极限定理，但在类别决策时，只与极少量的相邻样本有关。由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。\n",
    "K-Nearest Neighbor (kNN) classification algorithm is a mature method in theory and one of the simplest machine learning algorithms. The idea of this method is：***If a sample has k most similar smaples(have a shortest distance in characteristic space) which are mostly belong to a category, then the sample also belongs to this category.*** Although KNN method also depends on limit theorem in principle, it is only related to a very small number of adjacent samples when making category decisions. Because the KNN method mainly depends on the limited neighboring samples, rather than the  method of judging the class domain, the KNN method is more suitable than other methods for the sample set which has more overlapping or overlapping class domains. \n",
    "\n",
    "kNN算法不仅可以用于分类，还可以用于回归。通过找出一个样本的k个最近邻居，将这些邻居的属性的平均值赋给该样本，就可以得到该样本的属性。更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值(weight)，如权值与距离成正比（组合函数）。\n",
    "KNN algorithm can be used not only for classification, but also for regression. KNN algorithm can be used not only for classification, but also for regression. The attributes of a sample can be obtained by finding out k nearest neighbors of the sample and assigning the average value of the attributes of these neighbors to the sample. A more useful method is to give different weights to the influence of neighbors with different distances on the sample, for example, the weights are proportional to the distance (combinatorial function).\n",
    "\n",
    "该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。 该算法只计算“最近的”邻居样本，某一类的样本数量很大，那么或者这类样本并不接近目标样本，或者这类样本很靠近目标样本。无论怎样，数量并不能影响运行结果。可以采用权值的方法（和该样本距离小的邻居权值大）来改进。该方法的另一个不足之处是计算量较大，因为对每一个待分类的文本都要计算它到全体已知样本的距离，才能求得它的K个最近邻点。目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。\n",
    "The main disadvantage of this algorithm in classification is that when the samples are unbalanced, for example, the sample size of one class is very large, while the sample size of other classes is very small, which may lead to a large number samples contains most in the k neighbors of the sample when a new sample is input.In this case, the result of misjudgment may be produced. Therefore, we need to reduce the influence of quantity on operation results. \n",
    "another disadvantage of this method is that it is computationally intensive, because the distance between each text to be classified and all known samples must be calculated in order to obtain its K nearest neighbors. At present, the commonly used solution is to clip the known sample points in advance and remove the samples which have little effect on classification in advance. This algorithm is more suitable for automatic classification of class domains with large sample size, while those with small sample size are more prone to mismatching.\n",
    "\n",
    "k-NN可以说是一种最直接的用来分类未知数据的方法。基本通过下面这张图跟文字说明就可以明白K-NN是干什么的\n",
    "K-NN is the most direct method to classify unknown data. Basically, you can understand what K-NN does through the following picture and text description\n",
    "![knn](images/knn.png)\n",
    "\n",
    "简单来说，k-NN可以看成：**有那么一堆你已经知道分类的数据，然后当一个新数据进入的时候，就开始跟训练数据里的每个点求距离，然后挑离这个训练数据最近的K个点看看这几个点属于什么类型，然后用少数服从多数的原则，给新数据归类**。\n"
    "In short，k-NN can be seen as：**While you have a set of data which have already being sorted, \n",
    "There is a pile of data that you already know the classification, and then when a new data enters, you start to find the distance from each point in the training data, and then pick the K points closest to this training data to see what these points belong to, and then classify the new data with the principle that the minority obeys the majority.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 算法步骤：(FIXME: 把流程再细化一下，循环需要体现的更好)\n",
    "## Algorithm steps：\n",
    "\n",
    "* step.1---初始化距离为最大值\n",
    "* step.2---计算未知样本和每个训练样本的距离dist\n",
    "* step.3---得到目前K个最临近样本中的最大距离maxdist\n",
    "* step.4---如果dist小于maxdist，则将该训练样本作为K-最近邻样本\n",
    "* step.5---重复步骤2、3、4，直到未知样本和所有训练样本的距离都算完\n",
    "* step.6---统计K-最近邻样本中每个类标号出现的次数\n",
    "* step.7---选择出现频率最大的类标号作为未知样本的类标号"
    "* step.1---Import training sample\n",
    "* step.2---Transfer the featuresof the sample into numbers\n",
    "* step.3---Calculate the distance between unkonwn sample and training sample.\n",
    "* step.4---Record the distace calculated in step 3 and save the category which training sample belong.\n",
    "* step.5---Repeat step2,3, until we calculate all the distance.\n",
    "* step.6---Sort the training data according to the distance with unkonwn sample and find the K nearest sample.\n",
    "* step.7---Count the number of occurrences of each class label in the K-nearest neighbor sample\n",
    "* step.8---Choose the label with the highest occurrence frequency as the class label of the unknown sample"
   ]
  },
  {
@@ -333,7 +336,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
   "version": "3.6.5"
  }
 },
 "nbformat": 4,