Browse Source

Improve algorithm description

pull/5/head
bushuhui 3 years ago
parent
commit
bbf54263e2
3 changed files with 45 additions and 107 deletions
  1. +19
    -83
      2_knn/knn_classification.ipynb
  2. +24
    -22
      3_kmeans/1-k-means.ipynb
  3. +2
    -2
      5_nn/1-Perceptron.ipynb

+ 19
- 83
2_knn/knn_classification.ipynb
File diff suppressed because it is too large
View File


+ 24
- 22
3_kmeans/1-k-means.ipynb View File

@@ -17,15 +17,13 @@
"\n",
"## 1. 方法\n",
"\n",
"由于具有出色的速度和良好的可扩展性,K-Means聚类算法算得上是最著名的聚类方法。***K-Means算法是一个重复移动类中心点的过程,把类的中心点,也称重心(centroids)***:\n",
"由于具有出色的速度和良好的可扩展性,K-Means聚类算法最经典的聚类方法。***k-Means算法是一个重复移动类中心点的过程,把类的中心点,也称重心(centroids)***:\n",
"* 移动到其包含成员的平均位置;\n",
"* 然后重新划分其内部成员。\n",
"\n",
"K是算法计算出的超参数,表示类的数量;K-Means可以自动分配样本到不同的类,但是不能决定究竟要分几个类。\n",
"`k`是算法中的超参数,表示类的数量;k-Means可以自动分配样本到不同的类,但是不能决定究竟要分几个类。`k`必须是一个比训练集样本数小的正整数。有时,类的数量是由问题内容指定的。例如,一个鞋厂有三种新款式,它想知道每种新款式都有哪些潜在客户,于是它调研客户,然后从数据里找出三类。也有一些问题没有指定聚类的数量,最优的聚类数量是不确定的。\n",
"\n",
"K必须是一个比训练集样本数小的正整数。有时,类的数量是由问题内容指定的。例如,一个鞋厂有三种新款式,它想知道每种新款式都有哪些潜在客户,于是它调研客户,然后从数据里找出三类。也有一些问题没有指定聚类的数量,最优的聚类数量是不确定的。\n",
"\n",
"K-Means的参数是类的重心位置和其内部观测值的位置。与广义线性模型和决策树类似,K-Means参数的最优解也是以成本函数最小化为目标。K-Means成本函数公式如下:\n",
"k-Means的参数是类的重心位置和其内部观测值的位置。与广义线性模型和决策树类似,k-Means参数的最优解也是以成本函数最小化为目标。k-Means成本函数公式如下:\n",
"$$\n",
"J = \\sum_{k=1}^{K} \\sum_{i \\in C_k} | x_i - u_k|^2\n",
"$$\n",
@@ -36,13 +34,24 @@
"$$\n",
"\n",
"\n",
"成本函数是各个类畸变程度(distortions)之和。每个类的畸变程度等于该类重心与其内部成员位置距离的平方和。若类内部的成员彼此间越紧凑则类的畸变程度越小,反之,若类内部的成员彼此间越分散则类的畸变程度越大。\n",
"\n",
"成本函数是各个类畸变程度(distortions)之和。每个类的畸变程度等于该类重心与其内部成员位置距离的平方和。若类内部的成员彼此间越紧凑则类的畸变程度越小,反之,若类内部的成员彼此间越分散则类的畸变程度越大。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 算法\n",
"求解成本函数最小化的参数就是一个重复配置每个类包含的观测值,并不断移动类重心的过程。\n",
"1. 首先,类的重心是随机确定的位置。实际上,重心位置等于随机选择的观测值的位置。\n",
"2. 每次迭代的时候,K-Means会把观测值分配到离它们最近的类,然后把重心移动到该类全部成员位置的平均值那里。\n",
"3. 若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束,否则重复步骤2。\n",
"\n",
"输入:$T=\\{ x_1, x_2, ..., x_N\\}$,其中$x_i \\in R_n$,i=1,2...N,学习速率为η\n",
"\n",
"输出:聚类集合$C_k$, 聚类中心$u_k$, 其中k=1,2,...K\n",
"\n",
"1. 初始化类的重心,可以随机选择样本作为聚类中心\n",
"2. 每次迭代的时候,把所有样本分配到离它们最近的类\n",
"3. 然后把重心移动到该类全部成员位置的平均值那里\n",
"4. 若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束,否则重复步骤2\n",
"\n"
]
},
@@ -304,6 +313,7 @@
"def kmeans(X, k):\n",
" # 样本总数\n",
" m = np.shape(X)[0]\n",
" \n",
" # 分配样本到最近的簇:存[簇序号,距离的平方] (m行 x 2 列)\n",
" clusterAssment = np.zeros((m, 2))\n",
"\n",
@@ -311,30 +321,22 @@
" centroids = rand_cluster_cents(X, k)\n",
" print('最初的中心=', centroids)\n",
"\n",
" # 初始化迭代次数计数器\n",
" iterN = 0\n",
" \n",
" # 所有样本分配结果不再改变,迭代终止\n",
" while True: \n",
" # 标志位,如果迭代前后样本分类发生变化值为True,否则为False\n",
" clusterChanged = False\n",
" \n",
" # step2:分配到最近的聚类中心对应的簇中\n",
" for i in range(m):\n",
" # 初始定义距离为无穷大\n",
" minDist = np.inf;\n",
" # 初始化索引值\n",
" minIndex = -1\n",
" # 计算每个样本与k个中心点距离\n",
" for j in range(k):\n",
" # 计算第i个样本到第j个中心点的距离\n",
" distJI = calc_distance(centroids[j, :], X[i, :])\n",
" # 判断距离是否为最小\n",
" if distJI < minDist:\n",
" # 更新获取到最小距离\n",
" minDist = distJI\n",
" # 获取对应的簇序号\n",
" minIndex = j\n",
" \n",
" # 样本上次分配结果跟本次不一样,标志位clusterChanged置True\n",
" if clusterAssment[i, 0] != minIndex:\n",
" clusterChanged = True\n",
@@ -346,9 +348,7 @@
" \n",
" # step3:更新聚类中心\n",
" for cent in range(k): # 样本分配结束后,重新计算聚类中心\n",
" # 获取该簇所有的样本点,nonzero[0]表示A == cent的元素所在的行,如果没有[0],列也会表示\n",
" ptsInClust = X[clusterAssment[:, 0] == cent, :]\n",
" # 更新聚类中心:axis=0沿列方向求均值。\n",
" centroids[cent, :] = np.mean(ptsInClust, axis=0)\n",
" \n",
" # 如果聚类重心没有发生改变,则退出迭代\n",
@@ -626,7 +626,9 @@
"the adjusted index is:\n",
"![ARI_define](images/ARI_define.png)\n",
"\n",
"* [ARI reference](https://davetang.org/muse/2017/09/21/adjusted-rand-index/)"
"* [ARI reference](https://davetang.org/muse/2017/09/21/adjusted-rand-index/)\n",
"* [聚类性能评估-ARI(调兰德指数)](https://zhuanlan.zhihu.com/p/145856959)\n",
"* [ARI聚类效果评价指标](https://blog.csdn.net/qtlyx/article/details/52678895)"
]
},
{


+ 2
- 2
5_nn/1-Perceptron.ipynb View File

@@ -129,9 +129,9 @@
"### 4.1 算法\n",
"\n",
"\n",
"输入:T={(x1,y1),(x2,y2)...(xN,yN)}(其中xi∈X=Rn,yi∈Y={-1, +1},i=1,2...N,学习速率为η)\n",
"输入:$T=\\{(x_1,y_1),(x_2,y_2), ..., (x_N,y_N)\\}$, 其中$x_i \\in X=R^n$,$y_i \\in Y = {-1, +1}$,i=1,2...N,学习速率为η\n",
"\n",
"输出:w, b;感知机模型f(x)=sign(w·x+b)\n",
"输出:$w$, $b$; 感知机模型$f(x)=sign(w·x+b)$\n",
"\n",
"1. 初始化$w_0$,$b_0$\n",
"2. 在训练数据集中选取$(x_i, y_i)$\n",


Loading…
Cancel
Save