|
|
@@ -17,15 +17,13 @@ |
|
|
|
"\n", |
|
|
|
"## 1. 方法\n", |
|
|
|
"\n", |
|
|
|
"由于具有出色的速度和良好的可扩展性,K-Means聚类算法算得上是最著名的聚类方法。***K-Means算法是一个重复移动类中心点的过程,把类的中心点,也称重心(centroids)***:\n", |
|
|
|
"由于具有出色的速度和良好的可扩展性,K-Means聚类算法最经典的聚类方法。***k-Means算法是一个重复移动类中心点的过程,把类的中心点,也称重心(centroids)***:\n", |
|
|
|
"* 移动到其包含成员的平均位置;\n", |
|
|
|
"* 然后重新划分其内部成员。\n", |
|
|
|
"\n", |
|
|
|
"K是算法计算出的超参数,表示类的数量;K-Means可以自动分配样本到不同的类,但是不能决定究竟要分几个类。\n", |
|
|
|
"`k`是算法中的超参数,表示类的数量;k-Means可以自动分配样本到不同的类,但是不能决定究竟要分几个类。`k`必须是一个比训练集样本数小的正整数。有时,类的数量是由问题内容指定的。例如,一个鞋厂有三种新款式,它想知道每种新款式都有哪些潜在客户,于是它调研客户,然后从数据里找出三类。也有一些问题没有指定聚类的数量,最优的聚类数量是不确定的。\n", |
|
|
|
"\n", |
|
|
|
"K必须是一个比训练集样本数小的正整数。有时,类的数量是由问题内容指定的。例如,一个鞋厂有三种新款式,它想知道每种新款式都有哪些潜在客户,于是它调研客户,然后从数据里找出三类。也有一些问题没有指定聚类的数量,最优的聚类数量是不确定的。\n", |
|
|
|
"\n", |
|
|
|
"K-Means的参数是类的重心位置和其内部观测值的位置。与广义线性模型和决策树类似,K-Means参数的最优解也是以成本函数最小化为目标。K-Means成本函数公式如下:\n", |
|
|
|
"k-Means的参数是类的重心位置和其内部观测值的位置。与广义线性模型和决策树类似,k-Means参数的最优解也是以成本函数最小化为目标。k-Means成本函数公式如下:\n", |
|
|
|
"$$\n", |
|
|
|
"J = \\sum_{k=1}^{K} \\sum_{i \\in C_k} | x_i - u_k|^2\n", |
|
|
|
"$$\n", |
|
|
@@ -36,13 +34,24 @@ |
|
|
|
"$$\n", |
|
|
|
"\n", |
|
|
|
"\n", |
|
|
|
"成本函数是各个类畸变程度(distortions)之和。每个类的畸变程度等于该类重心与其内部成员位置距离的平方和。若类内部的成员彼此间越紧凑则类的畸变程度越小,反之,若类内部的成员彼此间越分散则类的畸变程度越大。\n", |
|
|
|
"\n", |
|
|
|
"成本函数是各个类畸变程度(distortions)之和。每个类的畸变程度等于该类重心与其内部成员位置距离的平方和。若类内部的成员彼此间越紧凑则类的畸变程度越小,反之,若类内部的成员彼此间越分散则类的畸变程度越大。\n" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"## 2. 算法\n", |
|
|
|
"求解成本函数最小化的参数就是一个重复配置每个类包含的观测值,并不断移动类重心的过程。\n", |
|
|
|
"1. 首先,类的重心是随机确定的位置。实际上,重心位置等于随机选择的观测值的位置。\n", |
|
|
|
"2. 每次迭代的时候,K-Means会把观测值分配到离它们最近的类,然后把重心移动到该类全部成员位置的平均值那里。\n", |
|
|
|
"3. 若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束,否则重复步骤2。\n", |
|
|
|
"\n", |
|
|
|
"输入:$T=\\{ x_1, x_2, ..., x_N\\}$,其中$x_i \\in R_n$,i=1,2...N,学习速率为η\n", |
|
|
|
"\n", |
|
|
|
"输出:聚类集合$C_k$, 聚类中心$u_k$, 其中k=1,2,...K\n", |
|
|
|
"\n", |
|
|
|
"1. 初始化类的重心,可以随机选择样本作为聚类中心\n", |
|
|
|
"2. 每次迭代的时候,把所有样本分配到离它们最近的类\n", |
|
|
|
"3. 然后把重心移动到该类全部成员位置的平均值那里\n", |
|
|
|
"4. 若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束,否则重复步骤2\n", |
|
|
|
"\n" |
|
|
|
] |
|
|
|
}, |
|
|
@@ -304,6 +313,7 @@ |
|
|
|
"def kmeans(X, k):\n", |
|
|
|
" # 样本总数\n", |
|
|
|
" m = np.shape(X)[0]\n", |
|
|
|
" \n", |
|
|
|
" # 分配样本到最近的簇:存[簇序号,距离的平方] (m行 x 2 列)\n", |
|
|
|
" clusterAssment = np.zeros((m, 2))\n", |
|
|
|
"\n", |
|
|
@@ -311,30 +321,22 @@ |
|
|
|
" centroids = rand_cluster_cents(X, k)\n", |
|
|
|
" print('最初的中心=', centroids)\n", |
|
|
|
"\n", |
|
|
|
" # 初始化迭代次数计数器\n", |
|
|
|
" iterN = 0\n", |
|
|
|
" \n", |
|
|
|
" # 所有样本分配结果不再改变,迭代终止\n", |
|
|
|
" while True: \n", |
|
|
|
" # 标志位,如果迭代前后样本分类发生变化值为True,否则为False\n", |
|
|
|
" clusterChanged = False\n", |
|
|
|
" \n", |
|
|
|
" # step2:分配到最近的聚类中心对应的簇中\n", |
|
|
|
" for i in range(m):\n", |
|
|
|
" # 初始定义距离为无穷大\n", |
|
|
|
" minDist = np.inf;\n", |
|
|
|
" # 初始化索引值\n", |
|
|
|
" minIndex = -1\n", |
|
|
|
" # 计算每个样本与k个中心点距离\n", |
|
|
|
" for j in range(k):\n", |
|
|
|
" # 计算第i个样本到第j个中心点的距离\n", |
|
|
|
" distJI = calc_distance(centroids[j, :], X[i, :])\n", |
|
|
|
" # 判断距离是否为最小\n", |
|
|
|
" if distJI < minDist:\n", |
|
|
|
" # 更新获取到最小距离\n", |
|
|
|
" minDist = distJI\n", |
|
|
|
" # 获取对应的簇序号\n", |
|
|
|
" minIndex = j\n", |
|
|
|
" \n", |
|
|
|
" # 样本上次分配结果跟本次不一样,标志位clusterChanged置True\n", |
|
|
|
" if clusterAssment[i, 0] != minIndex:\n", |
|
|
|
" clusterChanged = True\n", |
|
|
@@ -346,9 +348,7 @@ |
|
|
|
" \n", |
|
|
|
" # step3:更新聚类中心\n", |
|
|
|
" for cent in range(k): # 样本分配结束后,重新计算聚类中心\n", |
|
|
|
" # 获取该簇所有的样本点,nonzero[0]表示A == cent的元素所在的行,如果没有[0],列也会表示\n", |
|
|
|
" ptsInClust = X[clusterAssment[:, 0] == cent, :]\n", |
|
|
|
" # 更新聚类中心:axis=0沿列方向求均值。\n", |
|
|
|
" centroids[cent, :] = np.mean(ptsInClust, axis=0)\n", |
|
|
|
" \n", |
|
|
|
" # 如果聚类重心没有发生改变,则退出迭代\n", |
|
|
@@ -626,7 +626,9 @@ |
|
|
|
"the adjusted index is:\n", |
|
|
|
"\n", |
|
|
|
"\n", |
|
|
|
"* [ARI reference](https://davetang.org/muse/2017/09/21/adjusted-rand-index/)" |
|
|
|
"* [ARI reference](https://davetang.org/muse/2017/09/21/adjusted-rand-index/)\n", |
|
|
|
"* [聚类性能评估-ARI(调兰德指数)](https://zhuanlan.zhihu.com/p/145856959)\n", |
|
|
|
"* [ARI聚类效果评价指标](https://blog.csdn.net/qtlyx/article/details/52678895)" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|