K-均值算法

  1. 确定预期样本中的聚类个数k

  2. 选取一组k个随机质心

  3. 将样本观测指派到最近的质心,接近程度由度量制决定

  4. 计算每个聚类的中心(均值)

  5. 对所有样本观测检查指派关系。若一个观测离另一个中心更近则重新指派到新聚类

  6. 重复3~5直至不发生新的指派

K-均值算法的R语言示例

rm(list = ls())
library(fpc)#使用fpc包的kmeansruns函数
## Warning: 程辑包'fpc'是用R版本4.2.3 来建造的
data('Vehicle', package = 'mlbench')#使用mlbench包中的Vehicle数据集,对柯基型汽车的外形进行分类
data <- Vehicle[,-19]#去除最后一列的型号数据
set.seed(2023)
class <- Vehicle[,19]
levels(class)[levels(class)=='saab'] <- 'car'
levels(class)[levels(class)=='opel'] <- 'car'
fit <- kmeansruns(data, krange = 2:8, critout = T, runs = 100, scaledata = T, criterion = 'ch')#krange指定预期聚类个数3~8,runs指定100个不同的初始聚类,scaleruns标准化观测值
## 2  clusters  588.9425 
## 3  clusters  456.7939 
## 4  clusters  433.9752 
## 5  clusters  381.9294 
## 6  clusters  352.7663 
## 7  clusters  331.3718 
## 8  clusters  309.9998
#输出结果为Calinski-Harabasz指数,最优的聚类个数使用最大值选择
#比较聚类结果和类别标记
tab <- table(fit$cluster, class)
tab
##    class
##     bus car van
##   1 161 195 195
##   2  57 234   4
#聚类1在所有类型上重叠程度都很大,聚类2可能为car

系统聚类分析

data <- scale(data)
library(pvclust)
## Warning: 程辑包'pvclust'是用R版本4.2.3 来建造的
set.seed(2023)
fit <- pvclust(data, method.hclust = 'ward.D', nboot = 5000, method.dist = 'euclidean')#pvclust函数通过标准R函数hclust执行系统聚类分析。但是使用了自助重抽样更好估计聚类。使用wald法作为聚类算法,自助重抽样个数5000,使用欧几里得距离计算距离矩阵
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
print(fit)
## 
## Cluster method: ward.D
## Distance      : euclidean
## 
## Estimates on edges:
## 
##       si    au    bp se.si se.au se.bp      v      c  pchi
## 1  1.000 1.000 1.000 0.000 0.000 0.000  0.000  0.000 0.000
## 2  1.000 1.000 1.000 0.000 0.000 0.000  0.000  0.000 0.000
## 3  1.000 1.000 1.000 0.000 0.000 0.000  0.000  0.000 0.000
## 4  0.893 0.942 0.963 0.011 0.007 0.001 -1.677 -0.107 0.704
## 5  1.000 1.000 1.000 0.000 0.000 0.000  0.000  0.000 0.000
## 6  1.000 1.000 1.000 0.000 0.000 0.000  0.000  0.000 0.000
## 7  0.747 0.897 0.788 0.012 0.007 0.002 -1.032  0.231 0.203
## 8  0.789 0.941 0.656 0.011 0.004 0.002 -0.980  0.580 0.055
## 9  0.742 0.851 0.924 0.016 0.011 0.001 -1.236 -0.195 0.017
## 10 0.767 0.877 0.902 0.014 0.009 0.001 -1.228 -0.067 0.073
## 11 0.482 0.723 0.779 0.017 0.013 0.002 -0.679 -0.089 0.819
## 12 0.777 0.916 0.776 0.011 0.006 0.002 -1.068  0.308 0.128
## 13 0.767 0.877 0.902 0.014 0.009 0.001 -1.228 -0.067 0.073
## 14 0.539 0.862 0.515 0.017 0.008 0.002 -0.563  0.526 0.499
## 15 0.778 0.920 0.754 0.011 0.005 0.002 -1.047  0.358 0.576
## 16 0.767 0.877 0.902 0.014 0.009 0.001 -1.228 -0.067 0.073
## 17 1.000 1.000 1.000 0.000 0.000 0.000  0.000  0.000 0.000
plot(fit)
pvrect(fit, alpha = 0.95)

基于模型的聚类

#使用Mclust包,对14种正态混合模型计算最优的贝叶斯信息准则
data('Vehicle', package = 'mlbench')
data <- Vehicle[,-19]
library(mclust)
## Warning: 程辑包'mclust'是用R版本4.2.3 来建造的
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
set.seed(2023)
fit <- Mclust(as.matrix(data), G = 1:8)
fit
## 'Mclust' model object: (VEV,6) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"
fit3 <- Mclust(as.matrix(data), G = 3, modelNames = 'VEV')
class <- Vehicle[,19]
levels(class)[levels(class)=='saab'] <- 'car'
levels(class)[levels(class)=='opel'] <- 'car'
table(fit3$classification,class)
##    class
##     bus car van
##   1  78 156 138
##   2  49 207   6
##   3  91  66  55