确定预期样本中的聚类个数k
选取一组k个随机质心
将样本观测指派到最近的质心,接近程度由度量制决定
计算每个聚类的中心(均值)
对所有样本观测检查指派关系。若一个观测离另一个中心更近则重新指派到新聚类
重复3~5直至不发生新的指派
rm(list = ls())
library(fpc)#使用fpc包的kmeansruns函数
## Warning: 程辑包'fpc'是用R版本4.2.3 来建造的
data('Vehicle', package = 'mlbench')#使用mlbench包中的Vehicle数据集,对柯基型汽车的外形进行分类
data <- Vehicle[,-19]#去除最后一列的型号数据
set.seed(2023)
class <- Vehicle[,19]
levels(class)[levels(class)=='saab'] <- 'car'
levels(class)[levels(class)=='opel'] <- 'car'
fit <- kmeansruns(data, krange = 2:8, critout = T, runs = 100, scaledata = T, criterion = 'ch')#krange指定预期聚类个数3~8,runs指定100个不同的初始聚类,scaleruns标准化观测值
## 2 clusters 588.9425
## 3 clusters 456.7939
## 4 clusters 433.9752
## 5 clusters 381.9294
## 6 clusters 352.7663
## 7 clusters 331.3718
## 8 clusters 309.9998
#输出结果为Calinski-Harabasz指数,最优的聚类个数使用最大值选择
#比较聚类结果和类别标记
tab <- table(fit$cluster, class)
tab
## class
## bus car van
## 1 161 195 195
## 2 57 234 4
#聚类1在所有类型上重叠程度都很大,聚类2可能为car
data <- scale(data)
library(pvclust)
## Warning: 程辑包'pvclust'是用R版本4.2.3 来建造的
set.seed(2023)
fit <- pvclust(data, method.hclust = 'ward.D', nboot = 5000, method.dist = 'euclidean')#pvclust函数通过标准R函数hclust执行系统聚类分析。但是使用了自助重抽样更好估计聚类。使用wald法作为聚类算法,自助重抽样个数5000,使用欧几里得距离计算距离矩阵
## Bootstrap (r = 0.5)... Done.
## Bootstrap (r = 0.6)... Done.
## Bootstrap (r = 0.7)... Done.
## Bootstrap (r = 0.8)... Done.
## Bootstrap (r = 0.9)... Done.
## Bootstrap (r = 1.0)... Done.
## Bootstrap (r = 1.1)... Done.
## Bootstrap (r = 1.2)... Done.
## Bootstrap (r = 1.3)... Done.
## Bootstrap (r = 1.4)... Done.
print(fit)
##
## Cluster method: ward.D
## Distance : euclidean
##
## Estimates on edges:
##
## si au bp se.si se.au se.bp v c pchi
## 1 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
## 2 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
## 3 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
## 4 0.893 0.942 0.963 0.011 0.007 0.001 -1.677 -0.107 0.704
## 5 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
## 6 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
## 7 0.747 0.897 0.788 0.012 0.007 0.002 -1.032 0.231 0.203
## 8 0.789 0.941 0.656 0.011 0.004 0.002 -0.980 0.580 0.055
## 9 0.742 0.851 0.924 0.016 0.011 0.001 -1.236 -0.195 0.017
## 10 0.767 0.877 0.902 0.014 0.009 0.001 -1.228 -0.067 0.073
## 11 0.482 0.723 0.779 0.017 0.013 0.002 -0.679 -0.089 0.819
## 12 0.777 0.916 0.776 0.011 0.006 0.002 -1.068 0.308 0.128
## 13 0.767 0.877 0.902 0.014 0.009 0.001 -1.228 -0.067 0.073
## 14 0.539 0.862 0.515 0.017 0.008 0.002 -0.563 0.526 0.499
## 15 0.778 0.920 0.754 0.011 0.005 0.002 -1.047 0.358 0.576
## 16 0.767 0.877 0.902 0.014 0.009 0.001 -1.228 -0.067 0.073
## 17 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
plot(fit)
pvrect(fit, alpha = 0.95)
#使用Mclust包,对14种正态混合模型计算最优的贝叶斯信息准则
data('Vehicle', package = 'mlbench')
data <- Vehicle[,-19]
library(mclust)
## Warning: 程辑包'mclust'是用R版本4.2.3 来建造的
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
set.seed(2023)
fit <- Mclust(as.matrix(data), G = 1:8)
fit
## 'Mclust' model object: (VEV,6)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
fit3 <- Mclust(as.matrix(data), G = 3, modelNames = 'VEV')
class <- Vehicle[,19]
levels(class)[levels(class)=='saab'] <- 'car'
levels(class)[levels(class)=='opel'] <- 'car'
table(fit3$classification,class)
## class
## bus car van
## 1 78 156 138
## 2 49 207 6
## 3 91 66 55