中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (4): 311-319.DOI: 10.3969/j.issn.0253-2778.2017.04.005

• 论著 • 上一篇    下一篇

稀疏谱聚类算法在高维数据上的应用

徐雪丽,赵学靖   

  1. 兰州大学数学与统计学院,甘肃兰州 730000
  • 收稿日期:2016-08-28 修回日期:2016-12-08 出版日期:2017-04-30 发布日期:2017-04-30
  • 通讯作者: 赵学靖
  • 作者简介:徐雪丽,女,1990年生,硕士生,研究方向:数据挖掘. E-mail: xuxl2014@lzu.edu.cn

Application of sparse spectral clustering algorithm in high-dimensional data

XU Xueli, ZHAO Xuejing   

  1. School of Mathematics and Statistics, Lanzhou University, Lanzhou 730000, China
  • Received:2016-08-28 Revised:2016-12-08 Online:2017-04-30 Published:2017-04-30

摘要: 提出一种新的稀疏谱聚类算法——基于PAM算法的HSSPAM聚类(high-dimensional sparse spectral clustering based on partitioning around medoids).该算法先用高相关系数过滤及主成分分析降维方法以有效减小甚至消除维度灾难对高维数据处理的影响,再采用Minkowski距离指数变换函数及稀疏化算法来构建分块对角矩阵以重新解释样本之间的相似度;然后构造新颖的拉普拉斯矩阵以实现进一步压缩数据矩阵,进而结合partitioning around medoids(PAM)算法取代传统谱聚类中的K-means算法对特征向量聚类以提高算法的聚类稳定性;最后引入高维基因数据设计了实验,并以不同的聚类评价指标来衡量该研究算法的聚类质量,实验结果表明,新算法能够更精确、更稳定地对基因数据聚类.

关键词: 高维数据聚类, 稀疏谱聚类算法, 降维方法, 分块对角矩阵, 聚类评价指标

Abstract: A new sparse spectral clustering algorithm——high-dimensional sparse spectral clustering based on partitioning around medoids (HSSPAM) was proposed, which takes advantage of the sparse similarity matrix in computation as well as the superiority of the PAM algorithm over K-means. To reduce or even eliminate the impact of “dimensionality curse” on high dimensional data processing, the high correlation filter (HCF) and the principal component analysis (PCA) method are also investigated in the algorithm. The proposed method has higher precision and more stable clustering results than the algorithms introduced in this paper for comparison in the real high-dimensional gene data under different clustering evaluation criteria.

Key words: clustering of high-dimensional data, sparse spectral clustering algorithm, dimension-reduction technique, block diagonal matrix, clustering evaluation index

中图分类号: