中国科学技术大学学报 ›› 2014, Vol. 44 ›› Issue (7): 537-543.DOI: 10.3969/j.issn.0253-2778.2014.07.001

• 论著 •    下一篇

基于MapReduce的基因数据密度层次聚类算法

涂金金,杨明,郭丽娜   

  1. 南京师范大学计算机科学与技术学院,江苏南京 210046
  • 收稿日期:2014-03-21 修回日期:2014-06-15 接受日期:2014-06-15 出版日期:2023-05-11 发布日期:2014-06-15
  • 通讯作者: 杨明
  • 作者简介:涂金金,男,1988年生,硕士生. 研究方向:机器学习、模式识别. E-mail: tujinjin1988@sina.com
  • 基金资助:
    国家自然科学基金(61272222,61003116),江苏省自然科学基金重点重大专项(BK2011005),江苏省自然科学基金(BK2011782),江苏省普通高校研究生科研创新计划项目(CXLX12_0415)资助.

A density-based hierarchical clustering algorithm of gene data based on MapReduce

TU Jinjin, YANG Ming, GUO Lina   

  1. School of Computer Science and Technology, Nanjing Normal University, Nanjing 210046, China
  • Received:2014-03-21 Revised:2014-06-15 Accepted:2014-06-15 Online:2023-05-11 Published:2014-06-15

摘要: 随着生物信息技术的快速发展,基因表达数据的规模急剧增长,这给传统的基因表达数据聚类算法带来了严峻的挑战.基于密度的层次聚类(DHC)能够较好地解决基因表达数据嵌套类问题且鲁棒性较好,但处理海量数据的效率不高.为此,提出了基于MapReduce的密度层次聚类算法——DisDHC.该算法首先进行数据分割,在每个子集上利用DHC进行聚类获得稀疏化的数据;在此基础上再次进行DHC聚类;最终产生整体数据的密度中心点.在酵母数据集、酵母细胞周期数据集、人血清数据集上进行实验,结果表明,DisDHC算法在保持DHC聚类效果的同时,极大地缩短了聚类时间.

关键词: MapReduce, 密度层次聚类, 基因表达数据

Abstract: The amount of gene expression data scale is increasing sharply with the rapid development of bio-informatics technology, which poses a serious challenge for traditional clustering algorithms. Density-based hierarchical clustering (DHC) can solve the problem of the nested class of gene expression data and has good robustness, but for handling huge amounts of data. Therefore, a density-based hierarchical clustering algorithm on MapReduce(DisDHC) was proposed. It partitioned data sets into smaller blocks, clustered each block using DHC in parallel, gathered the result for re-clustering, and produced all density centers of each cluster. The experiments on GAL dataset, Cell cycle dataset, and Serum dataset show that DisDHC reduces clustering time and achieves high performance.

Key words: MapReduce, density-based hierarchical clustering, gene expression data

中图分类号: