中国科学技术大学学报 ›› 2016, Vol. 46 ›› Issue (9): 711-718.DOI: 10.3969/j.issn.0253-2778.2016.09.001

• 论著 •    

基于Spark的ISOMAP算法并行化

石陆魁,袁彬,刘文浩   

  1. 1.河北工业大学计算机科学与软件学院,天津 300401;2.河北省大数据计算重点实验室,天津 300401
  • 收稿日期:2016-03-01 修回日期:2016-09-17 接受日期:2016-09-17 出版日期:2016-09-17 发布日期:2016-09-17
  • 通讯作者: 石陆魁
  • 作者简介:石陆魁(通讯作者),男,1974年生,博士/教授. 研究方向:机器学习. E-mail: shilukui@scse.hebut.edu.cn
  • 基金资助:
    天津市应用基础与前沿技术研究计划重点项目(14JCZDJC31600),河北省自然科学基金(F2013202104)资助.

Parallel ISOMAP algorithm based on Spark

SHI Lukui, YUAN Bin, LIU Wenhao   

  1. 1. School of Computer Science and Engineering, Hebei University of Technology, Tianjin 300401, China; 2. Hebei Province Bigdata Computation Key Library, Tianjin 300401, China
  • Received:2016-03-01 Revised:2016-09-17 Accepted:2016-09-17 Online:2016-09-17 Published:2016-09-17

摘要: 为了实现大数据环境下非线性高维数据的快速降维,提出了一种基于Spark的并行ISOMAP算法.在该算法中,为了快速构建邻域矩阵,设计并实现了基于精确欧式位置敏感哈希的近邻搜索并行算法;为了实现特征值的快速求解,设计并实现了基于幂法和降阶法交替执行的特征值求解并行算法.为了进一步提高算法的性能,基于Spark的特性,利用Spark的稀疏向量、广播机制和缓存机制对并行ISOMAP算法进行了优化,减少了计算过程中的内存消耗和数据传输.在Swissroll数据集和S-curve数据集上的实验结果表明,基于Spark的并行ISOMAP算法通过并行执行和计算过程的优化,极大地提高了算法的执行效率,能够适用于大规模数据集的降维处理.

关键词: ISOMAP, Spark, 精确欧式位置敏感哈希, 流形学习, 大数据

Abstract: To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.

Key words: ISOMAP, Spark, E2LSH, manifold learning, big data

中图分类号: