基于Spark的ISOMAP算法并行化

doi:10.3969/j.issn.0253-2778.2016.09.001

中国科学技术大学学报 ›› 2016, Vol. 46 ›› Issue (9): 711-718.DOI: 10.3969/j.issn.0253-2778.2016.09.001

• 论著 •

基于Spark的ISOMAP算法并行化

石陆魁，袁彬，刘文浩

1.河北工业大学计算机科学与软件学院，天津 300401；2.河北省大数据计算重点实验室，天津 300401

收稿日期:2016-03-01 修回日期:2016-09-17 接受日期:2016-09-17 出版日期:2016-09-17 发布日期:2016-09-17
通讯作者: 石陆魁
作者简介:石陆魁(通讯作者)，男，1974年生，博士/教授. 研究方向：机器学习. E-mail: shilukui@scse.hebut.edu.cn
基金资助:
天津市应用基础与前沿技术研究计划重点项目(14JCZDJC31600)，河北省自然科学基金(F2013202104）资助.

Parallel ISOMAP algorithm based on Spark

SHI Lukui, YUAN Bin, LIU Wenhao

1. School of Computer Science and Engineering, Hebei University of Technology, Tianjin 300401, China； 2. Hebei Province Bigdata Computation Key Library, Tianjin 300401, China

Received:2016-03-01 Revised:2016-09-17 Accepted:2016-09-17 Online:2016-09-17 Published:2016-09-17

摘要/Abstract

摘要： 为了实现大数据环境下非线性高维数据的快速降维，提出了一种基于Spark的并行ISOMAP算法.在该算法中，为了快速构建邻域矩阵，设计并实现了基于精确欧式位置敏感哈希的近邻搜索并行算法；为了实现特征值的快速求解，设计并实现了基于幂法和降阶法交替执行的特征值求解并行算法.为了进一步提高算法的性能，基于Spark的特性，利用Spark的稀疏向量、广播机制和缓存机制对并行ISOMAP算法进行了优化，减少了计算过程中的内存消耗和数据传输.在Swissroll数据集和S-curve数据集上的实验结果表明，基于Spark的并行ISOMAP算法通过并行执行和计算过程的优化，极大地提高了算法的执行效率，能够适用于大规模数据集的降维处理.

关键词: ISOMAP, Spark, 精确欧式位置敏感哈希, 流形学习, 大数据

Abstract: To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.

Key words: ISOMAP, Spark, E2LSH, manifold learning, big data

中图分类号:

TP18

石陆魁，袁彬，刘文浩. 基于Spark的ISOMAP算法并行化[J]. 中国科学技术大学学报, 2016, 46(9): 711-718.

SHI Lukui, YUAN Bin, LIU Wenhao. Parallel ISOMAP algorithm based on Spark[J]. Journal of University of Science and Technology of China, 2016, 46(9): 711-718.

参考文献

［1］
TENENBAUM J B, DE SILVA V, LANGFORD J C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5): 2319-2323.
[2] 王自强，钱旭，孔敏. 流形学习算法综述[J]. 计算机工程与应用，2008，44(35): 9-12.
WANG Z Q, QIAN X, KONG M. Survey on manifold learning algorithms[J]. Computer Engineering and Applications, 2008, 44(35): 9-12.
[3] PLESS R, SOUVENIR R. A survey of manifold learning for images[J]. IPSJ Transactions on Computer Vision & Applications, 2009, 1(1): 83-94.
[4] IZENMAN A J. Introduction to manifold learning[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2012, 4(5):439-446.
[5] 曾宪华, 罗四维. 全局保持的流形学习算法对比研究[J]. 计算机工程与应用, 2010, 46(15): 1-6.
ZENG X H, LUO S W. Contrasting research of global preserving manifold learning algorithms[J]. Computer Engineering and Applications, 2010, 46(15): 1-6.
[6] 尹宏伟，李凡长. 谱机器学习研究综述[J]. 计算机科学与探索，2015，9（12)：1409-1419.
[7] 任磊，杜一，马帅，等.大数据可视分析综述[J]. 软件学报, 2014, 25(9): 1909-1936.
[8] 李毅. 基于Hadoop平台的局部线性嵌入算法研究[D]. 华南理工大学，2011.
[9] 卞云龙. 基于云计算平台的大规模流形学习算法研究[D]. 南京理工大学，2012.
[10] 薛永坚, 倪志伟. 基于MapReduce的大规模数据集流形学习降维研究[J]. 系统工程理论与实践，2014，34(S): 151-157.
[11] 刘勇. 头部姿态估计的监督流形学习研究及其并行化扩展[D]. 厦门大学，2013
[12] GU L, LI H. Memory or time: Performance evaluation for iterative operation on hadoop and spark [C]// International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. Zhangjiajie, China: IEEE Press, 2013: 721-727.
[13] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C]// Proceedings the of 20th Annual symposium on Computational Geometry. New York: ACM Press, 2004: 253-262.
[14] DE SILVA V, TENENBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C] // Proceedings of the Advances in Neural Information Processing Systems. Cambridge: IEEE Press, 2003, 15: 1959-1966.
[15] 张瑞杰，郭志刚, 李弼程，等. 基于E2LSH-MKL的视觉语义概念检测[J]. 自动化学报，2012，38(10): 1671-1678.
ZHANG R J, GUO Z G, LI B C, et al. A visual semantic concept detection algorithm based on E2LSH-MKL[J]. Acta Automatica Sinica, 2012, 38(10): 1671-1678.
[16] 王洪峰，刘辛. 基于位置敏感哈希的网络视频重复检测[J]. 计算机应用研究，2012，29(5): 1954-1958.
WANG H F, LIU X. Near-duplicate Web video detection based on locality sensitive hashing[J]. Application Research of Computers, 2012，29(5): 1954-1958.
[17] INDYK P. Stable distributions, pseudorandom generators, embeddings, and data stream computation [J]. Journal of the ACM, 2006, 53(3): 307-323.
[18] SPARKS E R, TALWALKAR A, SMITH V, et al. MLI: An API for distributed machine learning [C]// Proceedings of the 13th International Conference on Data Mining. Dallas: IEEE Computer Society, 2013:1187-1192.
[19] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]// Proceedings of the USENIX Conference on Networked Systems Design and Implementation. Berkeley: ACM Press, 2012, 70(2):141-146.
[20] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets [J]. USENIX Conference on Hot Topics in Cloud Computing, 2010, 15(1):1765-1773.

()
()

[1]	邱镇，王琪媛，刘迪，孟洪民. 一种基于可伸缩模式的潜在语义挖掘方法[J]. 中国科学技术大学学报, 2019, 49(7): 524-532.
[2]	石陆魁，郭林林，房子哲，张军. 基于Spark的并行ISOMAP算法[J]. 中国科学技术大学学报, 2019, 49(10): 842-850.
[3]	顾军华，王守彬，武君艳，张素琪. 基于Spark的多策略蚁群算法求解最大团问题[J]. 中国科学技术大学学报, 2019, 49(10): 851-860.
[4]	陈锋，张智，李琴剑，陈宇强，陈国良. FCD大数据并行处理的动态任务调度算法[J]. 中国科学技术大学学报, 2018, 48(9): 718-722.
[5]	陈志注，王宏志，熊风，张义策，高宏，李建中. 大数据拍卖的定价策略与方法[J]. 中国科学技术大学学报, 2018, 48(6): 486-494.
[6]	王佳玉，张振宇，褚征，吴晓红. 一种基于轨迹数据密度分区的分布式并行聚类方法[J]. 中国科学技术大学学报, 2018, 48(1): 47-56.
[7]	徐子伟，王鹏，陈宗海. 一种基于Fisher比率和预测风险准则的电信客户流失预测分步特征选择方法[J]. 中国科学技术大学学报, 2017, 47(8): 686-694.
[8]	王进，王鸿，夏翠萍，欧阳卫华，陈乔松，邓欣. 基于Spark的组合分类器链多标签分类方法[J]. 中国科学技术大学学报, 2017, 47(4): 350-357.
[9]	陈振国，田立勤，林闯. 基于感知源信任评价的物联网数据可靠保障模型[J]. 中国科学技术大学学报, 2017, 47(4): 297-303.
[10]	卜尧，吴斌，陈玉峰，白德盟. BDAP——一个基于Spark的数据挖掘工具平台[J]. 中国科学技术大学学报, 2017, 47(4): 358-368.
[11]	张静静，杨燕*，王红军，韩晓涛，邓强. 一种新的软聚类投票法及其并行化实现[J]. 中国科学技术大学学报, 2016, 46(3): 173-179.
[12]	殷超，王健宗，吕海涛，崔宗敏，程良伦，李同芳，刘妍. BDCode: 一种面向大数据存储系统的纠删码算法[J]. 中国科学技术大学学报, 2016, 46(3): 188-199.
[13]	王亚玲，刘越，洪建光，崔蔚，李彦虎，苏伊鹏，黄高攀，张明明，刘万涛. 基于Spark/Shark的电力用采大数据OLAP分析系统[J]. 中国科学技术大学学报, 2016, 46(1): 66-75.
[14]	黄冬梅，孙乐，赵丹枫. 基于ADMD融合策略的海洋大数据索引技术研究[J]. 中国科学技术大学学报, 2015, 45(10): 813-821.

基于Spark的ISOMAP算法并行化

Parallel ISOMAP algorithm based on Spark

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 14

编辑推荐

Metrics

本文评价