Parallel ISOMAP algorithm based on Spark

doi:10.3969/j.issn.0253-2778.2016.09.001

Abstract

Abstract: To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.

Key words: ISOMAP, Spark, E2LSH, manifold learning, big data

CLC Number:

TP18

SHI Lukui, YUAN Bin, LIU Wenhao. Parallel ISOMAP algorithm based on Spark[J]. Journal of University of Science and Technology of China, 2016, 46(9): 711-718.

References

［1］
TENENBAUM J B, DE SILVA V, LANGFORD J C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5): 2319-2323.
[2] 王自强，钱旭，孔敏. 流形学习算法综述[J]. 计算机工程与应用，2008，44(35): 9-12.
WANG Z Q, QIAN X, KONG M. Survey on manifold learning algorithms[J]. Computer Engineering and Applications, 2008, 44(35): 9-12.
[3] PLESS R, SOUVENIR R. A survey of manifold learning for images[J]. IPSJ Transactions on Computer Vision & Applications, 2009, 1(1): 83-94.
[4] IZENMAN A J. Introduction to manifold learning[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2012, 4(5):439-446.
[5] 曾宪华, 罗四维. 全局保持的流形学习算法对比研究[J]. 计算机工程与应用, 2010, 46(15): 1-6.
ZENG X H, LUO S W. Contrasting research of global preserving manifold learning algorithms[J]. Computer Engineering and Applications, 2010, 46(15): 1-6.
[6] 尹宏伟，李凡长. 谱机器学习研究综述[J]. 计算机科学与探索，2015，9（12)：1409-1419.
[7] 任磊，杜一，马帅，等.大数据可视分析综述[J]. 软件学报, 2014, 25(9): 1909-1936.
[8] 李毅. 基于Hadoop平台的局部线性嵌入算法研究[D]. 华南理工大学，2011.
[9] 卞云龙. 基于云计算平台的大规模流形学习算法研究[D]. 南京理工大学，2012.
[10] 薛永坚, 倪志伟. 基于MapReduce的大规模数据集流形学习降维研究[J]. 系统工程理论与实践，2014，34(S): 151-157.
[11] 刘勇. 头部姿态估计的监督流形学习研究及其并行化扩展[D]. 厦门大学，2013
[12] GU L, LI H. Memory or time: Performance evaluation for iterative operation on hadoop and spark [C]// International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. Zhangjiajie, China: IEEE Press, 2013: 721-727.
[13] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C]// Proceedings the of 20th Annual symposium on Computational Geometry. New York: ACM Press, 2004: 253-262.
[14] DE SILVA V, TENENBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C] // Proceedings of the Advances in Neural Information Processing Systems. Cambridge: IEEE Press, 2003, 15: 1959-1966.
[15] 张瑞杰，郭志刚, 李弼程，等. 基于E2LSH-MKL的视觉语义概念检测[J]. 自动化学报，2012，38(10): 1671-1678.
ZHANG R J, GUO Z G, LI B C, et al. A visual semantic concept detection algorithm based on E2LSH-MKL[J]. Acta Automatica Sinica, 2012, 38(10): 1671-1678.
[16] 王洪峰，刘辛. 基于位置敏感哈希的网络视频重复检测[J]. 计算机应用研究，2012，29(5): 1954-1958.
WANG H F, LIU X. Near-duplicate Web video detection based on locality sensitive hashing[J]. Application Research of Computers, 2012，29(5): 1954-1958.
[17] INDYK P. Stable distributions, pseudorandom generators, embeddings, and data stream computation [J]. Journal of the ACM, 2006, 53(3): 307-323.
[18] SPARKS E R, TALWALKAR A, SMITH V, et al. MLI: An API for distributed machine learning [C]// Proceedings of the 13th International Conference on Data Mining. Dallas: IEEE Computer Society, 2013:1187-1192.
[19] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]// Proceedings of the USENIX Conference on Networked Systems Design and Implementation. Berkeley: ACM Press, 2012, 70(2):141-146.
[20] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets [J]. USENIX Conference on Hot Topics in Cloud Computing, 2010, 15(1):1765-1773.

()
()

[1]	QIU Zhen, WANG Qiyuan, LIU Di, MENG Hongmin. A novel method for mining latent events based on scalable patterns [J]. Journal of University of Science and Technology of China, 2019, 49(7): 524-532.
[2]	SHI Lukui, GUO Linlin, FANG Zizhe, ZHANG Jun. Parallel ISOMAP algorithm based on Spark [J]. Journal of University of Science and Technology of China, 2019, 49(10): 842-850.
[3]	GU Junhua, WANG Shoubin, WU Junyan , ZHANG suqi. Multi-strategy ant colony algorithm for solving the maximum clique problem based on Spark [J]. Journal of University of Science and Technology of China, 2019, 49(10): 851-860.
[4]	CHEN Feng, ZHANG Zhi, LI Qinjian, CHEN Yuqiang, CHEN Guoliang. Dynamic task scheduling algorithm of parallel computing for FCD big data [J]. Journal of University of Science and Technology of China, 2018, 48(9): 718-722.
[5]	CHEN Zhizhu, WANG Hongzhi, XIONG Feng, ZHANG Yice, GAO Hong, LI Jianzhong. Research on the auction strategies and pricing of big data [J]. Journal of University of Science and Technology of China, 2018, 48(6): 486-494.
[6]	WANG Jiayu, ZHANG Zhenyu, CHU Zheng, WU Xiaohong. A trajectory data density partition based distributed parallel clustering method [J]. Journal of University of Science and Technology of China, 2018, 48(1): 47-56.
[7]	XU Ziwei, WANG Peng, CHEN Zonghai. A two-stage feature selection method based on Fisher’s ratio and prediction risk for telecom customer churn prediction [J]. Journal of University of Science and Technology of China, 2017, 47(8): 686-694.
[8]	WANG Jin, WANG Hong, XIA Cuiping, OUYANG Weihua, CHEN Qiaosong, DENG Xin. Ensembles of classifier chains for multi-label classification based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 350-357.
[9]	CHEN Zhenguo, TIAN Liqin, LIN Chuang. Data reliable assurance model of Internet of Things based on the trust evaluation of perceived source [J]. Journal of University of Science and Technology of China, 2017, 47(4): 297-303.
[10]	BU Yao, WU Bin, CHEN Yufeng, BAI Demeng. BDAP: A data mining platform based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 358-368.
[11]	ZHANG Jingjing,YANG Yan*,WANG Hongjun, HAN Xiaotao,DENG Qiang. A novel voting method and parallel implementation for soft clustering [J]. Journal of University of Science and Technology of China, 2016, 46(3): 173-179.
[12]	YIN Chao, WANG Jianzong, LV Haitao, CUI Zongmin, CHENG Lianglun, LI Tongfang, LIU Yan. BDCode: An erasure code algorithm for big data storage systems [J]. Journal of University of Science and Technology of China, 2016, 46(3): 188-199.
[13]	WANG Yaling, LIU Yue, HONG Jianguang, CUI Wei LI Yanhu, SU Yipeng, HUANG Gaopan, ZHANG Mingming, LIU Wantao. Spark/Shark-based OLAP system for smart grid applications [J]. Journal of University of Science and Technology of China, 2016, 46(1): 66-75.