Parallel ISOMAP algorithm based on Spark

doi:10.3969/j.issn.0253-2778.2019.10.010

Abstract

Abstract: To reduce the dimension of the nonlinear high-dimensional data in the big data environment, a parallel ISOMAP algorithm based on Spark is proposed, where a Spark-based parallel block Davidson method is designed and implemented to quickly solve eigenvalues and eigenvectors of the large scale matrices. Simultaneously, a row-block matrix multiplication strategy based on RDD partition is proposed for the difficulty of computation and transmission of the large scale matrices, which converts the matrix rows in each partition into block matrices. The row-block matrices are not restricted by the map operator to RDD calculation one by one, and can treat operations at the matrix level by using linear algebraic Library in Spark. The experimental results show that the row-block matrix multiplication strategy effectively improves the efficiency of matrix operations; the parallel block Davidson method can quickly solve the eigenvalues and eigenvectors of the large scale matrices and effectively improve the performance of parallel ISOMAP algorithm; and the parallel ISOMAP algorithm can adapt to dimensionality reduction in the big data environment.

Key words: ISOMAP, row-block matrix, block Davidson method, Spark

CLC Number:

TP181

SHI Lukui, GUO Linlin, FANG Zizhe, ZHANG Jun. Parallel ISOMAP algorithm based on Spark[J]. Journal of University of Science and Technology of China, 2019, 49(10): 842-850.

References

［1］
TENENBAUM J B, DE SILVA V, LANGFORD J C.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5): 2319–2323.
[2] PLESS R, SOUVENIR R. A survey of manifold learning for images [J]. IPSJ Transactions on Computer Vision & Applications, 2009, 1(1):83-94.
[3] SARVENIAZI A. An actual survey of dimensionality reduction [J]. American Journal of Computational Mathematics, 2014, 4: 55-72.
[4] SINGH K P, BHAI R, MISHRA V, etal.Localization in wireless sensor network using LLE-ISOMAP algorithm[C]// Proceedings of IEEE Region 10 Conference. Penang, Malaysia: IEEE, 2017:393-397.
[5] RAJAGURU H, KUMAR PRABHAKAR S. Performance analysis of local linear embedding and Hessian LLE with hybrid ABC-PSO for epilepsy classification from EEG signals[C]//Proceedings of International Conference on Inventive Research in Computing Applications. Coimbatore, India: IEEE, 2018:1084-1088.
[6] YEH T T, CHEN T Y , CHEN Y C , et al. Efficient parallel algorithm for nonlinear dimensionality reduction on GPU[C]// Proceedings of International Conference on Granular Computing. San Jose, USA:IEEE, 2010:592-597.
[7] LI W, ZHANG L, DU B. GPU parallel implementation of isometric mapping for hyperspectral classification [J]. IEEE Geoscience and Remote Sensing Letters, 2017, 14(9): 1532-1536.
[8] 李毅. 基于Hadoop平台的局部线性嵌入算法研究[D]. 广州，华南理工大学，2011.
[9] 卞云龙. 基于云计算平台的大规模流形学习算法研究[D]. 南京，南京理工大学，2012.
[10] 刘勇. 头部姿态估计的监督流形学习研究及其并行化扩展[D]. 厦门，厦门大学，2013.
[11] 薛永坚, 倪志伟. 基于MapReduce的大规模数据集流形学习降维研究[J]. 系统工程理论与实践, 2014, 34:151-157.
XUE Yongjian, NI Zhiwei.Research of large scale manifold learning based on MapReduce[J]. Systems Engineering - Theory & Practice, 2014, 34: 151-157.
[12] CAMPANA-OLIVO R, MANIAN V B. Parallel implementation of nonlinear dimensionality reduction methods applied in object segmentation using CUDA in GPU[C]// Proceedings of the International Society for Optical Engineering. SPIE, 2011:289-293.
[13] MOUSTAFA M, EBEID H M, HELMY A, et al. Parallel implementation of super-resolution based neighbor embedding using GPU [C]// Proceedings of the International Conference on Advanced Intelligent Systems and Informatics. Springer, 2016: 628-638.
[14] SAMUDRALA S K, ZOLA J, ALURU S, et al. Parallel framework for dimensionality reduction of large-scale datasets [J]. Scientific Programming, 2015, No.1: 1-12.
[15] SCHOENEMAN F, ZOLA J. Scalable manifold learning for big data with Apache Spark [C]// Proceedings of IEEE International Conference on Big Data. Seattle, USA: IEEE, 2018:272-281.
[16] WANG YH, GAO Y, XU C. Manifold learning method for large scale dataset based on gradient descent [C]//Proceedings of 3rd International Conference on Multimedia Technology.Springer, 2013, 84:1187-1194.
[17] ZHOU Y. A block Chebyshev-Davidson method with inner-outer restart for large eigenvalue problems[J]. Journal of Computational Physics, 2010, 229(24):9188-9200.
[18] SADKANE M, SIDJE R B. Implementation of a variable block Davidson method with deflation for solving large sparse eigenproblems[J]. Numerical Algorithms, 1999, 20:217-240
[19] 王顺绪,戴华.求解大型矩阵特征值问题的并行块Davidson方法[J].南京航空航天大学学报，2007，39(6): 814-818.
WANG Shunxu, DAI Hua. Parallel Block Davidson Method for Solving Large Scale Eigenvalue Problem[J]. Journal of Nanjing University of Aeronautics & Astronautics,2007，39(6): 814-818.
[20] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C] // Proceedings 20th of Annual symposium on Computational Geometry. New York, USA: ACM, 2004: 253-262.
[21] SAUL L K, ROWEIS S.An introduction to locally linear embedding [EB/OL].Journal of Machine Learning Research, [2019-8-30] http://www.cs.toronto.edu/~roweis/lle,2001.

()
()

[1]	GU Junhua, WANG Shoubin, WU Junyan , ZHANG suqi. Multi-strategy ant colony algorithm for solving the maximum clique problem based on Spark [J]. Journal of University of Science and Technology of China, 2019, 49(10): 851-860.
[2]	XU Ziwei, WANG Peng, CHEN Zonghai. A two-stage feature selection method based on Fisher’s ratio and prediction risk for telecom customer churn prediction [J]. Journal of University of Science and Technology of China, 2017, 47(8): 686-694.
[3]	WANG Jin, WANG Hong, XIA Cuiping, OUYANG Weihua, CHEN Qiaosong, DENG Xin. Ensembles of classifier chains for multi-label classification based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 350-357.
[4]	BU Yao, WU Bin, CHEN Yufeng, BAI Demeng. BDAP: A data mining platform based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 358-368.