Journal of University of Science and Technology of China ›› 2016, Vol. 46 ›› Issue (9): 711-718.DOI: 10.3969/j.issn.0253-2778.2016.09.001

• Original Paper •    

Parallel ISOMAP algorithm based on Spark

SHI Lukui, YUAN Bin, LIU Wenhao   

  1. 1. School of Computer Science and Engineering, Hebei University of Technology, Tianjin 300401, China; 2. Hebei Province Bigdata Computation Key Library, Tianjin 300401, China
  • Received:2016-03-01 Revised:2016-09-17 Accepted:2016-09-17 Online:2016-09-17 Published:2016-09-17

Abstract: To achieve quick dimensionality reduction of the nonlinear high dimensional in the big data environment, a parallel ISOMAP algorithm based on Spark was proposed. In the method, a parallel algorithm to search for near neighbors was designed and realized to fast build the neighborhood matrix, which was based on exact Euclidean locality sensitive hashing. A parallel eigenvalue solving method was designed to quickly solve the eigenvalues, which executes the power method and the order reduction method in turn. To further improve the performance of the algorithm, the parallel method was optimized to reduce the memory consumption and data transmissions through using Spark’s sparse vector, broadcast mechanism and caching mechanism according to the characteristics of Spark. Experimental results on Swiss roll data and S-curve data demonstrated that the parallel ISOMAP algorithm based on Spark greatly improved the executing efficiency of the method through parallel executing and optimizing the calculating procedure. It could be suitable for reducing the dimension of large scale data sets.

Key words: ISOMAP, Spark, E2LSH, manifold learning, big data

CLC Number: