Journal of University of Science and Technology of China ›› 2015, Vol. 45 ›› Issue (10): 836-845.DOI: 10.3969/j.issn.0253-2778.2015.10.006

• Original Paper • Previous Articles    

PipelineJoin:A new MapReduce-based multi-table join algorithm

LIN Ziyu, LI Yuqian, LI Can, LAI Yongxuan   

  1. 1. College of Information Science and Technology, Xiamen University, Xiamen 361005, China;2. School of Software, Xiamen University, Xiamen 361005, China
  • Received:2015-08-27 Revised:2015-09-29 Accepted:2015-09-29 Online:2015-09-29 Published:2015-09-29

Abstract: MapReduce, a parallel and distributed computing model, has been widely used to process join operations for two or more large tables. The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join. Some methods can not process join operations for multi large tables, and others involve sequentially running too many MapReduce tasks, which leads to low efficiency. Here a new MapReduce-based multi-table join algorithm, PipelineJoin, is proposed to process chain join of a number of tables. PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join, while effectively overcoming the deficiency of the existing methods. Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.

Key words: join, multi-table, MapReduce, PipelineJoin

CLC Number: