中国科学技术大学学报 ›› 2015, Vol. 45 ›› Issue (10): 836-845.DOI: 10.3969/j.issn.0253-2778.2015.10.006

• 论著 • 上一篇    

PipelineJoin:一种新的基于MapReduce的多表连接算法

林子雨,李雨倩,李粲,赖永炫   

  1. 厦门大学信息科学与技术学院, 福建 厦门 361005;2. 厦门大学软件学院, 福建 厦门 361005
  • 收稿日期:2015-08-27 修回日期:2015-09-29 接受日期:2015-09-29 出版日期:2015-09-29 发布日期:2015-09-29
  • 通讯作者: 林子雨
  • 作者简介:林子雨(通讯作者),男,1987年生,博士/讲师,研究方向:大数据挖掘.E-mail:ziylin@xmu.edu.cn
  • 基金资助:
    国家自然科学基金(61303004,1202012); 国家科技支撑计划(863) (2015BAH16F00/F01/F02)资助.

PipelineJoin:A new MapReduce-based multi-table join algorithm

LIN Ziyu, LI Yuqian, LI Can, LAI Yongxuan   

  1. 1. College of Information Science and Technology, Xiamen University, Xiamen 361005, China;2. School of Software, Xiamen University, Xiamen 361005, China
  • Received:2015-08-27 Revised:2015-09-29 Accepted:2015-09-29 Online:2015-09-29 Published:2015-09-29

摘要: MapReduce是一个并行分布式计算模型,已经被广泛应用于处理两个或多个大型表的连接操作.现有的基于MapReduce的多表连接算法,在处理链式连接时,不能处理多个大表的连接,或者需要顺序运行较多的MapReduce任务,效率较低.为此提出了一种基于MapReduce的多表连接算法—— PipelineJoin,高效地实现任意多个大表的链式连接.PipelineJoin采用流水线模型和调度器来实现MapReduce任务的流水线式执行,从而有效提高多表连接的效率,同时可以较好地克服链式多表连接算法的缺陷.最后,在不同规模的数据集上进行了大量实验,实验结果表明PipelineJoin算法与原有链式多表连接算法相比,可以有效减少连接所需的时间.

关键词: 连接, 多表, MapReduce, PipelineJoin

Abstract: MapReduce, a parallel and distributed computing model, has been widely used to process join operations for two or more large tables. The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join. Some methods can not process join operations for multi large tables, and others involve sequentially running too many MapReduce tasks, which leads to low efficiency. Here a new MapReduce-based multi-table join algorithm, PipelineJoin, is proposed to process chain join of a number of tables. PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join, while effectively overcoming the deficiency of the existing methods. Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.

Key words: join, multi-table, MapReduce, PipelineJoin

中图分类号: