PipelineJoin：一种新的基于MapReduce的多表连接算法

doi:10.3969/j.issn.0253-2778.2015.10.006

中国科学技术大学学报 ›› 2015, Vol. 45 ›› Issue (10): 836-845.DOI: 10.3969/j.issn.0253-2778.2015.10.006

• 论著 • 上一篇

PipelineJoin：一种新的基于MapReduce的多表连接算法

林子雨，李雨倩，李粲，赖永炫

厦门大学信息科学与技术学院, 福建厦门 361005；2. 厦门大学软件学院, 福建厦门 361005

收稿日期:2015-08-27 修回日期:2015-09-29 接受日期:2015-09-29 出版日期:2015-09-29 发布日期:2015-09-29
通讯作者: 林子雨
作者简介:林子雨（通讯作者），男，1987年生，博士/讲师，研究方向：大数据挖掘.E-mail:ziylin@xmu.edu.cn
基金资助:
国家自然科学基金（61303004,1202012）; 国家科技支撑计划(863) (2015BAH16F00/F01/F02)资助.

PipelineJoin：A new MapReduce-based multi-table join algorithm

LIN Ziyu, LI Yuqian, LI Can, LAI Yongxuan

1. College of Information Science and Technology, Xiamen University, Xiamen 361005, China；2. School of Software, Xiamen University, Xiamen 361005, China

Received:2015-08-27 Revised:2015-09-29 Accepted:2015-09-29 Online:2015-09-29 Published:2015-09-29

摘要/Abstract

摘要： MapReduce是一个并行分布式计算模型，已经被广泛应用于处理两个或多个大型表的连接操作.现有的基于MapReduce的多表连接算法，在处理链式连接时，不能处理多个大表的连接，或者需要顺序运行较多的MapReduce任务，效率较低.为此提出了一种基于MapReduce的多表连接算法—— PipelineJoin，高效地实现任意多个大表的链式连接.PipelineJoin采用流水线模型和调度器来实现MapReduce任务的流水线式执行，从而有效提高多表连接的效率，同时可以较好地克服链式多表连接算法的缺陷.最后，在不同规模的数据集上进行了大量实验，实验结果表明PipelineJoin算法与原有链式多表连接算法相比，可以有效减少连接所需的时间.

关键词: 连接, 多表, MapReduce, PipelineJoin

Abstract: MapReduce, a parallel and distributed computing model, has been widely used to process join operations for two or more large tables. The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join. Some methods can not process join operations for multi large tables, and others involve sequentially running too many MapReduce tasks, which leads to low efficiency. Here a new MapReduce-based multi-table join algorithm, PipelineJoin, is proposed to process chain join of a number of tables. PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join, while effectively overcoming the deficiency of the existing methods. Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.

Key words: join, multi-table, MapReduce, PipelineJoin

中图分类号:

TP18

林子雨，李雨倩，李粲，赖永炫. PipelineJoin：一种新的基于MapReduce的多表连接算法[J]. 中国科学技术大学学报, 2015, 45(10): 836-845.

LIN Ziyu, LI Yuqian, LI Can, LAI Yongxuan. PipelineJoin：A new MapReduce-based multi-table join algorithm[J]. Journal of University of Science and Technology of China, 2015, 45(10): 836-845.

参考文献

［1］
Slagter K, Hsu C H, Chung Y C, et al. SmartJoin: A network-aware multiway join for MapReduce［J］. Cluster Computing, 2014, 17(3): 629-641.
［2］ Jiang D W, Tung A K H, Chen G. MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters［J］. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299-1311.
［3］ Afrati F N, Ullman J D. Optimizing multiway joins in a map-reduce environment［J］. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1282-1298.
［4］ Kimmett, Thomo A, Venkatesh S. Three-way joins on MapReduce: An experimental study［C］// Proceedings of the 5th International Conference on Information, Intelligence, Systems and Applications. Crete, Greece: IEEE Press, 2014: 227-232.
［5］ Yan K, Zhu H. Two MRJs for multi-way theta-join in MapReduce［C］// Proceedings of the 6th International Conference on Internet and Distributed Computing Systems. Hangzhou, China: Springer, 2013:321-332.
［6］ Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)［C］// Proceedings of the 36th International Conference on Very Large Data Bases. Singapore: ACM Press, 2010: 518-529.
［7］ Eltabakh M Y, Tian Y Y, ?zcan F, et al. Cohadoop: Flexible data placement and its exploitation in hadoop［C］// Proceedings of the 37th International Conference on Very Large Data Bases. Seattle, USA: ACM Press, 2011: 575-585.
［8］ Yang H C, Dasdan A, Hsiao R L, et al. Map-reduce-merge: Simplified relational data processing on large clusters［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data. Beijing, China: ACM Press, 2007: 1029-1040.
［9］ Blanas S, Patel J M, Ercegovac V, et al. A comparison of join algorithms for log processing in MapReduce［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data. Indianapolis, USA: ACM Press, 2010: 975-986.
［10］ Zhang X X, Guo Z L, Guo H L, et al. CasJoin: A Cascade Chain for Text Similarity Joins［C］// Proceedings of the 19th ACM international conference on Information and knowledge management. New York, USA: ACM Press, 2010: 1725-1728.
［11］ Blanas S, Li Y N, Patel J M. Design and evaluation of main memory hash join algorithms for multi-core CPUs［C］// Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. Athens, Greece: ACM Press, 2011:37-48.
［12］ Yuan Y, Wang D, Liu J C. Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments［C］// Proceedings of the IEEE Conference on Computer Communications. Toronto, Canada: ACM Press, 2014: 2175-2183.
［13］ Hunt P, Konar M, Junqueira F P, et al. ZooKeeper: Wait-free coordination for Internet-scale systems［C］// Proceedings of the Annual Technical Conference. Boston, USA: ACM Press, 2010:11-24.

()
()

[1]	孙磊，张义宁，薛艳芳，乔立山，张丽梅. 自适应功能连接网络学习及其在脑疾病识别中的应用[J]. 中国科学技术大学学报, 2020, 50(8): 1102-1109.
[2]	苏浩，刘其成，牟春晓. 基于MapReduce的商品评论热点发现算法研究[J]. 中国科学技术大学学报, 2019, 49(2): 112-118.
[3]	蔡勇，陈红梅，. MapReduce环境下基于概念分层的概念格并行构造算法[J]. 中国科学技术大学学报, 2018, 48(4): 275-283.
[4]	叶敏，陈海波，张会武，赵秀珍. 螺栓连接对外力荷载下输电塔的影响[J]. 中国科学技术大学学报, 2017, 47(6): 498-507.
[5]	王进，王鸿，夏翠萍，欧阳卫华，陈乔松，邓欣. 基于Spark的组合分类器链多标签分类方法[J]. 中国科学技术大学学报, 2017, 47(4): 350-357.
[6]	王丽, 郑刚. 逆作法开挖坑底工程桩差异回弹有限元分析[J]. 中国科学技术大学学报, 2017, 47(3): 274-282.
[7]	陈远，汪璟玢. 分布式RDF关键词近似搜索方法[J]. 中国科学技术大学学报, 2017, 47(10): 823-836.
[8]	靳春艳，盛自章，黄京飞. 真核生物DNA连接酶Ⅲ的功能演化[J]. 中国科学技术大学学报, 2012, 42(4): 302-310.

PipelineJoin：一种新的基于MapReduce的多表连接算法

PipelineJoin：A new MapReduce-based multi-table join algorithm

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

本文评价