PipelineJoin：A new MapReduce-based multi-table join algorithm

doi:10.3969/j.issn.0253-2778.2015.10.006

Abstract

Abstract: MapReduce, a parallel and distributed computing model, has been widely used to process join operations for two or more large tables. The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join. Some methods can not process join operations for multi large tables, and others involve sequentially running too many MapReduce tasks, which leads to low efficiency. Here a new MapReduce-based multi-table join algorithm, PipelineJoin, is proposed to process chain join of a number of tables. PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join, while effectively overcoming the deficiency of the existing methods. Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.

Key words: join, multi-table, MapReduce, PipelineJoin

CLC Number:

TP18

LIN Ziyu, LI Yuqian, LI Can, LAI Yongxuan. PipelineJoin：A new MapReduce-based multi-table join algorithm[J]. Journal of University of Science and Technology of China, 2015, 45(10): 836-845.

References

［1］
Slagter K, Hsu C H, Chung Y C, et al. SmartJoin: A network-aware multiway join for MapReduce［J］. Cluster Computing, 2014, 17(3): 629-641.
［2］ Jiang D W, Tung A K H, Chen G. MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters［J］. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299-1311.
［3］ Afrati F N, Ullman J D. Optimizing multiway joins in a map-reduce environment［J］. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1282-1298.
［4］ Kimmett, Thomo A, Venkatesh S. Three-way joins on MapReduce: An experimental study［C］// Proceedings of the 5th International Conference on Information, Intelligence, Systems and Applications. Crete, Greece: IEEE Press, 2014: 227-232.
［5］ Yan K, Zhu H. Two MRJs for multi-way theta-join in MapReduce［C］// Proceedings of the 6th International Conference on Internet and Distributed Computing Systems. Hangzhou, China: Springer, 2013:321-332.
［6］ Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)［C］// Proceedings of the 36th International Conference on Very Large Data Bases. Singapore: ACM Press, 2010: 518-529.
［7］ Eltabakh M Y, Tian Y Y, ?zcan F, et al. Cohadoop: Flexible data placement and its exploitation in hadoop［C］// Proceedings of the 37th International Conference on Very Large Data Bases. Seattle, USA: ACM Press, 2011: 575-585.
［8］ Yang H C, Dasdan A, Hsiao R L, et al. Map-reduce-merge: Simplified relational data processing on large clusters［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data. Beijing, China: ACM Press, 2007: 1029-1040.
［9］ Blanas S, Patel J M, Ercegovac V, et al. A comparison of join algorithms for log processing in MapReduce［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data. Indianapolis, USA: ACM Press, 2010: 975-986.
［10］ Zhang X X, Guo Z L, Guo H L, et al. CasJoin: A Cascade Chain for Text Similarity Joins［C］// Proceedings of the 19th ACM international conference on Information and knowledge management. New York, USA: ACM Press, 2010: 1725-1728.
［11］ Blanas S, Li Y N, Patel J M. Design and evaluation of main memory hash join algorithms for multi-core CPUs［C］// Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. Athens, Greece: ACM Press, 2011:37-48.
［12］ Yuan Y, Wang D, Liu J C. Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments［C］// Proceedings of the IEEE Conference on Computer Communications. Toronto, Canada: ACM Press, 2014: 2175-2183.
［13］ Hunt P, Konar M, Junqueira F P, et al. ZooKeeper: Wait-free coordination for Internet-scale systems［C］// Proceedings of the Annual Technical Conference. Boston, USA: ACM Press, 2010:11-24.

()
()

[1]	TAN Jiaxin, ZHANG Weiping. A robust joint modeling approach for longitudinal data [J]. Journal of University of Science and Technology of China, 2020, 50(3): 317-327.
[2]	. Combined toxicity of Fe3O4 nanoparticles and cadmium chloride in the liver of mice by oral route [J]. Journal of University of Science and Technology of China, 2019, 49(6): 431-438.
[3]	SU Hao, LIU Qicheng, MU Chunxiao. Research on product reviews hot spot discovery algorithm based on MapReduce [J]. Journal of University of Science and Technology of China, 2019, 49(2): 112-118.
[4]	CAI Yong, CHEN Hongmei,. A parallel algorithm for constructing concept lattice based on hierarchical concept under MapReduce [J]. Journal of University of Science and Technology of China, 2018, 48(4): 275-283.
[5]	ZHANG Huimin, YANG Ming, LV Jing. Multifeature hyperspectral image classification based on adaptive kernel joint sparse representation [J]. Journal of University of Science and Technology of China, 2018, 48(4): 298-306.
[6]	YE Min, CHEN Haibo, ZHANG Huiwu, ZHAO Xiuzhen. Effect of bolt joints on behaviour of transmission tower under external loads [J]. Journal of University of Science and Technology of China, 2017, 47(6): 498-507.
[7]	WANG Jin, WANG Hong, XIA Cuiping, OUYANG Weihua, CHEN Qiaosong, DENG Xin. Ensembles of classifier chains for multi-label classification based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 350-357.
[8]	GU Min, GUO Qing, CAO Ye, ZHU Feng, GU Yanhui, ZHOU Junsheng, QU Weiguang,. Research on web page automatic categorization based on structural and text information [J]. Journal of University of Science and Technology of China, 2017, 47(4): 290-296.
[9]	WANG Li, ZHENG Gang. Finite element analysis of the differential rebound of piles in pit during top-down excavation [J]. Journal of University of Science and Technology of China, 2017, 47(3): 274-282.
[10]	CHEN Yuan, WANG Jingbin. Distributed keyword approximate search method for RDF [J]. Journal of University of Science and Technology of China, 2017, 47(10): 823-836.
[11]	LI Fulin, ZHU Shixin. A fast algorithm for the 2-adic joint complexity of pn -periodic a binary multisequence [J]. Journal of University of Science and Technology of China, 2013, 43(2): 169-172.
[12]	ZHNAG Jian-min, ZHANG Zhao-yang, HUANG Ai-ping. Optimal subchannel, rate and power allocation for OFDM-based two-tier Femtocell networks [J]. Journal of University of Science and Technology of China, 2009, 39(10): 1034-1038.