Spark/Shark-based OLAP system for smart grid applications

doi:10.3969/j.issn.0253-2778.2016.01.009

Abstract

Abstract: The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, Shark does not support fine-grained index, which hinders further improvement of query performance. To overcome this limitation, a Trie tree based fine-grained index technique TrieIndex and data re-organization scheme for better query performance was proposed. The experiment results with real electricity consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.

Key words: Spark, OLAP, power big data, index, Trie tree

CLC Number:

TP18

WANG Yaling, LIU Yue, HONG Jianguang, CUI Wei LI Yanhu, SU Yipeng, HUANG Gaopan, ZHANG Mingming, LIU Wantao. Spark/Shark-based OLAP system for smart grid applications[J]. Journal of University of Science and Technology of China, 2016, 46(1): 66-75.

References

［1］
Apache Hadoop. Welcome to apache hadoop[EB/OL]. https://hadoop.apache.org/.
[2] Spark. Lightning-fast cluster computing[EB/OL]. https://spark.apache.org/.
[3] Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Ttopics in Cloud Computing. Boston, USA: USENIX, 2010: 10-14.
[4] Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2013:13-24.
[5] Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 922-933.
[6] Jiang D W, Ooi B C, Shi L, et al. The performance of MapReduce: An in-depth study[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 472-483.
[7] Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing) [J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 515-529.
[8] Eltabakh M Y, zcan F, Sismanis Y, et al. Eagle-eyed elephant: Split-oriented indexing in Hadoop[C]// Proceedings of the 16th International Conference on Extending Database Technology. Genoa, Italy: ACM Press, 2013: 89-100.
[9] Liu Y, Hu S L, Rabl T, et al. DGFIndex for smart grid: Enhancing hive with a cost-effective multidimensional range index[C]// 40th International Conference on VLDB. Hangzhou, China: ACM Press, 2014: 1496-1507.
[10] 宋振伟. 用电信息采集系统数据库的云存储设计[D].山东大学, 2014.
[11] 彭小圣，邓迪元，程时杰，等. 面向智能电网应用的电力大数据关键技术[J]. 中国电机工程学报. 2015, 35(3): 503-511.
Peng X S, Deng D Y, Cheng S J, et al. Key technologies of electric power big data and its application prospects in smart grid[J]. Proceedings of the CSEE, 2015, 35(3): 503-511.
[12] Apache HiveTM[EB/OL]. http://hive.apache.org/.
[13] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. ACM Press, 2004: 137-149.
[14] Apache Oozie. Apache Oozie workflow scheduler for Hadoop[EB/OL]. http://oozie.apache.org.

()
()

[1]	Li Ming, Wen Canhong. Impact of COVID-19 pandemic on stock market via sparse principal component analysis [J]. Journal of University of Science and Technology of China, 2021, 51(5): 404-418.
[2]	HU Jiheng, LI Rui, WANG Yu, WANG Yipu, FU Yuyun. Analysis of the characteristics of satellite-derived multiple channel microwave emissivity difference vegetation index (EDVI) over different vegetation types [J]. Journal of University of Science and Technology of China, 2020, 50(4): 528-541.
[3]	. On the multiplicatively weighted Harary index of composite graphs [J]. Journal of University of Science and Technology of China, 2020, 50(3): 261-270.
[4]	YE Wuyi, LI Yiwei, WU Zun. Dynamic systematic tail risk measurement based on tail index regression [J]. Journal of University of Science and Technology of China, 2020, 50(2): 176-184.
[5]	YE Wuyi, MA Ronggui, WU Zun. Dynamic correlation of quantile regression model based on smooth transition mechanism [J]. Journal of University of Science and Technology of China, 2019, 49(8): 668-679.
[6]	. Maximum Balaban index and sum-Balaban index of cacti [J]. Journal of University of Science and Technology of China, 2019, 49(5): 368-376.
[7]	. Maximum augmented Zagreb index of graphs with given vertex bipartiteness [J]. Journal of University of Science and Technology of China, 2019, 49(3): 195-198.
[8]	SHI Lukui, GUO Linlin, FANG Zizhe, ZHANG Jun. Parallel ISOMAP algorithm based on Spark [J]. Journal of University of Science and Technology of China, 2019, 49(10): 842-850.
[9]	GU Junhua, WANG Shoubin, WU Junyan , ZHANG suqi. Multi-strategy ant colony algorithm for solving the maximum clique problem based on Spark [J]. Journal of University of Science and Technology of China, 2019, 49(10): 851-860.
[10]	XU Ziwei, WANG Peng, CHEN Zonghai. A two-stage feature selection method based on Fisher’s ratio and prediction risk for telecom customer churn prediction [J]. Journal of University of Science and Technology of China, 2017, 47(8): 686-694.
[11]	WANG Jin, WANG Hong, XIA Cuiping, OUYANG Weihua, CHEN Qiaosong, DENG Xin. Ensembles of classifier chains for multi-label classification based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 350-357.
[12]	XU Xueli, ZHAO Xuejing. Application of sparse spectral clustering algorithm in high-dimensional data [J]. Journal of University of Science and Technology of China, 2017, 47(4): 311-319.
[13]	BU Yao, WU Bin, CHEN Yufeng, BAI Demeng. BDAP: A data mining platform based on Spark [J]. Journal of University of Science and Technology of China, 2017, 47(4): 358-368.
[14]	CHEN Pengfei, LIU Haifang, LIU Qiaoling. An analysis method for structural reliability sensitivity based on the bisection method of sampling [J]. Journal of University of Science and Technology of China, 2015, 45(9): 763-769.
[15]	DA Tingting, ZHANG Shuguang, DA Cheng. The MFCCA algorithm and its application in financial market: A new view of multifractal extension of DCCA [J]. Journal of University of Science and Technology of China, 2015, 45(8): 683-691.