基于Spark/Shark的电力用采大数据OLAP分析系统

doi:10.3969/j.issn.0253-2778.2016.01.009

中国科学技术大学学报 ›› 2016, Vol. 46 ›› Issue (1): 66-75.DOI: 10.3969/j.issn.0253-2778.2016.01.009

• 论著 • 上一篇

基于Spark/Shark的电力用采大数据OLAP分析系统

王亚玲，刘越，洪建光，崔蔚，李彦虎，苏伊鹏，黄高攀，张明明，刘万涛

1.国网信息通信产业集团有限公司，北京100761；2.中国科学院计算技术研究所，北京 100190； 3.国网浙江省电力公司，浙江杭州 310007；4. 国网江苏省电力公司信息通信分公司，江苏南京 210029

收稿日期:2015-08-27 修回日期:2015-09-29 接受日期:2015-09-29 出版日期:2015-09-29 发布日期:2015-09-29
通讯作者: 刘万涛
作者简介:王亚玲，女，1972年生，硕士/高级工程师. 研究方向：数据挖掘. E-mail: wangyaling@sgitg.sgcc.com.cn
基金资助:
国家电网公司科技项目(SGJSXT00YWJS1400072)资助.

Spark/Shark-based OLAP system for smart grid applications

WANG Yaling, LIU Yue, HONG Jianguang, CUI Wei LI Yanhu, SU Yipeng, HUANG Gaopan, ZHANG Mingming, LIU Wantao

1. State Grid Information & Telecommunication Group Co. Ltd., Beijing 100761, China; 2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 3. State Grid Zhejiang Electric Power Company, Hangzhou 310007, China; 4. State Grid Jiangsu Electric Power Company Information &Telecommunication branch, Nanjing 210029, China

Received:2015-08-27 Revised:2015-09-29 Accepted:2015-09-29 Online:2015-09-29 Published:2015-09-29

摘要/Abstract

摘要： 用电信息大数据上的OLAP查询涉及数据量大，具有多表连接操作频繁、SQL结构复杂等特点，传统关系型数据库面对该类应用，表现出可扩展性弱、数据写入吞吐量低与查询效率低等问题.为此设计了一套基于Spark/Shark的电力大数据OLAP分析系统，该系统采用分布式文件系统HDFS保存电力用电信息采集系统的大数据，通过Shark进行前端SQL解析，Spark进行查询计算；然而，原生Shark只支持粗粒度分区，不支持细粒度的索引技术，难以高效地过滤无关数据，影响了查询性能.为克服这一不足，该系统设计了一种基于前缀树的细粒度索引结构TrieIndex，并通过数据重组技术优化了数据在HDFS的分布，提升了Shark的数据过滤能力以及用电信息大数据OLAP分析的性能.真实用电信息采集系统数据与查询的实验结果表明，该系统比关系型数据库的写入速度提升了12倍，比原生Shark的查询效率提升了10倍以上.

关键词: Spark, OLAP, 电力大数据, 索引, 前缀树

Abstract: The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, Shark does not support fine-grained index, which hinders further improvement of query performance. To overcome this limitation, a Trie tree based fine-grained index technique TrieIndex and data re-organization scheme for better query performance was proposed. The experiment results with real electricity consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.

Key words: Spark, OLAP, power big data, index, Trie tree

中图分类号:

TP18

王亚玲，刘越，洪建光，崔蔚，李彦虎，苏伊鹏，黄高攀，张明明，刘万涛. 基于Spark/Shark的电力用采大数据OLAP分析系统[J]. 中国科学技术大学学报, 2016, 46(1): 66-75.

WANG Yaling, LIU Yue, HONG Jianguang, CUI Wei LI Yanhu, SU Yipeng, HUANG Gaopan, ZHANG Mingming, LIU Wantao. Spark/Shark-based OLAP system for smart grid applications[J]. Journal of University of Science and Technology of China, 2016, 46(1): 66-75.

参考文献

［1］
Apache Hadoop. Welcome to apache hadoop[EB/OL]. https://hadoop.apache.org/.
[2] Spark. Lightning-fast cluster computing[EB/OL]. https://spark.apache.org/.
[3] Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Ttopics in Cloud Computing. Boston, USA: USENIX, 2010: 10-14.
[4] Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2013:13-24.
[5] Abouzeid A, Bajda-Pawlikowski K, Abadi D, et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 922-933.
[6] Jiang D W, Ooi B C, Shi L, et al. The performance of MapReduce: An in-depth study[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 472-483.
[7] Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing) [J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 515-529.
[8] Eltabakh M Y, zcan F, Sismanis Y, et al. Eagle-eyed elephant: Split-oriented indexing in Hadoop[C]// Proceedings of the 16th International Conference on Extending Database Technology. Genoa, Italy: ACM Press, 2013: 89-100.
[9] Liu Y, Hu S L, Rabl T, et al. DGFIndex for smart grid: Enhancing hive with a cost-effective multidimensional range index[C]// 40th International Conference on VLDB. Hangzhou, China: ACM Press, 2014: 1496-1507.
[10] 宋振伟. 用电信息采集系统数据库的云存储设计[D].山东大学, 2014.
[11] 彭小圣，邓迪元，程时杰，等. 面向智能电网应用的电力大数据关键技术[J]. 中国电机工程学报. 2015, 35(3): 503-511.
Peng X S, Deng D Y, Cheng S J, et al. Key technologies of electric power big data and its application prospects in smart grid[J]. Proceedings of the CSEE, 2015, 35(3): 503-511.
[12] Apache HiveTM[EB/OL]. http://hive.apache.org/.
[13] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation. ACM Press, 2004: 137-149.
[14] Apache Oozie. Apache Oozie workflow scheduler for Hadoop[EB/OL]. http://oozie.apache.org.

()
()

[1]	石陆魁，郭林林，房子哲，张军. 基于Spark的并行ISOMAP算法[J]. 中国科学技术大学学报, 2019, 49(10): 842-850.
[2]	顾军华，王守彬，武君艳，张素琪. 基于Spark的多策略蚁群算法求解最大团问题[J]. 中国科学技术大学学报, 2019, 49(10): 851-860.
[3]	朱一波，鲍培明，吉根林. 一种用户频繁移动模式并行挖掘算法[J]. 中国科学技术大学学报, 2018, 48(1): 57-64.
[4]	徐子伟，王鹏，陈宗海. 一种基于Fisher比率和预测风险准则的电信客户流失预测分步特征选择方法[J]. 中国科学技术大学学报, 2017, 47(8): 686-694.
[5]	王进，王鸿，夏翠萍，欧阳卫华，陈乔松，邓欣. 基于Spark的组合分类器链多标签分类方法[J]. 中国科学技术大学学报, 2017, 47(4): 350-357.
[6]	卜尧，吴斌，陈玉峰，白德盟. BDAP——一个基于Spark的数据挖掘工具平台[J]. 中国科学技术大学学报, 2017, 47(4): 358-368.
[7]	黄冬梅，孙乐，赵丹枫. 基于ADMD融合策略的海洋大数据索引技术研究[J]. 中国科学技术大学学报, 2015, 45(10): 813-821.
[8]	班雷雨，霍欢，徐彪. 基于移动数据的人群活动热点区域的发现[J]. 中国科学技术大学学报, 2015, 45(10): 829-835.

基于Spark/Shark的电力用采大数据OLAP分析系统

Spark/Shark-based OLAP system for smart grid applications

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

本文评价