中国科学技术大学学报 ›› 2016, Vol. 46 ›› Issue (1): 66-75.DOI: 10.3969/j.issn.0253-2778.2016.01.009

• 论著 • 上一篇    

基于Spark/Shark的电力用采大数据OLAP分析系统

王亚玲,刘越,洪建光,崔蔚,李彦虎,苏伊鹏,黄高攀,张明明,刘万涛   

  1. 1.国网信息通信产业集团有限公司,北京100761;2.中国科学院计算技术研究所,北京 100190; 3.国网浙江省电力公司,浙江杭州 310007;4. 国网江苏省电力公司信息通信分公司,江苏南京 210029
  • 收稿日期:2015-08-27 修回日期:2015-09-29 接受日期:2015-09-29 出版日期:2015-09-29 发布日期:2015-09-29
  • 通讯作者: 刘万涛
  • 作者简介:王亚玲,女,1972年生,硕士/高级工程师. 研究方向:数据挖掘. E-mail: wangyaling@sgitg.sgcc.com.cn
  • 基金资助:
    国家电网公司科技项目(SGJSXT00YWJS1400072)资助.

Spark/Shark-based OLAP system for smart grid applications

WANG Yaling, LIU Yue, HONG Jianguang, CUI Wei LI Yanhu, SU Yipeng, HUANG Gaopan, ZHANG Mingming, LIU Wantao   

  1. 1. State Grid Information & Telecommunication Group Co. Ltd., Beijing 100761, China; 2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 3. State Grid Zhejiang Electric Power Company, Hangzhou 310007, China; 4. State Grid Jiangsu Electric Power Company Information &Telecommunication branch, Nanjing 210029, China
  • Received:2015-08-27 Revised:2015-09-29 Accepted:2015-09-29 Online:2015-09-29 Published:2015-09-29

摘要: 用电信息大数据上的OLAP查询涉及数据量大,具有多表连接操作频繁、SQL结构复杂等特点,传统关系型数据库面对该类应用,表现出可扩展性弱、数据写入吞吐量低与查询效率低等问题.为此设计了一套基于Spark/Shark的电力大数据OLAP分析系统,该系统采用分布式文件系统HDFS保存电力用电信息采集系统的大数据,通过Shark进行前端SQL解析,Spark进行查询计算;然而,原生Shark只支持粗粒度分区,不支持细粒度的索引技术,难以高效地过滤无关数据,影响了查询性能.为克服这一不足,该系统设计了一种基于前缀树的细粒度索引结构TrieIndex,并通过数据重组技术优化了数据在HDFS的分布,提升了Shark的数据过滤能力以及用电信息大数据OLAP分析的性能.真实用电信息采集系统数据与查询的实验结果表明,该系统比关系型数据库的写入速度提升了12倍,比原生Shark的查询效率提升了10倍以上.

关键词: Spark, OLAP, 电力大数据, 索引, 前缀树

Abstract: The OLAP queries on electricity consumption information in Smart Grid have some prominent features: huge amounts of data, involving multiple tables in a joint operation, complex SQL structure, etc. Faced with this kind of applications, traditional RDBMS always leads to poor scalability, low write throughput, and unacceptable query performance, etc. A Spark/Shark-Based OLAP system for electricity consumption information in smart grid was designed. The system used distributed file system HDFS for data storage, and makes use of Shark to parse the SQL queries and Spark to execute them. However, Shark does not support fine-grained index, which hinders further improvement of query performance. To overcome this limitation, a Trie tree based fine-grained index technique TrieIndex and data re-organization scheme for better query performance was proposed. The experiment results with real electricity consumption information data and query show that the write throughput of the system is 12 times faster than that of RDBMS, and the query efficiency of the system is 10 times greater than that of original Shark.

Key words: Spark, OLAP, power big data, index, Trie tree

中图分类号: