中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (4): 358-368.DOI: 10.3969/j.issn.0253-2778.2017.04.011

• 论著 • 上一篇    

BDAP——一个基于Spark的数据挖掘工具平台

卜尧,吴斌,陈玉峰,白德盟   

  1. 1. 北京邮电大学智能通信软件与多媒体北京重点实验室, 北京 100876;
    2. 北京邮电大学计算机学院 北京100876; 3. 国网山东省电力公司电力科学研究院 济南 250000)
  • 收稿日期:2016-08-28 修回日期:2017-12-08 出版日期:2017-04-30 发布日期:2017-04-30
  • 通讯作者: 吴斌
  • 作者简介:卜尧,女,1993年生,硕士生.研究方向:机器学习. E-mail: buyao1993@bupt.edu.cn
  • 基金资助:
    国家高技术研究发展(863)计划(2015AA050204),国网科技项目(520626150032),北京市教育委员会共建项目建设计划资助.

BDAP: A data mining platform based on Spark

BU Yao, WU Bin, CHEN Yufeng, BAI Demeng   

  1. 1. Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;
    2. School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China;
    3. State Grid Shandong Electric Power Research Institute, Jinan 250000, China
  • Received:2016-08-28 Revised:2017-12-08 Online:2017-04-30 Published:2017-04-30

摘要: 大数据处理系统是大数据领域的一个热点,为此首先研究大数据分析平台的架构与功能,将大数据分析平台分为数据源、数据吸收层、数据存储层、平台层、安全与监控层、设备层和应用层.平台包含多个数据预处理和算法模块,平台架构为大数据分析了奠定基础.在功能上,该平台功能全面,可以自由组合各种操作,模块之间耦合度低,便于维护和拓展.在用户体验上,调参、建立流程、监控、数据挖掘过程都是可视的,融合工作流和调度流技术.在性能上,该平台相应算法的性能优于Hive和MLlib.最后,举例说明大数据挖掘平台的应用场景.可以对电网线路故障和气象数据进行预处理,从而对故障进行预测和分类,可以通过视频挖掘组件,对数据分类.

关键词: 大数据分析平台, Hadoop, Storm, Spark, 批处理, 数据挖掘

Abstract: Large data processing system has become a hot spot research issue in the field of large data. First of all, The data analysis platform architecture and the function was analyzed, dividing it into the data source layer, data absorption layer, data storage layer, data platform layer, security and monitoring layer, equipment layer and application layer. Platform includes multiple data preprocessing and algorithm modules. The platform architecture provided a foundation for the big data analysis. The platform comprehensively features which can be freely combined. The coupling degree between the modules is low, which is convenient for maintenance and further development. From the user's point of view, the adjustment of parameters, the establishment of the process, monitoring, and data mining process are all visual, and workflow and scheduling stream technology are available. Terms of performance, the BDAP algorithm works better than Hive and MLlib. Finally, an example illustrates the application scenarios of this data mining platform. After analyzing the circuit fault and meteorological data, faults can be predicted and classified. Also video mining can be used to get useful information.

Key words: big data analysis framework, Hadoop, Storm, Spark, batch processing, data mining

中图分类号: