中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (4): 350-357.DOI: 10.3969/j.issn.0253-2778.2017.04.010

• 论著 • 上一篇    下一篇

基于Spark的组合分类器链多标签分类方法

王进,王鸿,夏翠萍,欧阳卫华,陈乔松,邓欣   

  1. 重庆邮电大学计算智能重庆市重点实验室,重庆 400065
  • 收稿日期:2016-08-28 修回日期:2016-12-08 出版日期:2017-04-30 发布日期:2017-04-30
  • 通讯作者: 王进
  • 作者简介:王进(通讯作者),男,1979年生,博士/教授. 研究方向:数据挖掘、人工智能. E-mail: wangjin@cqupt.edu.cn
  • 基金资助:
    重庆市基础与前沿研究计划(cstc2014jcyjA40001,cstc2014jcyjA40022),重庆教委科学技术研究项目(KJ1400436)资助.

Ensembles of classifier chains for multi-label classification based on Spark

WANG Jin, WANG Hong, XIA Cuiping, OUYANG Weihua, CHEN Qiaosong, DENG Xin   

  1. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2016-08-28 Revised:2016-12-08 Online:2017-04-30 Published:2017-04-30

摘要: 随着数据挖掘技术在现实问题中的广泛应用,多标签学习现已成为数据挖掘技术中的一个研究热点.组合分类器链(ECC)算法是一种性能较好的多标签分类方法,其分类效果好、准确度高,但该算法的时空复杂度较高,不能适应大规模多标签数据分类任务.为此提出了一种基于Spark的组合分类器链多标签分类方法,将串行组合分类器链算法的各步骤进行了并行化实现.通过单机实验和集群并行化实验,证明该方法对大规模多标签数据集具有良好的适应能力和加速比,且分类效果不输于传统的串行多标签分类方法.

关键词: 多标签学习, 组合分类器链, Apache Spark, MapReduce

Abstract: With the wide application of data mining technology, multi-label learning has become a hot topic in the data mining domain. Although ensembles of classifier chains (ECC) algorithm is a multi-label learning method which is effective and accurate, its complexity of time and space is so high that it cannot adapt to the large-scale multi-label classification tasks. A new algorithm named Spark ensembles of classifier chains(S-ECC) was proposed based on Spark platform on which a parallel implementation was conducted of each step of the sequential ECC algorithm. The test results in stand-alone and cluster environments show that S-ECC has a good adaptability to large-scale data with a high speedup, and that it is no less capable than the traditional sequential program.

Key words: multi-label learning, ECC, Apache Spark, MapReduce

中图分类号: