中国科学技术大学学报 ›› 2019, Vol. 49 ›› Issue (12): 974-984.DOI: 10.3969/j.issn.0253-2778.2019.12.004

• 原创论文 • 上一篇    下一篇

高维数据情形下的一种基于随机投影的集成分类方法

崔文泉   

  1. 中国科学技术大学管理学院统计与金融系,安徽合肥 230026
  • 收稿日期:2019-04-14 修回日期:2019-05-23 出版日期:2019-12-31 发布日期:2019-12-31
  • 通讯作者: 崔文泉
  • 作者简介:崔文泉(通讯作者),男, 1964年生,博士/副教授. 研究方向:数理统计. E-mail: wqcui@ustc.edu.cn
  • 基金资助:
    国家自然科学基金(71873128),安徽省自然科学基金(1308085MA02)资助.

A new random projection-based ensemble classifier for high-dimensional data

CUI Wenquan   

  1. Department of Statistics and Finance, Shool of Management, University of Science and of Technology of China, Hefei 230026, China
  • Received:2019-04-14 Revised:2019-05-23 Online:2019-12-31 Published:2019-12-31

摘要: 针对高维数据的分类问题,提出一种基于随机投影的决策树集成学习方法(Projection Forest,简记PJForest).该方法以决策树为基分类器,利用一系列随机投影对数据进行降维,基于降维后的数据构建相应的一系列决策树,而后通过集成学习构造集成分类器.利用适当的随机投影对数据进行降维,能保持数据几何结构的信息;且通过随机投影对原始数据进行扰动,能丰富决策树的多样性,经过适当集成可有效克服噪音的影响,进而提升PJForest的泛化能力.证明了PJForest泛化误差的极限性质,得到泛化误差在一定意义下的收敛速度.还开展大量的模拟研究,并对实际数据进行了实证分析.模拟研究的结果表明,PJForest能有效地对包含大量噪音的高维数据进行分类,与已有的诸如随机森林、Xgboost这些方法相比,有更好的分类性能.

关键词: 决策树, 多样性, 高维, 分类, 集成学习, 随机投影

Abstract: A decision tree ensemble method based on random projection(projection forest, PJForest) was proposed to solve the classification problem of high-dimensional data. This method used the decision tree as the base classifier and reduced the dimensionality of the data by using a series of random projections. Then based on dimensionally reduced data, a series of decision trees were constructed, and then the ensemble classifier was constructed through ensemble learning. Using appropriate random projection to reduce the dimensionality of the data can preserve the information contained in the geometric structure of the data. Moreover, perturbation of raw data through random projection can enrich the diversity of decision trees. After proper ensemble learning, it can effectively overcome the influence of noise and improve the generalization ability of PJForest. The limiting property of PJForest generalization error was proved and the convergence rate of generalization error under certain conditions was obtained. Many simulation studies were conducted and empirical studies on real life data were empirically analyzed. The simulation results showed that the method of PJForest can effectively classify high dimensional data with a large amount of noises, and has better properties than current classification methods such as random forest, Xgboost.

Key words: decision tree, diversity, high-dimensional classification, ensemble learning, random projection