一种基于Fisher比率和预测风险准则的电信客户流失预测分步特征选择方法

doi:10.3969/j.issn.0253-2778.2017.08.008

中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (8): 686-694.DOI: 10.3969/j.issn.0253-2778.2017.08.008

一种基于Fisher比率和预测风险准则的电信客户流失预测分步特征选择方法

徐子伟，王鹏，陈宗海

中国科学技术大学自动化系，安徽合肥 230027)

收稿日期:2016-03-18 修回日期:2016-11-07 出版日期:2017-08-31 发布日期:2017-08-31

A two-stage feature selection method based on Fisher’s ratio and prediction risk for telecom customer churn prediction

XU Ziwei, WANG Peng, CHEN Zonghai

Department of Automation, University of Science and Technology of China, Hefei, 230027, China

Received:2016-03-18 Revised:2016-11-07 Online:2017-08-31 Published:2017-08-31
Contact: CHEN Zonghai
About author:XU Ziwei, male, born in 1986, PhD candidate. Research field: Prediction control. E-mail: xziwei@mail.ustc.edu.cn
Supported by:
Supported by the National Natural Science Foundation of China ( 61375079).

摘要/Abstract

摘要： 电信客户流失预测是电信运营商客户关系管理系统的一个重要问题，其目的是预测具有较高流失风险的客户.电信客户流失预测模型的构建过程包括数据预处理、不均衡处理、特征选择和分类器的训练与评估.针对电信数据集中存在的特征维度过高问题，结合过滤式特征选择和嵌入式特征选择方法的优点，提出了一种基于Fisher比率和预测风险准则的分步特征提取方法.结合真实数据集的实验结果表明，该方法能够减少特征维度，提高分类器的预测效果.

关键词: 大数据, 流失预测, 分步特征选择, Spark

Abstract: Telecom customer churn prediction is crucial to the customer relationship management systems of telecom operators. It aims to predict a particular customer who is at a high risk of churning. The predicting process includes the steps of data pre-processing, imbalance processing, feature selection, classifier training and evaluation. A two-stage feature selection method based on fisher’s ratio and prediction risk was proposed, which took advantage of the filter feature selection method and wrapper feature selection method to solve the high dimensionality problem of telecom customer churn prediction. The method was evaluated on a real-world dataset, and the experimental results verify that it is able to reduce feature dimensionality and improve the performance of classifiers.

Key words: big data, churn prediction, two stage feature selection, Spark

中图分类号:

O157.4

徐子伟，王鹏，陈宗海. 一种基于Fisher比率和预测风险准则的电信客户流失预测分步特征选择方法[J]. 中国科学技术大学学报, 2017, 47(8): 686-694.

XU Ziwei, WANG Peng, CHEN Zonghai. A two-stage feature selection method based on Fisher’s ratio and prediction risk for telecom customer churn prediction[J]. Journal of University of Science and Technology of China, 2017, 47(8): 686-694.

参考文献

［1］
QIN H F. The application of data mining in telecommunication churn customer[J]. Research Journal of Applied Sciences, Engineering and Technology, 2012, 4(11): 1054-1057.
[2] NICULESCU-MIZIL A, PERLICH C, SWIRSZCZ G, et al. Winning the KDD cup orange challenge with ensemble selection[C]// Proceedings of the International Conference on Knowledge Discovery in Data Competition. Paris : ACM Press, 2009: 23-24.
[3] XIE J J, ROJKOVA V, PAL S, et al. A Combination of boosting and bagging for KDD cup 2009-fast scoring on a large database[C]// Proceedings of the International Conference on Knowledge Discovery in Data Competition. Paris : ACM Press, 2009: 35-43.
[4] MILLER H, CLARKE S, LANE S, et al. Predicting customer behaviour: The University of Melbourne’s KDD cup report[C]// Proceedings of the International Conference on Knowledge Discovery in Data Competition. Paris : ACM Press, 2009: 45-55.
[5] YABAS U, CANKAYA H C. Churn prediction in subscriber management for mobile and wireless communications services[C]// Proceeding of the IEEE GLOBECOM Workshops. Atlanta, USA : IEEE Press, 2013: 991-995.
[6] IDRIS A, KHAN A. Ensemble based efficient churn prediction model for telecom[C]// Proceeding of 12th International Conference on Frontiers of Information Technology. Islamabad, Pakistan : ACM Press, 2014: 238-244.
[7] IDRIS A, RIZWAN M, KHAN A. Churn prediction in telecom using random forest and PSO based data balancing in combination with various feature selection strategies[J]. Computers & Electrical Engineering, 2012, 38(6): 1808-1819.
[8] PENG H C, LONG F H, DING C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
[9] XU Hong, ZHANG Zigang, ZHANG Yishi. Churn prediction in telecom using a hybrid two-phase feature selection method[C]// Proceeding of the Third IEEE International Symposium on Intelligent Information Technology Application. Nanchang, China: ACM Press, 2009: 576-579.
[10] MANYIKA J, CHUI M, BROWN B, et al. Big data: The next frontier for innovation, competition, and productivity[R]. McKinsey Global Institute Report, New York, 2011.
[11] WANG S G, LI D Y, WEI Y J, et al. A feature selection method based on fisher’s discriminant ratio for text sentiment classification[C]// Proceedings of the International Conference on Web Information Systems and Mining. Berlin: Springer, 2009: 88-97.
[12] MOODY J. Prediction risk and architecture selection for neural networks[M]//From Statistics to Neural Networks. Berlin: Springer, 1994: 147-165.
[13] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, USA: ACM Press, 2010: 10.
[14] HUGEDOMAINS. 2009 knowledge discovery and data-mining competition[EB/OL]. http://www.kddcup-orange.com.
[15] JAPKOWICZ N, STEPHEN S. The class imbalance problem: A systematic study[J]. Intelligent Data Analysis, 2002, 6(5): 429-449.
[16] BREIMAN L. Manual on setting up, using, and understanding random forests v3. 1. 2002[EB/OL]. http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf[2015-12-30] .
[17] DAVIS J, GOADRICH M. The relationship between precision-recall and ROC curves[C]// Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, USA: ACM Press, 2006: 233-240.

[1]	邱镇，王琪媛，刘迪，孟洪民. 一种基于可伸缩模式的潜在语义挖掘方法[J]. 中国科学技术大学学报, 2019, 49(7): 524-532.
[2]	石陆魁，郭林林，房子哲，张军. 基于Spark的并行ISOMAP算法[J]. 中国科学技术大学学报, 2019, 49(10): 842-850.
[3]	顾军华，王守彬，武君艳，张素琪. 基于Spark的多策略蚁群算法求解最大团问题[J]. 中国科学技术大学学报, 2019, 49(10): 851-860.
[4]	陈锋，张智，李琴剑，陈宇强，陈国良. FCD大数据并行处理的动态任务调度算法[J]. 中国科学技术大学学报, 2018, 48(9): 718-722.
[5]	陈志注，王宏志，熊风，张义策，高宏，李建中. 大数据拍卖的定价策略与方法[J]. 中国科学技术大学学报, 2018, 48(6): 486-494.
[6]	王佳玉，张振宇，褚征，吴晓红. 一种基于轨迹数据密度分区的分布式并行聚类方法[J]. 中国科学技术大学学报, 2018, 48(1): 47-56.
[7]	王进，王鸿，夏翠萍，欧阳卫华，陈乔松，邓欣. 基于Spark的组合分类器链多标签分类方法[J]. 中国科学技术大学学报, 2017, 47(4): 350-357.
[8]	陈振国，田立勤，林闯. 基于感知源信任评价的物联网数据可靠保障模型[J]. 中国科学技术大学学报, 2017, 47(4): 297-303.
[9]	卜尧，吴斌，陈玉峰，白德盟. BDAP——一个基于Spark的数据挖掘工具平台[J]. 中国科学技术大学学报, 2017, 47(4): 358-368.

一种基于Fisher比率和预测风险准则的电信客户流失预测分步特征选择方法

A two-stage feature selection method based on Fisher’s ratio and prediction risk for telecom customer churn prediction

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics

本文评价