中国科学技术大学学报 ›› 2011, Vol. 41 ›› Issue (7): 607-614.DOI: 10.3969/j.issn.0253-2778.2011.07.007

• 原创论文 • 上一篇    下一篇

基于尾项加权的自适应文本分类方法研究

赖英旭   

  1. 北京工业大学计算机学院,北京 100124
  • 收稿日期:2011-04-28 修回日期:2011-06-21 出版日期:2011-07-31 发布日期:2011-07-31
  • 通讯作者: 赖英旭
  • 作者简介:赖英旭(通讯作者),女,1973年生,博士/副教授. 研究方向:信息安全. E-mail: laiyingxu@bjut.edu.cn
  • 基金资助:
    国家自然科学基金(61001178),北京市自然科学基金(4102012),北京市教育委员会科技发展计划面上项目(KM200810005030),北京市高等学校人才强教深化计划“中青年骨干人才培养计划”项目(PHR201108016),北京工业大学青年科学基金资助.

Adaptive adjustment weighted text classification

LAI Yingxu   

  1. College of Computer Science and Technology, Beijing University of Technology, Beijing 100124, China
  • Received:2011-04-28 Revised:2011-06-21 Online:2011-07-31 Published:2011-07-31

摘要: 基于朴素贝叶斯分类框架,通过添加尾项值对部分严重扭曲的分类结果进行调整,达到提升分类器性能的目的.方法通过增量式自适应学习分类模式,根据历史结果,判断分类器分类质量,进而确定尾项添加区间,对明显产生分类扭曲的区间结果自适应添加尾项补偿,调整分类结果.在Trec05,Trec06,Trec07,Ceas08数据集上的对比实验表明,改进算法在accuracy,Macro F1两个指标上均比朴素贝叶斯分类器和bagging朴素贝叶斯分类器显著提高,且方法简单易行.

关键词: 文本分类, 朴素贝叶斯分类器, 垃圾邮件过滤, 尾项加权

Abstract: To improve the performance of the naive Bayes classifier, a method is proposed which regulates text categories by adding adjustment values to the output of the naive Bayes classifier. The classification pattern was learned in an incremental and adaptive way, and the interval during which the output of the naive Bayes classifier should be adjusted was built according to the classification performance evaluated by historical outputs. Then the adjustment value was adaptively added to the output of the naive Bayes classifier distributed in the interval to regulate its category. The experiment results on Trec05,Trec06,Trec07,CEAS08 datasets show that the proposed method outperforms the naive Bayes classifier and the bagging naive Bayes classifier in terms of accuracy, Macro F1, in addition to its simplicity and practicality.

Key words: text classification, naive Bayes, spam filtering, adaptive adjustment