中国科学技术大学学报 ›› 2015, Vol. 45 ›› Issue (4): 314-320.DOI: 10.3969/j.issn.0253-2778.2015.04.009

• 论著 • 上一篇    

基于类别信息优化的潜在语义分析分类技术

季铎,毕臣,蔡东风   

  1. 1.中国刑事警察学院网络犯罪侦查系,辽宁沈阳 110854;2.沈阳航空航天大学知识工程中心,辽宁沈阳 110136
  • 收稿日期:2014-03-21 修回日期:2014-11-04 接受日期:2014-11-04 出版日期:2014-11-04 发布日期:2014-11-04
  • 通讯作者: 季 铎
  • 作者简介:季 铎(通讯作者),男,1981年生,博士/副教授.研究方向:数据挖掘.E-mail:jiduo_1@163.com
  • 基金资助:
    辽宁省教育厅自然科学基金(L201120302)资助.

A latent semantic analysis classification technique based on optimized categorization information

JI Duo, BI Chen, CAI Dongfeng   

  1. 1. Cyber Crime Investigation Department, National Police University of China,Shenyang 110854, China;2. Knowledge Engineering Research Center, Shenyang Aerospace University, Shenyang 110136, China
  • Received:2014-03-21 Revised:2014-11-04 Accepted:2014-11-04 Online:2014-11-04 Published:2014-11-04

摘要: 潜在语义索引作为一种公认有效的矩阵降维技术,在关键词检索、文本分类等多种基于统计的机器文本学习任务中被广泛应用.基于专业文献的文本分类任务,结合严格分类体系下同类与不同类文本的特点,以专利文献分类为例,提出了一种基于类别信息优化的潜在语义分析分类技术.该方法根据分类文本各类别的特征信息,将原始文档分解为多种伪文档,强化不同分类的专属特征出现频率,进而优化构建潜在语义空间,提升模型分类性能.实验结果证明,专利文本分类任务结合该方法时,可以有效地提高分类的准确性.

关键词: 潜在语义分析, 特征共现, 文本分类

Abstract: As an effective method in the way of dimensionality reduction, latent semantic analysis( LSA) has been widely applied to many text learning missions, such as information retrieval and text categorization. Based on professional literature text classification tasks, features of text from same and different categories were analyzed under a strict classification system, patent documents classification was taken as an example, an optimized LSA classification technique was purposed based on categorization information. Utilizing features information from different category text, the technique divided original documents into a variety of fake documents, strengthens occurrence frequency of exclusive features from different categories, thus building optimized latent semantic space and improving the performance of the classification model. The experimental result shows that the method effectively improves categorization precision when applied to text categorization.

Key words: Latent Semantic Indexing, Term Co-occurrence, Text Categorization

中图分类号: