中国科学技术大学学报 ›› 2016, Vol. 46 ›› Issue (10): 874-882.DOI: 10.3969/j.issn.0253-2778.2016.10.012

• 原创论文 • 上一篇    

基于网格聚类的情感分析研究

缪裕青   

  1. 1.桂林电子科技大学计算机与信息安全学院,广西桂林 541004; 2.广西可信软件重点实验室,广西桂林 541004
  • 收稿日期:2016-03-09 修回日期:2016-09-16 出版日期:2016-10-31 发布日期:2016-10-31
  • 通讯作者: 缪裕青
  • 作者简介:缪裕青(通讯作者),女,1966年生,博士/副教授. 研究方向:数据挖掘、云计算、并行与分布式计算.
  • 基金资助:
    广西自然科学基金(2014GXNSFAA118395),国家自然科学基金(61363029),桂林电子科技大学研究生教育创新计划(GDYCSZ201466)资助.

Sentiment analysis based on grid clustering

MIAO Yuqing   

  1. 1.School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China; 2.Guangxi Key Laboratory of Trusted Software, Guilin 541004, China
  • Received:2016-03-09 Revised:2016-09-16 Online:2016-10-31 Published:2016-10-31

摘要: 传统基于语义词典和基于机器学习的中文情感分析方法,其情感分析结果受人的主观因素影响较大,在一定程度上依赖于人工建立的词典,词典的可扩展性不强.本文对于不被包括在知网情感词典中但又含有一定情感倾向的词语,使用点互信息PMI算法、设置参数阈值等方法,进行自动识别、提取和分类,从而达到扩充词典的目的.在此基础上,建立商品评论的特征向量模型,提出情感分类算法SCG,通过网格聚类算法建立分类模型,在网格聚类过程中引入动态衰减因子,周期性地移除稀疏网格,减少计算量.实验结果表明,相比Naive Bayes,SMO(sequential minimal optimization)等分类算法,SCG算法具有更高的准确率和领域适应性.

关键词: 情感分析, 网格, 聚类, 点互信息, 分类

Abstract: To expand a lexicon, the methods of point mutual information (PMI), setting the threshold parameter, etc. were used to automatically identify, extract and classification the words which are not included in the HowNet but have a certain emotional tendency. On that basis, a feature vector model based on commodity comments was established, and the SCG (sentiment classification based on grid clustering) algorithm was presented. Next, the grid-based clustering algorithm was used to build up a classification model. The amount of calculation decreased after the dynamic attenuation factors were introduced and sparse grids were periodically removed in the grid-based clustering process. Experimental results indicate that the classification accuracy and field adaptability of SCG is higher, compared with other algorithms such as Naive Bayes, SMO (sequential minimal optimization).

Key words: sentiment analysis, grid, cluster, point mutual information (PMI), classification