中国科学技术大学学报 ›› 2019, Vol. 49 ›› Issue (1): 8-14.DOI: 10.3969/j.issn.0253-2778.2019.01.002

• 原创论文 • 上一篇    下一篇

基于样本过滤和迁移学习的多领域情感分类模型

曲昭伟   

  1. 北京邮电大学计算机学院,北京 100876
  • 收稿日期:2018-05-29 修回日期:2018-09-18 出版日期:2019-01-31 发布日期:2019-01-31
  • 通讯作者: 赵燕娇
  • 作者简介:曲昭伟,男,1970年生,博士/教授. 研究方向:人工智能、数据挖掘、计算机网络技术. E-mail: zwqu@bupt.edu.cn
  • 基金资助:
    国家自然科学基金(61672108)资助.

A multi-domain sentiment classification model based on sample filtering and transfer learning

QU Zhaowei   

  1. School of Computer Science and Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2018-05-29 Revised:2018-09-18 Online:2019-01-31 Published:2019-01-31

摘要: 目前,大部分进行情感分类的模型以单个数据集进行训练并测试,然而对一个数据集训练得到的模型参数不适用于另一个数据集,模型不具备通用性.为此提出一种适用于多个领域的情感分类模型(MDSC),借助样本过滤和迁移学习,使训练得到的模型参数适用于多个领域下的不同数据集,使模型更具适用性和拓展性,即先将文档映射到领域的分布式表示,并以此作为领域分类和情感分类的桥梁,最后进行情感分类.为了使模型更具通用性,需要选择代表性强的数据样本,于是通过构建具有领域独立性的情感字典对属于同一文档的句子进行过滤,获取高质量的训练集.同时为了提高分类准确率并减少训练时间,使用基于参数的迁移学习方法,利用神经网络获得文档向量再进行分类.在包含15个不同领域的数据集上进行实验,与其他情感分类模型相比得到了较好的实验效果.

关键词: 情感分类, 样本过滤, 迁移学习, 情感字典, 神经网络

Abstract: Most of the models for sentiment classification are trained and tested on a single dataset. However, the model parameters obtained by training on one dataset are not suitable for another dataset and the model is not generic. A multi-domain sentiment classification model (MDSC) was proposed. With sample filtering and transfer learning, the trained model can be applied to different datasets in multiple domains and the model is more applicable and expandable. Specifically, a document is first mapped to the domain distribution which is used as a bridge between domain classification and sentiment classification, and then sentiment classification is completed. In order to make the model more generic, representative data samples should be selected. MDSC constructs a domain-independent sentiment lexicon to filter sentences that belong to the same document and obtain a high-quality training dataset. At the same time, to improve the classification accuracy and reduce the training time, parameter-based transfer learning with neutral networks is used to obtain the document embeddings for classification. Extensive experiments on datasets containing 15 different domains show that the proposed model can achieve better performance compared with traditional models when applied to datasets in multiple domains.

Key words: sentiment classification, sample filtering, transfer learning, sentiment lexicon, neural network