中国科学技术大学学报 ›› 2019, Vol. 49 ›› Issue (2): 112-118.DOI: 10.3969/j.issn.0253-2778.2019.02.005

• 原创论文 • 上一篇    下一篇

基于MapReduce的商品评论热点发现算法研究

苏 浩   

  1. 烟台大学计算机与控制工程学院, 山东烟台 264005
  • 收稿日期:2018-06-15 修回日期:2018-09-18 出版日期:2019-02-28 发布日期:2019-02-28
  • 通讯作者: 刘其成
  • 作者简介:苏浩, 男, 1995年生, 硕士生, 研究方向:并行计算与数据挖掘. E-mail:1406205897@qq.com
  • 基金资助:
    山东省自然科学基金(ZR2016FM42); 山东省重点研发计划(2016GGX109004); 国家海洋局"十三五"海洋经济创新发展示范重点项目(YHC-ZB-P201701); 国家自然科学基金(61702439)资助.

Research on product reviews hot spot discovery algorithm based on MapReduce

SU Hao   

  1. School of Computer and Control Engineering, Yantai University, Yantai 264000, China
  • Received:2018-06-15 Revised:2018-09-18 Online:2019-02-28 Published:2019-02-28

摘要: 提出一种基于MapReduce框架的商品评论热点发现并行算法——PR-HD算法. PR-HD算法使用爬虫技术提取某电商平台下某热门手机的评论数据生成评论数据集, 以TF-IDF算法来计算特征词的权重, 通过特征词添加位置权重的方式来得到特征词的最终权值, 建立向量空间模型(VSM)计算不同评论语句的相似度, 使用Canopy算法和K-means算法相结合从而实现商品评论的热点发现. 这使得产品开发人员可以从中获取更直接有效的建议和反馈.

关键词: 评论热点发现, MapReduce, Canopy算法, K-means算法

Abstract: A parallel algorithm based on MapReduce framework for finding hot spots from commodity reviews (PR-HD algorithm) is proposed. The PR-HD algorithm uses crawler technology to extract an electricity supplier. A review data set is generated from the review data of a popular mobile phone under the platform, and the weight of the feature words is calculated by the TF-IDF algorithm. The final weights of the feature words are obtained by adding position weights of the feature words, and a vector space model (VSM) calculation is established. The similarity of different comment sentences is combined using Canopy algorithm and K-means algorithm to realize hot spot discovery from commodity reviews. This allows product developers to obtain more direct and effective suggestions and feedback.

Key words: comment hot spot found, MapReduce, Canopy algorithm, K-means algorithm