中国科学技术大学学报 ›› 2020, Vol. 50 ›› Issue (1): 29-38.DOI: 10.3969/j.issn.0253-2778.2020.01.004

• 科研论文 • 上一篇    下一篇

基于模糊度的半监督自步协同下的微信流业务识别

刘玮康   

  1. 中国科学院无线光电通信重点实验室,中国科学技术大学,安徽合肥 230026
  • 收稿日期:2019-03-28 修回日期:2019-07-17 出版日期:2020-01-31 发布日期:2020-01-31
  • 通讯作者: 秦晓卫
  • 作者简介:刘玮康,男,1994年生,硕士生.研究方向:用户行为感知、机器学习.E-mail:1210301638@qq.com
  • 基金资助:
    国家重点研发计划(2018YFA0701603)资助.

Service identification of WeChat traffic based on fuzziness and semi-supervised self-paced co-training

LIU Weikang   

  1. CAS Key Laboratory of Wireless-Optical Communications, University of Science and Technology of China,Hefei 230026,China
  • Received:2019-03-28 Revised:2019-07-17 Online:2020-01-31 Published:2020-01-31

摘要: 网络数据流的精准业务识别是实现差异化服务的先决条件,常用的监督学习在构建训练数据集时因需要大量人力标注因而难以实施,基于少量标注数据的半监督学习成为研究的热点之一.自步协同训练(self-paced co-training)的半监督框架在处理未标记数据时采用了从易到难、多视角协同的方法,但该方法仅以置信度为选取依据给样本标记伪标签,容易导致多视角的差异性在训练过程中逐步下降,从而引起协同增益下降、模型性能受限等问题.为此面向微信数据流识别问题,提出了一种基于模糊度的自步协同训练模型(fuzziness based self-paced co-training, FBSpaCo),在标注伪标签时进一步引入模糊度评估机制.实验表明,该模型在保证置信度的前提下有效地避免了训练过程中两视角差异性下降,较已有方法较大地提升了识别准确度.

关键词: 数据流识别, 半监督学习, 自步协同训练, 模糊度

Abstract: Accurate service identification of network data streams is a prerequisite for providing differentiated services. The commonly used supervised learning is difficult to implement when constructing training data sets due to the need for a large number of human annotations. Semi-supervised learning based on a small amount of annotated data has become one of the research hotspots. Semi-supervised framework of Self-paced Co-training adopts the method of collaboration that processes the easier pieces first using multiple perspectives when dealing with unlabeled data. However, this method only uses confidence as the criterion to select pseudo labels for samples, which can easily lead to the gradual decline of multi-perspective differences in the training process, resulting in the decline of synergy gain and the limitation of model performance. Therefore, for the recognition of WeChat data streams, a self-paced co-training model based on fuzziness (FBSpaCo) is proposed. When labeling pseudo labels, the fuzziness evaluation mechanism is introduced. Experiments show that the model can effectively avoid the decline of the difference between two perspectives in the training process. Compared with the existing methods, the recognition accuracy is greatly improved.

Key words: network data identification, semi-supervised learning, self-paced co-training, fuzziness