中国科学技术大学学报 ›› 2021, Vol. 51 ›› Issue (10): 753-765.DOI: 10.52396/JUST-2021-0061

• 信息科学 • 上一篇    下一篇

基于关系建模的弱监督时序动作定位

占永昆, 杨文飞, 张天柱   

  1. 中国科学技术大学未来网络实验室,安徽合肥 230027
  • 收稿日期:2021-03-02 修回日期:2021-04-28 出版日期:2021-10-31 发布日期:2022-01-11
  • 通讯作者: *E-mail:qqzheng@mail.ustc.edu.cn

Relation aware network for weakly-supervised temporal action localization

ZHAN Yongkun, YANG Wenfei, ZHANG Tianzhu*   

  1. Laboratory for Future Networks, University of Science and Technology of China, Hefei 230027, China
  • Received:2021-03-02 Revised:2021-04-28 Online:2021-10-31 Published:2022-01-11
  • Contact: *E-mail: tzzhang@ustc.edu.cn

摘要: 时序动作定位因其广泛的实际应用成为重要且具有挑战性的方向.由于全监督定位方法需要大量的人力对长视频进行视频帧或视频片段级别的细腻标注,近些年来,弱监督学习受到了越来越多的关注.弱监督动作定位在训练阶段只需提供视频级别类别标签,即可定位出视频中动作的区间位置.然而,大多数现存的方法往往只对独立的视频片段进行分类损失约束,而忽略了这些视频片段之间的关系.本文提出一种新的关系感知网络实现了基于弱监督的行为时序定位.通过考虑对视频内和视频间的片段进行关系建模,从而学习出更加鲁棒的视频动作定位特征表示.具体来说,视频内关系模块的目的是使得网络预测出更加完整的动作,而视频间关系模块的目是将动作从高度依赖的背景中分离出来.通过在THUOUS14,ActivityNet1.2/1.3等三个公共基准定位数据集上进行实验,与最新的方法比,我们提出的方法取得了更好的结果.

关键词: 时序动作定位, 弱监督学习, 关系建模

Abstract: Temporal action localization has become an important and challenging research orientation due to its various applications. Since fully supervised localization requires a lot of manpower expenditure to get frame-level or segment-level fine annotations on untrimmed long videos, weakly supervised methods have received more and more attention in recent years. Weakly-supervised Temporal Action Localization (WS-TAL) aims to predict action temporal boundaries with only video-level labels provided in the training phase. However, the existing methods often only perform classification loss constraints on independent video segments, but ignore the relation within or between these segments. In this paper, we propose a novel framework called Relation Aware Network (RANet), which aims to model the segment relations of intra-video and inter-video. Specifically, the Intra-video Relation Module is designed to generate more complete action predictions, while the Inter-video Relation Module is designed to separate the action from the background. Through this design, our model can learn more robust visual feature representations for action localization. Extensive experiments on three public benchmarks including THUMOS 14 and ActivityNet 1.2/1.3 demonstrate the impressive performance of our proposed method compared with the state-of-the-arts.

Key words: temporal action localization, weakly-supervised learning, relation modeling

中图分类号: