基于关系建模的弱监督时序动作定位

doi:10.52396/JUST-2021-0061

摘要/Abstract

摘要： 时序动作定位因其广泛的实际应用成为重要且具有挑战性的方向.由于全监督定位方法需要大量的人力对长视频进行视频帧或视频片段级别的细腻标注,近些年来,弱监督学习受到了越来越多的关注.弱监督动作定位在训练阶段只需提供视频级别类别标签,即可定位出视频中动作的区间位置.然而,大多数现存的方法往往只对独立的视频片段进行分类损失约束,而忽略了这些视频片段之间的关系.本文提出一种新的关系感知网络实现了基于弱监督的行为时序定位.通过考虑对视频内和视频间的片段进行关系建模,从而学习出更加鲁棒的视频动作定位特征表示.具体来说,视频内关系模块的目的是使得网络预测出更加完整的动作,而视频间关系模块的目是将动作从高度依赖的背景中分离出来.通过在THUOUS14,ActivityNet1.2/1.3等三个公共基准定位数据集上进行实验,与最新的方法比,我们提出的方法取得了更好的结果.

关键词: 时序动作定位, 弱监督学习, 关系建模

Abstract: Temporal action localization has become an important and challenging research orientation due to its various applications. Since fully supervised localization requires a lot of manpower expenditure to get frame-level or segment-level fine annotations on untrimmed long videos, weakly supervised methods have received more and more attention in recent years. Weakly-supervised Temporal Action Localization (WS-TAL) aims to predict action temporal boundaries with only video-level labels provided in the training phase. However, the existing methods often only perform classification loss constraints on independent video segments, but ignore the relation within or between these segments. In this paper, we propose a novel framework called Relation Aware Network (RANet), which aims to model the segment relations of intra-video and inter-video. Specifically, the Intra-video Relation Module is designed to generate more complete action predictions, while the Inter-video Relation Module is designed to separate the action from the background. Through this design, our model can learn more robust visual feature representations for action localization. Extensive experiments on three public benchmarks including THUMOS 14 and ActivityNet 1.2/1.3 demonstrate the impressive performance of our proposed method compared with the state-of-the-arts.

Key words: temporal action localization, weakly-supervised learning, relation modeling

中图分类号:

TP391.8

占永昆, 杨文飞, 张天柱. 基于关系建模的弱监督时序动作定位[J]. 中国科学技术大学学报, 2021, 51(10): 753-765.

ZHAN Yongkun, YANG Wenfei, ZHANG Tianzhu. Relation aware network for weakly-supervised temporal action localization[J]. Journal of University of Science and Technology of China, 2021, 51(10): 753-765.

参考文献

[1] LIN J, GAN C, HAN S. TSM: Temporal shift module for efficient video understanding. Proceedings of the International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 7083-7093.
[2] Antol S, Agrawal A, Lu J, et al. VQA: Visual question answering. Proceedings of the International Conference on Computer Vision. Long Beach, USA: IEEE, 2015: 2425-2433.
[3] Akti S, Tatarolu G A, Ekenel H K. Vision-based fight detection from surveillance cameras. Ninth International Conference on Image Processing Theory, Tools and Applications.Vancouver, Canada: IEEE, 2019: 1-6.
[4] Lee Y J, Ghosh J, Grauman K. Discovering important people and objects for egocentric video summarization. IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE, 2012: 1346-1353.
[5] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. [S.l.: s.n.], 2015: 91-99.
[6] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 580-587.
[7] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 779-788.
[8] Girshick R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.
[9] Dai X, Singh B, Zhang G, et al. Temporal context network for activity localization in videos. Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 5793-5802.
[10] Gao J, Yang Z, Nevatia R. Cascaded boundary regression for temporal action detection. 2017, arXiv:1705.01180 .
[11] Wang L, Xiong Y, Lin D, et al. Untrimmednets for weakly supervised action recognition and detection. Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 4325-4334.
[12] Narayan S, Cholakkal H, Khan F S, et al. 3C-Net: Category count and center loss for weakly-supervised action localization. Proceedings of the International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 8679-8687.
[13] Nguyen P, Liu T, Prasad G, et al. Weakly supervised action localization by sparse temporal pooling network. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6752-6761.
[14] Singh G, Saha S, Sapienza M, et al. Online real-time multiple spatio temporal action localisation and prediction. Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 3637-3646.
[15] Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. Proceedings of the Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019: 1298- 1307.
[16] Singh K K, Lee Y J. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 3544-3553.
[17] Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies. Proceedings of the Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE, 2008: 1-8.
[18] Cholakkal H, Sun G, Khan F S, et al. Object counting and instance segmentation with image-level supervision. Proceedings of the Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019: 12397-12405.
[19] Paul S, Roy S, Roy-Chowdhury A K. W-TALC: Weakly-supervised temporal activity localization and classification. Proceedings of the European Con-ference on Computer Vision. 2018: 563-579.
[20] Bojanowski P, Bach F, Laptev I, et al. Finding actors and actions in movies. Proceedings of the International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 2280-2287.
[21] Nguyen P X, Ramanan D, Fowlkes C C. Weakly-supervised action localization with background modeling. Proceedings of the International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 5502-5511.
[22] Shi B, Dai Q, Mu Y, et al. Weakly-supervised action localization by generative attention modeling. Proceedings of the Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 1009-1019.
[23] Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems. 2017: 1025-1035.
[24] Kläser A, Marszałek M, Schmid C. A spatio-temporal descriptor based on 3D-gradients. Proceedings of the 2008 British Machine Vision Conference. [2021-03-26] , http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.974&rep=rep1&type=pdf.
[25] Wang H, Schmid C. Action recognition with improved trajectories. Proceedings of the International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 3551-3558.
[26] Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. European Conference on Computer Vision. Springer, 2006: 428-441.
[27] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014: 568-576.
[28] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 4489-4497.
[29] Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks. Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 5533-5541.
[30] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 6299-6308.
[31] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision. Springer, 2016: 20-36.
[32] Zhou B, Andonian A, Oliva A, et al. Temporal relational reasoning in videos. Proceedings of the European Conference on Computer Vision. Springer, 2018: 803-818.
[33] Idrees H, Zamir A R, Jiang Y G, et al. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155:1-23.
[34] Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the Conference on Computer Vision And Pattern Recognition. Boston, USA: IEEE, 2015: 961-970.
[35] Sigurdsson G A, Varol G, Wang X, et al. Hollywood in homes: Crowd-sourcing data collection for activity understanding. European Conference on Computer Vision. Springer, 2016: 510-526.
[36] Gu C, Sun C, Ross D A, et al. AVA: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6047-6056.
[37] Chao Y W, Vijayanarasimhan S, Seybold B, et al. Rethinking the faster R-CNN architecture for temporal action localization. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1130-1139.
[38] Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. Proceedings of the Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 1049-1058.
[39] Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. Proceedings of the International Conference On Computer Vision. Venice, Italy: IEEE, 2017: 5783-5792.
[40] Zhao Y, Xiong Y, Wang L, et al. Temporal action detection with structured segment networks. Proceedings of the IEEE International Conference on Computer Vision. Springer, 2017: 2914-2923.
[41] Lin T, Zhao X, Shou Z. Single shot temporal action detection. Proceedings of the 25th International Conference on Multimedia. Bucharest, Romania: ACM, 2017: 988- 996.
[42] Zhang D, Dai X, Wang X, et al. S3D: Single shot multi-span detector via fully 3D convolutional networks. 2018, arXiv:1807.08069, .
[43] Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34: 11320-11327.
[44] Zhai Y, Wang L, Tang W, et al. Two-stream consensus network for weakly-supervised temporal action localization. European Conference on Computer Vision. Springer, 2020: 37-54.
[45] Huang L, Huang Y, Ouyang W, et al. Relational prototypical network for weakly supervised temporal action localization. Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34: 11053-11060.
[46] Gong G, Wang X, Mu Y, et al. Learning temporal co-attention models for unsupervised video action localization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9819-9828.
[47] Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading. 2016, arXiv:1601.06733 .
[48] Parikh A P, Täckström O, Das D, et al. A decomposable attention model for natural language inference. 2016, arXiv:1606.01933 .
[49] Paulus R, Xiong C, Socher R. A deep reinforced model for abstractive summarization. 2017, arXiv:1705.04304 .
[50] Davenport T H, Beck J C. The Attention Economy. Harvard Bus. SC, 2001.
[51] Wang X, Girshick R, Gupta A, et al. Non-local neural networks. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City,USA: IEEE, 2018: 7794-7803.
[52] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City,USA: IEEE, 2018: 7132-7141.
[53] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional block attention module. Proceedings of the European Conference on Computer Vision. Munich, Germany: ACM, 2018: 3-19.
[54] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA: ACM, 2019: 3146-3154.
[55] Shou Z, Gao H, Zhang L, et al. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. Proceedings of the European Conference on Computer Vision. Munich, Germany: ACM, 2018: 154-171.
[56] Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching. Proceedings of the European Conference on Computer Vision. Munich, Germany: ACM, 2018: 201-216.
[57] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Lauderdale, USA: IEEE, 2011: 315-323.
[58] Lin T, Zhao X, Su H, et al. BSN: Boundary sensitive network for temporal action proposal generation. Proceedings of the European Conference on Computer Vision. Springer, 2018: 3-19.
[59] Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization. Proceedings of the IEEE International Conference on Computer Vision. Seoul, South Korea, 2019: 7094-7103.
[60] Shou Z, Chan J, Zareian A, et al. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 5734-5743.
[61] Yang K, Qiao P, Li D, et al. Exploring temporal preservation networks for precise temporal action localization. 2017, arXiv:1708.03280.
[62] Liu Z, Wang L, Zhang Q, et al. Weakly supervised temporal action localization through contrast based evaluation networks. Proceedings of the International Conference on Computer Vision. Seoul, South, Korea: IEEE, 2019: 3899-3908.
[63] Yuan Y, Lyu Y, Shen X, et al. Marginalized average attentional network for weakly-supervised learning. 2019, arXiv:1905.08586.
[64] Wedel A, Pock T, Zach C, et al. An Improved Algorithm for TV-L¹ optical flow. Statistical and geometrical approaches to visual motion analysis. Springer, 2009: 23-45.
[65] Kingma D P, Ba J. Adam: A method for stochastic optimization. 2014, arXiv:1412.6980.
[66] Xu M, Zhao C, Rojas D S, et al. G-TAD: Sub-graph localization for temporal action detection. Proceedings of the Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10156-10165.
[67] Chen P, Gan C, Shen G, et al. Relation attention for temporal action localization. IEEE Transactions on Multimedia, 2019, 22(10): 2723-2733.
[68] Zhong J X, Li N, Kong W, et al. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. Proceedings of the 26th International Conference on Multimedia. ACM, 2018: 35-44.
[69] Zeng R, Gan C, Chen P, et al. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Transactions on Image Processing, 2019, 28(12):5797-5808.