Relation aware network for weakly-supervised temporal action localization

doi:10.52396/JUST-2021-0061

Abstract

Abstract: Temporal action localization has become an important and challenging research orientation due to its various applications. Since fully supervised localization requires a lot of manpower expenditure to get frame-level or segment-level fine annotations on untrimmed long videos, weakly supervised methods have received more and more attention in recent years. Weakly-supervised Temporal Action Localization (WS-TAL) aims to predict action temporal boundaries with only video-level labels provided in the training phase. However, the existing methods often only perform classification loss constraints on independent video segments, but ignore the relation within or between these segments. In this paper, we propose a novel framework called Relation Aware Network (RANet), which aims to model the segment relations of intra-video and inter-video. Specifically, the Intra-video Relation Module is designed to generate more complete action predictions, while the Inter-video Relation Module is designed to separate the action from the background. Through this design, our model can learn more robust visual feature representations for action localization. Extensive experiments on three public benchmarks including THUMOS 14 and ActivityNet 1.2/1.3 demonstrate the impressive performance of our proposed method compared with the state-of-the-arts.

Key words: temporal action localization, weakly-supervised learning, relation modeling

CLC Number:

TP391.8

ZHAN Yongkun, YANG Wenfei, ZHANG Tianzhu. Relation aware network for weakly-supervised temporal action localization[J]. Journal of University of Science and Technology of China, 2021, 51(10): 753-765.

References

[1] LIN J, GAN C, HAN S. TSM: Temporal shift module for efficient video understanding. Proceedings of the International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 7083-7093.
[2] Antol S, Agrawal A, Lu J, et al. VQA: Visual question answering. Proceedings of the International Conference on Computer Vision. Long Beach, USA: IEEE, 2015: 2425-2433.
[3] Akti S, Tatarolu G A, Ekenel H K. Vision-based fight detection from surveillance cameras. Ninth International Conference on Image Processing Theory, Tools and Applications.Vancouver, Canada: IEEE, 2019: 1-6.
[4] Lee Y J, Ghosh J, Grauman K. Discovering important people and objects for egocentric video summarization. IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE, 2012: 1346-1353.
[5] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. [S.l.: s.n.], 2015: 91-99.
[6] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014: 580-587.
[7] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 779-788.
[8] Girshick R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.
[9] Dai X, Singh B, Zhang G, et al. Temporal context network for activity localization in videos. Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 5793-5802.
[10] Gao J, Yang Z, Nevatia R. Cascaded boundary regression for temporal action detection. 2017, arXiv:1705.01180 .
[11] Wang L, Xiong Y, Lin D, et al. Untrimmednets for weakly supervised action recognition and detection. Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 4325-4334.
[12] Narayan S, Cholakkal H, Khan F S, et al. 3C-Net: Category count and center loss for weakly-supervised action localization. Proceedings of the International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 8679-8687.
[13] Nguyen P, Liu T, Prasad G, et al. Weakly supervised action localization by sparse temporal pooling network. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6752-6761.
[14] Singh G, Saha S, Sapienza M, et al. Online real-time multiple spatio temporal action localisation and prediction. Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 3637-3646.
[15] Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. Proceedings of the Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019: 1298- 1307.
[16] Singh K K, Lee Y J. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 3544-3553.
[17] Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies. Proceedings of the Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE, 2008: 1-8.
[18] Cholakkal H, Sun G, Khan F S, et al. Object counting and instance segmentation with image-level supervision. Proceedings of the Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019: 12397-12405.
[19] Paul S, Roy S, Roy-Chowdhury A K. W-TALC: Weakly-supervised temporal activity localization and classification. Proceedings of the European Con-ference on Computer Vision. 2018: 563-579.
[20] Bojanowski P, Bach F, Laptev I, et al. Finding actors and actions in movies. Proceedings of the International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 2280-2287.
[21] Nguyen P X, Ramanan D, Fowlkes C C. Weakly-supervised action localization with background modeling. Proceedings of the International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 5502-5511.
[22] Shi B, Dai Q, Mu Y, et al. Weakly-supervised action localization by generative attention modeling. Proceedings of the Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 1009-1019.
[23] Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems. 2017: 1025-1035.
[24] Kläser A, Marszałek M, Schmid C. A spatio-temporal descriptor based on 3D-gradients. Proceedings of the 2008 British Machine Vision Conference. [2021-03-26] , http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.974&rep=rep1&type=pdf.
[25] Wang H, Schmid C. Action recognition with improved trajectories. Proceedings of the International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 3551-3558.
[26] Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. European Conference on Computer Vision. Springer, 2006: 428-441.
[27] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014: 568-576.
[28] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 4489-4497.
[29] Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks. Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 5533-5541.
[30] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 6299-6308.
[31] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision. Springer, 2016: 20-36.
[32] Zhou B, Andonian A, Oliva A, et al. Temporal relational reasoning in videos. Proceedings of the European Conference on Computer Vision. Springer, 2018: 803-818.
[33] Idrees H, Zamir A R, Jiang Y G, et al. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155:1-23.
[34] Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the Conference on Computer Vision And Pattern Recognition. Boston, USA: IEEE, 2015: 961-970.
[35] Sigurdsson G A, Varol G, Wang X, et al. Hollywood in homes: Crowd-sourcing data collection for activity understanding. European Conference on Computer Vision. Springer, 2016: 510-526.
[36] Gu C, Sun C, Ross D A, et al. AVA: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6047-6056.
[37] Chao Y W, Vijayanarasimhan S, Seybold B, et al. Rethinking the faster R-CNN architecture for temporal action localization. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 1130-1139.
[38] Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. Proceedings of the Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 1049-1058.
[39] Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. Proceedings of the International Conference On Computer Vision. Venice, Italy: IEEE, 2017: 5783-5792.
[40] Zhao Y, Xiong Y, Wang L, et al. Temporal action detection with structured segment networks. Proceedings of the IEEE International Conference on Computer Vision. Springer, 2017: 2914-2923.
[41] Lin T, Zhao X, Shou Z. Single shot temporal action detection. Proceedings of the 25th International Conference on Multimedia. Bucharest, Romania: ACM, 2017: 988- 996.
[42] Zhang D, Dai X, Wang X, et al. S3D: Single shot multi-span detector via fully 3D convolutional networks. 2018, arXiv:1807.08069, .
[43] Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34: 11320-11327.
[44] Zhai Y, Wang L, Tang W, et al. Two-stream consensus network for weakly-supervised temporal action localization. European Conference on Computer Vision. Springer, 2020: 37-54.
[45] Huang L, Huang Y, Ouyang W, et al. Relational prototypical network for weakly supervised temporal action localization. Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34: 11053-11060.
[46] Gong G, Wang X, Mu Y, et al. Learning temporal co-attention models for unsupervised video action localization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 9819-9828.
[47] Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading. 2016, arXiv:1601.06733 .
[48] Parikh A P, Täckström O, Das D, et al. A decomposable attention model for natural language inference. 2016, arXiv:1606.01933 .
[49] Paulus R, Xiong C, Socher R. A deep reinforced model for abstractive summarization. 2017, arXiv:1705.04304 .
[50] Davenport T H, Beck J C. The Attention Economy. Harvard Bus. SC, 2001.
[51] Wang X, Girshick R, Gupta A, et al. Non-local neural networks. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City,USA: IEEE, 2018: 7794-7803.
[52] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City,USA: IEEE, 2018: 7132-7141.
[53] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional block attention module. Proceedings of the European Conference on Computer Vision. Munich, Germany: ACM, 2018: 3-19.
[54] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA: ACM, 2019: 3146-3154.
[55] Shou Z, Gao H, Zhang L, et al. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. Proceedings of the European Conference on Computer Vision. Munich, Germany: ACM, 2018: 154-171.
[56] Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching. Proceedings of the European Conference on Computer Vision. Munich, Germany: ACM, 2018: 201-216.
[57] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Lauderdale, USA: IEEE, 2011: 315-323.
[58] Lin T, Zhao X, Su H, et al. BSN: Boundary sensitive network for temporal action proposal generation. Proceedings of the European Conference on Computer Vision. Springer, 2018: 3-19.
[59] Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization. Proceedings of the IEEE International Conference on Computer Vision. Seoul, South Korea, 2019: 7094-7103.
[60] Shou Z, Chan J, Zareian A, et al. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 5734-5743.
[61] Yang K, Qiao P, Li D, et al. Exploring temporal preservation networks for precise temporal action localization. 2017, arXiv:1708.03280.
[62] Liu Z, Wang L, Zhang Q, et al. Weakly supervised temporal action localization through contrast based evaluation networks. Proceedings of the International Conference on Computer Vision. Seoul, South, Korea: IEEE, 2019: 3899-3908.
[63] Yuan Y, Lyu Y, Shen X, et al. Marginalized average attentional network for weakly-supervised learning. 2019, arXiv:1905.08586.
[64] Wedel A, Pock T, Zach C, et al. An Improved Algorithm for TV-L¹ optical flow. Statistical and geometrical approaches to visual motion analysis. Springer, 2009: 23-45.
[65] Kingma D P, Ba J. Adam: A method for stochastic optimization. 2014, arXiv:1412.6980.
[66] Xu M, Zhao C, Rojas D S, et al. G-TAD: Sub-graph localization for temporal action detection. Proceedings of the Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10156-10165.
[67] Chen P, Gan C, Shen G, et al. Relation attention for temporal action localization. IEEE Transactions on Multimedia, 2019, 22(10): 2723-2733.
[68] Zhong J X, Li N, Kong W, et al. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. Proceedings of the 26th International Conference on Multimedia. ACM, 2018: 35-44.
[69] Zeng R, Gan C, Chen P, et al. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Transactions on Image Processing, 2019, 28(12):5797-5808.