research-article
Authors: Yuanbing Zou, Qingjie Zhao, Prodip Kumar Sarker, Shanshan Li, Lei Wang, Wangwang Liu
Volume 160, Issue C
Published: 11 February 2025 Publication History
Metrics
Total Citations0Total Downloads0Last 12 Months0
Last 6 weeks0
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
- View Options
- References
- Figures
- Tables
- Media
- Share
Abstract
Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at https://github.com/Rlab123/action_diff.
Highlights
•
We propose a diffusion-based framework for weakly-supervised temporal action localization.
•
We leverage a local masking module to separate local action instances from backgrounds.
•
We propose a new-refining strategy to improve the efficiency of model operation.
References
[1]
Gao J., Zhang T., Xu C., Learning to model relationships for zero-shot video classification, IEEE Trans Pattern Anal. Mach. Intell. 43 (10) (2020) 3476–3491.
[2]
Hu Y., Gao J., Dong J., Fan B., Liu H., Exploring rich semantics for open-set action recognition, IEEE Trans. Multimed. (2023).
[3]
Zhang Y., Zhang X.-Y., Shi H., OW-TAL: learning unknown human activities for open-world temporal action localization, Pattern Recognit. 133 (2023).
[4]
Kim Y.H., Nam S., Kim S.J., 2PESNet: Towards online processing of temporal action localization, Pattern Recognit. 131 (2022).
[5]
T. Lin, X. Zhao, H. Su, C. Wang, M. Yang, BSN: Boundary sensitive network for temporal action proposal generation, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 3–19.
[6]
F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei, Gaussian temporal awareness networks for action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 344–353.
[7]
F.-T. Hong, J.-C. Feng, D. Xu, Y. Shan, W.-S. Zheng, Cross-modal consensus network for weakly supervised temporal action localization, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1591–1599.
[8]
Gao J., Chen M., Xu C., Vectorized evidential learning for weakly-supervised temporal action localization, IEEE Trans Pattern Anal. Mach. Intell. (2023).
[9]
Zhang X.-Y., Shi H., Li C., Li P., Li Z., Ren P., Weakly-supervised action localization via embedding-modeling iterative optimization, Pattern Recognit. 113 (2021).
[10]
Chen M., Gao J., Yang S., Xu C., Dual-evidential learning for weakly-supervised temporal action localization, in: European Conference on Computer Vision, Springer, 2022, pp. 192–208.
[11]
Song Y., Ermon S., Generative modeling by estimating gradients of the data distribution, Adv. Neural Inf. Process. Syst. 32 (2019).
[12]
K. Mei, V. Patel, Vidm: Video implicit diffusion models, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, No. 8, 2023, pp. 9117–9125.
[13]
D. Liu, Q. Li, A.-D. Dinh, T. Jiang, M. Shah, C. Xu, Diffusion action segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10139–10149.
[14]
S. Nag, X. Zhu, J. Deng, Y.-Z. Song, T. Xiang, Difftad: Temporal action detection with proposal denoising diffusion, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10362–10374.
[15]
Ren H., Ran W., Liu X., Ren H., Lu H., Zhang R., Jin C., Weakly-supervised temporal action localization with adaptive clustering and refining network, in: 2023 IEEE International Conference on Multimedia and Expo, ICME, IEEE, 2023, pp. 1008–1013.
[16]
Jiang Y.-G., Liu J., Roshan Zamir A., Toderici G., Laptev I., Shah M., Sukthankar R., THUMOS challenge:Action recognition with a large number of classes, 2014, http://crcv.ucf.edu/THUMOS14/.
[17]
F. Caba Heilbron, V. Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.
[18]
Zhao P., Xie L., Ju C., Zhang Y., Wang Y., Tian Q., Bottom-up temporal action localization with mutual regularization, in: European Conference on Computer Vision, Springer, 2020, pp. 539–555.
[19]
Liu H., Wang S., Wang W., Cheng J., Multi-scale based context-aware net for action detection, IEEE Trans. Multimed. 22 (2) (2019) 337–348.
[20]
J. Zhou, Y. Wu, Temporal Feature Enhancement Dilated Convolution Network for Weakly-supervised Temporal Action Localization, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6028–6037.
[21]
P. Lee, Y. Uh, H. Byun, Background suppression network for weakly-supervised temporal action localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 11320–11327.
[22]
Z. Yang, J. Qin, D. Huang, ACGNET: Action complement graph network for weakly-supervised temporal action localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, No. 3, 2022, pp. 3090–3098.
[23]
Zhao Y., Zhang H., Gao Z., Gao W., Wang M., Chen S., A novel action saliency and context-aware network for weakly-supervised temporal action localization, IEEE Trans. Multimed. (2023).
[24]
A. Islam, C. Long, R. Radke, A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 2, 2021, pp. 1637–1645.
[25]
H. Shi, X.-Y. Zhang, C. Li, L. Gong, Y. Li, Y. Bao, Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3820–3828.
[26]
Q. Liu, Z. Wang, S. Rong, J. Li, Y. Zhang, Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10433–10443.
[27]
Y. Wang, Y. Li, H. Wang, Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18878–18887.
[28]
Ho J., Jain A., Abbeel P., Denoising diffusion probabilistic models, Adv. Neural Inf. Process. Syst. 33 (2020) 6840–6851.
[29]
O. Avrahami, D. Lischinski, O. Fried, Blended diffusion for text-driven editing of natural images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18208–18218.
[30]
Austin J., Johnson D.D., Ho J., Tarlow D., Van Den Berg R., Structured denoising diffusion models in discrete state-spaces, Adv. Neural Inf. Process. Syst. 34 (2021) 17981–17993.
[31]
Ren H., Ren H., Ran W., Lu H., Jin C., Weakly-supervised temporal action localization with multi-head cross-modal attention, in: Pacific Rim International Conference on Artificial Intelligence, Springer, 2022, pp. 281–295.
[32]
J. Ma, S.K. Gorti, M. Volkovs, G. Yu, Weakly supervised action selection learning in video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7587–7596.
[33]
L. Huang, L. Wang, H. Li, Weakly supervised temporal action localization via representative snippet knowledge propagation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3272–3281.
[34]
F. Yi, H. Wen, T. Jiang, ASFormer: Transformer for Action Segmentation, in: The British Machine Vision Conference, BMVC, 2021.
[35]
P. Lee, J. Wang, Y. Lu, H. Byun, Weakly-supervised temporal action localization by uncertainty modeling, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 3, 2021, pp. 1854–1862.
[36]
W. Yang, T. Zhang, X. Yu, T. Qi, Y. Zhang, F. Wu, Uncertainty guided collaborative training for weakly supervised temporal action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 53–63.
[37]
S. Narayan, H. Cholakkal, M. Hayat, F.S. Khan, M.-H. Yang, L. Shao, D2-net: Weakly-supervised action localization via discriminative embeddings and denoised activations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13608–13617.
[38]
Zhang X.-Y., Shi H., Li C., Shi X., Action shuffling for weakly supervised temporal localization, IEEE Trans. Image Process. 31 (2022) 4447–4457.
[39]
B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, A. Shrivastava, Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13925–13935.
[40]
Shi H., Zhang X.-Y., Li C., Stochasticformer: Stochastic modeling for weakly supervised temporal action localization, IEEE Trans. Image Process. 32 (2023) 1379–1389.
[41]
Moniruzzaman M., Yin Z., Collaborative foreground, background, and action modeling network for weakly supervised temporal action localization, IEEE Trans. Circuits Syst. Video Technol. (2023).
[42]
G. Wang, P. Zhao, C. Zhao, S. Yang, J. Cheng, L. Leng, J. Liao, Q. Guo, Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10203–10213.
[43]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.
[44]
Ge Y., Qin X., Yang D., Jagersand M., Deep snippet selective network for weakly supervised temporal action localization, Pattern Recognit. 110 (2021).
[45]
H. Ren, W. Yang, T. Zhang, Y. Zhang, Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2394–2404.
[46]
Wang C., Wang J., Liu P., Complementary adversarial mechanisms for weakly-supervised temporal action localization, Pattern Recognit. 139 (2023).
[47]
Z. Liu, L. Wang, Q. Zhang, W. Tang, J. Yuan, N. Zheng, G. Hua, ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 3, 2021, pp. 2233–2241.
Index Terms
Diffusion-based framework for weakly-supervised temporal action localization
Computing methodologies
Artificial intelligence
Computer vision
Computer vision tasks
Activity recognition and understanding
Machine learning
Learning paradigms
Multi-task learning
Transfer learning
Machine learning approaches
Learning latent representations
Neural networks
Information systems
Information retrieval
Specialized information retrieval
Multimedia and multimodal retrieval
Video search
Index terms have been assigned to the content through auto-classification.
Recommendations
- Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action ...
Read More
- Weakly Supervised Temporal Action Localization with Segment-Level Labels
Pattern Recognition and Computer Vision
Abstract
Temporal action localization presents a trade-off between test performance and annotation-time cost. Fully supervised methods achieve good performance with time-consuming boundary annotations. Weakly supervised methods with cheaper video-level ...
Read More
- Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning
Computer Vision – ECCV 2020
Abstract
Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains ...
Read More
Comments
Information & Contributors
Information
Published In
Pattern Recognition Volume 160, Issue C
Apr 2025
437 pages
Issue’s Table of Contents
Copyright © 2024.
Publisher
Elsevier Science Inc.
United States
Publication History
Published: 11 February 2025
Author Tags
- Temporal action localization
- Weakly-supervised learning
- Diffusion
- Mask learning
Qualifiers
- Research-article
Contributors
Other Metrics
View Article Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
Total Citations
Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025
Other Metrics
View Author Metrics
Citations
View Options
View options
Figures
Tables
Media