Diffusion-based framework for weakly-supervised temporal action localization (2025)

research-article

Authors: Yuanbing Zou, Qingjie Zhao, Prodip Kumar Sarker, Shanshan Li, Lei Wang, Wangwang Liu

Published: 11 February 2025 Publication History

Metrics

Total Citations0Total Downloads0

Last 12 Months0

Last 6 weeks0

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

Manage my Alerts

New Citation Alert!

Please log in to your account

    • View Options
    • References
    • Figures
    • Tables
    • Media
    • Share

Abstract

Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at https://github.com/Rlab123/action_diff.

Highlights

We propose a diffusion-based framework for weakly-supervised temporal action localization.

We leverage a local masking module to separate local action instances from backgrounds.

We propose a new-refining strategy to improve the efficiency of model operation.

References

[1]

Gao J., Zhang T., Xu C., Learning to model relationships for zero-shot video classification, IEEE Trans Pattern Anal. Mach. Intell. 43 (10) (2020) 3476–3491.

[2]

Hu Y., Gao J., Dong J., Fan B., Liu H., Exploring rich semantics for open-set action recognition, IEEE Trans. Multimed. (2023).

[3]

Zhang Y., Zhang X.-Y., Shi H., OW-TAL: learning unknown human activities for open-world temporal action localization, Pattern Recognit. 133 (2023).

[4]

Kim Y.H., Nam S., Kim S.J., 2PESNet: Towards online processing of temporal action localization, Pattern Recognit. 131 (2022).

[5]

T. Lin, X. Zhao, H. Su, C. Wang, M. Yang, BSN: Boundary sensitive network for temporal action proposal generation, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 3–19.

[6]

F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei, Gaussian temporal awareness networks for action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 344–353.

[7]

F.-T. Hong, J.-C. Feng, D. Xu, Y. Shan, W.-S. Zheng, Cross-modal consensus network for weakly supervised temporal action localization, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1591–1599.

[8]

Gao J., Chen M., Xu C., Vectorized evidential learning for weakly-supervised temporal action localization, IEEE Trans Pattern Anal. Mach. Intell. (2023).

[9]

Zhang X.-Y., Shi H., Li C., Li P., Li Z., Ren P., Weakly-supervised action localization via embedding-modeling iterative optimization, Pattern Recognit. 113 (2021).

[10]

Chen M., Gao J., Yang S., Xu C., Dual-evidential learning for weakly-supervised temporal action localization, in: European Conference on Computer Vision, Springer, 2022, pp. 192–208.

[11]

Song Y., Ermon S., Generative modeling by estimating gradients of the data distribution, Adv. Neural Inf. Process. Syst. 32 (2019).

[12]

K. Mei, V. Patel, Vidm: Video implicit diffusion models, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, No. 8, 2023, pp. 9117–9125.

[13]

D. Liu, Q. Li, A.-D. Dinh, T. Jiang, M. Shah, C. Xu, Diffusion action segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10139–10149.

[14]

S. Nag, X. Zhu, J. Deng, Y.-Z. Song, T. Xiang, Difftad: Temporal action detection with proposal denoising diffusion, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10362–10374.

[15]

Ren H., Ran W., Liu X., Ren H., Lu H., Zhang R., Jin C., Weakly-supervised temporal action localization with adaptive clustering and refining network, in: 2023 IEEE International Conference on Multimedia and Expo, ICME, IEEE, 2023, pp. 1008–1013.

[16]

Jiang Y.-G., Liu J., Roshan Zamir A., Toderici G., Laptev I., Shah M., Sukthankar R., THUMOS challenge:Action recognition with a large number of classes, 2014, http://crcv.ucf.edu/THUMOS14/.

[17]

F. Caba Heilbron, V. Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.

[18]

Zhao P., Xie L., Ju C., Zhang Y., Wang Y., Tian Q., Bottom-up temporal action localization with mutual regularization, in: European Conference on Computer Vision, Springer, 2020, pp. 539–555.

[19]

Liu H., Wang S., Wang W., Cheng J., Multi-scale based context-aware net for action detection, IEEE Trans. Multimed. 22 (2) (2019) 337–348.

[20]

J. Zhou, Y. Wu, Temporal Feature Enhancement Dilated Convolution Network for Weakly-supervised Temporal Action Localization, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6028–6037.

[21]

P. Lee, Y. Uh, H. Byun, Background suppression network for weakly-supervised temporal action localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 11320–11327.

[22]

Z. Yang, J. Qin, D. Huang, ACGNET: Action complement graph network for weakly-supervised temporal action localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, No. 3, 2022, pp. 3090–3098.

[23]

Zhao Y., Zhang H., Gao Z., Gao W., Wang M., Chen S., A novel action saliency and context-aware network for weakly-supervised temporal action localization, IEEE Trans. Multimed. (2023).

[24]

A. Islam, C. Long, R. Radke, A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 2, 2021, pp. 1637–1645.

[25]

H. Shi, X.-Y. Zhang, C. Li, L. Gong, Y. Li, Y. Bao, Dynamic Graph Modeling for Weakly-Supervised Temporal Action Localization, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3820–3828.

[26]

Q. Liu, Z. Wang, S. Rong, J. Li, Y. Zhang, Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10433–10443.

[27]

Y. Wang, Y. Li, H. Wang, Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18878–18887.

[28]

Ho J., Jain A., Abbeel P., Denoising diffusion probabilistic models, Adv. Neural Inf. Process. Syst. 33 (2020) 6840–6851.

[29]

O. Avrahami, D. Lischinski, O. Fried, Blended diffusion for text-driven editing of natural images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18208–18218.

[30]

Austin J., Johnson D.D., Ho J., Tarlow D., Van Den Berg R., Structured denoising diffusion models in discrete state-spaces, Adv. Neural Inf. Process. Syst. 34 (2021) 17981–17993.

[31]

Ren H., Ren H., Ran W., Lu H., Jin C., Weakly-supervised temporal action localization with multi-head cross-modal attention, in: Pacific Rim International Conference on Artificial Intelligence, Springer, 2022, pp. 281–295.

[32]

J. Ma, S.K. Gorti, M. Volkovs, G. Yu, Weakly supervised action selection learning in video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7587–7596.

[33]

L. Huang, L. Wang, H. Li, Weakly supervised temporal action localization via representative snippet knowledge propagation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3272–3281.

[34]

F. Yi, H. Wen, T. Jiang, ASFormer: Transformer for Action Segmentation, in: The British Machine Vision Conference, BMVC, 2021.

[35]

P. Lee, J. Wang, Y. Lu, H. Byun, Weakly-supervised temporal action localization by uncertainty modeling, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 3, 2021, pp. 1854–1862.

[36]

W. Yang, T. Zhang, X. Yu, T. Qi, Y. Zhang, F. Wu, Uncertainty guided collaborative training for weakly supervised temporal action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 53–63.

[37]

S. Narayan, H. Cholakkal, M. Hayat, F.S. Khan, M.-H. Yang, L. Shao, D2-net: Weakly-supervised action localization via discriminative embeddings and denoised activations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13608–13617.

[38]

Zhang X.-Y., Shi H., Li C., Shi X., Action shuffling for weakly supervised temporal localization, IEEE Trans. Image Process. 31 (2022) 4447–4457.

[39]

B. He, X. Yang, L. Kang, Z. Cheng, X. Zhou, A. Shrivastava, Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13925–13935.

[40]

Shi H., Zhang X.-Y., Li C., Stochasticformer: Stochastic modeling for weakly supervised temporal action localization, IEEE Trans. Image Process. 32 (2023) 1379–1389.

[41]

Moniruzzaman M., Yin Z., Collaborative foreground, background, and action modeling network for weakly supervised temporal action localization, IEEE Trans. Circuits Syst. Video Technol. (2023).

[42]

G. Wang, P. Zhao, C. Zhao, S. Yang, J. Cheng, L. Leng, J. Liao, Q. Guo, Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10203–10213.

[43]

Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.

[44]

Ge Y., Qin X., Yang D., Jagersand M., Deep snippet selective network for weakly supervised temporal action localization, Pattern Recognit. 110 (2021).

[45]

H. Ren, W. Yang, T. Zhang, Y. Zhang, Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2394–2404.

[46]

Wang C., Wang J., Liu P., Complementary adversarial mechanisms for weakly-supervised temporal action localization, Pattern Recognit. 139 (2023).

[47]

Z. Liu, L. Wang, Q. Zhang, W. Tang, J. Yuan, N. Zheng, G. Hua, ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 3, 2021, pp. 2233–2241.

Index Terms

  1. Diffusion-based framework for weakly-supervised temporal action localization

    1. Computing methodologies

      1. Artificial intelligence

        1. Computer vision

          1. Computer vision tasks

            1. Activity recognition and understanding

        2. Machine learning

          1. Learning paradigms

            1. Multi-task learning

              1. Transfer learning

            2. Machine learning approaches

              1. Learning latent representations

                1. Neural networks

            3. Information systems

              1. Information retrieval

                1. Specialized information retrieval

                  1. Multimedia and multimodal retrieval

                    1. Video search

            Index terms have been assigned to the content through auto-classification.

            Recommendations

            • Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization

              MM '20: Proceedings of the 28th ACM International Conference on Multimedia

              The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action ...

              Read More

            • Weakly Supervised Temporal Action Localization with Segment-Level Labels

              Pattern Recognition and Computer Vision

              Abstract

              Temporal action localization presents a trade-off between test performance and annotation-time cost. Fully supervised methods achieve good performance with time-consuming boundary annotations. Weakly supervised methods with cheaper video-level ...

              Read More

            • Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

              Computer Vision – ECCV 2020

              Abstract

              Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains ...

              Read More

            Comments

            Information & Contributors

            Information

            Published In

            Diffusion-based framework for weakly-supervised temporal action localization (7)

            Pattern Recognition Volume 160, Issue C

            Apr 2025

            437 pages

            Issue’s Table of Contents

            Copyright © 2024.

            Publisher

            Elsevier Science Inc.

            United States

            Publication History

            Published: 11 February 2025

            Author Tags

            1. Temporal action localization
            2. Weakly-supervised learning
            3. Diffusion
            4. Mask learning

            Qualifiers

            • Research-article

            Contributors

            Diffusion-based framework for weakly-supervised temporal action localization (8)

            Other Metrics

            View Article Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Total Citations

            • Total Downloads

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0

            Reflects downloads up to 14 Feb 2025

            Other Metrics

            View Author Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Diffusion-based framework for weakly-supervised temporal action localization (2025)
            Top Articles
            Latest Posts
            Recommended Articles
            Article information

            Author: Prof. An Powlowski

            Last Updated:

            Views: 6604

            Rating: 4.3 / 5 (64 voted)

            Reviews: 87% of readers found this page helpful

            Author information

            Name: Prof. An Powlowski

            Birthday: 1992-09-29

            Address: Apt. 994 8891 Orval Hill, Brittnyburgh, AZ 41023-0398

            Phone: +26417467956738

            Job: District Marketing Strategist

            Hobby: Embroidery, Bodybuilding, Motor sports, Amateur radio, Wood carving, Whittling, Air sports

            Introduction: My name is Prof. An Powlowski, I am a charming, helpful, attractive, good, graceful, thoughtful, vast person who loves writing and wants to share my knowledge and understanding with you.