STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos

Anshul Shah, Benjamin Lundell, Harpreet Sawhney, Rama Chellappa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We address the problem of extracting key steps from un-labeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks. Our code can be found at https://github.com/anshulbshah/STEPs.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages10341-10353
Number of pages13
ISBN (Electronic)9798350307184
DOIs
StatePublished - 2023
Event2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France
Duration: Oct 2 2023Oct 6 2023

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499

Conference

Conference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/TerritoryFrance
CityParis
Period10/2/2310/6/23

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos'. Together they form a unique fingerprint.

Cite this