TY - GEN
T1 - STEPs
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
AU - Shah, Anshul
AU - Lundell, Benjamin
AU - Sawhney, Harpreet
AU - Chellappa, Rama
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - We address the problem of extracting key steps from un-labeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks. Our code can be found at https://github.com/anshulbshah/STEPs.
AB - We address the problem of extracting key steps from un-labeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks. Our code can be found at https://github.com/anshulbshah/STEPs.
UR - http://www.scopus.com/inward/record.url?scp=85181844526&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85181844526&partnerID=8YFLogxK
U2 - 10.1109/ICCV51070.2023.00952
DO - 10.1109/ICCV51070.2023.00952
M3 - Conference contribution
AN - SCOPUS:85181844526
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 10341
EP - 10353
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2023 through 6 October 2023
ER -