TY - GEN
T1 - MOST
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
AU - Rambhatla, Sai Saketh
AU - Misra, Ishan
AU - Chellappa, Rama
AU - Shrivastava, Abhinav
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pretraining of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.Our project is publicly available at rssaketh.github.io/most.
AB - We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pretraining of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.Our project is publicly available at rssaketh.github.io/most.
UR - http://www.scopus.com/inward/record.url?scp=85175230544&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175230544&partnerID=8YFLogxK
U2 - 10.1109/ICCV51070.2023.01450
DO - 10.1109/ICCV51070.2023.01450
M3 - Conference contribution
AN - SCOPUS:85175230544
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 15777
EP - 15788
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2023 through 6 October 2023
ER -