ICCV Retail workshop Competition

https://grocery-vision.github.io/

Github

Track 1

3 classes of events : Take, Return, Rummage
Temporally and spatially locate the events. Temporal boundaries are frame precise. Spatial boundaries are bounding boxes around objects involved.
Camera is always placed at same location, inside the cart.
A single frame can be a part of multiple events

Dataset

271 Videos
300k Frames
52k tagged frames with bboxes and event classes
total 4k events ~1300 each class

alt

One VLM to rule them all

VLMs can emit rough bounding boxes
VLMs can temporally locate event boundaries

Context length problem : In training set largest given video is 90 seconds@25fps = 2250 images

Pixel shuffle brings tokens down to 81 per image
2250 * 81 = 182250 tokens is very very long
So we need to iteratively detect precise boundaries, starting from 1fps = 90*81 = 7290 tokens ~8k tokens

Classification accuracy : Training set has about 4k events.

Training smolvlm to classify events on videos at 3fps gives about 88% accuracy after one epoch. So SmolVLM can detect these events, and they are detectable at 3fps.
This is a small 2.2B vlm. Using Gemma might yeild better results.

Ideas to try

Rolling Window (VLM) : Keep a fixed number of frames at one time. Train to mark all of them. Use previous frames results in context ensuring continuity.
- 1st sequence : <f1><f2><f3><f4><f5>| -> <r1><r2><r3><r4><r5>
- 2nd sequence : <f2><f3><f4><f5><f6>|<r2><r3><r4><r5> -> <r6>
Estimated Training Tokens (More quality tokens, more learning):
- 1 image = 81 tokens
- total images in dataset : ~300k
- images with annotation : ~72k
- one result = ~x,y,w,h,class ~10 tokens
- 300k * 10 = ~3M tokens with loss (no loss will flow from image tokens as visual encoder is frozen)
- 3M tokens are ok for finetuning
- sequence of 25 images per sample
- Number of samples => 300k (more samples can be made by augs like skipping frames, rotation/flipping of video)
- Context length-> 25 * 81 + 25 * 10 = ~2250
- Total tokens passing through = 300k * 2k = 600M tokens (1day on H100)
- Total loss tokens = 300k * 250 = 75M tokens (actual 3M, 25x repeats due to windowing)
- configurable ->
  - seq len = 25,
  - skip frames = 2
  - video rotation
  - video horizontal flip
  - vlm backbones -> smolvlm2, gemma3/3n, qwen2.5-3B
Video embedder (+regressor) : VJEPA2, Video-Prism are both very high quality video embedders that emit sequence embeddings for videos.
- Known to work very well for event datasets like ssv2
- Take video embeddings and train spatial-temporal regressors on top
- https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6
- VJEPA2 300M -> input 64 frames -> (8192, 1024) embedding
- https://huggingface.co/facebook/vjepa2-vitl-fpc64-256/blob/main/notebook_finetuning.ipynb
- much smaller and faster to train than a full vlm.