Logo

Notes | Work | Github | Twitter | Linkedin | Email

MM-Duplex

Github

This is an incomplete experiment. It trains a vlm with single token per image and the resulting vlm is very good for describing simple videos.

Humans can handle both input and output streams together. Like a commentator reciting the events of a match can both see and talk together. LLMs of today can either consume or produce tokens at a time making such tasks impossible in real time. In this paper we present MM-Duplex, a multimodal duplex architecture that takes video stream as a input and produces a text/audio stream as output.

Video -> text/audio “duplex”

Problems in current VLMs

VLM

  1. train a clip like model from scratch
  2. train a vlm with existing clip head

streaming vid model

  1. moshi like model but with existing clip and llm
  2. see where best returns are, in tokenizer or out of it

Proposed architecture

Resource estimates

Video tokenizer

  1. Clip takes 800xH100 hours to train with 400M images : 1600$
  2. Retrain a new type of tokenizer which encodes flows and events not images. Small model 75M, should be enough to understand basic things in videos.
  3. H100 for 10 days in a month = 55k Rs.
  4. All small experiments on 4090. Freeeee. Use hyperbolic for 5090s.
  5. timed transcriptions are video equivalent of captions for images. Can we train a model that aligns video with event transcriptions ?

MM-Duplex

  1. Moshi like structure with small text model (300M). Here videos are preencoded and we need to just train a projector. Projection layer of 100M ?
  2. Video encoding and text decoding work in parallel. What is the arch for keeping so much video information in context ? Are we going to compress video info using a tokenizer that encodes events, or are we going to make something that compresses the result of tokenizer more efficiently.

Past

Many attempts have been made on audio-audio duplex streaming as follows.

Moshi introduces tokenizer mimi that compresses audio to 12.5 tokens per second per codebook with 8 codebooks giving close to real audio.

They also build an architecture to weave a text and audio streams together, which they call delayed-streams-modeling. They build other models like an stt, tts and a streaming speech-speech translator using same arch.

Delayed streaming

TODOS

VLM

  1. train a clip like model from scratch
  2. train a vlm with existing clip head

streaming vid model

  1. moshi like model but with existing clip and llm
  2. see where best returns are, in tokenizer or out of it

Think, training video tokenizer from scratch is going to be very hard. Instead figuring out an arch that can

  1. stream
  2. express videos as events without training clip like thing from scratch

Measurement

  1. LongVideoBench https://longvideobench.github.io/
  2. ReasoningVideo lyx97/reasoning_videos
  3. Natural video commentary arena [None yet. Look equivalent for audio]

Training dataset

  1. PEVideo : https://huggingface.co/datasets/facebook/PE-Video : 1M videos, 500M frames@30fps. PLM was trained on 90M images.
  2. Live CC dataset: 5M videos