Small World Setup
Imagine a 2D grid world with only boxes of multiple colors. Physics in this world is exactly as you’d expect. Can an LLM learn to operate in this world? Can it reason through images like these?

We can edit these worlds mentally with ease. We can manipulate boxes and solve problems in our mind. Our mind searches for solutions to problems in this world, even though we have never seen this world before.
- Search over solutions: very fast simulation or intrinsic search?
- Verification on regions by zooming and selecting
Solving This in the Current Learning Paradigm
- The model needs to intrinsically understand this world and its basic operations.
- It should grasp what “add” and “subtract” mean in this tiny universe.
- It must be able to:
- Solve tasks in this environment
- Learn new scenarios, as long as they stay within this world
Three Experiments Enabled by This Setup
- Reasoning about a small world
- Learning a world model and using it in a reasoning chain
- Teaching the world to a model without wrecking past learning (e.g., by building world models in context)
Creating These Worlds
- Each world has organisms—just connected shapes.
- We take real-life “behaviours” such as:
- Copy
- Reverse
- Hole filling
- Symmetry
- Repetitions
- “Big eats small” (size)
- Shape-dependent properties
…and generate millions of small worlds.
Notes on Representation
- Tokens are edits in small worlds.
- E.g., rather than using raw image frames (where a simple leg movement is costly to encode), use a starting state + an “edit” to capture that movement.
- This is a promising way to represent videos.
- Questions:
- Do existing video tokenizers already capture this?
- Does DINO-WM do it?
- Are edits embeddings, or can they be discrete tokens?
Next Data Frontier
Models left “in the wild” learning from their own actions and feedback.
- Can we teach a model to learn autonomously in such an environment?
Proposed Architecture
- Build a small world
- Train an architecture that interacts with it
- If the world is similar to ARC-2, the model should solve ARC tasks.
- Mixture of Experts:
- One expert triggers only on visual input
- Language as the most efficient communication channel between agents in an environment
- Evolutionary mechanism: many instances interact, poorly performing ones get “eliminated”
Development Steps
- Get the stack ready for small-world setup
- Train a small language model
- Train a small visual‐language model
- Perform reward‐based training
- Write a simulator for ARC-like problems (copying, falling, etc.)
- Attach a pre-trained small language model to the simulator and collect data
- Devise training so the model retains and prioritizes important information
- Measure accuracy before and after training
Possible Environment Builders
Readables on World Models