Ctrl-World: A Controllable Generative World Model for Robot Manipulation

1Stanford University, 2Tsinghua University

Ctrl-World diagram

Ctrl-World is designed for policy-in-the-loop rollouts with generalist robot policies. It generates joint multi-view predictions (including wrist views), enforces fine-grained action control via frame-level conditioning, and sustains coherent long-horizon dynamics through pose-conditioned memory retrieval. Together, these components enable (1) accurate evaluation of policy instruction-following ability via imagination, and (2) targeted policy improvement on previously unseen instructions.



Interactive demos

Starting from the same initial frame, Ctrl-World can autoregressively generate diverse future trajectories conditioned on the given action chunks, achieving centimeter-level precision. You can select any action combinations and generate corresponding videos. All videos are generated by passing in the initial frame and a different sequences of actions as input. For interpretability, we translate each action chunk into a text description of the action.

Interactive Control Demo 1: Keyboard Control

Initial state visualization

Action chunk 1:

Action chunk 2:

Action chunk 3:

Generated video:

Interactive Control Demo 2: Interact with Different Object

Initial state visualization

Action chunk 1-2:

Action Chunk 3:

Action Chunk 4:

Generated video:

Interactive Control Demo 3: Centimeter-level precision

Initial state visualization

Action chunk 1:

Action chunk 2:

Action chunk 3:

Generated video:

Interactive Control Demo 4: Interact with Different Object

Initial state visualization

Action chunk 1-3

Action Chunk 4-5:

Action Chunk 6-7:

Generated video:



Model architecture

Ctrl-World is initialized from a pretrained video diffusion model and adapted into a controllable, temporally consistent world model with: (1) Multi-view input and joint prediction for unified information understanding. (2) Memory retrieval mechanism, which adds sparse history frames in context and project pose information into each frame via frame-level cross-attention, re-anchoring predictions to similar past states. (3) Frame-level action conditioning to better align high-frequency action with visual dynamics.


Comparisons on Rollouts in Real-World and World Model (Figure 6 of paper)

Real World
Execution

Pick blue block and place on white plate

Fold the towel into half

World Model
Rollout
Real World
Execution

Place sponge in drawer

Close the laptop

World Model
Rollout
Real World
Execution

Move towel from left to right.

Pull one tissue out of the box.

World Model
Rollout

Synthetic Data Used for Finetuning the Policy (Figure 8 of paper)

Novel Object
Folding towel from desired direction.
Spatial understanding.

(e.g., left, right, top right, bottom side)

Shape understanding.

(E.g., smaller,larger block)

Qualitative Results of Policy Improvement (Figure 9 of paper)

Language instruction:
Base Policy: wrong instruction following :(
Post-training on synthetic data: success :)
1. Pick the object in top left side and place in box.
2. Pick glove and place in box.
3. Pick the larger red blcok and place in box.
4. Fold the towel from left to right.