Survey and tutorial map

From World Models to World Action Models: A Concise Tutorial for Robotics

A concise tutorial that builds from task-specific worlds and embodied policies to predictive world models—and finally to models that connect imagined futures with executable robot actions.

Xiaoxiong Zhang, Xiong Zeng, and Wei Zhang

Southern University of Science and Technology; LimX Dynamics

Paper PDF arXiv GitHub

Architectural view

📘 Introduction

Predicting how the world evolves under actions is central to embodied AI and generative simulation. A unified architecture makes clear what a model represents, what it predicts, and how its predictions support control, planning, decision making, and policy learning.

World

The world is the set of all objects relevant to an embodied AI task. It consists of a robot and its environment; the environment contains both the objects of interest and the ambient environment.

Embodied AI Task and Policy

A world configuration describes the robot and every object in its environment. An embodied AI task asks a policy to transform an initial configuration into a target one by controlling the robot. The exact world is therefore task-dependent.

Humanoid locomotion needs a robot and ground surface; robotic table cleaning additionally includes dishes, furniture, and the surrounding household environment.

Humanoid locomotion and robotic table-cleaning as two examples of task-specific worlds. — Figure 2. Two embodied AI tasks instantiate different worlds and target configurations.

A closed-loop policy

At each step, a policy receives a language instruction l and current observation o_t, then outputs an action a_t to the robot. The policy may be a PID controller, an MPC, a vision-language-action model, or a world action model.

A language instruction and current world observation feed a policy that sends an action to the robot in a closed loop. — Figure 3. A language-conditioned closed-loop policy framework.

World Models and World Action Models

For a specified world, a world model predicts how a future observation o_t+1 or state x_t+1 evolves under action a_t, typically conditioned on observation history o_0:t.

The model can be a symbolic dynamics equation, a neural dynamics model, or a diffusion-based video predictor. Observations may be RGB or RGB-D images, point clouds, or proprioceptive states; predicted states may be object poses, keypoints, latent states, or other task-relevant variables.

A world model predicts future observations or states from observation history and an action. — Figure 4. A world model predicts future observations or states from observation history and action.

Examples spanning symbolic, neural, and diffusion-based world models. — Figure 5. World models range from symbolic equations to neural dynamics and diffusion video predictors.

From prediction to action

World Action Model

A world action model is a policy in the embodied AI framework above. It extends world modeling by explicitly associating predicted future observations or states with the actions that realize them.

Explore the four WAM paradigms →

Predictive modeling choices

🧭 World Model Design Space

We first divide world models into two formulations according to the space in which prediction is performed: observation-space world models and state-space world models.

🔮 Observation-Space World Models

These models predict future observations directly, such as RGB images, RGB-D frames, or point clouds.

🧩 State-Space World Models

These models first abstract observations into a state representation, then predict how that state evolves.

🔮 Observation-Space World Models

Classification criteria: observation explicitness and action abstraction.

For observation-space world models, we classify methods by what future observation they generate and how the conditioning action is represented. This yields a two-axis design space.

Design space of observation-space world models. — Observation-space models organize future prediction by observation type and action abstraction.

🧩 State-Space World Models

Classification criteria: the state representation used for prediction.

For state-space world models, we classify methods by the representation that mediates prediction. The key question is what information is retained from observations before dynamics are modeled.

Design space of state-space world models. — State-space models trade raw visual fidelity for compactness, structure, and physical meaning.

From future prediction to action

🤖 World Action Models

World action models bridge language-conditioned video prediction and embodied control. Existing work differs in how explicitly it uses imagined futures during policy inference.

Taxonomy of world action model paradigms. — Four ways to couple visual future prediction with action generation.

Explicit plan

Imagine-then-execute

Generate visual subgoals or rollouts first, then use inverse dynamics, pose estimation, flow, or goal-conditioned policies to produce executable actions.

Feature transfer

Video-feature-conditioned action prediction

Reuse internal representations from video prediction backbones without decoding full future frames at inference time.

Unified model

Joint video-action modeling

Learn a shared generative distribution over future observations and corresponding robot action sequences.

Training signal

Auxiliary video prediction for policy learning

Use future prediction as an auxiliary objective to shape policy representations, then remove the video branch during deployment.

Survey bibliography

📚 Resource Browser

The browser covers the deduplicated references in the survey and follows the taxonomy used in the paper. Some works are cross-listed when they appear in both world-model and world-action-model sections.

Feedback and updates

💬 Contribute to this survey

If you have suggestions, ideas for improving the taxonomy, works that should be included, or would like to discuss world models and world action models, please contact us.

12433017@mail.sustech.edu.cn

Cite this survey

✍️ Citation

The paper is currently represented by the local LaTeX source and compiled PDF. Update venue and publication metadata here as the manuscript changes.

@article{zhang2026worldactionmodels,
  title   = {From World Models to World Action Models: A Concise Tutorial for Robotics},
  author  = {Zhang, Xiaoxiong and Zeng, Xiong and Zhang, Wei},
  year    = {2026},
  note    = {Survey manuscript}
}