World
The world is the set of all objects relevant to an embodied AI task. It consists of a robot and its environment; the environment contains both the objects of interest and the ambient environment.
Survey and tutorial map
A concise tutorial that builds from task-specific worlds and embodied policies to predictive world models—and finally to models that connect imagined futures with executable robot actions.
Architectural view
Predicting how the world evolves under actions is central to embodied AI and generative simulation. A unified architecture makes clear what a model represents, what it predicts, and how its predictions support control, planning, decision making, and policy learning.
The world is the set of all objects relevant to an embodied AI task. It consists of a robot and its environment; the environment contains both the objects of interest and the ambient environment.
A world configuration describes the robot and every object in its environment. An embodied AI task asks a policy to transform an initial configuration into a target one by controlling the robot. The exact world is therefore task-dependent.
Humanoid locomotion needs a robot and ground surface; robotic table cleaning additionally includes dishes, furniture, and the surrounding household environment.
At each step, a policy receives a language instruction l and current observation ot, then outputs an action at to the robot. The policy may be a PID controller, an MPC, a vision-language-action model, or a world action model.
For a specified world, a world model predicts how a future observation ot+1 or state xt+1 evolves under action at, typically conditioned on observation history o0:t.
The model can be a symbolic dynamics equation, a neural dynamics model, or a diffusion-based video predictor. Observations may be RGB or RGB-D images, point clouds, or proprioceptive states; predicted states may be object poses, keypoints, latent states, or other task-relevant variables.
From prediction to action
A world action model is a policy in the embodied AI framework above. It extends world modeling by explicitly associating predicted future observations or states with the actions that realize them.
Explore the four WAM paradigms →Predictive modeling choices
We first divide world models into two formulations according to the space in which prediction is performed: observation-space world models and state-space world models.
These models predict future observations directly, such as RGB images, RGB-D frames, or point clouds.
These models first abstract observations into a state representation, then predict how that state evolves.
Classification criteria: observation explicitness and action abstraction.
For observation-space world models, we classify methods by what future observation they generate and how the conditioning action is represented. This yields a two-axis design space.
Classification criteria: the state representation used for prediction.
For state-space world models, we classify methods by the representation that mediates prediction. The key question is what information is retained from observations before dynamics are modeled.
From future prediction to action
World action models bridge language-conditioned video prediction and embodied control. Existing work differs in how explicitly it uses imagined futures during policy inference.
Generate visual subgoals or rollouts first, then use inverse dynamics, pose estimation, flow, or goal-conditioned policies to produce executable actions.
Reuse internal representations from video prediction backbones without decoding full future frames at inference time.
Learn a shared generative distribution over future observations and corresponding robot action sequences.
Use future prediction as an auxiliary objective to shape policy representations, then remove the video branch during deployment.
Survey bibliography
The browser covers the deduplicated references in the survey and follows the taxonomy used in the paper. Some works are cross-listed when they appear in both world-model and world-action-model sections.
Feedback and updates
If you have suggestions, ideas for improving the taxonomy, works that should be included, or would like to discuss world models and world action models, please contact us.
Cite this survey
The paper is currently represented by the local LaTeX source and compiled PDF. Update venue and publication metadata here as the manuscript changes.
@article{zhang2026worldactionmodels,
title = {From World Models to World Action Models: A Concise Tutorial for Robotics},
author = {Zhang, Xiaoxiong and Zeng, Xiong and Zhang, Wei},
year = {2026},
note = {Survey manuscript}
}