GVF-TAPE Overview. Our approach combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation across diverse table-top tasks. Given a RGB observation and a task description, GVF-TAPE predicts future RGB-D frames via a generative foresight model. A decoupled pose estimator then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers.
Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single side-view RGB image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems.
Framework Overview.GVF-TAPE first generates a future RGB-D video conditioned on the current RGB observation and task description. A transformer-based pose estimation model then extracts the end- effector pose from each predicted frame and sends it to a low-level controller for execution. After completing the predicted trajectory, the system receives a new observation and repeats the process in a closed-loop manner.
This section presents real-world experiments demonstrating the effectiveness of GVF-TAPE. In all videos, the left column shows the real-world rollout, the middle column depicts the generated RGB visual foresight, and the right column displays the generated depth map.
Our method leverages random exploration to learn a task-agnostic pose estimation model that maps generated frames to robot actions, without relying on action-labeled data. This process is human-free and thus highly efficient. Below, we present a real-world random exploration video.
We present GVF-TAPE, a real-time manipulation framework that decouples visual planning from action execution by combining generative video prediction with task-agnostic pose estimation. Unlike prior methods, GVF-TAPE learns from unlabeled videos and random exploration, removing the need for action-labeled data. This design allows robots to predict future visual outcomes and infer executable poses, enabling robust closed-loop control across diverse tasks. Experiments in both simulation and the real world show that GVF-TAPE outperforms action-supervised and video-based baselines, demonstrating the potential of label-free, foresight-driven frameworks for scalable manipulation. We hope this work encourages further research in video-guided, action-free robot learning.