Video Generators are Robot Policies  

1Columbia University, 2Toyota Research Institute

We analyze visuomotor policy learning as action decoding from video generation, paving the way for more scalable and data-efficient robot learning.

Abstract

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is closely tied to the generated video, with action-free video data providing critical benefits for generalizing to novel tasks. By leveraging large-scale video generative models, we achieve superior performance compared to traditional behavior cloning, paving the way for more scalable and data-efficient robot policy learning.

Action-Free Video Training

Learning Curves
Generalization to tasks with no policy supervision by capitalizing on action-free video data. Both our behavior cloning head and the baseline DP are trained on 12 tasks on the left, but our video generation model also has access to action-free videos for all 24 tasks. The upper bounds on the right correspond to models trained with full action supervision for comparison.

Example Tasks with Action-Free Video Training Only

Robot Execution (Top) + Generated Video (Bottom)

Walkthrough Video

Video Generation as a Policy Learning Objective

2-stage training first optimizes for video generation, then freezes the video U-Net and trains an action head, while joint training finetunes both objectives simultaneously. Notably, 2-stage training achieves significantly higher success rates, indicating that video generation serves as a more general learning objective than action prediction.

Learning Curves

Success Rate vs Video Prediction Horizon

We plot average task success rates, separating tasks with and without distribution shifts between training and evaluation environments, as a function of the model's video prediction horizon. While longer-term video prediction consistently improves performance, its impact is more pronounced on tasks that demand stronger generalization. These results highlight that learning accurate environment dynamics is critical for achieving generalization in policy learning.

Learning Curves

Real-World Evaluation

Evaluation
We evaluate our model across generalization in object location, object appearance, and background appearance. Success rates are computed over 10 rollouts per task, and our method demonstrates strong robustness across the experiments in real-world settings.

Real-World Experiments with Varied Object Location

Robot Execution (Top) + Generated Video (Bottom)

Real-World Experiments with Unseen Objects

Robot Execution (Top) + Generated Video (Bottom)

Real-World Experiments with Unseen Background

Robot Execution (Top) + Generated Video (Bottom)

BibTeX

@misc{liang2024dreamitate,
  title={Dreamitate: Real-World Visuomotor Policy Learning via Video Generation}, 
  author={Junbang Liang and Ruoshi Liu and Ege Ozguroglu and Sruthi Sudhakar and Achal Dave and Pavel Tokmakov and Shuran Song and Carl Vondrick},
  year={2024},
  eprint={2406.16862},
  archivePrefix={arXiv},
  primaryClass={id='cs.RO' full_name='Robotics' is_active=True alt_name=None in_archive='cs' is_general=False description='Roughly includes material in ACM Subject Class I.2.9.'}
}
Code for the website is inherited from Nerfies.