Write a PREreview

Predicting the World via Video Representation: A Comprehensive Survey on Video World Models

by Jiaxin Yan, Chaoning Zhang, Xudong Wang, Pengcheng Zheng, Ya Wen, Qigan Sun, Jiaxin Huang, Shuxu Chen, Yang Yang, and Hyundong Shin

Posted: May 7, 2026
Server: Preprints.org
DOI: 10.20944/preprints202605.0435.v1

Video world models have emerged as a critical framework, offering a powerful approach to modeling dynamic environments through lens of video data, and serving as a key tool for understanding and predicting complex systems. While prior papers have focused on specific domains such as 3D modeling, autonomous driving, and robotics, they have largely overlooked the growing importance of video modality in the development of future world models. These papers often concentrate on particular data representations, failing to account for how video-based representations can bridge the gap between perception, prediction, and decision-making in intelligent systems. This paper aims to fill this gap by providing a standardized and systematic classification of video world models. We introduce a comprehensive taxonomy that distinguishes between implicit state deduction which, focuses on learning compact latent representations and explicit visual modeling, which emphasizes frame level video processing. Additionally, we analyze indepth review of experimental setups, specific applications, and open problems. By focusing on video world models, this paper offers a unified reference that highlights their critical role in the future of world modeling research.

You can write a PREreview of Predicting the World via Video Representation: A Comprehensive Survey on Video World Models. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.