Escribe una PREreview

Predicting the World via Video Representation: A Comprehensive Survey on Video World Models

por Jiaxin Yan, Chaoning Zhang, Xudong Wang, Pengcheng Zheng, Ya Wen, Qigan Sun, Jiaxin Huang, Shuxu Chen, Yang Yang y Hyundong Shin

Publicada: 7 de mayo de 2026
Servidor: Preprints.org
DOI: 10.20944/preprints202605.0435.v1

Video world models have emerged as a critical framework, offering a powerful approach to modeling dynamic environments through lens of video data, and serving as a key tool for understanding and predicting complex systems. While prior papers have focused on specific domains such as 3D modeling, autonomous driving, and robotics, they have largely overlooked the growing importance of video modality in the development of future world models. These papers often concentrate on particular data representations, failing to account for how video-based representations can bridge the gap between perception, prediction, and decision-making in intelligent systems. This paper aims to fill this gap by providing a standardized and systematic classification of video world models. We introduce a comprehensive taxonomy that distinguishes between implicit state deduction which, focuses on learning compact latent representations and explicit visual modeling, which emphasizes frame level video processing. Additionally, we analyze indepth review of experimental setups, specific applications, and open problems. By focusing on video world models, this paper offers a unified reference that highlights their critical role in the future of world modeling research.

Puedes escribir una PREreview de Predicting the World via Video Representation: A Comprehensive Survey on Video World Models. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.