MBench
A Comprehensive Benchmark on Memory Capability for Video World Models

Shengjun Zhang1,* Zhang Zhang1,2,* Simin Huang1 Zhenyu Tang3 Hanyang Wang1 Chensheng Dai1 Min Chen1 Yifan Li1 Yuxin Li1 Yingjie Chen2 Hao Liu2 Chen Li2 Yueqi Duan1,†

1Tsinghua University 2WeChat Vision, Tencent Inc. 3Peking University

* Equal contribution. † Corresponding author.

MBench Dimension (a) MBench Dimension
Text-Conditioned Models (b) Text-Conditioned Models
Action-Conditioned Models (c) Action-Conditioned Models
Overview of MBench. MBench evaluates memory capability through a three-level taxonomy and reports text-conditioned and action-conditioned model behavior across long-horizon consistency dimensions.

Abstract

Recent video-based world models can synthesize high-fidelity visual sequences, but a fundamental gap remains between visually plausible generation and the functional requirements of a world model. A reliable world model must maintain a stable and reasonable internal state across extended temporal horizons, camera motion, occlusion, and interaction.

MBench is a comprehensive benchmark for quantifying memory capability in video world models. It decomposes memory into three complementary dimensions: entity consistency, environment consistency, and causal consistency. These dimensions are further refined into twelve quantifiable sub-dimensions covering object geometry and texture, human identity and appearance, spatial and rendering stability, self-evolution, and text/action-conditioned interaction.

The benchmark is built from rigorously curated real-captured long videos and evaluated with a hybrid protocol that combines rule-based quantitative metrics with VLM question-answering. Extensive evaluation of recent video continuation models and action-conditioned world models reveals systematic limitations in long-term state retention, exposing memory as a central bottleneck for building persistent, controllable, and causally coherent video world models.

MBench Quantitative Results

We evaluate eight text-conditioned continuation models and six action-conditioned world models. The table reports the memory sub-dimensions for which valid automatic measurements are available in the paper.

Model Setting Geometry Texture Identity Appearance Epipolar Reprojection Lighting Style State Correctness Text Action
MemflowText61.7256.0639.4851.6057.9520.8955.0630.0862.7572.0446.31-
Self ForcingText34.9733.0243.9254.5867.4455.1966.8330.1550.1966.8443.91-
Skyreels V2Text70.0353.7024.5753.7649.0156.3946.4420.3368.3579.5244.68-
LongLiveText63.5755.4142.5155.8946.6827.5159.4425.2670.3274.6946.97-
LongCat-VideoText46.9643.1326.5652.9828.289.2656.1027.4684.1787.8346.25-
Cosmos-Predict 2.5Text51.9047.3116.9545.429.7314.6855.9522.6683.6780.8145.08-
Causal ForcingText62.2353.3642.5364.3718.102.8857.4427.4864.7973.1044.90-
HeliosText79.4363.7031.3341.6424.7932.4641.7925.2658.2775.0843.17-
Matrix-Game 2.0Action14.6228.991.220.9414.783.0838.7973.7810.0026.40-47.86
Matrix-Game 3.0Action44.1558.2242.3847.9161.9932.8662.0695.1737.5048.80-81.93
HY-WorldPlayAction47.1268.5452.4666.5883.8668.1782.6798.2349.5062.40-85.69
Yume-1.5Action60.9649.9917.4140.5751.8624.5551.2192.0597.9095.00-62.20
Lingbot-WorldAction33.2044.5411.5733.5322.127.5740.0685.8796.0089.40-63.32
Infinite-WorldAction35.7061.8823.0846.8574.0461.5162.6396.8748.0078.40-86.37

Dashes indicate metrics that are not evaluated for that model setting. Higher values indicate stronger memory capability under the corresponding sub-dimension.

Memory Capability Dimensions

MBench uses a hierarchical taxonomy that moves from persistent entities, to stable environments, to the causal rules that govern state evolution and interaction.

Entity Consistency

Tests whether individual objects and human subjects keep their persistent identity and attributes across long rollouts.

  • Object geometry consistency
  • Object texture consistency
  • Human identity consistency
  • Human appearance consistency

Environment Consistency

Measures whether the spatial stage and rendering properties of the world remain stable as viewpoints change.

  • Epipolar geometry consistency
  • Reprojection consistency
  • Lighting consistency
  • Rendering style consistency

Causal Consistency

Evaluates whether generated worlds follow established physical and semantic rules through hidden intervals.

  • State evolution
  • Evolution correctness
  • Text-conditioned interaction
  • Action-conditioned interaction

Qualitative Video Comparisons by Memory Dimension

Select a memory dimension and sub-dimension to inspect paired video comparisons with per-dimension scoring.

Object Consistency

Object Geometry Consistency

Check whether the geometric structure of a target object is preserved after departure-return camera motion.

Evaluation Protocol

MBench uses Trigger-Conditioned Scoring to avoid rewarding models that preserve consistency only by avoiding the memory challenge. A generated video must first enter the intended state before post-event consistency is scored.

01

Curate Long Videos

Collect real-captured long videos with occlusion, departure-return camera motion, human-object interaction, and physical state transitions.

02

Generate Conditions

Create multi-segment text continuations and exit-wait-reenter action sequences that explicitly require state retention.

03

Verify Triggers

Use VLM verification to confirm that the generated video actually executes the memory-triggering event.

04

Score Reliability

Evaluate consistency only on valid samples, then aggregate the post-event consistency scores with the M-Score formulation.

M-Scorek = 2 × Srelk × Ctrigk / (Srelk + Ctrigk)

Dataset & Evaluation Kit

MBench aggregates real-world long videos from DL3DV, Tanks and Temples, OpenHumanVID, SpatialVID, and Physics-aware-video. These sources cover indoor and outdoor environments, human-object interactions, dynamic camera motion, and physical state transitions, with video durations ranging from seconds to minutes.

A VLM is used to select clips that pose meaningful memory challenges for entity consistency, environment consistency, and causal consistency. For text-conditioned continuation, each video is converted into a structured scene description and split into five semantically coherent segments with camera-control instructions. For action-conditioned models, MBench adopts an exit-and-reenter paradigm: the camera leaves the target entity, waits while the target is invisible, and then follows the reverse trajectory back to the initial view.

The evaluation kit implements specialized metrics for the twelve sub-dimensions, including SAM 2 masks, DINOv2 features, ArcFace identity tracks, DA3 camera geometry, CIELAB lighting statistics, Gram matrix style distance, OpenCLIP text-video alignment, 6-DoF action alignment, and VLM-based causal scoring.

Prompt and action distribution statistics from MBench
Prompt suite statistics from the paper: prompt distribution, text prompt cloud, and action distribution.

Quick Start

git clone https://github.com/study-overflow/MBench.git
cd MBench

pip install -r requirements.txt

python evaluate.py \
  --pred_dir outputs/model_name \
  --split benchmark \
  --metrics all

Experimental Findings

Memory Remains a Bottleneck

Many models produce visually plausible long videos but fail to maintain persistent world state after occlusion, camera departure, or long-horizon continuation.

Spatial Reasoning Is Fragile

Models can preserve local appearance while losing the underlying 3D layout, producing high visual coherence but weak epipolar and reprojection consistency.

Action Control Is Uneven

Action-conditioned world models differ substantially in their ability to execute controlled motion, return to prior viewpoints, and preserve hidden state.

Causal Memory Is Hardest

Out-of-view state evolution exposes whether a model truly simulates causal progress or simply resets to a plausible but unrelated later frame.

In the human alignment study, the current annotation set contains 4,459 records from 22 annotators, covering binary trigger judgments and pairwise memory-consistency comparisons across all 14 evaluated models and 12 dimensions. Entity metrics align strongly with human preferences on text-conditioned continuation, while spatial epipolar and reprojection metrics are especially predictive for action-conditioned rollouts.

BibTeX

If you find MBench useful for your research, please consider citing our paper.

@article{zhang2026mbench,
  title   = {MBench: A Comprehensive Benchmark on Memory Capability for Video World Models},
  author  = {Zhang, Shengjun and Zhang, Zhang and Huang, Simin and Tang, Zhenyu and Wang, Hanyang and Dai, Chensheng and Chen, Min and Li, Yifan and Li, Yuxin and Chen, Yingjie and Liu, Hao and Li, Chen and Duan, Yueqi},
  journal = {arXiv preprint},
  year    = {2026}
}