MBench

Text-Conditioned Models — **Overview of MBench.** MBench evaluates memory capability through a three-level taxonomy and reports text-conditioned and action-conditioned model behavior across long-horizon consistency dimensions.

Abstract

Recent video-based world models can synthesize high-fidelity visual sequences, but a fundamental gap remains between visually plausible generation and the functional requirements of a world model. A reliable world model must maintain a stable and reasonable internal state across extended temporal horizons, camera motion, occlusion, and interaction.

MBench is a comprehensive benchmark for quantifying memory capability in video world models. It decomposes memory into three complementary dimensions: entity consistency, environment consistency, and causal consistency. These dimensions are further refined into twelve quantifiable sub-dimensions covering object geometry and texture, human identity and appearance, spatial and rendering stability, self-evolution, and text/action-conditioned interaction.

The benchmark is built from rigorously curated real-captured long videos and evaluated with a hybrid protocol that combines rule-based quantitative metrics with VLM question-answering. Extensive evaluation of recent video continuation models and action-conditioned world models reveals systematic limitations in long-term state retention, exposing memory as a central bottleneck for building persistent, controllable, and causally coherent video world models.

MBench Quantitative Results

We evaluate eight text-conditioned continuation models and six action-conditioned world models. The table reports the memory sub-dimensions for which valid automatic measurements are available in the paper.

Model	Setting	Geometry	Texture	Identity	Appearance	Epipolar	Reprojection	Lighting	Style	State	Correctness	Text	Action
Memflow	Text	61.72	56.06	39.48	51.60	57.95	20.89	55.06	30.08	62.75	72.04	46.31	-
Self Forcing	Text	34.97	33.02	43.92	54.58	67.44	55.19	66.83	30.15	50.19	66.84	43.91	-
Skyreels V2	Text	70.03	53.70	24.57	53.76	49.01	56.39	46.44	20.33	68.35	79.52	44.68	-
LongLive	Text	63.57	55.41	42.51	55.89	46.68	27.51	59.44	25.26	70.32	74.69	46.97	-
LongCat-Video	Text	46.96	43.13	26.56	52.98	28.28	9.26	56.10	27.46	84.17	87.83	46.25	-
Cosmos-Predict 2.5	Text	51.90	47.31	16.95	45.42	9.73	14.68	55.95	22.66	83.67	80.81	45.08	-
Causal Forcing	Text	62.23	53.36	42.53	64.37	18.10	2.88	57.44	27.48	64.79	73.10	44.90	-
Helios	Text	79.43	63.70	31.33	41.64	24.79	32.46	41.79	25.26	58.27	75.08	43.17	-
Matrix-Game 2.0	Action	14.62	28.99	1.22	0.94	14.78	3.08	38.79	73.78	10.00	26.40	-	47.86
Matrix-Game 3.0	Action	44.15	58.22	42.38	47.91	61.99	32.86	62.06	95.17	37.50	48.80	-	81.93
HY-WorldPlay	Action	47.12	68.54	52.46	66.58	83.86	68.17	82.67	98.23	49.50	62.40	-	85.69
Yume-1.5	Action	60.96	49.99	17.41	40.57	51.86	24.55	51.21	92.05	97.90	95.00	-	62.20
Lingbot-World	Action	33.20	44.54	11.57	33.53	22.12	7.57	40.06	85.87	96.00	89.40	-	63.32
Infinite-World	Action	35.70	61.88	23.08	46.85	74.04	61.51	62.63	96.87	48.00	78.40	-	86.37

Dashes indicate metrics that are not evaluated for that model setting. Higher values indicate stronger memory capability under the corresponding sub-dimension.

Memory Capability Dimensions

MBench uses a hierarchical taxonomy that moves from persistent entities, to stable environments, to the causal rules that govern state evolution and interaction.

Entity Consistency

Tests whether individual objects and human subjects keep their persistent identity and attributes across long rollouts.

Object geometry consistency
Object texture consistency
Human identity consistency
Human appearance consistency

Environment Consistency

Measures whether the spatial stage and rendering properties of the world remain stable as viewpoints change.

Epipolar geometry consistency
Reprojection consistency
Lighting consistency
Rendering style consistency

Causal Consistency

Evaluates whether generated worlds follow established physical and semantic rules through hidden intervals.

State evolution
Evolution correctness
Text-conditioned interaction
Action-conditioned interaction

Qualitative Video Comparisons by Memory Dimension

Select a memory dimension and sub-dimension to inspect paired video comparisons with per-dimension scoring.

Object Consistency

Object Geometry Consistency

Check whether the geometric structure of a target object is preserved after departure-return camera motion.

Evaluation Protocol

MBench uses Trigger-Conditioned Scoring to avoid rewarding models that preserve consistency only by avoiding the memory challenge. A generated video must first enter the intended state before post-event consistency is scored.

01

Curate Long Videos

Collect real-captured long videos with occlusion, departure-return camera motion, human-object interaction, and physical state transitions.

02

Generate Conditions

Create multi-segment text continuations and exit-wait-reenter action sequences that explicitly require state retention.

03

Verify Triggers

Use VLM verification to confirm that the generated video actually executes the memory-triggering event.

04

Score Reliability

Evaluate consistency only on valid samples, then aggregate the post-event consistency scores with the M-Score formulation.

M-Score_k = 2 × S^rel_k × C^trig_k / (S^rel_k + C^trig_k)

Dataset & Evaluation Kit

MBench aggregates real-world long videos from DL3DV, Tanks and Temples, OpenHumanVID, SpatialVID, and Physics-aware-video. These sources cover indoor and outdoor environments, human-object interactions, dynamic camera motion, and physical state transitions, with video durations ranging from seconds to minutes.

A VLM is used to select clips that pose meaningful memory challenges for entity consistency, environment consistency, and causal consistency. For text-conditioned continuation, each video is converted into a structured scene description and split into five semantically coherent segments with camera-control instructions. For action-conditioned models, MBench adopts an exit-and-reenter paradigm: the camera leaves the target entity, waits while the target is invisible, and then follows the reverse trajectory back to the initial view.

The evaluation kit implements specialized metrics for the twelve sub-dimensions, including SAM 2 masks, DINOv2 features, ArcFace identity tracks, DA3 camera geometry, CIELAB lighting statistics, Gram matrix style distance, OpenCLIP text-video alignment, 6-DoF action alignment, and VLM-based causal scoring.

Prompt and action distribution statistics from MBench — Prompt suite statistics from the paper: prompt distribution, text prompt cloud, and action distribution.

Quick Start

git clone https://github.com/study-overflow/MBench.git
cd MBench

pip install -r requirements.txt

python evaluate.py \
  --pred_dir outputs/model_name \
  --split benchmark \
  --metrics all

Experimental Findings

Memory Remains a Bottleneck

Many models produce visually plausible long videos but fail to maintain persistent world state after occlusion, camera departure, or long-horizon continuation.

Spatial Reasoning Is Fragile

Models can preserve local appearance while losing the underlying 3D layout, producing high visual coherence but weak epipolar and reprojection consistency.

Action Control Is Uneven

Action-conditioned world models differ substantially in their ability to execute controlled motion, return to prior viewpoints, and preserve hidden state.

Causal Memory Is Hardest

Out-of-view state evolution exposes whether a model truly simulates causal progress or simply resets to a plausible but unrelated later frame.

In the human alignment study, the current annotation set contains 4,459 records from 22 annotators, covering binary trigger judgments and pairwise memory-consistency comparisons across all 14 evaluated models and 12 dimensions. Entity metrics align strongly with human preferences on text-conditioned continuation, while spatial epipolar and reprojection metrics are especially predictive for action-conditioned rollouts.

BibTeX

If you find MBench useful for your research, please consider citing our paper.

@article{zhang2026mbench,
  title   = {MBench: A Comprehensive Benchmark on Memory Capability for Video World Models},
  author  = {Zhang, Shengjun and Zhang, Zhang and Huang, Simin and Tang, Zhenyu and Wang, Hanyang and Dai, Chensheng and Chen, Min and Li, Yifan and Li, Yuxin and Chen, Yingjie and Liu, Hao and Li, Chen and Duan, Yueqi},
  journal = {arXiv preprint},
  year    = {2026}
}

MBench A Comprehensive Benchmark on Memory Capability for Video World Models