Beyond FID β Measuring Intelligence, Not Just Motion. The first benchmark for evaluating cognitive abilities of World Models in Embodied Intelligence.
3 Pillars Β· 10 Categories Β· 100 Scenarios Β· Automatic Scoring Β· Part of FINAL Bench Family by VIDRAFT
| Model | WM Scoreβ | Grade | π Perceptionβ | π§ Cognitionβ | π₯ Embodimentβ | FPSβ | Lat(ms)β | Track | Brain | Motion | GPU |
|---|
β = officially verified Β· est. = estimated from published data
Normalized % per pillar (100 = full marks for that pillar)
PROMETHEUS leads C04 Threat Diff Β· C05 Emotion Escalation by a wide margin. V-JEPA 2 strong on C03. GAIA-3 leads C01 from driving data.
The core differentiator of WM Bench β PROMETHEUS leads by a wide margin
Scatter plot β upper-right (high perception + high cognition) is ideal
All models are evaluated via the same text interface. No 3D environment required.
All scoring is quantitative and deterministic. Zero subjective judgment.
| Cat | Category / Description | Pillar | Type | Analogous Metric | Definition Status | Max |
|---|
Simplest entry. LLMs, rule-based systems, any API-compatible model. Max 750 pts.
wm_bench_dataset.jsonTrack A + performance metrics or live demo. Max 1000 pts.
{
"benchmark": "WM Bench v1.0",
"model_name": "YourModel v1.0",
"organization": "YourOrg",
"track": "A",
"wm_score": 0,
"grade": "?",
"fps": 0,
"cognitive_latency_ms": 0,
"gpu": "NVIDIA A100",
"pillar_scores": {
"P1_perception": 0,
"P2_cognition": 0,
"P3_embodiment": 0
},
"category_scores": {
"C01":0,"C02":0,"C03":0,"C04":0,"C05":0,
"C06":0,"C07":0,"C08":0,"C09":0,"C10":0
},
"paper_url": "",
"demo_url": ""
}
Existing benchmarks (HumanML3D, BABEL) measure only motion quality (FID). WM Bench is the world's first benchmark to evaluate cognitive capabilities of world models.
C05 Autonomous Emotion Escalation and C10 Body-Swap Extensibility have zero prior research. C03Β·C04Β·C06Β·C07Β·C08 are also first defined by WM Bench.
Current baseline. Open LLM brain (any LLM pluggable) + FloodDiffusion-VIDRAFT motion engine. RTX5070 (local/16GB). 47 FPS. WM Score 726/1000 (Grade B).
v1.0 (2026.03) β Initial release
100 scenarios Β· Auto-scored
3 Tracks Β· 10 Categories
PROMETHEUS baseline registered
Dataset: CC-BY-SA-4.0
Scoring code: Apache 2.0
Free to use and cite. Attribution required.