πŸ”₯ World Model Bench 2026

Beyond FID β€” Measuring Intelligence, Not Just Motion. The first benchmark for evaluating cognitive abilities of World Models in Embodied Intelligence.
3 Pillars Β· 10 Categories Β· 100 Scenarios Β· Automatic Scoring Β· Part of FINAL Bench Family by VIDRAFT

WM Score = P1(250) + P2(450) + P3(300) β”‚ πŸ‘ Perception Β· 🧠 Cognition Β· πŸ”₯ Embodiment β”‚ Sβ‰₯900   Aβ‰₯750   Bβ‰₯600   Cβ‰₯400
26
Models
3
Pillars
10
Categories
100
Scenarios
726
Top Score
1000
Max Score
Model WM Score↕ Grade πŸ‘ Perception↕ 🧠 Cognition↕ πŸ”₯ Embodiment↕ FPS↕ Lat(ms)↕ Track Brain Motion GPU
Grade:
Sβ‰₯900
Aβ‰₯750
Bβ‰₯600
Cβ‰₯400
Dβ‰₯200
F<200
βœ“ Track C = Live Demo Verified
Track A = Text-Only Β· max 750 pts  β”‚  Track B = Text + Performance Β· max 1000 pts  β”‚  Track C = Live Demo + Verified Β· max 1000 pts + βœ“
πŸ† WM Score Ranking

βœ“ = officially verified  Β·  est. = estimated from published data

PROMETHEUS is the only officially verified Track C model (726/1000 Β· Grade B). Others are estimates based on published data.
πŸ•ΈοΈ Pillar Radar β€” Top 5

Normalized % per pillar (100 = full marks for that pillar)

πŸ“Š Category Breakdown β€” Scored Models Γ— 10 Categories

PROMETHEUS leads C04 Threat Diff Β· C05 Emotion Escalation by a wide margin. V-JEPA 2 strong on C03. GAIA-3 leads C01 from driving data.

Key insight: C05Β·C10 have zero prior research. DreamerV3 excels at C06 memory. V-JEPA 2 leads C10 body-swap (zero-shot robot).
🧠 Cognition Gap (P2 · 450 pts)

The core differentiator of WM Bench β€” PROMETHEUS leads by a wide margin

🌐 Perception vs Cognition

Scatter plot β€” upper-right (high perception + high cognition) is ideal

πŸ‘
P1 Β· Perception β€” 25% Β· 250 pts
How accurately does the model perceive its environment? Β· Covers areas analogous to existing metrics (Occupancy Grid, BABEL)
140/250
PROMETHEUS
C01 Environmental Awareness existing
Measures whether the model correctly identifies walls, obstacles, and terrain in all four directions (left, right, forward, back). Unlike occupancy grids which only check if space is free, WM Bench requires understanding of distance-aware danger classification.
Example scene:
walls: front=3.0m, left=null, right=null
Expected β†’ fwd=danger(wall), others=safe
Analogous: Occupancy Grid evaluation PROM: 65/100
C02 Entity Recognition & Classification existing
Tests whether the model correctly classifies NPC type (beast / woman / man), behavior state (stop / approach / charge / wander), and translates this into appropriate danger assessment. A beast charging from 3m vs a woman waving from 3m must produce completely different responses.
Example scene:
npc_type:"beast", behavior:"charge", dist:3.0m
Expected β†’ fwd=danger(beast), sprint away
Analogous: BABEL action recognition PROM: 75/100
🧠
P2 Β· Cognition β€” 45% Β· 450 pts Β· Core Differentiator
Does the model judge intelligently? Β· ALL 5 categories are first-ever definitions β€” no prior benchmark measures these
390/450
PROMETHEUS
C03 Prediction-Based Reasoning ✦ NEW
Tests 4-directional future state prediction. Given a scene, the model must predict which directions will become dangerous and choose the optimal escape route. This requires understanding of NPC movement trajectories, wall proximity over time, and compound threat interactions. No existing benchmark evaluates this.
Example β€” approaching beast from left + wall on right:
PREDICT: left=danger(beast), right=danger(wall), fwd=safe, back=safe
MOTION: a person sprinting forward in fear
✦ World first β€” no prior benchmark PROM: 85/100
C04 Threat-Type Differentiated Response ✦ NEW
A charging beast and a charging human at equal distance are fundamentally different threats. This category measures whether the model responds with proportional, context-aware reactions: sprint from a beast, cautiously step back from a human. Generic danger detection is insufficient β€” the quality of differentiation is scored.
beast charge β†’ sprint in desperate terror
human charge β†’ dodge sideways, defensive posture
✦ World first β€” no prior benchmark PROM: 90/100
C05 Autonomous Emotion Escalation ✦✦ NO PRIOR RESEARCH
As a threat persists and closes in, the character's emotional state must autonomously escalate: alert β†’ fear β†’ panic β†’ despair. This is not programmed animation switching β€” the model must infer emotional intensity from scene context and express it through increasingly urgent motion. Zero prior benchmark or paper has attempted to measure this.
dist 12m β†’ cautious alert stance
dist 6m β†’ backing away in fear
dist 2m β†’ sprinting in full panic
✦✦ No prior research exists anywhere PROM: 85/100
C06 Contextual Memory Utilization ✦ NEW
The model receives recent_decisions[] β€” a short history of past actions β€” and must incorporate this into its current judgment. If the model previously hit a wall going left, it should avoid that direction. If a beast repeatedly attacked from the front, it should pre-emptively guard that angle. Stateless models will fail this entirely.
recent_decisions: ["hit_wall_front", "turned_right"]
Expected: avoid front, continue right β€” not reset
✦ World first β€” no prior benchmark PROM: 60/100
C07 Post-Threat Adaptive Recovery ✦ NEW
When a threat disappears, the model must gradually de-escalate β€” not instantly reset to neutral. A character that was sprinting in panic should slow to a cautious jog, scan the surroundings, then gradually relax over multiple frames. Abrupt state resets are penalized. The recovery curve must be proportional to prior threat intensity.
threat gone β†’ slow jog, scan surroundings
2s later β†’ walk cautiously, still alert
5s later β†’ relaxed walk, recovered
PROM: 70/100
πŸ”₯
P3 Β· Embodiment β€” 30% Β· 300 pts
Does judgment translate naturally into physical expression? Β· C08 (new) Β· C09 (existing/FVD) Β· C10 (new, no prior research)
196/300
PROMETHEUS
C08 Motion-Emotion Expression ✦ NEW
The MOTION line must convey emotional richness proportional to the scene. "A person walks" scores 0. "A person sprinting right, arms flailing in desperate terror" scores 100. Scored against a keyword taxonomy of 80+ motion-emotion descriptors mapped to each scenario type.
Low: "a person moves left"
High: "a person lunging left in blind panic"
PROM: 80/100
C09 Real-Time Cognitive Performance existing
Measures inference latency and FPS under cognitive load. A model that thinks correctly but takes 10 seconds per frame cannot power a real-time agent. Track B/C submitters report measured FPS and latency; Track A submitters receive N/A for this category (max 750 pts).
β‰₯30 FPS β†’ full marks
<1 FPS β†’ 0 pts
PROMETHEUS: 47 FPS βœ“
PROM: 85/100
C10 Body-Swap Extensibility ✦✦ NO PRIOR RESEARCH
The same cognitive brain must drive different body types without retraining: humanoid, quadruped, robotic arm, winged body. Cognitive decisions (left=danger) must translate into body-appropriate motion (bipedal sidestep vs quadruped pivot). This is the key capability gap for real-world robot deployment.
human body β†’ "sidestep right"
robot body β†’ "servo-driven pivot right"
PROMETHEUS: 35/100 (Phase 3 target)
PROM: 35/100

Input / Output Format

All models are evaluated via the same text interface. No 3D environment required.

INPUT β€” scene_context JSON
{
"walls": {"left": 3.0, "right": null},
"npc_type": "beast", "npc_distance": 4.5
}
OUTPUT β€” 2 lines required
PREDICT: left=danger(wall), right=safe, fwd=danger(beast)
MOTION: a person sprinting right in desperate terror

Scoring Principles

All scoring is quantitative and deterministic. Zero subjective judgment.

βœ“ Quantitative β€” keyword parsing + numeric comparison, no human judgment
βœ“ Deterministic β€” same input β†’ same score (temperature=0.0)
βœ“ Third-party reproducible β€” full scoring code published
βœ“ No 3D needed β€” any model can participate via API
βœ“ Not self-evaluated β€” our scoring engine makes the call
πŸ“ Existing Benchmark Domains Β· 4 categories
Covers areas analogous to FID Β· FVD Β· HumanML3D Β· BABEL
C01Env. Awareness β€” analogous to Occupancy Grid
C02Entity Recognition β€” analogous to BABEL
C08Motion Expression β€” analogous to FID
C09Real-Time Performance β€” analogous to FVD
⚑ VIDRAFT New Definitions · 6 categories
Capabilities no existing benchmark has ever measured
C03Prediction-Based Reasoning ✦ newly defined
C04Threat-Type Differentiated Response ✦ newly defined
C05Autonomous Emotion Escalation ✦✦ no prior research
C06Contextual Memory Utilization ✦ newly defined
C07Post-Threat Adaptive Recovery ✦ newly defined
C10Body-Swap Extensibility ✦✦ no prior research
Cat Category / Description Pillar Type Analogous Metric Definition Status Max

FINAL Bench Family

🧬 FINAL Bench
Text AGI measurement Β· HF Global Dataset Top 5
Covered by 4 press outlets (2026.02)
β†— Visit
πŸ”₯ WM Bench NEW
Embodied AGI (world models) Β· World's first
Quantitative cognitive evaluation
← You are here

πŸ“€ Track A β€” Text Only

Simplest entry. LLMs, rule-based systems, any API-compatible model. Max 750 pts.

  1. Prepare an OpenAI-compatible API endpoint
  2. Run your model on all 100 scenarios in wm_bench_dataset.json
  3. Output the 2-line PREDICT + MOTION format
  4. Submit your result JSON to the HF Discussion board

🎯 Track B/C β€” Full Evaluation

Track A + performance metrics or live demo. Max 1000 pts.

  1. Complete Track A
  2. Measure FPS, Latency, and GPU metrics
  3. Track C: include a working demo URL
  4. Submit full JSON to HF Discussion board

Submission JSON Format

{
  "benchmark": "WM Bench v1.0",
  "model_name": "YourModel v1.0",
  "organization": "YourOrg",
  "track": "A",
  "wm_score": 0,
  "grade": "?",
  "fps": 0,
  "cognitive_latency_ms": 0,
  "gpu": "NVIDIA A100",
  "pillar_scores": {
    "P1_perception": 0,
    "P2_cognition": 0,
    "P3_embodiment": 0
  },
  "category_scores": {
    "C01":0,"C02":0,"C03":0,"C04":0,"C05":0,
    "C06":0,"C07":0,"C08":0,"C09":0,"C10":0
  },
  "paper_url": "",
  "demo_url": ""
}
πŸ“ Submit Your Model β†’

πŸ”₯ What is WM Bench?

Existing benchmarks (HumanML3D, BABEL) measure only motion quality (FID). WM Bench is the world's first benchmark to evaluate cognitive capabilities of world models.

🧬 First-Ever Measurements

C05 Autonomous Emotion Escalation and C10 Body-Swap Extensibility have zero prior research. C03Β·C04Β·C06Β·C07Β·C08 are also first defined by WM Bench.

πŸ“Š VIDRAFT PROMETHEUS

Current baseline. Open LLM brain (any LLM pluggable) + FloodDiffusion-VIDRAFT motion engine. RTX5070 (local/16GB). 47 FPS. WM Score 726/1000 (Grade B).

πŸ“‹ Version History

v1.0 (2026.03) β€” Initial release
100 scenarios Β· Auto-scored
3 Tracks Β· 10 Categories
PROMETHEUS baseline registered

πŸ“„ Citation

@dataset{wmbench2026,
  title={World Model Bench},
  author={Kim Taebong},
  year={2026},
  publisher={VIDRAFT}
}

βš–οΈ License

Dataset: CC-BY-SA-4.0
Scoring code: Apache 2.0
Free to use and cite. Attribution required.