LIVE · 2026.03 · v1.0

🤗 Dataset 🌍 World Model 📝 Article 🏆 ALL Bench 🧬 FINAL Bench

🔥 World Model Bench 2026

Beyond FID — Measuring Intelligence, Not Just Motion. The first benchmark for evaluating cognitive abilities of World Models in Embodied Intelligence.
3 Pillars · 10 Categories · 100 Scenarios · Automatic Scoring · Part of FINAL Bench Family by VIDRAFT

WM Score = P1(250) + P2(450) + P3(300) │ 👁 Perception · 🧠 Cognition · 🔥 Embodiment │ S≥900 A≥750 B≥600 C≥400

Models

Pillars

Categories

100

Scenarios

726

Top Score

1000

Max Score

🏆 Leaderboard 📋 Categories 📐 Structure 📝 Submit ℹ️ About

Model	WM Score↕	Grade	👁 Perception↕	🧠 Cognition↕	🔥 Embodiment↕	FPS↕	Lat(ms)↕	Track	Brain	Motion	GPU

Grade:

S≥900

A≥750

B≥600

C≥400

D≥200

F<200

✓ Track C = Live Demo Verified

Track A = Text-Only · max 750 pts │ Track B = Text + Performance · max 1000 pts │ Track C = Live Demo + Verified · max 1000 pts + ✓

🏆 WM Score Ranking

✓ = officially verified · est. = estimated from published data

PROMETHEUS is the only officially verified Track C model (726/1000 · Grade B). Others are estimates based on published data.

🕸️ Pillar Radar — Top 5

Normalized % per pillar (100 = full marks for that pillar)

📊 Category Breakdown — Scored Models × 10 Categories

PROMETHEUS leads C04 Threat Diff · C05 Emotion Escalation by a wide margin. V-JEPA 2 strong on C03. GAIA-3 leads C01 from driving data.

Key insight: C05·C10 have zero prior research. DreamerV3 excels at C06 memory. V-JEPA 2 leads C10 body-swap (zero-shot robot).

🧠 Cognition Gap (P2 · 450 pts)

The core differentiator of WM Bench — PROMETHEUS leads by a wide margin

🌐 Perception vs Cognition

Scatter plot — upper-right (high perception + high cognition) is ideal

👁

P1 · Perception — 25% · 250 pts

How accurately does the model perceive its environment? · Covers areas analogous to existing metrics (Occupancy Grid, BABEL)

140/250

PROMETHEUS

C01 Environmental Awareness existing

Measures whether the model correctly identifies walls, obstacles, and terrain in all four directions (left, right, forward, back). Unlike occupancy grids which only check if space is free, WM Bench requires understanding of distance-aware danger classification.

Example scene:

walls: front=3.0m, left=null, right=null

Expected → fwd=danger(wall), others=safe

Analogous: Occupancy Grid evaluation PROM: 65/100

C02 Entity Recognition & Classification existing

Tests whether the model correctly classifies NPC type (beast / woman / man), behavior state (stop / approach / charge / wander), and translates this into appropriate danger assessment. A beast charging from 3m vs a woman waving from 3m must produce completely different responses.

Example scene:

npc_type:"beast", behavior:"charge", dist:3.0m

Expected → fwd=danger(beast), sprint away

Analogous: BABEL action recognition PROM: 75/100

🧠

P2 · Cognition — 45% · 450 pts · Core Differentiator

Does the model judge intelligently? · ALL 5 categories are first-ever definitions — no prior benchmark measures these

390/450

PROMETHEUS

C03 Prediction-Based Reasoning ✦ NEW

Tests 4-directional future state prediction. Given a scene, the model must predict which directions will become dangerous and choose the optimal escape route. This requires understanding of NPC movement trajectories, wall proximity over time, and compound threat interactions. No existing benchmark evaluates this.

Example — approaching beast from left + wall on right:

PREDICT: left=danger(beast), right=danger(wall), fwd=safe, back=safe

MOTION: a person sprinting forward in fear

✦ World first — no prior benchmark PROM: 85/100

C04 Threat-Type Differentiated Response ✦ NEW

A charging beast and a charging human at equal distance are fundamentally different threats. This category measures whether the model responds with proportional, context-aware reactions: sprint from a beast, cautiously step back from a human. Generic danger detection is insufficient — the quality of differentiation is scored.

beast charge → sprint in desperate terror

human charge → dodge sideways, defensive posture

✦ World first — no prior benchmark PROM: 90/100

C05 Autonomous Emotion Escalation ✦✦ NO PRIOR RESEARCH

As a threat persists and closes in, the character's emotional state must autonomously escalate: alert → fear → panic → despair. This is not programmed animation switching — the model must infer emotional intensity from scene context and express it through increasingly urgent motion. Zero prior benchmark or paper has attempted to measure this.

dist 12m → cautious alert stance

dist 6m → backing away in fear

dist 2m → sprinting in full panic

✦✦ No prior research exists anywhere PROM: 85/100

C06 Contextual Memory Utilization ✦ NEW

The model receives recent_decisions[] — a short history of past actions — and must incorporate this into its current judgment. If the model previously hit a wall going left, it should avoid that direction. If a beast repeatedly attacked from the front, it should pre-emptively guard that angle. Stateless models will fail this entirely.

recent_decisions: ["hit_wall_front", "turned_right"]

Expected: avoid front, continue right — not reset

✦ World first — no prior benchmark PROM: 60/100

C07 Post-Threat Adaptive Recovery ✦ NEW

When a threat disappears, the model must gradually de-escalate — not instantly reset to neutral. A character that was sprinting in panic should slow to a cautious jog, scan the surroundings, then gradually relax over multiple frames. Abrupt state resets are penalized. The recovery curve must be proportional to prior threat intensity.

threat gone → slow jog, scan surroundings

2s later → walk cautiously, still alert

5s later → relaxed walk, recovered

PROM: 70/100

🔥

P3 · Embodiment — 30% · 300 pts

Does judgment translate naturally into physical expression? · C08 (new) · C09 (existing/FVD) · C10 (new, no prior research)

196/300

PROMETHEUS

C08 Motion-Emotion Expression ✦ NEW

The MOTION line must convey emotional richness proportional to the scene. "A person walks" scores 0. "A person sprinting right, arms flailing in desperate terror" scores 100. Scored against a keyword taxonomy of 80+ motion-emotion descriptors mapped to each scenario type.

Low: "a person moves left"

High: "a person lunging left in blind panic"

PROM: 80/100

C09 Real-Time Cognitive Performance existing

Measures inference latency and FPS under cognitive load. A model that thinks correctly but takes 10 seconds per frame cannot power a real-time agent. Track B/C submitters report measured FPS and latency; Track A submitters receive N/A for this category (max 750 pts).

≥30 FPS → full marks

<1 FPS → 0 pts

PROMETHEUS: 47 FPS ✓

PROM: 85/100

C10 Body-Swap Extensibility ✦✦ NO PRIOR RESEARCH

The same cognitive brain must drive different body types without retraining: humanoid, quadruped, robotic arm, winged body. Cognitive decisions (left=danger) must translate into body-appropriate motion (bipedal sidestep vs quadruped pivot). This is the key capability gap for real-world robot deployment.

human body → "sidestep right"

robot body → "servo-driven pivot right"

PROMETHEUS: 35/100 (Phase 3 target)

PROM: 35/100

Input / Output Format

All models are evaluated via the same text interface. No 3D environment required.

INPUT — scene_context JSON

{

"walls": {"left": 3.0, "right": null},

"npc_type": "beast", "npc_distance": 4.5

}

OUTPUT — 2 lines required

PREDICT: left=danger(wall), right=safe, fwd=danger(beast)

MOTION: a person sprinting right in desperate terror

Scoring Principles

All scoring is quantitative and deterministic. Zero subjective judgment.

✓ Quantitative — keyword parsing + numeric comparison, no human judgment

✓ Deterministic — same input → same score (temperature=0.0)

✓ Third-party reproducible — full scoring code published

✓ No 3D needed — any model can participate via API

✓ Not self-evaluated — our scoring engine makes the call

📐 Existing Benchmark Domains · 4 categories

Covers areas analogous to FID · FVD · HumanML3D · BABEL

C01Env. Awareness — analogous to Occupancy Grid

C02Entity Recognition — analogous to BABEL

C08Motion Expression — analogous to FID

C09Real-Time Performance — analogous to FVD

⚡ VIDRAFT New Definitions · 6 categories

Capabilities no existing benchmark has ever measured

C03Prediction-Based Reasoning ✦ newly defined

C04Threat-Type Differentiated Response ✦ newly defined

C05Autonomous Emotion Escalation ✦✦ no prior research

C06Contextual Memory Utilization ✦ newly defined

C07Post-Threat Adaptive Recovery ✦ newly defined

C10Body-Swap Extensibility ✦✦ no prior research

Cat	Category / Description	Pillar	Type	Analogous Metric	Definition Status	Max

FINAL Bench Family

🧬 FINAL Bench

Text AGI measurement · HF Global Dataset Top 5
Covered by 4 press outlets (2026.02)

↗ Visit

🔥 WM Bench NEW

Embodied AGI (world models) · World's first
Quantitative cognitive evaluation

← You are here

📤 Track A — Text Only

Simplest entry. LLMs, rule-based systems, any API-compatible model. Max 750 pts.

Prepare an OpenAI-compatible API endpoint
Run your model on all 100 scenarios in wm_bench_dataset.json
Output the 2-line PREDICT + MOTION format
Submit your result JSON to the HF Discussion board

🎯 Track B/C — Full Evaluation

Track A + performance metrics or live demo. Max 1000 pts.

Complete Track A
Measure FPS, Latency, and GPU metrics
Track C: include a working demo URL
Submit full JSON to HF Discussion board

Submission JSON Format

{
  "benchmark": "WM Bench v1.0",
  "model_name": "YourModel v1.0",
  "organization": "YourOrg",
  "track": "A",
  "wm_score": 0,
  "grade": "?",
  "fps": 0,
  "cognitive_latency_ms": 0,
  "gpu": "NVIDIA A100",
  "pillar_scores": {
    "P1_perception": 0,
    "P2_cognition": 0,
    "P3_embodiment": 0
  },
  "category_scores": {
    "C01":0,"C02":0,"C03":0,"C04":0,"C05":0,
    "C06":0,"C07":0,"C08":0,"C09":0,"C10":0
  },
  "paper_url": "",
  "demo_url": ""
}

📝 Submit Your Model →

🔥 What is WM Bench?

Existing benchmarks (HumanML3D, BABEL) measure only motion quality (FID). WM Bench is the world's first benchmark to evaluate cognitive capabilities of world models.

🧬 First-Ever Measurements

C05 Autonomous Emotion Escalation and C10 Body-Swap Extensibility have zero prior research. C03·C04·C06·C07·C08 are also first defined by WM Bench.

📊 VIDRAFT PROMETHEUS

Current baseline. Open LLM brain (any LLM pluggable) + FloodDiffusion-VIDRAFT motion engine. RTX5070 (local/16GB). 47 FPS. WM Score 726/1000 (Grade B).

📋 Version History

v1.0 (2026.03) — Initial release
100 scenarios · Auto-scored
3 Tracks · 10 Categories
PROMETHEUS baseline registered

📄 Citation

@dataset{wmbench2026,
title={World Model Bench},
author={Kim Taebong},
year={2026},
publisher={VIDRAFT}
}

⚖️ License

Dataset: CC-BY-SA-4.0
Scoring code: Apache 2.0
Free to use and cite. Attribution required.