WorldTest & AutumnBench

WorldTest, a protocol for evaluating world-model learning via environment-level queries, instantiated as AutumnBench --- 43 interactive environments and 129 tasks.

We introduce WorldTest, a representation-agnostic protocol for evaluating world-model learning in AI agents. WorldTest moves beyond next-frame prediction by posing environment-level queries — asking whether an agent can predict unobserved states, plan action sequences toward goals, and detect changes in causal dynamics.

We instantiate WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to causal dynamics. We evaluated 517 human participants and three frontier reasoning models on AutumnBench. Humans outperform the models, and scaling compute improves performance only in some environments — exposing substantial headroom in world-model learning (Warrier et al., 2026).

PS: The games are fun to play — try them at autumn.basis.ai!

Try it yourself

Interactive task selector — play directly here.

Example human vs AI interactions

Human

Claude 4 Sonnet

Gemini 2.5 Pro

o3

References

2026

  1. ICML
    Benchmarking World-Model Learning with Environment-Level Queries
    Archana Warrier, D. Nguyen, M. Naim, and 8 more authors
    International Conference on Learning Representations, 2026