Publications | Archana Warrier

2026

ICML
Benchmarking World-Model Learning with Environment-Level Queries

Archana Warrier, Dat Nguyen, Michelangelo Naim, and 8 more authors

In Forty-third International Conference on Machine Learning, 2026

Abs Bib Poster Website

World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions, such as next-frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the environment. In contrast, humans build general-purpose models that can answer many different questions about an environment—including questions that require understanding global structure and counterfactual consequences. We propose WorldTest: a protocol for evaluating whether agents learn models that support multiple environment-level queries—questions whose answers depend on properties of the full environment, not just observed trajectories. Individually, these queries can target properties (e.g., reachability or the effects of interventions) that no single rollout distribution determines. Collectively, they assess model generality across query types. We instantiate WorldTest as AutumnBench, a benchmark of 43 interactive grid-world environments and 129 tasks across three query families for both humans and learning agents. Experiments with 517 human participants and five frontier models on AutumnBench show that humans substantially outperform these models, a gap we attribute to differences in exploration and belief updating.
@inproceedings{warrier2026benchmarking, title = {Benchmarking World-Model Learning with Environment-Level Queries}, author = {Warrier, Archana and Nguyen, Dat and Naim, Michelangelo and Jain, Moksh and Liang, Yichao and Schroeder, Karen and Yang, Cambridge and Tenenbaum, Joshua and Vollmer, Sebastian and Ellis, Kevin and Tavares, Zenna}, booktitle = {Forty-third International Conference on Machine Learning}, year = {2026}, url = {https://openreview.net/forum?id=Ny6UYZ8ysN} }

2024

NeurIPS Workshop
Had Enough of Experts? Elicitation and Evaluation of Bayesian Priors from LLMs

D. Selby, K. Spriestersbach, Y. Iwashita, and 5 more authors

NeurIPS 2024 Workshop, 2024

Abs Bib

Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences, however their utility for quantitative information retrieval is less well understood. Here we explore the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid elicitation of expert-informed prior distributions for Bayesian statistical models. We present a prompt engineering framework, treating an LLM as an interface to scholarly literature, evaluating responses in different contexts and domains. We discuss the implications and challenges of treating LLMs as “experts”.
@article{selby2024experts, title = {Had Enough of Experts? Elicitation and Evaluation of Bayesian Priors from LLMs}, author = {Selby, D. and Spriestersbach, K. and Iwashita, Y. and Bappert, D. and Warrier, Archana and Mukherjee, S. and Kise, K. and Vollmer, S.}, journal = {NeurIPS 2024 Workshop}, year = {2024} }