Publications
Publications and research papers in reverse chronological order. Generated by jekyll-scholar.
2026
- ICMLBenchmarking World-Model Learning with Environment-Level QueriesArchana Warrier, D. Nguyen, M. Naim, and 8 more authorsInternational Conference on Learning Representations, 2026
Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended—models should support many different tasks unknown ahead of time—and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template—reward-free exploration, derived tests, and behavior-based scoring—to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
@article{warrier2025benchmarking, title = {Benchmarking World-Model Learning with Environment-Level Queries}, author = {Warrier, Archana and Nguyen, D. and Naim, M. and Jain, M. and Liang, Y. and Schroeder, K. and Yang, C. and Tenenbaum, J. and Vollmer, S. and Ellis, K. and Tavares, Z.}, journal = {International Conference on Learning Representations}, year = {2026}, }
2024
- NeurIPS WorkshopHad Enough of Experts? Elicitation and Evaluation of Bayesian Priors from LLMsD. Selby, K. Spriestersbach, Y. Iwashita, and 5 more authorsNeurIPS 2024 Workshop, 2024
Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences, however their utility for quantitative information retrieval is less well understood. Here we explore the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid elicitation of expert-informed prior distributions for Bayesian statistical models. We present a prompt engineering framework, treating an LLM as an interface to scholarly literature, evaluating responses in different contexts and domains. We discuss the implications and challenges of treating LLMs as “experts”.
@article{selby2024experts, title = {Had Enough of Experts? Elicitation and Evaluation of Bayesian Priors from LLMs}, author = {Selby, D. and Spriestersbach, K. and Iwashita, Y. and Bappert, D. and Warrier, Archana and Mukherjee, S. and Kise, K. and Vollmer, S.}, journal = {NeurIPS 2024 Workshop}, year = {2024} }