Jiayi Geng

PhD Student

Carnegie Mellon University

Research Interests

Reliable Agents

Multi-Agent Collaboration

AI for Science

About

I am a first year PhD student at the Language Technologies Institute at Carnegie Mellon University, advised by Prof. Graham Neubig.

My research explores building reliable and capable AI agents. I primarily focus on: 1) agents that can reliably adapt and evolve across long-horizon interactions, 2) multi-agent systems where agents with diverse skills and experiences collaborate effectively, and 3) evaluating agents for reliable autonomous scientific discovery.

Before CMU, I received my Master's degree at Princeton University, advised by Prof. Danqi Chen and Prof. Thomas L. Griffiths, and my Bachelor's degree at McGill University, advised by Prof. Xue (Steve) Liu and Prof. Eric D. Kolaczyk.

Selected Publications

View All →

(* indicates equal contribution)

Accumulating Context Changes the Beliefs of Language Models

Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L Griffiths

Preprint

arXiv

Investigating how accumulating context shifts the beliefs of language models.

Continual Memorization of Factoids in Large Language Models

Howard Chen^†, Jiayi Geng^†, Adithya Bhaskar, Dan Friedman, Danqi Chen

TMLR 2026

arXiv

Investigating how large language models continually memorize factual knowledge over training.

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Huan-ang Gao^†, Jiayi Geng^†, Wenyue Hua^†, Mengkang Hu^†, Xinzhe Juan^†, Hongzhang Liu^†, Shilong Liu^†, Jiahao Qiu^†, Xuan Qi^†, Qihan Ren^†, Yiran Wu^†, Hongru Wang^†, Han Xiao^†, Yuhang Zhou^†, Shaokun Zhang^†, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

TMLR 2026

arXiv

A comprehensive survey of self-evolving agents and their path toward artificial super intelligence.

Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

Jiayi Geng^†, Howard Chen^†, Dilip Arumugam, Thomas L Griffiths

LM4Sci COLM Workshop 2025

arXiv

Evaluating the reliability of LLMs as AI scientists through reverse-engineering assessments of black-box systems.

Mind Your Step (by Step): Chain-of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse

Ryan Liu^†, Jiayi Geng^†, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L Griffiths

ICML 2025

arXiv

Demonstrating that chain-of-thought prompting can reduce LLM performance on tasks where deliberate thinking hurts human performance.

Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L Griffiths

Preprint

arXiv

Applying cognitive science tools to understand LLMs at multiple levels of analysis.

Large Language Models Assume People are More Rational than We Really Are

Ryan Liu^†, Jiayi Geng^†, Joshua C Peterson, Ilia Sucholutsky, Thomas L Griffiths

ICLR 2025

arXiv

Showing that LLMs overestimate human rationality in their predictions of human behavior.

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang

EMNLP Findings 2025

arXiv

A method combining speculative tree-search with best-of-N sampling for better inference-time alignment.

Language Models as Science Tutors

Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen

ICML 2024

arXiv

Exploring the effectiveness of language models as science tutors for educational applications.

News

2025-11

Our paper Accumulating Context Changes the Beliefs of Language Models was featured by Science Science !

2025-05

Graduated from Princeton University and started my PhD at LTI CMU CMU !

2025-05

Our paper Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse has been accepted by ICML 2025!

2025-01

Our paper Large Language Models Assume People are More Rational has been accepted by ICLR 2025!

2023-09

Started my Master study at Princeton University!