About
I am a first year PhD student at the Language Technologies Institute at Carnegie Mellon University, advised by Prof. Graham Neubig.
My research explores building reliable and capable AI agents. I primarily focus on: 1) agents that can reliably adapt and evolve across long-horizon interactions, 2) multi-agent systems where agents with diverse skills and experiences collaborate effectively, and 3) evaluating agents for reliable autonomous scientific discovery.
Before CMU, I received my Master's degree at Princeton University, advised by Prof. Danqi Chen and Prof. Thomas L. Griffiths, and my Bachelor's degree at McGill University, advised by Prof. Xue (Steve) Liu and Prof. Eric D. Kolaczyk.
Selected Publications
View All →(* indicates equal contribution)
Accumulating Context Changes the Beliefs of Language Models
Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L Griffiths
Preprint
arXivInvestigating how accumulating context shifts the beliefs of language models.
Continual Memorization of Factoids in Large Language Models
Howard Chen†, Jiayi Geng†, Adithya Bhaskar, Dan Friedman, Danqi Chen
TMLR 2026
arXivInvestigating how large language models continually memorize factual knowledge over training.
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
Huan-ang Gao†, Jiayi Geng†, Wenyue Hua†, Mengkang Hu†, Xinzhe Juan†, Hongzhang Liu†, Shilong Liu†, Jiahao Qiu†, Xuan Qi†, Qihan Ren†, Yiran Wu†, Hongru Wang†, Han Xiao†, Yuhang Zhou†, Shaokun Zhang†, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang
TMLR 2026
arXivA comprehensive survey of self-evolving agents and their path toward artificial super intelligence.
Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems
Jiayi Geng†, Howard Chen†, Dilip Arumugam, Thomas L Griffiths
LM4Sci COLM Workshop 2025
arXivEvaluating the reliability of LLMs as AI scientists through reverse-engineering assessments of black-box systems.
Mind Your Step (by Step): Chain-of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse
Ryan Liu†, Jiayi Geng†, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L Griffiths
ICML 2025
arXivDemonstrating that chain-of-thought prompting can reduce LLM performance on tasks where deliberate thinking hurts human performance.
Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis
Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L Griffiths
Preprint
arXivApplying cognitive science tools to understand LLMs at multiple levels of analysis.
Large Language Models Assume People are More Rational than We Really Are
Ryan Liu†, Jiayi Geng†, Joshua C Peterson, Ilia Sucholutsky, Thomas L Griffiths
ICLR 2025
arXivShowing that LLMs overestimate human rationality in their predictions of human behavior.
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang
EMNLP Findings 2025
arXivA method combining speculative tree-search with best-of-N sampling for better inference-time alignment.
Language Models as Science Tutors
Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen
ICML 2024
arXivExploring the effectiveness of language models as science tutors for educational applications.
News
Our paper Accumulating Context Changes the Beliefs of Language Models was featured by Science
!
Graduated from Princeton University
and started my PhD at LTI CMU
!
Our paper Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse has been accepted by ICML 2025!
Our paper Large Language Models Assume People are More Rational has been accepted by ICLR 2025!
Started my Master study at Princeton University
!
