Publications

(* indicates equal contribution)

Accumulating Context Changes the Beliefs of Language Models

Accumulating Context Changes the Beliefs of Language Models

Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer, Graham Neubig, Thomas L Griffiths

Preprint

Investigating how accumulating context shifts the beliefs of language models.

arXivCode
Alita-G: Self-Evolving Generative Agent for Agent Generation

Alita-G: Self-Evolving Generative Agent for Agent Generation

Jiahao Qiu, Xuan Qi, Hongru Wang, Xinzhe Juan, Yimin Wang, Zelin Zhao, Jiayi Geng, Jiacheng Guo, Peihang Li, Jingzhe Shi, Shilong Liu, Mengdi Wang

Preprint

A self-evolution framework that transforms a general-purpose agent into a domain expert by generating, abstracting, and curating MCP tools.

arXiv
Continual Memorization of Factoids in Large Language Models

Continual Memorization of Factoids in Large Language Models

Howard Chen, Jiayi Geng, Adithya Bhaskar, Dan Friedman, Danqi Chen

TMLR 2026

Investigating how large language models continually memorize factual knowledge over training.

arXivCode
Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

Jiahao Qiu, Jingzhe Shi, Xinzhe Juan, Zelin Zhao, Jiayi Geng, Shilong Liu, Hongru Wang, Sanfeng Wu, Mengdi Wang

Preprint

An AI agent that matches elite gold medalists at the International Physics Olympiad 2025.

arXivCode
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Qihan Ren, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

TMLR 2026

A comprehensive survey of self-evolving agents and their path toward artificial super intelligence.

arXivCode
Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

Jiayi Geng, Howard Chen, Dilip Arumugam, Thomas L Griffiths

LM4Sci COLM Workshop 2025

Evaluating the reliability of LLMs as AI scientists through reverse-engineering assessments of black-box systems.

arXivCode
Mind Your Step (by Step): Chain-of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse

Mind Your Step (by Step): Chain-of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse

Ryan Liu, Jiayi Geng, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L Griffiths

ICML 2025

Demonstrating that chain-of-thought prompting can reduce LLM performance on tasks where deliberate thinking hurts human performance.

arXivCode
Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L Griffiths

Preprint

Applying cognitive science tools to understand LLMs at multiple levels of analysis.

arXiv
Large Language Models Assume People are More Rational than We Really Are

Large Language Models Assume People are More Rational than We Really Are

Ryan Liu, Jiayi Geng, Joshua C Peterson, Ilia Sucholutsky, Thomas L Griffiths

ICLR 2025

Showing that LLMs overestimate human rationality in their predictions of human behavior.

arXivCode
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang

EMNLP Findings 2025

A method combining speculative tree-search with best-of-N sampling for better inference-time alignment.

arXiv
Dr. GPT in Campus Counseling: Understanding Higher Education Students' Opinions on LLM-assisted Mental Health Services

Dr. GPT in Campus Counseling: Understanding Higher Education Students' Opinions on LLM-assisted Mental Health Services

Owen Xingjian Zhang, Shuyao Zhou, Jiayi Geng, Yuhan Liu, Sunny Xun Liu

Preprint

Understanding student opinions on LLM-assisted mental health services in campus counseling.

arXiv
Language Models as Science Tutors

Language Models as Science Tutors

Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen

ICML 2024

Exploring the effectiveness of language models as science tutors for educational applications.

arXivCode
Corgi-PM: A Chinese Corpus for Gender Bias Probing and Mitigation

Corgi-PM: A Chinese Corpus for Gender Bias Probing and Mitigation

Ge Zhang, Yizhi Li, Yaoyao Wu, Linyuan Zhang, Chenghua Lin, Jiayi Geng, Shi Wang, Jie Fu

Preprint

A Chinese corpus designed for probing and mitigating gender bias in language models.

arXivCode