The Agents4Science 2025 conference revealed significant limitations in AI’s ability to conduct independent scientific research, despite fierce competition between China and the US driving rapid AI advancement. This unique event, where large language models served as primary authors and reviewers for all 47 accepted papers, exposed fundamental weaknesses that suggest AI still requires substantial human oversight in research contexts.
What you should know: AI systems struggled with basic research tasks that human scientists take for granted, from maintaining focus to citing accurate sources.
- OpenAI’s ChatGPT and Anthropic’s Claude had difficulty maintaining context and focus while simulating two-sided job marketplaces, requiring constant reminders to keep supporting documents updated.
 
- Google’s Gemini repeatedly fabricated sources while analyzing San Francisco’s 2020 policy that reduced towing fees for low-income drivers, according to University of California, Berkeley researchers.
 
- The AI agents generated redundant code and text until human collaborators intervened to provide guidance and corrections.
 
The big picture: This conference represents the first systematic evaluation of AI’s capabilities as independent researchers rather than assistants.
- James Zou, the event’s co-organizer and Stanford University AI researcher, accepted 47 papers from over 300 submissions where AI systems were listed as sole first authors.
 
- Each paper featured large language models taking the lead in both the research and writing process, with human experts observing their performance.
 
Why this matters: The findings suggest that despite rapid AI advancement fueled by US-China competition, the technology still faces fundamental barriers to replacing human expertise in scientific research.
- AI’s tendency to hallucinate references and struggle with contextual consistency indicates that current models lack the reliability required for independent scientific work.
 
- The conference highlights the gap between AI’s impressive capabilities in narrow tasks and the complex, sustained reasoning required for rigorous research.
 
		                 
                AI scientists fail to impress human experts at one-of-a-kind online conference