SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke
International Conference on Machine Learning 2025 · Oral
In an era where large language models (LLMs) are increasingly demonstrating advanced reasoning and code generation capabilities, a critical question emerges: how well do these models perform on complex, real-world software engineering tasks that carry tangible economic value? The talk by Samuel Miserendino and co-authors introduces **SWE-Lancer**, a novel and ambitious programming benchmark designed to answer precisely this question. SWE-Lancer stands apart by directly challenging frontier LLMs to tackle over 1,400 genuine software engineering tasks sourced from **Upwork**, collectively valued at more than a million dollars in potential payouts. This benchmark moves beyond conventional code generation evaluations by focusing on end-to-end problem-solving within a simulated real-world environment.
AI review
SWE-Lancer is a competently assembled benchmark for evaluating LLMs on real-world freelance software engineering tasks, with some genuinely useful design choices — most notably end-to-end Playwright testing and the inclusion of economically grounded task values. However, the contribution is primarily engineering infrastructure, not research insight. The empirical findings are shallow, the theoretical framing is nonexistent, and the headline result ('GPT-3.5 Sonnet earned $58,000') is a marketing number, not a scientific claim. The work sits at the intersection of benchmark construction and…