A new study has shown that even the most advanced AI models can barely scrape by on real-world programming tasks.
In its latest experiment, OpenAI tested AI models using real engineering challenges. The most capable model, Claude 3.5 Sonnet, completed a dismal 26.2 per cent of hands-on coding tasks and just 44.9 per cent of technical management decisions.
The study used a benchmark called SWE-Lancer, built from 1,488 actual fixes made to Expensify’s codebase—representing $1 million in freelance engineering work. Even with this well-defined dataset, AI struggled to match human expertise.
While the AI models excelled at finding relevant code snippets, they floundered when asked to comprehend how different parts of a program work together. The best it could manage were shallow, surface-level fixes that failed to account for deeper software interactions.
Unlike previous AI coding tests that rely on simplistic algorithm puzzles, OpenAI’s benchmark replicated real-world software development. Tasks ranged from quick $50 bug fixes to intricate $32,000 feature implementations, with every solution rigorously tested in real user environments.