Benchmarking Open Source LLMs: SWE-bench and What It Misses

DeepSeek V4 Pro scoring 80.6% on SWE-bench Verified makes headlines, and it should. A number like that signals real progress in open-source coding models. The trouble starts when a figure from a public leaderboard gets treated as a buying decision. Benchmarks measure a narrow slice of what a business actually needs from a model, and reading them with care is what separates a sound choice from an expensive one.

For an Australian team weighing an open-source model against Claude, the question is never which model tops the chart this month. It is which model does your work reliably, at a cost you can predict, on data you control.

What SWE-bench actually measures

The benchmark has genuine value, and dismissing it would be a mistake. SWE-bench Verified runs a model against real pull requests from open-source repositories and checks whether its patch passes the project's own tests.

It measures real coding task completion on actual repositories, not toy problems
It allows a fair, repeatable comparison across very different models
It tracks the field's progress over months and years
It is hard to game in the way simpler multiple-choice tests are

For coding workloads specifically, a strong SWE-bench score is a meaningful signal that the model can read a codebase, reason about a fix, and produce a patch that holds up.

Where the benchmark goes quiet

A score is not a deployment, and the gaps are exactly where teams get caught. SWE-bench tells you almost nothing about how a model behaves once it leaves the test harness and meets your actual systems.

Reliability across long, multi-step agent runs where small errors compound
Behaviour on your specific data, file layouts, and internal conventions
Operational cost, rate limits, and the ongoing support burden of self-hosting
Performance on the non-coding work that fills most real projects

A model that fixes a clean open-source bug in isolation can still stall on a messy internal repository carrying years of accumulated quirks. The benchmark cannot see any of that, because it was never built to.

The cost of trusting a leaderboard

A model can top a public benchmark and still cost an Australian team $50,000 in wasted effort if it does not fit the real workload. That figure is not abstract. It is engineer time spent fighting flaky agent runs, re-reviewing output that looked right and was not, and rebuilding infrastructure when a self-hosted model cannot keep up. For a mid-sized Sydney team, a stalled six-week pilot can quietly burn $120K before anyone calls it.

Regulated industries carry an extra layer. A bank or insurer answering to APRA cannot ship a model into production on the strength of a leaderboard ranking. They need evidence the model performs on their tasks, with their controls in place, and a record of how that was tested.

The test that actually counts

The only evaluation that settles the question is one run on your own tasks. Public benchmarks are a first filter, not a verdict.

Treat public benchmarks as a shortlist tool, not a final answer
Build a small evaluation set from your real, day-to-day work
Measure cost and reliability alongside raw accuracy

We benchmark models on your real tasks rather than public leaderboards, and we keep Claude as the baseline for reliability so every comparison has a known floor. If you want help setting that up, book a short technical session.

Building an evaluation from your own work

A public score is a starting point. Your own test is the decision. Building one is less work than most teams expect, and it pays for itself the first time it stops a bad choice.

Collect twenty to fifty real tasks that reflect your daily work
Score each model on quality, cost, and reliability, not accuracy alone
Re-run the test whenever you consider switching models or versions

For an Australian team, this small investment prevents the costly mistake of choosing a model because it won a leaderboard on work that looks nothing like yours. The model that wins your test is the one worth deploying, whatever the public charts say.

Where Claude fits in the comparison

None of this is an argument against open-source models. DeepSeek, Qwen, and others have closed much of the gap, and for the right workload a self-hosted open model can be the correct call on both cost and control. The argument is against deciding from a distance. We start most engagements with Claude as the reliability baseline, then test open-source contenders against it on your tasks.

Sometimes the open model wins on price for a narrow job. Sometimes Claude's consistency across long agent runs is worth far more than the licence saving. The point is that you find out before you commit, not after. Run the benchmark that matters, the one built from your own work, and the choice between an open-source model and Claude stops being a guess.

Benchmarking Open Source LLMs: SWE-bench and What It Misses

What SWE-bench actually measures

Where the benchmark goes quiet

The cost of trusting a leaderboard

The test that actually counts

Building an evaluation from your own work

Where Claude fits in the comparison

Ready to move from AI pilot to production?

More from the blog

A CISO's Framework for Agentic AI: What Anthropic's Security Team Learned

Claude Code Can Migrate a Million Lines of Legacy Code in Two Weeks

Claude Code Can Set Up Your Server So You Don't Need a DevOps Hire