Blog

MiniMax M3 vs Claude on SWE-Bench Pro: How to Read Open-Weight Coding Scores

June 2026 · 6 min read · Technical

Hand-drawn illustration of a magnifying glass inspecting two columns of different heights, the taller one filled terracotta, representing a comparison of AI coding benchmark scores
← Back to all posts

When MiniMax released M3 in June 2026, the headline was a 59.0% score on SWE-Bench Pro, a hard coding benchmark. Claude Opus 4.8, from Anthropic, sits at a reported 69.2% on the same test. For an Australian team deciding what to build on, the gap and the context behind it matter more than either number on its own.

What SWE-Bench Pro measures

SWE-Bench Pro tests whether a model can resolve real software issues drawn from actual pull requests across actively maintained open-source repositories. It is harder than the older SWE-Bench Verified, which many models have nearly saturated. A score reflects how often the model produced a working fix that passed the project's own tests. That makes it more useful than a trivia-style benchmark, because it rewards getting real code to work rather than reciting facts.

Two things are worth keeping in mind when you read these scores:

  • The launch figures came from MiniMax's own evaluation. Independent results from groups like Artificial Analysis were not yet published, so treat the number as a starting point, not a settled fact.

  • A ten-point gap on this benchmark often widens on messy, real-world tasks, where small errors compound across the many steps a single ticket can take.

Why the gap matters in practice

A coding model that fails one fix in three behaves very differently from one that fails one in four once it is wired into an automated workflow. Each failed attempt costs retries, review time, and a little trust. For a system that runs unattended overnight, the more reliable model usually wins on total cost, even when its per-token price is higher.

That is why we build client coding and automation systems on Claude. The reliability across varied tasks holds up when the work leaves the benchmark and meets a real codebase, a real ticket queue, and a real set of business rules. A lead on clean test cases does not always survive that transition.

What a leaderboard number hides

A single benchmark score summarises thousands of attempts into one figure. That is convenient for a headline and thin for a decision. Before you trust a score, ask what it is averaging over and what it leaves out.

Four things sit underneath the number and rarely make the chart:

  • Pass rate by repository, not just the blended average. A model can be strong on Python web apps and weak on the codebase you actually run.

  • Behaviour on long tasks that need many steps, where a model that drifts halfway through is worse than its score suggests.

  • Whether a failed fix fails loudly with a clear error, or quietly ships code that looks right and breaks later.

  • Cost per solved issue rather than cost per token, which is the number your budget actually feels.

A worked example on cost

Put rough numbers on it. Say a Sydney team runs an agent that opens 1,000 pull requests a month. At a 69% solve rate the model clears about 690 and leaves 310 for a human. At 59% it clears 590 and leaves 410. That is roughly 100 extra failures a month landing on a senior engineer's desk.

If each failed attempt costs about half an hour of senior review at $120 an hour, those extra failures add close to $6,000 a month, or around $72,000 a year. A per-token saving on the cheaper model rarely closes a gap that size. The headline rate that looked like a rounding difference turns into a real line item, which is why the solve rate matters more than the sticker price.

When an open-weight coding model still fits

None of this rules out open-weight models. M3, Kimi K2.6, and GLM-5 are capable, and there are sound reasons to run one:

  • Code or data that cannot leave Australian infrastructure for privacy or contractual reasons.

  • Very high request volume, where self-hosting changes the cost maths in the open model's favour.

  • A need to fine-tune the model on a private codebase that a hosted API will not accommodate.

In those cases we test the open-weight model on the team's real repositories before committing, and score it the same way we would score any tool. The decision is about fit, not loyalty to a leaderboard.

A practical evaluation

The reliable way to choose is to run both models against a sample of your own issues and measure the pass rate yourself. A focused evaluation of this kind can be scoped for around $5,000 and gives a far better answer than any public ranking. For an Australian team weighing a coding model, that spend is small against the cost of building on the wrong one. If you want help running the comparison, book a brainstorm.

Ready to move from AI pilot to production?

We help mid-market Australian businesses deploy AI automations that actually reach production and deliver measurable ROI.