Mixture-of-Experts, Explained: Why 2026's Best Open Models Activate a Fraction of Their Parameters

The biggest open-weight models of 2026 carry enormous parameter counts but only switch on a small slice of them for any given token. That design is called Mixture-of-Experts, and it is the reason a 753-billion-parameter model can run at a sane cost. Here is a plain explanation for Australian technical teams sizing up a deployment, and what the numbers mean for your hardware bill.

What Mixture-of-Experts means

A dense model runs every parameter for every token. A Mixture-of-Experts, or MoE, model splits its parameters into many experts and a router picks a few for each token. The result is a model that is huge in total but cheap to run per request, because most of it sits idle on any given pass.

Recent open models make the gap clear:

GLM-5.2: 753 billion total parameters, MoE routing, with only about 40 billion active per token.
NVIDIA Nemotron 3 Ultra: 550 billion total parameters, 55 billion active per token.
Zyphra ZAYA1: 8 billion total parameters, just 760 million active per token.

Why it matters for your hardware bill

Active parameters, not total parameters, drive inference cost. That has direct consequences for an Australian team pricing a deployment:

A model can be enormous on paper yet fit a modest compute budget because only the active slice runs each pass.
Memory still has to hold all the experts, so VRAM needs stay high even when compute is low.
Throughput depends on the active count, which is why a 550B model with 55B active can serve faster than its headline size suggests.

This is why MoE models dominate the open leaderboards: they buy frontier-level quality without frontier-level running costs, at least on the compute side of the ledger.

The trade-offs to plan for

MoE is not free of complications. Before you commit, account for:

Larger memory footprint. You provision for the full parameter set, which can mean $15,000 to $40,000 a month for high-availability hardware in Australia.
Routing variability. Different tokens hit different experts, so latency can vary from request to request.
Fine-tuning complexity. Adjusting an MoE model is harder than a dense one and needs more careful tooling and testing.

How to read a model card

When you compare open models for self-hosting, two numbers matter more than the headline size. Read the active-parameter count to predict compute cost, and read the total count to predict memory cost. A 753-billion-parameter MoE model and a much smaller dense model can land at similar inference costs once routing is accounted for, but they will need very different amounts of VRAM. Size the memory for the full set and the throughput for the active slice, and you will avoid the two most common budgeting mistakes.

What to do with this

If you are comparing open models for a self-hosted deployment, match the active count against your token volume and your GPU budget before you commit, and price the full memory footprint, not just the active slice. The model that looks cheapest by headline size is often not the one that is cheapest to actually run in production.

A quick sizing example

Suppose you are pricing a self-hosted deployment of a 753-billion-parameter MoE model with about 40 billion active per token. The compute you need is driven by that 40 billion active slice, so for throughput the model behaves a little like a 40-billion dense one. That is the good news, and it is why the headline size does not wreck your latency budget the way you might expect from the raw number.

The memory tells a harder story. You still have to hold all 753 billion parameters in VRAM, so the card or cluster has to be sized for the full set even though only a fraction fires per request. This is the mismatch that catches teams out: they budget for the active count and then find the model will not fit the GPU they bought. Price the memory for the total, and the throughput for the active slice.

In practice that often puts a high-availability MoE deployment in Australia in the $15,000 to $40,000 a month band, with GPU memory as the dominant line. Run the same workload through a managed model and you pay per token with none of that fixed cost. Which one wins depends entirely on your daily volume, which is why the sizing exercise comes before the architecture, not after it.

If you remember one thing, make it this: the headline parameter count tells you almost nothing about what a model costs to run. Read the active count to predict compute, read the total count to predict memory, size each separately, and compare the fully loaded figure against a managed quote before you commit a cent to hardware. Get those two numbers straight and the rest of the deployment decision becomes a matter of arithmetic rather than guesswork.

Want a hand sizing an open-model deployment against your workload? Book a free brainstorm with us.

Mixture-of-Experts, Explained: Why 2026's Best Open Models Activate a Fraction of Their Parameters

What Mixture-of-Experts means

Why it matters for your hardware bill

The trade-offs to plan for

How to read a model card

What to do with this

A quick sizing example

Ready to move from AI pilot to production?

More from the blog

From Claude Code to a Live URL: Cloudflare's --temporary Flag and Agent-First Deployment

Claude Code and the IAM PassRole Trap: Writing Least-Privilege Policies for AI Agents

Claude Tag and Agent Identity: What Anthropic's New Access Model Means for Australian Teams Running AI Agents