MoE Architecture: Active vs Total Parameters Explained

Modern open models now advertise eye-watering parameter counts. DeepSeek V4 Pro, for instance, lists roughly 1.6 trillion total parameters but activates only about 49 billion of them for any single token. For an Australian team planning a self-hosted deployment, the distance between those two numbers is the whole story, because it quietly decides what you pay every month to run the model and how much hardware you have to buy before you serve a single request.

Mixture-of-experts, usually shortened to MoE, is the architecture behind that gap. Get the two numbers straight and you can size hardware to the work the model actually does. Confuse them and you will buy compute you never use to capacity, or starve a model that needed more. This guide explains the difference in plain language, then turns it into a sizing decision you can defend to a finance team.

What a mixture-of-experts model is

A dense model runs every one of its parameters for every token it processes. A mixture-of-experts model takes a different route. It splits its parameters into many smaller groups, called experts, and adds a routing layer that selects only a handful of them for each token. The remaining experts sit idle for that particular request. The effect is a model that is enormous on paper yet behaves like a much smaller one on any single call.

This is why open labs can publish trillion-parameter headline figures while still claiming the model is cheap to run. Both statements can be true at the same time, because the headline counts the whole model and the running cost tracks only the slice that actually fires. The marketing number and the invoice number are measuring different things.

Total parameters versus active parameters

Two numbers describe any MoE model, and each one controls a different part of your cost.

Total parameters set the storage footprint and the download size you have to host
Active parameters set the memory bandwidth and compute spent on each request
A routing layer decides which experts fire for any given token
Only that active set does real work on a single call, no matter how large the total

DeepSeek V4 Pro is a clean example to hold in mind. The 1.6 trillion total figure tells you how much disk space and loaded memory the complete weights demand. The roughly 49 billion active figure tells you how hard each query drives the GPUs. The first is a fixed cost you pay whether or not anyone uses the system, and the second scales with how much traffic you actually serve.

Why the gap decides your hardware bill

When the active count is a small fraction of the total, the two figures pull your budget in opposite directions. Sizing to the wrong one is the most common and most costly mistake teams make with these models.

Size your GPUs to the active parameters and your real peak concurrency, not the headline total
Plan storage and loaded memory for the full weights, which still have to live somewhere
Test a realistic workload before you commit to any capacity at all
Recheck the sizing whenever you change models, context length, or quantisation

A team in Sydney that quotes GPUs against the trillion-parameter total can spend $45,000 a year on hardware the workload never touches. Size to the active count instead and the same model often runs comfortably on a far smaller and cheaper footprint, sometimes a single node where the naive plan called for a whole rack.

Sizing a deployment without overspending

The practical method is short, and it is worth following in order rather than skipping ahead to the purchase.

Read both numbers on every model you shortlist, and record them side by side
Quote compute against active parameters and your measured peak concurrent requests
Budget storage separately for the complete set of weights and any backups
Validate the figures against your own traffic, not a vendor benchmark

Getting this wrong is not a rounding error. An Australian SMB that sizes to the total count can waste $30,000 a year on idle capacity, while a correctly sized cluster might deliver the same output for closer to $12,000. The active figure is the one that pays the bills, and it is usually a small fraction of the impressive number printed on the model card.

Reading a model card without being misled

Model cards lead with the number that sounds most impressive, which is almost always the total. A little discipline keeps you from buying the headline rather than the model.

Find the active parameter count, even when it is buried below the total
Check the expert count and how many of them fire per token
Note the context length, since it changes memory needs independently of parameters
Confirm the licence permits your commercial use before you plan any hardware

These few checks take minutes and routinely save an Australian business from sizing a cluster around a figure that has nothing to do with its real running cost. The headline sells the model, but the active count is what runs it day to day.

Where a managed model changes the maths

Every line of this sizing work exists because you have decided to run the model yourself. A managed model removes the question, because the provider owns the GPUs, the routing layer, and the spare capacity that sits idle overnight. For most Australian businesses, that trade is the cheaper one once the cost of the engineering time spent sizing, patching, and monitoring a cluster is counted honestly rather than ignored.

We size open-model deployments properly when self-hosting genuinely fits the workload, and we keep Claude as the default where managed simplicity wins on total cost. If you are weighing a self-hosted MoE model against a managed build, book a technical review at our contact page.

MoE Architecture: Active vs Total Parameters Explained

What a mixture-of-experts model is

Total parameters versus active parameters

Why the gap decides your hardware bill

Sizing a deployment without overspending

Reading a model card without being misled

Where a managed model changes the maths

Ready to move from AI pilot to production?

More from the blog

A CISO's Framework for Agentic AI: What Anthropic's Security Team Learned

Claude Code Can Migrate a Million Lines of Legacy Code in Two Weeks

Claude Code Can Set Up Your Server So You Don't Need a DevOps Hire