Quantisation and Open Source Models: Fitting Big Models on Small GPUs

Quantisation shrinks a large model so it runs on cheaper hardware. For Australian teams watching GPU spend, that can be the difference between a project that gets funded and one that stays on the whiteboard. It is not a free win, though. Quantisation trades a small amount of model quality for a large drop in cost, and the size of that trade depends entirely on the work you ask the model to do. This post explains how the technique works, where the savings come from, and how to test whether the quality holds before you rely on it in production.

What quantisation actually does

Most open source models ship in full precision, where every number inside the model is stored using 16 or 32 bits. Quantisation rewrites those numbers using fewer bits, often 8 or even 4. The model gets smaller on disk and uses far less memory while running, because each weight now takes a fraction of the space. The maths still works, it just carries less detail in each number.

The practical effect is that a model that once needed an expensive multi-GPU server can sometimes fit on a single consumer card. For a small team, that changes the question from whether self-hosting is affordable at all to simply which card to buy.

Where the savings show up

The benefit is mostly about cost, and at the scale most Australian businesses run, it adds up quickly.

Lower memory use, which means smaller and cheaper GPUs
Faster inference on the same hardware, so you serve more requests per dollar
A path to running larger models on a modest budget
More room to run several workloads on one machine instead of many

A Sydney team self-hosting an open model might be staring at a $40,000 GPU bill for full precision. A well-chosen quantised build can cut that to a fraction, sometimes bringing the same workload onto hardware that costs closer to $8,000. The saving is real, but only if the quality holds for your tasks.

The trade-off you cannot ignore

Nothing about quantisation is free. Squeezing the numbers down costs a little accuracy, and how much depends on the model and the job.

Some loss of accuracy, often small but always real
Extra care needed on tasks that demand precision, like figures, code, or legal wording
Testing required to confirm the quality holds for your specific work
Not every model quantises equally well, so results vary by architecture

On a casual summarising task you may never notice the difference. On a task where a single wrong number reaches a customer or a regulator, a small quality drop can cost far more than the hardware ever saved.

How to decide with evidence

The honest answer to whether quantisation is right for you is that it depends, and the only way to settle it is to measure. General benchmark scores tell you very little about how a quantised model behaves on your data.

Measure quality on your real tasks before and after quantising
Be cautious on precision-sensitive work and test it harder
Keep a full-precision option available for the jobs that matter most
Re-test when you change models or upgrade the quantisation method

We test quantisation carefully before recommending it, and we keep Claude as the default where precision matters most. If you are weighing self-hosting against a managed model for an Australian workload, book a brainstorm and we will map the trade-off against your numbers.

Quantisation and the compliance question

Cost is not the only reason teams look at open models. Some want data to stay on hardware they control, which matters under the Privacy Act and for businesses with strict data handling rules. Quantisation makes that self-hosted path cheaper, which is why the two often come up together. The caution is the same: a quantised model that drifts on accuracy can create a compliance risk of its own if it mishandles sensitive information, so the testing discipline matters even more here.

Where Claude fits

Open source and quantisation have a clear place. They lower the cost of self-hosting for tasks that tolerate a small quality margin, and they let a budget stretch further. For the work where accuracy is the whole point, a managed model like Claude removes the tuning and testing burden and gives you a known quality bar. Most Australian businesses we work with end up with both: quantised open models behind the scenes where review catches errors, and Claude where the output goes straight to a customer or a decision. The skill is knowing which task belongs in which lane, and proving it with evidence rather than guesswork.

Quantisation and Open Source Models: Fitting Big Models on Small GPUs

What quantisation actually does

Where the savings show up

The trade-off you cannot ignore

How to decide with evidence

Quantisation and the compliance question

Where Claude fits

Ready to move from AI pilot to production?

More from the blog

A CISO's Framework for Agentic AI: What Anthropic's Security Team Learned

Claude Code Can Migrate a Million Lines of Legacy Code in Two Weeks

Claude Code Can Set Up Your Server So You Don't Need a DevOps Hire