Why Quantized Small Models Will Dominate AI's Future — An Honest Take

The shift from giant generalists to tiny specialists

For the last few years, the AI world has been obsessed with one thing: scale. Big and large models, more parameters, more GPUs, more data. Every new breakthrough, every new LLM, seemed to come from adding another billion parameters to an already massive architecture.

But something interesting is happening beneath the surface — a quiet counter-revolution. A growing number of researchers, developers, and businesses are realising that small, quantized models might actually be the future of AI.

And honestly, they might be right. Here is my opinion on why it's more important to focus on smaller quantized models for specific business use cases rather than relying on larger models on each use.

Quantization

What actually makes an LLM model size smaller? It's quantization. It's a technique used to shrink the size of an LLM by reducing the precision of its numerical weights. Instead of storing values in high-precision formats like 32-bit floating point, a quantized model uses lower-precision formats such as INT8, INT4, or even INT2. This dramatically cuts memory usage, speeds up inference, and allows models to run efficiently on mobile devices, edge hardware, and low-power environments — often with only a small drop in accuracy.

Quantization is one of the key enablers of practical, on-device AI, making advanced models lightweight, fast, and accessible without relying on heavy cloud infrastructure. Projects like llama.cpp and the GGUF format have turned this from a research curiosity into something anyone can run on a laptop in an afternoon — as I covered hands-on in Fine-Tune an LLM on Your MacBook with LoRA.

The rise of small, quantized models

Small models are not just "mini versions" of big LLMs — they represent a shift in LLM philosophy: easy to deploy on edge devices, making AI efficient, specialised with concentrated knowledge.

Quantization — reducing model precision from FP32 to INT8, INT4, or even INT2 — shrinks models dramatically. Suddenly, models don't require massive data centres, expensive GPUs, and a lot of resources. Instead, they can run on:

smartphones and laptops
edge devices
IoT hardware
offline environments

This shift is huge. It means AI is no longer locked behind cloud APIs or expensive GPU clusters. It becomes personal, local, and accessible.

Why specialisation beats general knowledge

Large general-purpose LLMs are impressive, but they're also unfocused. They know a little bit about everything — and sometimes that's the problem. It's like a jack of all trades and master of none. A small model trained specifically for:

finance
cybersecurity
medical diagnosis
legal reasoning
customer support

…will often outperform a giant general model in that domain.

Why? Because specialisation reduces noise. It narrows the model's knowledge to what actually matters. It cuts down hallucinations. It improves reliability.

In the real world, accuracy beats size every time. Microsoft's Phi-3 technical report is a good public data point: a ~3.8B-parameter model matching or beating much larger generalists on targeted benchmarks, running locally on a phone.

Why businesses should pay attention

Companies are starting to realise they don't need a 70B or 120B parameter model to answer internal HR questions or summarise policy documents. They need:

a 5B model fine-tuned on their domain
quantized to INT4
running locally or on a small server
connected to their internal knowledge base — typically via a RAG pipeline

This approach is cheaper, faster, more secure, and more accurate for their use case. It's not just a technical shift — it's a business advantage. The data-sovereignty side alone (no prompts leaving the VPC, no training on your tickets by a third party) is enough for many regulated industries to prefer this pattern over calling a hosted frontier model.

On-device AI is the next big leap

Your phone becomes your AI assistant. Your laptop becomes your private analyst. Your smartwatch becomes your personal coach.

This is the future people actually want — AI that lives with them, not somewhere in a distant server farm. Imagine AI that:

works offline
respects your privacy
responds instantly
doesn't send your data to the cloud
doesn't cost a fortune to run

That's exactly what small quantized models enable. Apple is already shipping a ~3B-parameter on-device model as part of Apple Intelligence; Google's Gemini Nano runs inside Android. The tooling (MLX, ONNX Runtime, llama.cpp) has caught up to make this routine rather than exotic.

The future

Instead of one giant model doing everything, imagine a system made of a small finance model, a legal model, a small HR model. Each one is an expert. Each one is efficient. Each one is easy to update. Together, they form a powerful, modular AI ecosystem — often orchestrated by a router or a small "dispatch" model deciding which specialist to call.

This is where the industry is heading. The future is a network of small, specialised AI experts — not one giant model. Those who are early adopters will lead in this space.

Loved this article? Subscribe below for more on AI, quantization, and cybersecurity — no spam, just the technical posts.