At the end of my last article, I teased that I’d be writing about LLM inference — the part most people skip because it sounds like infrastructure plumbing. It is plumbing. But once you understand where your VRAM actually goes, the knobs that move latency and cost stop feeling like dark magic. Here’s the mental model I’ve built up running models locally.
AI-generated image via Google Nano Banana 2
Table of contents
Open Table of contents
- A quick refresher on LLMs and inference
- Context: what it is, and what it actually costs
- The KV cache, and the Q8 trick that’s almost free
- Weight quantization: Q4, Q5, Q8, and the sweet spot
- The format zoo: GGUF, MLX, and friends
- vLLM: when you’re done playing and need to serve
- Turning on tool use
- Wrapping up
A quick refresher on LLMs and inference
A large language model is, at the end of the day, a giant function that takes a sequence of tokens and predicts the next one. Training is the part where that function gets fit to a corpus. Inference is everything that happens after — every time you send a prompt and the model produces a reply.
Inference splits into two phases that behave very differently:
- Prefill — the model reads your entire prompt in one pass. Compute-heavy, fast per token, scales with prompt length.
- Decode — the model produces output one token at a time, each generation conditioned on everything before it. Memory-bandwidth bound, slow per token, scales with output length.
That distinction matters because the bottlenecks are different. Prefill is usually compute-limited. Decode is usually memory-limited — and that’s where the KV cache enters the picture.
Context: what it is, and what it actually costs
The context is everything the model can “see” in a given turn: the system prompt, the chat history, any retrieved documents, and the tokens it has already generated. The bigger the context, the more the model knows — and the more VRAM it eats.
People often think context cost is about the input string. It’s not. It’s about what the model has to keep in memory for each token, at each layer, to compute attention. The rough formula is:
KV cache size ≈ 2 × num_layers × hidden_dim × seq_len × bytes_per_value
The “2” is for K and V (more on that below). Run the numbers on a Llama-class 70B model in FP16 — 80 layers, 8192 hidden dimensions, 2 bytes per value — and you get roughly 320 KB per token. A 32K-token context is over 10 GB of VRAM. That’s on top of the ~140 GB of weights. Suddenly the “context window” headline number isn’t free real estate; it’s the most expensive thing in your GPU.
This is why “just use a bigger context” is the wrong default. Every token you keep is VRAM you can’t use for batching, for other requests, or for a larger model.
The KV cache, and the Q8 trick that’s almost free
During decode, every new token has to attend to every past token. If we recomputed each past token’s key (K) and value (V) projections at every step, decode would crawl. So we cache them. That’s the KV cache — and on long contexts, it’s often the single biggest chunk of VRAM after the weights themselves.
Why quantizing the KV cache works so well
Model weights are sensitive to quantization — drop them too low and quality cliffs. The KV cache is different. Attention is a soft, averaged operation, so small per-value errors get washed out. Empirically, going from FP16 to Q8 (8-bit) on K and V is essentially free: the quality difference is within noise on most benchmarks, and you immediately halve KV cache memory.
That’s not a marginal optimization. On a 32K context, dropping the KV cache from FP16 to Q8 buys back ~5 GB on a 70B model. That’s the difference between OOMing and serving two concurrent requests.
How to enable it
- vLLM: pass
--kv-cache-dtype fp8(on Hopper/Ada) or--kv-cache-dtype int8. - llama.cpp / Ollama: set
--cache-type-k q8_0and--cache-type-v q8_0(or the equivalent env vars). - MLX: use
mlx-lmwith--kv-bits 8.
Some setups support Q4 KV cache too, but that’s where you start seeing measurable quality regressions, especially on long contexts. Q8 is the free lunch. Q4 KV is a real trade-off.
Weight quantization: Q4, Q5, Q8, and the sweet spot
Weights are the other half of the VRAM equation. A 70B model in FP16 is ~140 GB. The same model in Q4 is ~40 GB — suddenly it fits on a single 48 GB card, with room left for context. That’s not a small unlock; that’s the difference between “I need an H100” and “my Mac Studio can run this.”
A rough mental model:
| Quantization | Bytes/param | 70B model size | Quality impact |
|---|---|---|---|
| FP16 | 2.0 | ~140 GB | Reference |
| Q8 | ~1.0 | ~70 GB | Indistinguishable |
| Q6 | ~0.75 | ~52 GB | Very small |
| Q5 | ~0.65 | ~45 GB | Small, often acceptable |
| Q4_K_M | ~0.55 | ~40 GB | Noticeable but usable |
| Q3 | ~0.4 | ~30 GB | Real degradation |
| Q2 | ~0.3 | ~22 GB | Often broken on small models |
For most practical work I land on Q4_K_M or Q5_K_M for big models (where the quality floor is high enough to absorb the loss) and Q8 for small models (where every bit of quality matters and the file is small anyway). Q3 and below are emergency-only.
The format zoo: GGUF, MLX, and friends
Quantization isn’t a single thing — it’s a family of formats and algorithms, and the file format you download dictates which runtime can load it.
GGUF
The de facto standard for CPU and consumer-GPU inference. Created by the llama.cpp project, used by Ollama, LM Studio, KoboldCpp, and llama.cpp itself. It’s a single-file format that bundles weights, tokenizer, and metadata. The _K_M / _K_S suffixes (K-quants) are the most common variants — they use mixed precision per tensor, keeping sensitive layers at higher bits.
GGUF is portable, well-supported across hardware, and the easiest place to start for local inference.
MLX
Apple’s machine learning framework, optimized for Apple Silicon’s unified memory. Models are typically shipped as a directory of safetensors with MLX-specific quantization (4-bit and 8-bit are most common). On an M-series Mac, MLX is usually faster than GGUF for the same model — sometimes 2× — because it actually uses the Neural Engine and tuned Metal kernels.
If you’re running on a Mac and not using MLX where you can, you’re leaving performance on the table.
Others worth knowing
- AWQ / GPTQ — activation-aware and gradient-based quantization, common on data-center GPUs. Higher quality at the same bitrate than naive RTN.
- EXL2 — flexible mixed-precision format used by ExLlama, popular for squeezing big models onto consumer NVIDIA cards.
- bitsandbytes (BNB) — on-the-fly quantization at load time. Simplest path inside HuggingFace Transformers, but slower at runtime than purpose-built formats.
The right choice depends on your hardware. NVIDIA data-center: AWQ or GPTQ. NVIDIA consumer: GGUF or EXL2. Apple Silicon: MLX. Anything else (CPU, mixed): GGUF.
vLLM: when you’re done playing and need to serve
Ollama and LM Studio are great. They give you a one-click experience, a nice UI, and a model running in minutes. But they’re optimized for one user, one conversation at a time. The moment you want to serve multiple requests, batch efficiently, or hit real throughput, you hit their ceiling.
vLLM is what you reach for next. A few reasons it’s different:
PagedAttention
Instead of allocating a contiguous KV cache block per request (and wasting memory on padding), vLLM treats the KV cache like virtual memory: small fixed-size pages, allocated on demand. The practical effect is that you can fit dramatically more concurrent sequences in the same VRAM. This is the headline trick that made vLLM famous.
Continuous batching
Most serving stacks batch requests at the prompt level — wait for N requests, run them together, return them together. The slowest one in the batch drags everyone down. vLLM batches at the token level: as soon as one request finishes a token, another request can take its slot. GPU utilization stays high; tail latency stays low.
Preloaded weights, persistent process
vLLM loads the model once into VRAM and keeps it there. Ollama loads on first request, sometimes unloads after a timeout, and reloads again — fine for a personal assistant, terrible for a production service. With vLLM, the first request after startup is the only cold one.
OpenAI-compatible API out of the box
You point an OpenAI SDK at a vLLM endpoint and it just works. Same /v1/chat/completions, same streaming, same parameters. This is huge — it means anything written for the OpenAI API runs against your local model with a URL change.
A minimal launch looks like:
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--tensor-parallel-size 2
Two GPUs, AWQ weights, FP8 KV cache, 32K context. That single command gets you a production-grade endpoint that will outperform a stack of Ollama instances on the same hardware.
When not to use vLLM
If you’re running a single model for yourself on a laptop, Ollama or LM Studio is still the right call. The setup overhead, the lack of a built-in UI, the GPU-only focus — vLLM is built for serving, not for tinkering. The line I draw: if you have more than one user, or you’re putting an LLM behind an API for an app, switch to vLLM. Below that, the consumer tools are perfectly fine.
Turning on tool use
Tool use (function calling) isn’t automatic — even if the model was trained for it. The inference server has to know how to parse the model’s structured output back into tool calls before forwarding them to your client.
In vLLM, this means two flags:
vllm serve <model> \
--enable-auto-tool-choice \
--tool-call-parser hermes
The parser has to match how the model emits tool calls. hermes, llama3_json, mistral, granite — each is tuned for a specific family. Pick the wrong one and the model’s tool calls come back as plain text. The vLLM docs list which parser fits which model; check before you launch.
Once enabled, the OpenAI SDK’s tools=[...] and tool_choice="auto" work exactly as they do against OpenAI. Same for Anthropic-compatible clients via a proxy. This is the moment a local model goes from “chat toy” to “actually wired into your stack.”
Wrapping up
LLM inference is mostly an exercise in fighting for VRAM. Weights, context, KV cache — every byte you don’t spend on one is a byte you can spend on another. Q8 KV cache is free quality. Q4 weights buy you a model class you couldn’t otherwise run. vLLM gives you back the throughput that consumer tools quietly leave on the floor. None of this is exotic — it’s just knobs, and once you know what they do, the cost and latency curves stop feeling mysterious.
Thanks for reading! If you have questions or want to swap notes on what’s worked on your hardware, feel free to reach out via email or LinkedIn.
Next up, I’ll be writing about Claude Design — a new way to design beautiful UIs that integrates with the tools you already use and exports straight to Claude Code for implementation. If you’ve ever bounced between Figma, a screenshot, and a prompt trying to describe an interface, you’ll enjoy that one. See you in two weeks!