Zack Design has published local-llm-coding-guide — a no-fluff, benchmark-driven guide to running a genuinely useful local LLM as a coding assistant on consumer hardware. It covers Qwen3.5 and Gemma 4 across llama.cpp, Ollama (with MLX), and vllm-mlx, with real tokens-per-second numbers from three real machines.
Why local
Cloud LLMs are wonderful until you are on a flight, behind a client VPN, editing code with sensitive data, or burning through a monthly token budget faster than is reasonable. The quality gap between the best frontier models and the best local-runnable models has narrowed dramatically — a quantised 9B Qwen model on a modest NVIDIA card is now perfectly capable of the “reformat this function, add a docstring, write a test” type of work that makes up most of a coding assistant’s day.
The benchmarks
Measured on release builds, real completions, real contexts:
| GPU | Model | Tok/s | Context | Memory |
|---|---|---|---|---|
| RTX 4070 Ti 12GB | Nemotron 3 Nano 4B Q4_K_M | TBD | 262K | ~5GB |
| RTX 4070 Ti 12GB | Qwen3.5-9B Q4_K_M | ~65 | 131K | 7.8GB |
| RTX 3060 12GB | Qwen3.5-9B Q4_K_M | ~43 | 128K | ~7.8GB |
| RTX 3090 24GB | Qwen3.5-27B Q4_K_M | ~30 | 262K | ~18GB |
| M3 Pro 36GB | Qwen3.5-35B-A3B Q4_K_M | ~29 | 131K | ~22GB |
| M3 Pro 36GB | Qwen3.5-9B Q4_K_M | ~20 | 131K | ~7GB |
| M3 Pro 36GB | Qwen3.5-27B Q4_K_M | ~9* | 131K | ~18GB |
| M3 Pro 36GB | Gemma 4 26B-A4B Q4_K_M (Ollama MLX) | ~31 | 256K | ~17GB |
*The dense 27B is slower than the 35B-A3B MoE on 36 GB machines — see “Why MoE?” in the repo for the full story.
Why MoE wins on Apple Silicon
Apple’s unified memory is generous but its memory bandwidth is not as high as a discrete NVIDIA card’s. A dense 27B model saturates that bandwidth on every token. A mixture-of-experts model like Qwen3.5-35B-A3B only activates 3B parameters per token, which means each token reads a fraction of the weights — and the model runs faster and smarter than the dense option it replaces. The guide walks through the tradeoff properly.
Test machines
- Windows/WSL2: RTX 4070 Ti (12 GB), Intel Core Ultra 9 285K, 48 GB DDR5
- macOS: M3 MacBook Pro, 36 GB unified memory
Quick start
The guide walks through llama.cpp from source (with -DGGML_CUDA=ON or -DGGML_METAL=ON), the llama-server binary, wiring it into VS Code via the Continue extension, and wiring it into Claude Code as a local endpoint. Ollama + MLX is covered as the one-command alternative for Apple Silicon.
Who it is for
Developers who want a serious coding assistant that runs on their own hardware, without a subscription, without a round-trip to a cloud inference endpoint, and without hand-tuning flags for six hours. Read it on GitHub.