Local Multi-Model LLM Infrastructure
Advanced architecture and configuration for running multiple large language models locally. Summary and key takeaways from the full document.
Read full document (Google Docs)
The shift to local AI
Deployment is moving from cloud-only APIs to locally hosted inference: data sovereignty, predictable latency, and avoiding per-token costs. Modern setups use multi-model orchestration—e.g. a large reasoning model, a fast code-completion model, and a vision-language model at once. Doing this on local hardware hits VRAM and memory bandwidth limits, so you need the right inference engine, routing, and tuning.
LLMOps for local stacks
Observability and evaluation for local models:
| Platform | Free tier / focus |
|---|---|
| Braintrust | Free (1M spans); evaluation & model testing |
| PostHog | Free (100K LLM events); AI + product analytics |
| LangSmith | Free (5K traces); LangChain / LangGraph |
| Weights & Biases | Paid; ML teams extending to LLMs |
| TrueFoundry | Free tier; DevOps for local deployments |
Inference engine: Ollama vs native llama.cpp
Ollama is easy (CLI, model management, OpenAI-compatible API) but abstracts away control. For multi-model and VRAM-constrained setups, native llama-server (llama.cpp) is the production-oriented choice.
- Ollama:
OLLAMA_MAX_LOADED_MODELS,OLLAMA_NUM_PARALLEL,OLLAMA_MAX_QUEUEcontrol concurrency; parallel slots share one weight copy (context buffers multiply RAM). - Under VRAM pressure, Ollama evicts entire models to load another—no partial offload. That causes big latency spikes.
- Benchmarks (e.g. Qwen-3 Coder 32B FP16): native ~52 tok/s vs Ollama ~30 tok/s—roughly 70% higher throughput with native.
| Criteria | Native llama.cpp | Ollama |
|---|---|---|
| Throughput | High (e.g. 52–161 tok/s) | Often 13–80% slower |
| Eviction | LRU + partial layer offload | All-or-nothing |
| Multi-model | Router mode, isolated processes | Context buffers via env vars |
| Config | Layer regex, KV cache, presets | Limited env vars |
Llama.cpp router mode
Run the server without a model path to enter router mode. It scans --models-dir, spawns a separate process per model, and uses LRU eviction (--models-max, default 4). Use --models-preset for per-model context, temperature, and GPU layer offload. --sleep-idle-seconds unloads weights after idle time, but be aware: a known bug leaves subprocesses alive (~600 MiB VRAM per idle process), and polling /models resets the idle timer—so sleep may never trigger if your UI polls.
llama-swap and Nginx
llama-swap (Go daemon) is a reverse proxy that starts the right backend per request (e.g. llama-server, vLLM) and can fully terminate it after a TTL, freeing VRAM. Good when you want zero footprint when idle. If you put it behind Nginx, set proxy_buffering off; and proxy_cache off; so SSE streaming isn’t buffered and tokens stream in real time.
LiteLLM and Open WebUI
LiteLLM is a central API gateway: one OpenAI-compatible endpoint that routes to local and cloud backends, with fallback and routing strategies. Use Docker Compose + PostgreSQL for keys, rate limits, and cost tracking. Point clients (e.g. Open WebUI) at the LiteLLM port; the UI then sees all configured models in one place.
Hardware tips: VRAM and MoE
For big MoE models (e.g. Qwen3-Coder-30B-A3B) on 16 GB VRAM:
- KV cache: Asymmetric quantization helps: Key cache q8_0, Value cache q4_0 keeps perplexity good while cutting VRAM (e.g.
--cache-type-k q8_0 --cache-type-v q4_0). - Layer offload: Use regex to offload only FFN expert layers to CPU (e.g.
blk\.(1[6-9]|[2-4][0-9])\.ffn_.*._=CPU) so attention stays on GPU. - RAM/BCLK: On AMD (AM4/AM5), align Fabric Clock with Memory Clock 1:1 for better CPU↔GPU transfer.
- Micro-batching:
--flash-attn,--ubatch-size 512,--batch-size 512can balance ingestion speed and VRAM on 16 GB.
Linux + NVIDIA edge cases
- Idle power: After unloading models, the GPU can stay in a high power state (50–80 W). Enabling the NVIDIA Persistence Daemon often fixes it (driver then drops to ~9 W idle). Disable Secure Boot if the daemon fails to init (RmInitAdapter / Xid 62).
- HW Power Brake: On some mobile/workstation GPUs, heavy load triggers a throttle (SM clocks stuck at 300–500 MHz). Set a software power limit slightly below TDP (e.g. via nvidia-smi) and optionally lock clocks (
-lgc) to avoid spikes that trigger the brake.