School | Tips & Tricks | WhyNot Productions

Local Multi-Model LLM Infrastructure

Advanced architecture and configuration for running multiple large language models locally. Summary and key takeaways from the full document.

Read full document (Google Docs)

The shift to local AI

Deployment is moving from cloud-only APIs to locally hosted inference: data sovereignty, predictable latency, and avoiding per-token costs. Modern setups use multi-model orchestration—e.g. a large reasoning model, a fast code-completion model, and a vision-language model at once. Doing this on local hardware hits VRAM and memory bandwidth limits, so you need the right inference engine, routing, and tuning.

LLMOps for local stacks

Observability and evaluation for local models:

Platform	Free tier / focus
Braintrust	Free (1M spans); evaluation & model testing
PostHog	Free (100K LLM events); AI + product analytics
LangSmith	Free (5K traces); LangChain / LangGraph
Weights & Biases	Paid; ML teams extending to LLMs
TrueFoundry	Free tier; DevOps for local deployments

Inference engine: Ollama vs native llama.cpp

Ollama is easy (CLI, model management, OpenAI-compatible API) but abstracts away control. For multi-model and VRAM-constrained setups, native llama-server (llama.cpp) is the production-oriented choice.

Ollama: OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_PARALLEL, OLLAMA_MAX_QUEUE control concurrency; parallel slots share one weight copy (context buffers multiply RAM).
Under VRAM pressure, Ollama evicts entire models to load another—no partial offload. That causes big latency spikes.
Benchmarks (e.g. Qwen-3 Coder 32B FP16): native ~52 tok/s vs Ollama ~30 tok/s—roughly 70% higher throughput with native.

Criteria	Native llama.cpp	Ollama
Throughput	High (e.g. 52–161 tok/s)	Often 13–80% slower
Eviction	LRU + partial layer offload	All-or-nothing
Multi-model	Router mode, isolated processes	Context buffers via env vars
Config	Layer regex, KV cache, presets	Limited env vars

Llama.cpp router mode

Run the server without a model path to enter router mode. It scans --models-dir, spawns a separate process per model, and uses LRU eviction (--models-max, default 4). Use --models-preset for per-model context, temperature, and GPU layer offload. --sleep-idle-seconds unloads weights after idle time, but be aware: a known bug leaves subprocesses alive (~600 MiB VRAM per idle process), and polling /models resets the idle timer—so sleep may never trigger if your UI polls.

llama-swap and Nginx

llama-swap (Go daemon) is a reverse proxy that starts the right backend per request (e.g. llama-server, vLLM) and can fully terminate it after a TTL, freeing VRAM. Good when you want zero footprint when idle. If you put it behind Nginx, set proxy_buffering off; and proxy_cache off; so SSE streaming isn’t buffered and tokens stream in real time.

LiteLLM and Open WebUI

LiteLLM is a central API gateway: one OpenAI-compatible endpoint that routes to local and cloud backends, with fallback and routing strategies. Use Docker Compose + PostgreSQL for keys, rate limits, and cost tracking. Point clients (e.g. Open WebUI) at the LiteLLM port; the UI then sees all configured models in one place.

Hardware tips: VRAM and MoE

For big MoE models (e.g. Qwen3-Coder-30B-A3B) on 16 GB VRAM:

KV cache: Asymmetric quantization helps: Key cache q8_0, Value cache q4_0 keeps perplexity good while cutting VRAM (e.g. --cache-type-k q8_0 --cache-type-v q4_0).
Layer offload: Use regex to offload only FFN expert layers to CPU (e.g. blk\.(1[6-9]|[2-4][0-9])\.ffn_.*._=CPU) so attention stays on GPU.
RAM/BCLK: On AMD (AM4/AM5), align Fabric Clock with Memory Clock 1:1 for better CPU↔GPU transfer.
Micro-batching: --flash-attn, --ubatch-size 512, --batch-size 512 can balance ingestion speed and VRAM on 16 GB.

Linux + NVIDIA edge cases

Idle power: After unloading models, the GPU can stay in a high power state (50–80 W). Enabling the NVIDIA Persistence Daemon often fixes it (driver then drops to ~9 W idle). Disable Secure Boot if the daemon fails to init (RmInitAdapter / Xid 62).
HW Power Brake: On some mobile/workstation GPUs, heavy load triggers a throttle (SM clocks stuck at 300–500 MHz). Set a software power limit slightly below TDP (e.g. via nvidia-smi) and optionally lock clocks (-lgc) to avoid spikes that trigger the brake.

Open full document (Google Docs) →

Good things to use today for AI development

Tools and frameworks we recommend for AI-assisted coding, project structure, and agent workflows. Also on the homepage with expandable cards.

Builder Methods
Agent OS

Lightweight system for defining and managing coding standards in AI-powered development. Discovers and documents standards from your codebase, injects them into agent context. Works with Claude Code, Cursor, Windsurf.

Docs / Site GitHub
bmad-code-org
BMAD Method

Breakthrough Method for Agile AI Driven Development. 21+ specialized agents, 50+ guided workflows. Scale-adaptive—from bug fixes to enterprise systems. Install: npx bmad-method install.

Docs / Site GitHub
WhyNot / NorthStar
NorthStar Rules (NSR)

Framework for full-scale project and company operations. Templates, workflows, rules, and standards. Generate complete project structures, ensure compliance. NorthStar team: Josef Lindbom, Craig Martin.

Site GitHub
agent0ai
Agent Zero

Personal, organic AI agent framework. Computer-as-tool; Skills (SKILL.md); memory; multi-agent cooperation. Docker-ready, Web UI, LiteLLM. docker run -p 50001:80 agent0ai/agent-zero.

Docs / Site GitHub
no-code-architects
thepopebot

Autonomous AI agent on GitHub Actions. Secrets filtered at process level; every action is a git commit. Free compute. Event handler + Docker agent opens PRs. Telegram. npm run setup.

Docs / Site GitHub
pi.dev
pi

Minimal terminal coding harness. TypeScript extensions, skills, prompt templates, themes; pi packages via npm or git. 15+ providers, tree sessions, AGENTS.md/SYSTEM.md. npm install -g @mariozechner/pi-coding-agent.

Docs / Site GitHub

View these on the homepage for expandable details and modals.

Local Multi-Model LLM Infrastructure

The shift to local AI

LLMOps for local stacks

Inference engine: Ollama vs native llama.cpp

Llama.cpp router mode

llama-swap and Nginx

LiteLLM and Open WebUI

Hardware tips: VRAM and MoE

Linux + NVIDIA edge cases

Good things to use today for AI development

Agent OS

BMAD Method

NorthStar Rules (NSR)

Agent Zero

thepopebot

pi