School

Tips and tricks for developers and AI workflows.

Local Multi-Model LLM Infrastructure

Advanced architecture and configuration for running multiple large language models locally. Summary and key takeaways from the full document.

Read full document (Google Docs)

The shift to local AI

Deployment is moving from cloud-only APIs to locally hosted inference: data sovereignty, predictable latency, and avoiding per-token costs. Modern setups use multi-model orchestration—e.g. a large reasoning model, a fast code-completion model, and a vision-language model at once. Doing this on local hardware hits VRAM and memory bandwidth limits, so you need the right inference engine, routing, and tuning.

LLMOps for local stacks

Observability and evaluation for local models:

Platform Free tier / focus
BraintrustFree (1M spans); evaluation & model testing
PostHogFree (100K LLM events); AI + product analytics
LangSmithFree (5K traces); LangChain / LangGraph
Weights & BiasesPaid; ML teams extending to LLMs
TrueFoundryFree tier; DevOps for local deployments

Inference engine: Ollama vs native llama.cpp

Ollama is easy (CLI, model management, OpenAI-compatible API) but abstracts away control. For multi-model and VRAM-constrained setups, native llama-server (llama.cpp) is the production-oriented choice.

  • Ollama: OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_PARALLEL, OLLAMA_MAX_QUEUE control concurrency; parallel slots share one weight copy (context buffers multiply RAM).
  • Under VRAM pressure, Ollama evicts entire models to load another—no partial offload. That causes big latency spikes.
  • Benchmarks (e.g. Qwen-3 Coder 32B FP16): native ~52 tok/s vs Ollama ~30 tok/s—roughly 70% higher throughput with native.
Criteria Native llama.cpp Ollama
ThroughputHigh (e.g. 52–161 tok/s)Often 13–80% slower
EvictionLRU + partial layer offloadAll-or-nothing
Multi-modelRouter mode, isolated processesContext buffers via env vars
ConfigLayer regex, KV cache, presetsLimited env vars

Llama.cpp router mode

Run the server without a model path to enter router mode. It scans --models-dir, spawns a separate process per model, and uses LRU eviction (--models-max, default 4). Use --models-preset for per-model context, temperature, and GPU layer offload. --sleep-idle-seconds unloads weights after idle time, but be aware: a known bug leaves subprocesses alive (~600 MiB VRAM per idle process), and polling /models resets the idle timer—so sleep may never trigger if your UI polls.

llama-swap and Nginx

llama-swap (Go daemon) is a reverse proxy that starts the right backend per request (e.g. llama-server, vLLM) and can fully terminate it after a TTL, freeing VRAM. Good when you want zero footprint when idle. If you put it behind Nginx, set proxy_buffering off; and proxy_cache off; so SSE streaming isn’t buffered and tokens stream in real time.

LiteLLM and Open WebUI

LiteLLM is a central API gateway: one OpenAI-compatible endpoint that routes to local and cloud backends, with fallback and routing strategies. Use Docker Compose + PostgreSQL for keys, rate limits, and cost tracking. Point clients (e.g. Open WebUI) at the LiteLLM port; the UI then sees all configured models in one place.

Hardware tips: VRAM and MoE

For big MoE models (e.g. Qwen3-Coder-30B-A3B) on 16 GB VRAM:

  • KV cache: Asymmetric quantization helps: Key cache q8_0, Value cache q4_0 keeps perplexity good while cutting VRAM (e.g. --cache-type-k q8_0 --cache-type-v q4_0).
  • Layer offload: Use regex to offload only FFN expert layers to CPU (e.g. blk\.(1[6-9]|[2-4][0-9])\.ffn_.*._=CPU) so attention stays on GPU.
  • RAM/BCLK: On AMD (AM4/AM5), align Fabric Clock with Memory Clock 1:1 for better CPU↔GPU transfer.
  • Micro-batching: --flash-attn, --ubatch-size 512, --batch-size 512 can balance ingestion speed and VRAM on 16 GB.

Linux + NVIDIA edge cases

  • Idle power: After unloading models, the GPU can stay in a high power state (50–80 W). Enabling the NVIDIA Persistence Daemon often fixes it (driver then drops to ~9 W idle). Disable Secure Boot if the daemon fails to init (RmInitAdapter / Xid 62).
  • HW Power Brake: On some mobile/workstation GPUs, heavy load triggers a throttle (SM clocks stuck at 300–500 MHz). Set a software power limit slightly below TDP (e.g. via nvidia-smi) and optionally lock clocks (-lgc) to avoid spikes that trigger the brake.

Open full document (Google Docs) →

Good things to use today for AI development

Tools and frameworks we recommend for AI-assisted coding, project structure, and agent workflows. Also on the homepage with expandable cards.

  • Builder Methods

    Agent OS

    Lightweight system for defining and managing coding standards in AI-powered development. Discovers and documents standards from your codebase, injects them into agent context. Works with Claude Code, Cursor, Windsurf.

  • bmad-code-org

    BMAD Method

    Breakthrough Method for Agile AI Driven Development. 21+ specialized agents, 50+ guided workflows. Scale-adaptive—from bug fixes to enterprise systems. Install: npx bmad-method install.

  • WhyNot / NorthStar

    NorthStar Rules (NSR)

    Framework for full-scale project and company operations. Templates, workflows, rules, and standards. Generate complete project structures, ensure compliance. NorthStar team: Josef Lindbom, Craig Martin.

  • agent0ai

    Agent Zero

    Personal, organic AI agent framework. Computer-as-tool; Skills (SKILL.md); memory; multi-agent cooperation. Docker-ready, Web UI, LiteLLM. docker run -p 50001:80 agent0ai/agent-zero.

  • no-code-architects

    thepopebot

    Autonomous AI agent on GitHub Actions. Secrets filtered at process level; every action is a git commit. Free compute. Event handler + Docker agent opens PRs. Telegram. npm run setup.

  • pi.dev

    pi

    Minimal terminal coding harness. TypeScript extensions, skills, prompt templates, themes; pi packages via npm or git. 15+ providers, tree sessions, AGENTS.md/SYSTEM.md. npm install -g @mariozechner/pi-coding-agent.

View these on the homepage for expandable details and modals.

More tips and tricks coming soon.