Local LLM Setup (Ollama + Qwen)
Claude Code can run with a local LLM instead of the Anthropic API, using Ollama and Alibaba's Qwen3-Coder model. Ollama v0.14.0+ includes built-in Anthropic Messages API compatibility, so Claude Code connects to it without any proxy or adapter.
When to use local vs Claude API
| Use case | Recommendation |
|---|---|
| Data sovereignty — code or data must stay in the EU / on-premise | Local Qwen |
| Security-sensitive work — credentials, private APIs, client data | Local Qwen |
| Offline / air-gapped environments | Local Qwen |
| Simple tasks — formatting, renaming, small refactors, boilerplate | Local Qwen |
| Cost reduction — high-volume, repetitive prompts | Local Qwen |
| Complex reasoning — architecture, debugging, multi-file changes | Claude API |
| Large context — analyzing entire codebases or long specs | Claude API |
| Quality-critical — production code, specs, client deliverables | Claude API |
Rule of thumb: Use Qwen locally for work that is private, simple, or high-volume. Use Claude API when quality and reasoning depth matter most. You can switch between them freely — they use the same Claude Code interface, tools, and commands.
Step 1: Install Ollama
Install Ollama natively on WSL (not in Docker — native gives better GPU passthrough and performance):
curl -fsSL https://ollama.com/install.sh | sh
Ollama runs as a background service automatically. Verify it's running:
ollama --version # Should show 0.14.0+
Step 2: Pull a Qwen model
Choose the right model for your GPU VRAM:
| Model | Download | Size | Min VRAM | Speed (RTX 3070) | Tool calling? |
|---|---|---|---|---|---|
qwen3:8b | ollama pull qwen3:8b | 5.2 GB | 8 GB (fits 100%) | ~12s | No (chat only) |
qwen3:14b | ollama pull qwen3:14b | 9.3 GB | 12 GB | ~2min (spills to CPU on 8GB) | Yes |
qwen3-coder | ollama pull qwen3-coder | 18 GB | 24 GB | ~6min (mostly CPU on 8GB) | Yes |
Recommended: qwen3:14b — the smallest model that supports tool calling (reading files, editing code, running commands). On 8GB VRAM it's slow (~2min/response) but works as a batch/overnight agent. On 12GB+ VRAM it runs at interactive speed (~15s).
ollama pull qwen3:14b
Why not
qwen3:8b? It's faster but can only chat — it cannot use tools (file access, shell commands, code editing). The model is too small to reliably produce the structured function-call format that CLI agents require. It will show its thinking but won't execute anything.Why not
qwen3-coder? It's the most capable (30B params) but requires 24GB+ VRAM. On an 8GB GPU it runs ~68% on CPU and takes ~6 minutes per response. Only use it with a workstation GPU (RTX 4090, A6000, etc).
Check your available memory:
free -h # Look at the "available" column
nvidia-smi # Check GPU VRAM
If you don't have enough system memory, increase the WSL allocation. On Windows, edit (or create) %USERPROFILE%\.wslconfig:
[wsl2]
memory=24GB
Then restart WSL from PowerShell:
wsl --shutdown
Reopen your Ubuntu terminal — the new memory limit is now active.
Step 3: Run Claude Code with Qwen
Open a new terminal and run (replace model name with whichever you pulled):
ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY=ollama claude --model qwen3:14b
This opens the full interactive Claude Code CLI — same interface, same tools, same commands — but powered by Qwen running locally on your machine. No data leaves your workstation.
For a quick one-shot prompt (no interactive session):
ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY=ollama claude --model qwen3:14b --print "explain this function"
Running local and API side by side
The env vars are scoped to that single terminal window only. This means you can run both simultaneously:
- Terminal 1 — Qwen locally (free, private, slower) doing a long-running task like a bulk refactor or code review
- Terminal 2 / VS Code — Claude API (fast, powerful) for your main interactive development work
This is the recommended workflow: sidecar the free local model for background tasks while you continue your normal work with Claude API at full speed. The local session won't affect your API session in any way — they're completely independent.
┌─────────────────────────┐ ┌─────────────────────────┐
│ Terminal 1 (Qwen) │ │ VS Code / Terminal 2 │
│ │ │ │
│ Free, local, private │ │ Claude API (Opus) │
│ Running: bulk refactor │ │ Fast interactive dev │
│ Speed: ~15 tok/s │ │ Speed: ~50-80 tok/s │
│ Cost: $0 │ │ Cost: normal API usage │