Run GLM 5.2 Locally with Ollama
Free, private, no API key needed — run GLM 5.2 on your own machine
System Requirements
| Config | RAM | VRAM | Example GPU | Speed |
|---|---|---|---|---|
| Minimum (7B) | 8GB RAM | 6GB VRAM | GTX 1660 / M1 | ~15 tok/s |
| Recommended (7B) | 16GB RAM | 8GB VRAM | RTX 3070 / M2 Pro | ~40 tok/s |
| Full (32B) | 32GB RAM | 24GB VRAM | RTX 4090 / M3 Max | ~25 tok/s |
| CPU only (7B) | 16GB RAM | None | Any | ~3 tok/s |
Step-by-Step Installation
1
Install Ollama
Download and install Ollama from the official site (ollama.com). Available for macOS, Linux, and Windows.
# macOS / Linux curl -fsSL https://ollama.com/install.sh | sh # Windows: download installer from ollama.com
2
Pull GLM 5.2
Pull the GLM 5.2 model. Choose the size based on your hardware.
# 7B model (~4.5GB) — runs on 8GB VRAM ollama pull glm4:7b # 32B model (~18GB) — needs 24GB VRAM ollama pull glm4:32b
3
Run the model
Start an interactive session or serve the API.
# Interactive chat ollama run glm4:7b # Serve as local API (port 11434) ollama serve
4
Test the API
Verify it's working with a quick curl request.
curl http://localhost:11434/api/generate -d '{
"model": "glm4:7b",
"prompt": "Write a Python hello world",
"stream": false
}'Use with OpenCode
Once Ollama is running, point OpenCode at your local endpoint:
# In your OpenCode config (opencode.json)
{
"model": {
"provider": "openai",
"name": "glm4:7b",
"baseURL": "http://localhost:11434/v1",
"apiKey": "ollama"
}
}Tips for Best Performance
- • Keep context short — long contexts slow down inference significantly on consumer GPUs
- • Use
OLLAMA_NUM_PARALLEL=1if you have limited VRAM - • Apple Silicon (M-series) gets impressive performance via Metal — M3 Pro can run 7B at ~50 tok/s
- • For coding tasks, 7B performs surprisingly close to the API model on single-file tasks