Run GLM 5.2 Locally with Ollama

Free, private, no API key needed — run GLM 5.2 on your own machine

System Requirements

Config	RAM	VRAM	Example GPU	Speed
Minimum (7B)	8GB RAM	6GB VRAM	GTX 1660 / M1	~15 tok/s
Recommended (7B)	16GB RAM	8GB VRAM	RTX 3070 / M2 Pro	~40 tok/s
Full (32B)	32GB RAM	24GB VRAM	RTX 4090 / M3 Max	~25 tok/s
CPU only (7B)	16GB RAM	None	Any	~3 tok/s

Step-by-Step Installation

Install Ollama

Download and install Ollama from the official site (ollama.com). Available for macOS, Linux, and Windows.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com

Pull GLM 5.2

Pull the GLM 5.2 model. Choose the size based on your hardware.

# 7B model (~4.5GB) — runs on 8GB VRAM
ollama pull glm4:7b

# 32B model (~18GB) — needs 24GB VRAM
ollama pull glm4:32b

Run the model

Start an interactive session or serve the API.

# Interactive chat
ollama run glm4:7b

# Serve as local API (port 11434)
ollama serve

Test the API

Verify it's working with a quick curl request.

curl http://localhost:11434/api/generate -d '{
  "model": "glm4:7b",
  "prompt": "Write a Python hello world",
  "stream": false
}'

Use with OpenCode

Once Ollama is running, point OpenCode at your local endpoint:

# In your OpenCode config (opencode.json)
{
  "model": {
    "provider": "openai",
    "name": "glm4:7b",
    "baseURL": "http://localhost:11434/v1",
    "apiKey": "ollama"
  }
}

Tips for Best Performance

• Keep context short — long contexts slow down inference significantly on consumer GPUs
• Use OLLAMA_NUM_PARALLEL=1 if you have limited VRAM
• Apple Silicon (M-series) gets impressive performance via Metal — M3 Pro can run 7B at ~50 tok/s
• For coding tasks, 7B performs surprisingly close to the API model on single-file tasks