GLM52.pro

Run GLM 5.2 Locally with Ollama

Free, private, no API key needed — run GLM 5.2 on your own machine

System Requirements

ConfigRAMVRAMExample GPUSpeed
Minimum (7B)8GB RAM6GB VRAMGTX 1660 / M1~15 tok/s
Recommended (7B)16GB RAM8GB VRAMRTX 3070 / M2 Pro~40 tok/s
Full (32B)32GB RAM24GB VRAMRTX 4090 / M3 Max~25 tok/s
CPU only (7B)16GB RAMNoneAny~3 tok/s

Step-by-Step Installation

1

Install Ollama

Download and install Ollama from the official site (ollama.com). Available for macOS, Linux, and Windows.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com
2

Pull GLM 5.2

Pull the GLM 5.2 model. Choose the size based on your hardware.

# 7B model (~4.5GB) — runs on 8GB VRAM
ollama pull glm4:7b

# 32B model (~18GB) — needs 24GB VRAM
ollama pull glm4:32b
3

Run the model

Start an interactive session or serve the API.

# Interactive chat
ollama run glm4:7b

# Serve as local API (port 11434)
ollama serve
4

Test the API

Verify it's working with a quick curl request.

curl http://localhost:11434/api/generate -d '{
  "model": "glm4:7b",
  "prompt": "Write a Python hello world",
  "stream": false
}'

Use with OpenCode

Once Ollama is running, point OpenCode at your local endpoint:

# In your OpenCode config (opencode.json)
{
  "model": {
    "provider": "openai",
    "name": "glm4:7b",
    "baseURL": "http://localhost:11434/v1",
    "apiKey": "ollama"
  }
}

Tips for Best Performance

  • • Keep context short — long contexts slow down inference significantly on consumer GPUs
  • • Use OLLAMA_NUM_PARALLEL=1 if you have limited VRAM
  • • Apple Silicon (M-series) gets impressive performance via Metal — M3 Pro can run 7B at ~50 tok/s
  • • For coding tasks, 7B performs surprisingly close to the API model on single-file tasks