Why I Left Ollama for llama.cpp

Background

We run AI agents (OpenClaw bots) on a home network that rely on locally-served LLMs for tool calling. The model — glm-4.7-flash:q8_0 (29.9B MoE, Q8_0, 29.7 GB) — runs on a workstation with ~8 GB VRAM and serves requests through a LiteLLM proxy.

After two weeks of increasingly painful workarounds, we replaced Ollama with llama-server (llama.cpp) for this model. This document explains why.

The problems with Ollama

1. Thinking model content drops

glm-4.7-flash is a thinking model — its responses include both a thinking field and a content field. LiteLLM’s native ollama/ provider silently dropped both content and tool_calls from these responses, returning empty completions. This forced us onto an indirect openai/ route through Ollama’s /v1 compatibility endpoint, adding an unnecessary translation layer.

2. Silent context truncation (the breaking issue)

Ollama defaults to num_ctx=4096 unless overridden per-model or per-request. Our agent framework (OpenClaw) sends ~7,600+ token prompts with 16 structured tool definitions. Ollama silently truncated these to 4,096 tokens, cutting off the tools array at the end of the request.

The result: the model received the system prompt telling it about tools, but never received the actual tool definitions. It would reply with text like “I’m on it, let me start working” but never call any tools — because it couldn’t. The agent appeared to acknowledge tasks and then do nothing.

This took hours to diagnose because there were no errors anywhere — not in Ollama, not in LiteLLM, not in OpenClaw. The model just silently produced worse output.

The /v1 endpoint has no standard way to pass num_ctx per-request (it’s not an OpenAI API parameter). The only fixes were:

  • OLLAMA_CONTEXT_LENGTH environment variable (server-wide, affects all models)
  • Creating a new model tag with a Modelfile (PARAMETER num_ctx 32768) — model modification for a server config

3. Model swap deadlocks

With OLLAMA_MAX_LOADED_MODELS=1 (necessary on our memory-constrained host), loading any fallback model evicted the primary. Loading glm-4.7-flash back (29.7 GB, mostly CPU-bound) took 30+ minutes. A single successful fallback to a different model caused every subsequent request to hang while models swapped back and forth. We had to remove all fallback models entirely.

4. Shared model slot instability

Ollama’s single model slot is shared across all consumers. If Alfred (another agent) loaded qwen3-coder-next for its own work, Sam’s glm-4.7-flash got evicted. Sam’s next request would hang for 30 minutes with no indication of what was happening.

5. Configuration spread

To get tool calling working through Ollama, we needed configuration across four systems:

  • Ollama on tuxedo: OLLAMA_CONTEXT_LENGTH, OLLAMA_MAX_LOADED_MODELS, OLLAMA_KEEP_ALIVE
  • LiteLLM on nia: openai/ route workaround, drop_params: True
  • OpenClaw on sam: openai-completions API mode, provider-level auth, model definitions
  • Ollama model config: thinking model flags, context window declarations

Each layer added its own abstraction and its own failure modes.

Why llama-server

llama-server (from llama.cpp) serves one model with explicit, predictable configuration:

llama-server \
  --model glm-4.7-flash-q8_0.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 14 \
  --flash-attn \
  --port 8012

Everything that matters is visible in one command.

What it fixes

Ollama problem llama-server behavior
Silent context truncation --ctx-size is explicit and enforced
Thinking model content drops Native OpenAI-compatible endpoint, no provider translation
Model swap deadlocks One model, one process, always loaded
Shared model slot Dedicated process; other models stay on Ollama
Per-request num_ctx impossible via /v1 Context size set at startup, applies to all requests

Pros

  • Explicit configuration: every parameter is a CLI flag — context size, GPU layers, batch size, thread count, flash attention. No hidden defaults.
  • Native OpenAI-compatible API: the /v1/chat/completions endpoint is first-class, not a compatibility shim. Tool calling works correctly out of the box.
  • Process isolation: the model is always loaded, always ready. No contention with other models or consumers.
  • Predictable resource usage: VRAM and RAM usage are determined at startup and don’t change. No surprise model swaps or memory spikes.
  • Direct control over GPU offloading: --n-gpu-layers lets you tune exactly how much goes to VRAM vs RAM, and you see the result immediately at startup.
  • Flash attention support: --flash-attn reduces KV cache memory usage, allowing larger contexts in the same VRAM budget.
  • Active development: llama.cpp tracks upstream model formats quickly and has frequent performance improvements.

Cons

  • No model management: you manage GGUF files yourself. No ollama pull, no automatic updates, no model library browser.
  • One model per instance: serving multiple models requires multiple llama-server processes on different ports (or a more complex setup). For our use case this is actually a feature.
  • Manual service setup: you write the systemd unit file yourself. No ollama serve convenience.
  • No built-in model quantization: need separate tools (llama-quantize) or pre-quantized GGUF files. Ollama handles this transparently.
  • Less beginner-friendly: requires understanding of GGUF formats, GPU layer counts, context sizing, and KV cache math. Ollama abstracts all of this.
  • No Modelfile abstraction: system prompts, default parameters, and templates are not bundled with the model. This is the application’s job (which it should be anyway).

When to stay on Ollama

Ollama is still good for:

  • Quick experimentation: ollama run model is unbeatable for trying models interactively.
  • Multi-model workflows with generous RAM: if you can keep multiple models loaded, the swap issue doesn’t apply.
  • Simple setups: if your prompts fit in 4K context and you don’t need tool calling, the defaults work fine.
  • Models that aren’t thinking models: the content-drop issue is specific to models that return thinking + content fields.

We still run Ollama alongside llama-server for Alfred’s models (qwen3-coder-next, qwen3.5:27b) where these issues don’t apply.

Summary

Ollama optimizes for convenience. That’s the right choice for many use cases. But when you need predictable, production-grade model serving with tool calling on constrained hardware, the abstractions get in the way. Every Ollama issue we hit was caused by a hidden default, a silent truncation, or a compatibility shim. llama-server gives you less magic and more control — which, for agentic workloads, is exactly what you want.