Replacing OpenClaw with Hermes Agent for Local LLMs

The setup

We run AI coding agents on a home network. Each agent lives in its own Proxmox LXC container, connects to Mattermost (our team chat), and uses LLMs to do real work — writing code, opening PRs, running commands, reviewing changes.

Sam is one of these agents. Unlike the others (which use GitHub Copilot via API), Sam runs on a local LLM served from a laptop with an NVIDIA RTX 3070 Ti (8 GB VRAM) and 64 GB RAM. The goal: a fully self-hosted coding agent with no cloud dependency.

Sam originally ran on OpenClaw, the same agent framework our other bots use. After two weeks of escalating problems, we replaced it with Hermes Agent by Nous Research. This is why.

The problem: tools that never get called

The symptom was deceptively simple. In Mattermost, a colleague asked Sam to fix three GitHub issues. Sam replied:

Got it. I can see the three issues assigned to me. I’m on it. I’ll clone the repo, address each issue individually, and open a PR per issue targeting dev. Let me start by getting familiar with the codebase.

Then nothing happened. CPU idle. GPU idle. No PRs, no clones, no commands. Sam acknowledged the task perfectly and then did absolutely nothing.

This wasn’t a timeout. It wasn’t a crash. The agent framework received the message, called the LLM, got a response, and posted it to Mattermost. The response just didn’t contain any tool calls.

The investigation

We spent hours tracing the request through every layer:

OpenClaw (agent framework on sam.home.arpa) receives a Mattermost message
LiteLLM (API proxy on nia.home.arpa) routes the request to the model backend
Ollama/llama-server (on tuxedo.home.arpa) runs inference on the local model
Response flows back through the same chain

Every layer looked healthy. The model was loaded, the WebSocket was connected, the API returned 200. But the model consistently generated conversational text instead of structured tool calls.

First red herring: context truncation

We discovered that Ollama defaults to num_ctx=4096 tokens. OpenClaw was sending 7,600+ token prompts. Ollama silently truncated them, cutting off the tools array that sat at the end of the request payload. The model literally never saw its tool definitions.

We fixed this by migrating from Ollama to llama-server (llama.cpp) with an explicit --ctx-size 32768. Context truncation solved. But tool calling still didn’t work.

Second red herring: the model

We suspected GLM-4.7-Flash’s chat template was incompatible with llama.cpp’s OpenAI-format tool calling. We switched to Qwen3.5-35B-A3B — a model with verified tool calling support in llama.cpp, high scores on agentic benchmarks, and a proper Jinja template.

Direct tests confirmed tool calling worked perfectly:

$ curl llama-server:8012/v1/chat/completions -d '{"messages":[...],"tools":[...]}'
→ "tool_calls": [{"function": {"name": "exec", "arguments": "{\"command\":\"echo hello\"}"}}]
→ "finish_reason": "tool_calls"

Through LiteLLM proxy — also worked. Through OpenClaw — still broken. Same “I’ll run that command for you” narration without any actual tool invocation.

The real cause: OpenClaw’s prompt design

We captured the raw HTTP request OpenClaw sends (45,354 bytes) and found the issue.

OpenClaw’s openai-completions API mode describes tools twice in every request:

In the system prompt (text): “Tool names are case-sensitive. Call tools exactly as listed. — read: Read file contents — write: Create or overwrite files — edit: Make precise edits to files — exec: Run shell commands…”
In the tools parameter (structured JSON): Standard OpenAI-format function definitions with names, descriptions, and parameter schemas.

This dual description is fine for frontier models like Claude or GPT-4, which understand that the textual list is informational while the structured tools parameter is the mechanism for invocation. Local models — including both GLM-4.7-Flash (29.9B) and Qwen3.5-35B-A3B (35B MoE) — consistently interpreted the textual description as the primary instruction and responded by narrating what they would do, rather than generating structured tool_calls in the API response.

The model wasn’t failing to call tools. It was doing exactly what the prompt told it to: describe tool usage in natural language.

The switch to Hermes Agent

Hermes Agent is an open-source AI agent framework by Nous Research. We already ran it for another bot (Herma) with GitHub Copilot. The key difference from OpenClaw: Hermes Agent was designed from the ground up to work with local models.

What changed

The migration took about 20 minutes:

Stopped and disabled OpenClaw (kept it installed for rollback)
Installed Hermes Agent in a Python venv on the same container
Configured it to point at the same LiteLLM proxy and Mattermost bot token
Started the gateway service

First test message:

@sam run: echo hermes-agent-works && echo tool-calling-verified

Result:

💻 terminal: “echo hermes-agent-works && echo tool-…”

Looks good on both counts. What’ll we try first?

The model called the terminal tool, executed the command, and reported the output. Same model. Same LiteLLM proxy. Same llama-server. Same everything — except the agent framework.

Why it works

Hermes Agent sends a cleaner prompt. Instead of describing tools in natural language text AND passing them as structured API parameters, it relies on the model’s native tool calling mechanism. The system prompt is shorter, the tool definitions live only in the tools parameter where the model expects them, and there’s no ambiguity about whether to narrate or invoke.

For frontier models, this difference doesn’t matter. For local models running on consumer hardware, it’s the difference between a working agent and an expensive text generator.

Benefits of Hermes Agent (for local LLMs)

Simpler prompt architecture

Hermes Agent doesn’t try to teach the model about tools in the system prompt. It trusts the model’s built-in tool calling capability and the structured tools API parameter. This is exactly how local models are trained to handle tools — through the chat template’s tool format, not through natural language descriptions.

Built for local models

Hermes Agent was created by Nous Research, who also build local models (Hermes 2 Pro, Hermes 3). The framework was designed and tested against models that run on consumer hardware. OpenClaw, by contrast, was designed for frontier API models and added local model support as a secondary path.

Lighter footprint

On the same 2.4 GB RAM container:

OpenClaw: ~425 MB memory (Node.js + gateway)
Hermes Agent: ~90 MB memory (Python + gateway)

Native Mattermost support

Both frameworks support Mattermost, but Hermes Agent’s adapter is simpler — a direct WebSocket connection with mention-gating in channels and direct processing in DMs. No plugin system, no slash command callbacks, no interaction callback URLs.

Simpler configuration

OpenClaw required coordinating four config files (openclaw.json, models.json, auth-profiles.json, systemd service) with provider-specific auth types, API modes, and model definitions. Hermes Agent uses two files: config.yaml (model + settings) and .env (API keys). The model config is three lines:

model:
  default: openai/qwen3.5-35b-a3b:q4_k_m
  provider: custom
  base_url: http://nia.home.arpa:4000/v1

Broad tool support out of the box

Hermes Agent ships with terminal execution, file operations, web browsing (Playwright), web search, cron jobs, memory, delegation to sub-agents, and code execution — all enabled by default with no additional installs beyond the base package.

What we kept from OpenClaw

OpenClaw isn’t bad software. For our bots running on GitHub Copilot (Claude, GPT-5), it works well. Its skill system, session management, and multi-agent coordination are more sophisticated than Hermes Agent’s equivalents. We specifically kept:

OpenClaw installed but disabled on Sam (in case we need to switch back)
OpenClaw running on Teo (which uses GitHub Copilot, not a local model)
The OpenClaw Mattermost plugin infrastructure (bot tokens, channel memberships)

The lesson isn’t “OpenClaw is bad” — it’s that agent frameworks designed for frontier models make assumptions about model capability that don’t hold for local models. The most important assumption: that a model can distinguish between a textual description of tools and a structured invocation mechanism. When that assumption breaks, the agent degrades silently from “autonomous worker” to “conversational narrator.”

The full stack

For reference, here’s what Sam’s stack looks like now:

Mattermost (mattermost.home.arpa)
    ↕ WebSocket
Hermes Agent (sam.home.arpa)
    ↕ OpenAI API
LiteLLM proxy (nia.home.arpa:4000)
    ↕ OpenAI API
llama-server (tuxedo.home.arpa:8012)
    ↕ CUDA / CPU inference
Qwen3.5-35B-A3B Q4_K_M (22 GB, 3B active params)

Every layer is self-hosted, open-source, and runs on consumer hardware. Total cloud dependency: zero.