Understanding llama-server Context Size, Memory, and Parallelism

I recently spent some time tuning llama-server from llama.cpp on my own laptop, and I ran into a set of interactions that are easy to get wrong if you only look at the flags one by one.

The three knobs that matter most are:

  • --ctx-size
  • --parallel
  • the actual RAM and VRAM available on the machine

On paper, each of these seems straightforward. In practice, they interact through the KV cache, server slots, and memory placement in ways that are not immediately obvious.

I was running a Q4_K_M GGUF of Qwen3.5-35B-A3B on an RTX 3070 Ti Laptop GPU with 8 GB of VRAM and 64 GB of system RAM. That is very much a “make it fit and make it useful” kind of setup, which makes these trade-offs impossible to ignore. Qwen3.5-35B-A3B is a real upstream Qwen model, and current llama.cpp server docs confirm the relevant runtime behavior around slots, Flash Attention, KV-cache types, and monitoring endpoints. (Hugging Face)

What follows is the mental model I ended up with after testing this on my own machine.

My setup

The hardware I used was:

  • GPU: RTX 3070 Ti Laptop GPU
  • VRAM: 8 GB
  • System RAM: 64 GB
  • Runtime: llama-server from llama.cpp
  • Model: Qwen3.5-35B-A3B, quantized to Q4_K_M

This is a good example of a hybrid CPU/GPU deployment. The model is too large to live entirely in VRAM, so part of it runs on the GPU and part of it spills into host memory. That is exactly where understanding memory layout stops being optional and starts being operationally important.

Where the memory actually goes

On a running llama-server, the two main memory consumers are:

  1. Model weights
  2. KV cache

The model weights are the neural-network parameters themselves. They are loaded once when the server starts and then shared across all requests.

The KV cache is different. That is the structure that stores the attention keys and values for the tokens already processed, which is what allows the model to “remember” the conversation so far. In current llama.cpp, the cache types for K and V default to f16, although I can override them with --cache-type-k and --cache-type-v if I want to trade quality for memory efficiency. KV offloading is also supported and enabled by default, which matters a lot on memory-constrained machines like mine. (GitHub)

On my laptop, the model did not fit entirely in VRAM, so llama.cpp had to split the workload across the GPU and system RAM. In practice, that meant some of the weights lived in VRAM and the rest lived in host memory. I also observed the same general pattern for KV allocation: some cache lived on the GPU and some in system RAM, depending on what the runtime could place where. Startup logs in llama.cpp explicitly report per-backend KV buffer allocations, which is the best way to confirm what is really happening instead of guessing. (GitHub)

For my specific setup, my rough observed numbers for Qwen3.5-35B-A3B Q4_K_M looked like this:

Component VRAM System RAM
Model weights ~6 GB ~16 GB
KV cache (64K total ctx budget) ~1 GB ~4 GB
Total ~7 GB ~20 GB

Those numbers are not universal. They are measurements from my own machine and they depend on the model, quantization, backend, cache type, and the exact way the runtime decides to split memory. But they are representative enough to show the shape of the problem.

The key point is this: model weights are a mostly fixed cost, while KV cache is the part that scales with context capacity.

What --ctx-size actually means in llama-server

This is the part that I think is most often misunderstood.

When I first started using llama-server, I naturally read --ctx-size as “the max context size for one conversation.” That is not completely wrong in the one-request-at-a-time case, but it is not the best way to think about it in server mode.

The more accurate model is this:

In llama-server, --ctx-size is the total KV-cache budget available to the server.

The llama.cpp maintainer explicitly describes --ctx-size this way: it is effectively the total number of tokens that can be stored across all independent sequences. In other words, it is a shared budget, not a private budget per request. (GitHub)

That leads directly to the next important concept: server slots.

Current llama-server docs define --parallel as the number of server slots. The server also exposes a /slots endpoint, enabled by default, where I can inspect per-slot state such as n_ctx. (GitHub)

So if I run:

llama-server --ctx-size 65536 --parallel 1

I can think of that as one slot with the full 65,536-token budget available.

But if I run:

llama-server --ctx-size 65536 --parallel 2

then I should think of that as two active slots sharing that total budget. In practical terms, that means I should plan for about 32K per slot in the worst case. Upstream discussion examples say the same thing more explicitly: -c 1024 -np 2 creates 2 processing slots with about 512 context tokens each. (GitHub)

That also means that if I want to support two simultaneous requests with 64K each, I should budget around:

--ctx-size 131072 --parallel 2

And that, of course, increases KV-cache memory substantially.

There is one extra nuance here that I find intellectually interesting: llama.cpp uses a unified KV cache strategy. The total cache is shared across sequences, and sequences are isolated by masking, not by giving each one a physically separate KV buffer. That design has advantages, including prompt sharing across sequences, but it also explains why the server thinks in terms of a global token budget instead of a strictly private per-request window. (GitHub)

Why a bigger context window is mostly a memory decision

One of the most useful things I learned is that a larger --ctx-size is primarily a memory reservation decision.

The server allocates KV capacity up front. That is why increasing --ctx-size changes memory usage immediately, even before I actually fill the context with real tokens. The logs make this visible by printing the KV buffer allocation at startup. (GitHub)

That does not mean a model suddenly becomes slow just because I reserved a bigger maximum window and then only used a small fraction of it.

The runtime cost is driven much more by the number of tokens I actually process than by the theoretical maximum I reserved. So if I reserve 64K but only use 2K in a given conversation, the main direct penalty is memory consumption, not that the model now behaves as if every request were 64K long.

That said, I would not phrase this as “unused context is completely free.” It is more accurate to say:

Unused context is mostly free computationally, but not free in memory.

And on a constrained GPU, memory pressure can still turn into speed loss indirectly. If a larger KV budget consumes more VRAM, fewer model layers may fit comfortably on the GPU, and the runtime may be forced into a worse GPU/CPU split. That is where a “memory-only” decision becomes a performance decision.

This is why on my own machine I saw a practical threshold effect: moderate context sizes were fine, but once I pushed context too high, the extra KV reservation started competing with model placement.

Another detail worth mentioning is Flash Attention. In current server docs, the flag is --flash-attn [on|off|auto], and the default is auto, not simply “always on.” So the right way for me to describe this is that llama-server will use Flash Attention automatically when the backend and model path support it. (GitHub)

What --parallel changes in real life

--parallel controls how many server slots exist, which means how many requests the server can actively work on at the same time. Current server docs describe it as the number of slots, with a documented default of -1 for auto in the server README, which is one reason I prefer to set it explicitly instead of relying on defaults. (GitHub)

Internally, the server uses continuous batching and maintains a single batch shared across all active slots. The developer documentation explains that update_slots() gathers work from all active slots into one batch and then calls llama_decode, which is the main compute bottleneck. (GitHub)

That design explains the trade-off I saw on my own machine:

  • With --parallel 1, one request runs, and the next request effectively waits for a free slot.
  • With --parallel 2, two requests can be active at once, but they share the same compute resources: GPU throughput, host-memory bandwidth, and CPU-side work.

On a laptop GPU with a partially offloaded 35B-class model, I found that the bottleneck was not just raw flops. A lot of the pain came from memory traffic, especially once some of the model lived in system RAM. In that situation, parallelism improved throughput more than it improved latency.

My own simplified experience looked like this:

Config Request A Request B Total wall time
--parallel 1 ~2 min ~2 min (waits for slot) ~4 min
--parallel 2 ~3.5 min ~3.5 min ~3.5 min

The exact numbers are workload-specific, but the pattern is the important part.

With one slot, the first user gets the best latency.

With two slots, the total work finishes sooner, but each individual request slows down because both requests are competing for the same limited machine.

That is why I think --parallel is best understood as a throughput knob, not a “make everything faster” knob.

How I think about the trade-offs now

After working through this on my own computer, my rule of thumb became very simple.

If I care most about single-user responsiveness

I use:

  • --parallel 1
  • the largest --ctx-size I can justify without causing bad memory placement

That gives the best experience for interactive use.

If I care about serving multiple requests

I increase --parallel, but I do it knowing two things:

  1. my total context budget is now shared across more slots
  2. each active request will probably get slower on a memory-constrained machine

That is still useful if my goal is throughput rather than best-case latency.

If I need large context on limited VRAM

I look at KV-cache quantization first.

Current docs expose:

  • --cache-type-k
  • --cache-type-v

with defaults of f16 for both, and lower-precision options such as q8_0 and q4_0. That is one of the cleanest levers available when the KV cache, rather than the model weights, is what is pushing the machine over the edge. (GitHub)

My practical recommendations

If I were advising someone with hardware similar to mine, this is what I would say.

First, set --parallel explicitly. I do not like leaving that to defaults or auto behavior when I am trying to reason about capacity. Current server docs describe it in terms of server slots, and the slot endpoint makes it easy to inspect what I actually got. (GitHub)

Second, treat --ctx-size as a total budget, not a per-chat entitlement. That one mental shift makes the rest of the behavior much easier to understand. (GitHub)

Third, remember that large context windows are primarily a memory problem. They only become a direct speed problem when I actually fill them with tokens or when the extra memory reservation causes worse placement of weights and KV. Startup KV logs are the fastest way to validate what changed. (GitHub)

Fourth, choose between latency and throughput on purpose. On consumer hardware, parallel inference is usually not a free lunch. The server’s shared-batch design explains why. (GitHub)

Final takeaway

The main thing I took away from this exercise is that llama-server behaves much more like a resource scheduler than like a simple single-process chat program.

--ctx-size is not just “how long my chat can be.” It is my total KV-cache budget.

--parallel is not just “how many users I can serve.” It is how many slots are competing for that budget and for the machine’s compute and memory bandwidth.

And on a laptop-class GPU, those choices matter a lot.

Once I started thinking about llama-server in those terms, the behavior stopped feeling mysterious. It became a capacity-planning problem: how much KV budget I want, how many active slots I want, and how much slowdown I am willing to accept in exchange for concurrency.

That, to me, is the real lesson from running large models on modest hardware: not that it is impossible, but that understanding the runtime matters just as much as understanding the model.