TL;DR

I published aaronbolton/llamacpp-low-vram as a minimal, copy-paste starting point for running llama-server on a GPU with limited VRAM. Adjust --n-cpu-moe and --ctx-size for your hardware, and you get an OpenAI-compatible local endpoint without buying a bigger card.


Running a large model on a modest GPU is mostly a VRAM budgeting problem. The model weights, the KV cache, and the attention computation all compete for the same pool of memory. This post explains the launch script I use, why each flag is there, and how to tune the two settings that have the most impact.

What You Get

The scripts start llama-server with a set of defaults that prioritise fitting a larger model into limited memory while keeping the server useful:

  • Web UI on http://0.0.0.0:8080
  • OpenAI-compatible API on http://0.0.0.0:8080/v1
  • No API key required by default — add one if you expose this beyond localhost
  • Model pulled directly from Hugging Face Hub via --hf-repo on first run

Any tool that speaks the OpenAI API — VS Code extensions, Open WebUI, curl — can point at this endpoint and work without modification.

Quick Start

  1. Download the llama.cpp release that matches your OS and CUDA version from the llama.cpp releases page.
  2. On Windows, copy the CUDA DLLs from the release archive into the extracted llama.cpp folder alongside the binaries.
  3. Copy run-llama-server.bat (Windows) or run-llama-server.sh (Linux) from the repo into the same folder.
  4. Run the script once. If it crashes, follow the OOM tuning steps below.

The Two Knobs That Matter

Most of the flags in the script are set-and-forget. These two directly control how much VRAM the server uses, and they are the only ones you should need to touch for a given GPU.

--n-cpu-moe

Mixture-of-Experts models (like Qwen 3.6) only activate a fraction of their parameters per token — the rest sit idle in VRAM. --n-cpu-moe offloads a portion of those expert layers to the CPU instead.

Higher value = less VRAM used, but each token that hits an offloaded expert pays a PCIe transfer cost. The inference still works; it just gets slower proportionally to how many experts you push off-GPU.

--ctx-size

Context length controls how many tokens the model can hold in its attention window. The KV cache that backs this grows linearly with context length — a 262 144-token context uses roughly twice the KV cache memory of a 131 072-token one.

Cutting context length is the fastest way to recover VRAM if the model weights themselves fit but the server still OOMs on startup.

GPU VRAM --n-cpu-moe --ctx-size
10 GB 35 262144
8 GB 41 262144
6 GB 41 131072

These are baselines, not maximums. Start here and reduce --ctx-size or increase --n-cpu-moe in small steps until the server starts cleanly.


The Rest of the Launch Flags

The remaining flags are fixed in the scripts because they are almost always the right choice for a low-VRAM setup.

Flag Value Why
--n-gpu-layers 999 Loads as many transformer layers as VRAM allows before falling back to CPU. Set high so it uses everything available.
--cache-type-k / --cache-type-v q8_0 Quantises the KV cache to 8-bit. Roughly halves KV cache VRAM vs fp16 with negligible quality impact.
--flash-attn on Flash attention computes attention without materialising the full attention matrix — faster and uses less memory bandwidth.
--no-mmap (flag) Forces the full model into RAM/VRAM on startup rather than memory-mapping the file. Avoids latency spikes mid-inference when the OS swaps pages in.
--warmup (flag) Runs a short generation pass on startup so the first real request does not pay a cold-start penalty.
--tools all (flag) Enables tool-calling support in the server, making the endpoint usable for agentic workflows.
--ui-mcp-proxy (flag) Exposes an MCP proxy through the web UI for connecting MCP-aware clients directly to the server.

If You Hit an OOM

If the server crashes on startup or mid-inference, work through this in order:

  1. Lower --ctx-size first. The KV cache is often the largest variable cost. Halving context length can recover several gigabytes immediately.
  2. Increase --n-cpu-moe next. Push more expert layers off-GPU. Inference slows, but the model stays resident.
  3. Try a smaller quantisation if neither of the above is enough. A Q3 or Q4 quant of the same model uses meaningfully less VRAM than Q6 or Q8.

This order matters. Context size has the largest immediate impact and costs no inference quality. Expert offload trades speed for memory. Re-quantising is a last resort because it does affect output quality.


Summary

The repo is intentionally minimal: one script, two knobs, runnable defaults. The goal is to give you a working local inference endpoint as quickly as possible and let you tune from a stable baseline rather than starting from scratch.

Setting Impact
--n-cpu-moe VRAM vs. speed trade-off for MoE models
--ctx-size VRAM vs. context length trade-off
--cache-type-k/v q8_0 ~50% KV cache VRAM reduction
--flash-attn on Lower memory bandwidth, faster attention

If you have a GPU with 6–10 GB of VRAM and want a local OpenAI-compatible endpoint for experimentation or tooling, this is the fastest path to a stable run.