Hardware

Component Spec
CPU 11th Gen Intel Core i7-11700K (16 threads) @ 5.00 GHz
GPU 1 NVIDIA GeForce RTX 4060 Ti 16GB (Discrete)
GPU 2 NVIDIA GeForce RTX 4060 Ti 16GB (Discrete)
Memory 128 GiB

Running a large language model locally is one thing. Serving it intelligently to a variety of workloads is another. This post walks through how I serve a single Qwen 3.6 model via llama.cpp and expose it as multiple purpose-tuned model aliases through LiteLLM — giving different clients the right inference parameters without ever loading a second model.

Architecture Overview

The request chain is straightforward:

( V S C o d e / O C p l e i n e W n e t b U I / c u r l ) O p e n L A / i I t - m e c o L o d L m e M p l a : t r 4 i o 0 b u 0 l t 0 e e r p r o x y I n f l e l r a e m n a c . e c p s p e r : v 8 e 0 r 8 0 · G P U Q S w i e n n g 3 l . e 6 - m 3 o 5 d B e - l A 3 i B n G V G R U A F M

Every client talks to LiteLLM over the standard OpenAI API. LiteLLM routes each request to llama.cpp and injects the sampling parameters appropriate for that named model alias. llama.cpp never sees more than one model — it just gets different generation parameters per request.


llama.cpp — The Inference Engine

llama.cpp runs in a Docker container with CUDA 12 and serves a single model over an OpenAI-compatible /v1 endpoint on port 8080.

llamacpp:
  image: ghcr.io/ggml-org/llama.cpp:full-cuda12
  container_name: llamacpp
  restart: unless-stopped
  ports:
    - "8080:8080"
  environment:
    - LLAMA_ARG_HOST=0.0.0.0
    - HF_HUB_CACHE=/models
    - HF_TOKEN=${HUGGING_FACE_HUB_TOKEN}
    - LLAMA_API_KEY=${LLAMA_API_KEY}
    # Push all layers to GPU
    - LLAMA_ARG_N_GPU_LAYERS=999
    - LLAMA_ARG_NO_MMAP
    # One parallel slot — single-user home setup
    - LLAMA_ARG_N_PARALLEL=1
    # KV cache quantisation — saves VRAM with negligible quality loss
    - LLAMA_ARG_CACHE_TYPE_K=q8_0
    - LLAMA_ARG_CACHE_TYPE_V=q8_0
    # Flash attention
    - LLAMA_ARG_FLASH_ATTN=1
  command:
    - "--server"
    - "--log-file"
    - "/var/log/llamacpp.log"
    - "-hf"
    - "unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL"
  volumes:
    - /path/to/your/model/cache:/models
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Key decisions

Setting Value Why
LLAMA_ARG_N_GPU_LAYERS=999 All layers on GPU Avoids slow CPU offload
LLAMA_ARG_NO_MMAP Enabled Loads the full model into VRAM upfront
LLAMA_ARG_N_PARALLEL=1 1 slot Single-user home lab; keeps VRAM headroom for a large context
CACHE_TYPE_K/V=q8_0 8-bit KV cache Roughly halves KV cache VRAM vs fp16
LLAMA_ARG_FLASH_ATTN=1 Flash attention Faster attention + lower memory bandwidth
Model unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL Good quality/VRAM trade-off for a 35B MoE

The model is pulled directly from Hugging Face Hub on first start and cached to your local model cache directory.

A healthcheck watches the log file for CUDA out-of-memory errors and marks the container unhealthy if one is detected, triggering autoheal to restart it cleanly.


LiteLLM — The Model Router

LiteLLM sits in front of llama.cpp and provides:

  • An OpenAI-compatible API that any client can target without modification.
  • Model aliasing — one physical model exposed under multiple names with different parameters.
  • Usage tracking via a Postgres database.
  • Prometheus metrics for observability.
litellm:
  image: docker.litellm.ai/berriai/litellm-database:main-stable
  container_name: litellm
  ports:
    - "4000:4000"
  volumes:
    - ./litellm/litellm-config.yaml:/app/config.yaml
  environment:
    - DATABASE_URL=${LITELLM_DATABASE_URL}
    - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    - LLAMA_API_KEY=${LLAMA_API_KEY}
  command: ["--config", "/app/config.yaml", "--port", "4000"]
  depends_on:
    litellm-db:
      condition: service_healthy

A Postgres 16 container (litellm-db) backs the spend/usage log. Both containers live on a shared private Docker network so they can communicate without exposing the database port.


The Config: One Model, Six Aliases

This is where it gets interesting. The litellm-config.yaml file defines six model entries that all point at the same llama.cpp endpoint and the same underlying model. What differs are the sampling parameters.

general_settings:
  store_model_in_db: true
  store_prompts_in_spend_logs: true  # WARNING: logs full prompt content to the database — set to false if prompts are sensitive
  watch_config: true
  request_timeout: 600
  enable_responses_api: true

litellm_settings:
  drop_params: true
  callbacks:
    - prometheus

router_settings:
  routing_strategy: "simple-shuffle"
  num_retries: 1
  allowed_fails: 1
  cooldown_time: 10

Model aliases

All six aliases use:

model: "openai/unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_XL"
api_base: "http://llamacpp:8080/v1"

What differs is the sampling configuration.

Parameter What it does ELI5
temperature Controls randomness. Higher values (e.g. 1.0) produce more varied output; lower values (e.g. 0.6) make the model more deterministic and stick closer to the most probable tokens. A creativity dial. Turn it up and the model picks wilder words; turn it down and it plays it safe.
top_p Nucleus sampling. The model only considers tokens whose cumulative probability reaches this threshold. 0.95 keeps a wide but sensible vocabulary; 0.8 narrows it further. Imagine the model has a jar of word options. top_p says “only pick from words until you’ve covered 95% of the good ones — ignore the weird ones at the bottom.”
presence_penalty Penalises tokens that have already appeared in the output, discouraging repetition of topics. Higher values push the model to introduce new ideas. If the model already said “cat”, this makes it less likely to say “cat” again. Higher = more variety.
enable_thinking Qwen 3.6 feature. When true the model emits a reasoning scratchpad before its final answer. Set to false to skip it for faster, lower-token responses. Should the model “think out loud” before answering? Like showing your working in maths class. Turn it off for quick answers.
top_k Limits the candidate pool to the top K most probable tokens at each step. Works alongside top_p; set to 20 here to keep outputs focused. Only consider the top 20 most likely next words. Stops the model going totally off-script.
min_p Filters out tokens whose probability falls below this fraction of the top token’s probability. 0.0 disables the filter. Cuts out really unlikely words. “If a word is less than X% as likely as your best option, don’t even consider it.” 0.0 means no filter at all.
repetition_penalty Multiplier applied to already-seen tokens. Values above 1.0 reduce repetition; 1.0 means no penalty. Makes the model reluctant to reuse words it has already said. The higher the number, the more it avoids repeating itself.

1. qwen3.6-35b-a3b-q4_k_xl — Default / bare

No extra parameters. Clients that want full control over sampling can use this alias and pass their own settings.

2. qwen3.6-35b-a3b-q4_k_xl-thinking-general — Thinking, general tasks

Qwen 3.6 supports a “thinking” mode where it emits a scratchpad before the final answer. This alias enables it with parameters recommended for open-ended generation:

temperature: 1.0
top_p: 0.95
presence_penalty: 1.5
extra_body:
  top_k: 20
  min_p: 0.0
  repetition_penalty: 1.0

Higher temperature and presence penalty encourage varied, exploratory responses. Good for brainstorming and open-ended questions.

3. qwen3.6-35b-a3b-q4_k_xl-thinking-coding — Thinking, precise coding

Same thinking mode, but with tighter sampling for code generation:

temperature: 0.6
top_p: 0.95
presence_penalty: 0.0
extra_body:
  top_k: 20
  min_p: 0.0
  repetition_penalty: 1.0

Lower temperature means less randomness — you want the model to pick the most probable correct token when writing code, not get creative.

4. qwen3.6-35b-a3b-q4_k_xl-nothink-general — No-think, general tasks

Qwen 3.6 can suppress the thinking scratchpad entirely, giving faster responses with lower token usage. The enable_thinking: false flag is passed in both chat_template_kwargs and extra_body to ensure it is respected at both the LiteLLM and llama.cpp layers:

temperature: 0.7
top_p: 0.8
presence_penalty: 1.5
chat_template_kwargs:
  enable_thinking: false
extra_body:
  top_k: 20
  min_p: 0.0
  repetition_penalty: 1.0
  chat_template_kwargs:
    enable_thinking: false

5. qwen3.6-35b-a3b-q4_k_xl-nothink-coding — No-think, precise coding

No thinking scratchpad, tight sampling for code:

temperature: 0.6
top_p: 0.95
presence_penalty: 0.0
chat_template_kwargs:
  enable_thinking: false

Fastest option for code completion tasks where you just want the answer.

6. qwen3.6-35b-a3b-q4_k_xl-nothink-reasoning — No-think, reasoning

No scratchpad but with high-entropy sampling — useful for multi-step reasoning tasks where you want diversity across retries:

temperature: 1.0
top_p: 0.95
presence_penalty: 1.5
chat_template_kwargs:
  enable_thinking: false

Why This Approach?

Single model load

Loading a 35B model takes time and significant VRAM. By keeping one model resident in llama.cpp and routing at the proxy layer, there is no model swap overhead between requests. The model is always warm.

Separation of concerns

Clients (VS Code extensions, Open WebUI, scripts) just pick an alias by name. They don’t need to know anything about sampling parameters. Changing a tuning value is a one-line edit in litellm-config.yaml — no client changes required.

Thinking vs. no-thinking

Qwen 3.6’s thinking mode produces significantly better results on complex tasks, but it increases latency and token usage because the model emits its reasoning chain first. For tasks that don’t benefit from it (quick lookups, code formatting, summarisation), the nothink aliases skip the scratchpad entirely.

Parameter presets per use case

Alias Temperature Use case
thinking-general 1.0 Brainstorming, open-ended
thinking-coding 0.6 Careful code generation with reasoning
nothink-general 0.7 Fast general chat
nothink-coding 0.6 Fast code completion
nothink-reasoning 1.0 Diverse multi-step reasoning

Summary

The full stack is:

  1. llama.cpp — loads the model once onto the GPU and exposes a single OpenAI-compatible endpoint.
  2. LiteLLM — acts as a multiplexing proxy, turning that one endpoint into many named models with different sampling presets.
  3. Clients — pick the alias that matches the task. No knowledge of the underlying hardware or parameters needed.

The result is a flexible, low-overhead local inference setup where the right parameters are applied automatically based on what you’re asking the model to do.