My Local LLM Setup: One Model, Many Personalities

Hardware

Component	Spec
CPU	11th Gen Intel Core i7-11700K (16 threads) @ 5.00 GHz
GPU 1	NVIDIA GeForce RTX 4060 Ti 16GB (Discrete)
GPU 2	NVIDIA GeForce RTX 4060 Ti 16GB (Discrete)
Memory	128 GiB

Running a large language model locally is one thing. Serving it intelligently to a variety of workloads is another. This post walks through how I serve a single Qwen 3.6 model via llama.cpp and expose it as multiple purpose-tuned model aliases through LiteLLM — giving different clients the right inference parameters without ever loading a second model.

Architecture Overview

The request chain is straightforward:

Every client talks to LiteLLM over the standard OpenAI API. LiteLLM routes each request to llama.cpp and injects the sampling parameters appropriate for that named model alias. llama.cpp never sees more than one model — it just gets different generation parameters per request.

llama.cpp — The Inference Engine

llama.cpp runs in a Docker container with CUDA 12 and serves a single model over an OpenAI-compatible /v1 endpoint on port 8080.

llamacpp:
  image: ghcr.io/ggml-org/llama.cpp:full-cuda12
  container_name: llamacpp
  restart: unless-stopped
  ports:
    - "8080:8080"
  environment:
    - LLAMA_ARG_HOST=0.0.0.0
    - HF_HUB_CACHE=/models
    - HF_TOKEN=${HUGGING_FACE_HUB_TOKEN}
    - LLAMA_API_KEY=${LLAMA_API_KEY}
    # Push all layers to GPU
    - LLAMA_ARG_N_GPU_LAYERS=999
    - LLAMA_ARG_NO_MMAP
    # One parallel slot — single-user home setup
    - LLAMA_ARG_N_PARALLEL=1
    # KV cache quantisation — saves VRAM with negligible quality loss
    - LLAMA_ARG_CACHE_TYPE_K=q8_0
    - LLAMA_ARG_CACHE_TYPE_V=q8_0
    # Flash attention
    - LLAMA_ARG_FLASH_ATTN=1
  command:
    - "--server"
    - "--log-file"
    - "/var/log/llamacpp.log"
    - "-hf"
    - "unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL"
  volumes:
    - /path/to/your/model/cache:/models
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Key decisions

Setting	Value	Why
`LLAMA_ARG_N_GPU_LAYERS=999`	All layers on GPU	Avoids slow CPU offload
`LLAMA_ARG_NO_MMAP`	Enabled	Loads the full model into VRAM upfront
`LLAMA_ARG_N_PARALLEL=1`	1 slot	Single-user home lab; keeps VRAM headroom for a large context
`CACHE_TYPE_K/V=q8_0`	8-bit KV cache	Roughly halves KV cache VRAM vs fp16
`LLAMA_ARG_FLASH_ATTN=1`	Flash attention	Faster attention + lower memory bandwidth
Model	`unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL`	Good quality/VRAM trade-off for a 35B MoE

The model is pulled directly from Hugging Face Hub on first start and cached to your local model cache directory.

A healthcheck watches the log file for CUDA out-of-memory errors and marks the container unhealthy if one is detected, triggering autoheal to restart it cleanly.

LiteLLM — The Model Router

LiteLLM sits in front of llama.cpp and provides:

An OpenAI-compatible API that any client can target without modification.
Model aliasing — one physical model exposed under multiple names with different parameters.
Usage tracking via a Postgres database.
Prometheus metrics for observability.

litellm:
  image: docker.litellm.ai/berriai/litellm-database:main-stable
  container_name: litellm
  ports:
    - "4000:4000"
  volumes:
    - ./litellm/litellm-config.yaml:/app/config.yaml
  environment:
    - DATABASE_URL=${LITELLM_DATABASE_URL}
    - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    - LLAMA_API_KEY=${LLAMA_API_KEY}
  command: ["--config", "/app/config.yaml", "--port", "4000"]
  depends_on:
    litellm-db:
      condition: service_healthy

A Postgres 16 container (litellm-db) backs the spend/usage log. Both containers live on a shared private Docker network so they can communicate without exposing the database port.

The Config: One Model, Six Aliases

This is where it gets interesting. The litellm-config.yaml file defines six model entries that all point at the same llama.cpp endpoint and the same underlying model. What differs are the sampling parameters.

general_settings:
  store_model_in_db: true
  store_prompts_in_spend_logs: true  # WARNING: logs full prompt content to the database — set to false if prompts are sensitive
  watch_config: true
  request_timeout: 600
  enable_responses_api: true

litellm_settings:
  drop_params: true
  callbacks:
    - prometheus

router_settings:
  routing_strategy: "simple-shuffle"
  num_retries: 1
  allowed_fails: 1
  cooldown_time: 10

Model aliases

All six aliases use:

model: "openai/unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_XL"
api_base: "http://llamacpp:8080/v1"

What differs is the sampling configuration.

Parameter	What it does	ELI5
`temperature`	Controls randomness. Higher values (e.g. 1.0) produce more varied output; lower values (e.g. 0.6) make the model more deterministic and stick closer to the most probable tokens.	A creativity dial. Turn it up and the model picks wilder words; turn it down and it plays it safe.
`top_p`	Nucleus sampling. The model only considers tokens whose cumulative probability reaches this threshold. 0.95 keeps a wide but sensible vocabulary; 0.8 narrows it further.	Imagine the model has a jar of word options. `top_p` says “only pick from words until you’ve covered 95% of the good ones — ignore the weird ones at the bottom.”
`presence_penalty`	Penalises tokens that have already appeared in the output, discouraging repetition of topics. Higher values push the model to introduce new ideas.	If the model already said “cat”, this makes it less likely to say “cat” again. Higher = more variety.
`enable_thinking`	Qwen 3.6 feature. When `true` the model emits a reasoning scratchpad before its final answer. Set to `false` to skip it for faster, lower-token responses.	Should the model “think out loud” before answering? Like showing your working in maths class. Turn it off for quick answers.
`top_k`	Limits the candidate pool to the top K most probable tokens at each step. Works alongside `top_p`; set to 20 here to keep outputs focused.	Only consider the top 20 most likely next words. Stops the model going totally off-script.
`min_p`	Filters out tokens whose probability falls below this fraction of the top token’s probability. 0.0 disables the filter.	Cuts out really unlikely words. “If a word is less than X% as likely as your best option, don’t even consider it.” 0.0 means no filter at all.
`repetition_penalty`	Multiplier applied to already-seen tokens. Values above 1.0 reduce repetition; 1.0 means no penalty.	Makes the model reluctant to reuse words it has already said. The higher the number, the more it avoids repeating itself.

1. `qwen3.6-35b-a3b-q4_k_xl` — Default / bare

No extra parameters. Clients that want full control over sampling can use this alias and pass their own settings.

2. `qwen3.6-35b-a3b-q4_k_xl-thinking-general` — Thinking, general tasks

Qwen 3.6 supports a “thinking” mode where it emits a scratchpad before the final answer. This alias enables it with parameters recommended for open-ended generation:

temperature: 1.0
top_p: 0.95
presence_penalty: 1.5
extra_body:
  top_k: 20
  min_p: 0.0
  repetition_penalty: 1.0

Higher temperature and presence penalty encourage varied, exploratory responses. Good for brainstorming and open-ended questions.

3. `qwen3.6-35b-a3b-q4_k_xl-thinking-coding` — Thinking, precise coding

Same thinking mode, but with tighter sampling for code generation:

temperature: 0.6
top_p: 0.95
presence_penalty: 0.0
extra_body:
  top_k: 20
  min_p: 0.0
  repetition_penalty: 1.0

Lower temperature means less randomness — you want the model to pick the most probable correct token when writing code, not get creative.

4. `qwen3.6-35b-a3b-q4_k_xl-nothink-general` — No-think, general tasks

Qwen 3.6 can suppress the thinking scratchpad entirely, giving faster responses with lower token usage. The enable_thinking: false flag is passed in both chat_template_kwargs and extra_body to ensure it is respected at both the LiteLLM and llama.cpp layers:

temperature: 0.7
top_p: 0.8
presence_penalty: 1.5
chat_template_kwargs:
  enable_thinking: false
extra_body:
  top_k: 20
  min_p: 0.0
  repetition_penalty: 1.0
  chat_template_kwargs:
    enable_thinking: false

5. `qwen3.6-35b-a3b-q4_k_xl-nothink-coding` — No-think, precise coding

No thinking scratchpad, tight sampling for code:

temperature: 0.6
top_p: 0.95
presence_penalty: 0.0
chat_template_kwargs:
  enable_thinking: false

Fastest option for code completion tasks where you just want the answer.

6. `qwen3.6-35b-a3b-q4_k_xl-nothink-reasoning` — No-think, reasoning

No scratchpad but with high-entropy sampling — useful for multi-step reasoning tasks where you want diversity across retries:

temperature: 1.0
top_p: 0.95
presence_penalty: 1.5
chat_template_kwargs:
  enable_thinking: false

Why This Approach?

Single model load

Loading a 35B model takes time and significant VRAM. By keeping one model resident in llama.cpp and routing at the proxy layer, there is no model swap overhead between requests. The model is always warm.

Separation of concerns

Clients (VS Code extensions, Open WebUI, scripts) just pick an alias by name. They don’t need to know anything about sampling parameters. Changing a tuning value is a one-line edit in litellm-config.yaml — no client changes required.

Thinking vs. no-thinking

Qwen 3.6’s thinking mode produces significantly better results on complex tasks, but it increases latency and token usage because the model emits its reasoning chain first. For tasks that don’t benefit from it (quick lookups, code formatting, summarisation), the nothink aliases skip the scratchpad entirely.

Parameter presets per use case

Alias	Temperature	Use case
thinking-general	1.0	Brainstorming, open-ended
thinking-coding	0.6	Careful code generation with reasoning
nothink-general	0.7	Fast general chat
nothink-coding	0.6	Fast code completion
nothink-reasoning	1.0	Diverse multi-step reasoning

Summary

The full stack is:

llama.cpp — loads the model once onto the GPU and exposes a single OpenAI-compatible endpoint.
LiteLLM — acts as a multiplexing proxy, turning that one endpoint into many named models with different sampling presets.
Clients — pick the alias that matches the task. No knowledge of the underlying hardware or parameters needed.

The result is a flexible, low-overhead local inference setup where the right parameters are applied automatically based on what you’re asking the model to do.