I got Qwen 3.6 27B running at 85 to 90 tokens per second on my RTX 5090. The setup took a few rounds of tweaking, but the result runs fast and handles long context well. A Twitter user named @GatoAnvorgeso shared a config that worked. I built on that with some help from ChatGPT and a lot of trial and error.

My machine runs Windows 11. It's my gaming PC, which means I run development and AI experiments on top of games and general use. Windows makes the llama.cpp build a little more involved than Linux, but the process is straightforward once you have the right tools installed.

Prerequisites

You need three things before building llama.cpp with CUDA support on Windows.

Install Visual Studio Build Tools from visualstudio.microsoft.com/downloads. Make sure the C++ desktop development workload is selected.

Install CMake through winget:

winget install Kitware.CMake

Install the CUDA Toolkit. I used version 12.8 from the NVIDIA download page. Pick the version that matches your driver.

Building llama.cpp

Clone the repo and configure the build:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -G "Visual Studio 17 2022" -A x64 -DGGML_CUDA=ON

I only needed the server binary, so I built that target:

cmake --build build --config Release --target llama-server -j

Build all binaries instead if you want the full suite:

cmake --build build --config Release -j

The compiled executables land in build/bin/Release/.

Adding llama-server to PATH

Add the Release binary folder to your system PATH so you can call llama-server from any terminal. On Windows, search for "Environment Variables" in the Start menu, find the Path variable, and add the full path to llama.cpp\build\bin\Release.

The Launch Script

I created a .cmd file called qwen.cmd that wraps llama-server with the optimized config. Save it somewhere on your PATH and you can start the server by typing qwen in any terminal.

@echo off
"YOUR PATH TO llama.cpp\build\bin\Release\llama-server.exe" ^
  -m "YOUR PATH TO GGUF FILE\unsloth_Qwen3.6-27B-UD-Q4_K_XL.gguf" ^
  -ngl 99 ^
  -c 140000 ^
  -b 2048 ^
  -ub 512 ^
  -fa on ^
  --kv-unified ^
  -ctk q8_0 ^
  -ctv q8_0 ^
  -ctkd q8_0 ^
  -ctvd q8_0 ^
  -np 1 ^
  --no-mmap ^
  --spec-type draft-mtp ^
  --spec-draft-n-max 2 ^
  --jinja ^
  --temp 0.5 ^
  --top-p 0.95 ^
  --top-k 20 ^
  --min-p 0.0 ^
  --repeat-penalty 1.0 ^
  --ctx-checkpoints 24 ^
  --cache-ram 8192 ^
  --checkpoint-min-step 256

Update the two paths at the top. Point the model path to your GGUF file. I used the unsloth/Qwen3.6-MTP-GGUF_Qwen3.6-27B-UD-Q4_K_XL.gguf quantization.

What the Flags Do

The config does a few specific things that push performance.

-ngl 99 offloads all layers to the GPU. Leave nothing on the CPU.

-c 140000 sets a 140k context window. Enough room for long coding sessions with full conversation history.

-b 2048 and -ub 512 set the batch size and unbatched size. Larger batch sizes feed the GPU more tokens at once.

-fa on enables Flash Attention, which speeds up attention computation at the cost of some VRAM.

--kv-unified with the -ctk and -ctv flags set to q8_0 gives you an 8-bit KV cache. The standard quantization is lower precision. Q8 keeps quality high over long contexts.

--no-mmap forces the model to load through standard file I/O instead of memory mapping. This avoids some Windows-specific memory issues.

--spec-type draft-mtp and --spec-draft-n-max 2 enable speculative decoding using the model's own multi-token prediction heads. The model drafts up to 2 tokens ahead, then verifies them. This is where the speed comes from.

The sampling parameters keep output focused: temperature at 0.5, top-p at 0.95, top-k at 20, and repeat penalty at 1.0. Good settings for code generation where you want consistency over creativity.

--ctx-checkpoints 24 and --checkpoint-min-step 256 enable context checkpointing. The server saves KV cache snapshots so it can resume from a checkpoint instead of reprocessing the full context when memory gets tight.

Results

The server hits about 90 tokens per second on my setup. The Q8 KV cache keeps output quality consistent across the full 140k context. Speculative decoding with MTP is the main speed multiplier. Without it, you'd see a noticeable drop.

Point your client at http://127.0.0.1:8080 after starting the server, and you're running.

Sources

unsloth/Qwen3.6-27B-MTP-GGUF — Qwen3.6-27B-UD-Q4_K_XL.gguf — Exact model and quantization used in this guide.
@GatoAnvorgeso on X — Original config that served as the starting point.

Articles

Three Things That Cut Token Usage with Local AI

How to Set Up Qwen 3.6 27B and llama.cpp for Optimal Coding

Select an article