Run Qwen 35B A3B at 80 tok/s on 12GB VRAM: The MTP + llama.cpp Cheat Sheet

What if a 12GB RTX 4070 Super could run a 35B-parameter model at 80 tok/s with 128K context? That’s not theoretical — it’s what Reddit user janvitos achieved on r/LocalLLaMA (667+ upvotes).

Core Mechanism: MTP (Multi-Token Prediction) lets llama.cpp predict multiple tokens per inference step instead of one, dramatically boosting generation speed. Combined with CPU offloading via the -fitt parameter, a 12GB GPU loads the 35B model body + MTP draft model + KV cache simultaneously, achieving 80%+ draft acceptance rate.

Prerequisites

Hardware

Component Original Config Minimum Recommended
GPU RTX 4070 Super 12GB 8GB VRAM (slower)
CPU AMD Ryzen 7 9700X Any 8+ core
RAM 48GB DDR5-6000 32GB DDR4/DDR5
OS CachyOS (highly recommended) Windows / Linux / macOS

💡 Key Trick: The original author plugged their monitor into the iGPU (integrated graphics), leaving all 12GB VRAM free for inference. If your dGPU handles display output, reserve some VRAM and increase -fitt.

Software

Overview

The configuration works through four key mechanisms:

  1. MTP Technology: Model predicts multiple tokens per inference step
  2. CPU Offloading: Model body too large for 12GB VRAM, partial layers offloaded to CPU
  3. -fitt Parameter: Precisely controls how much GPU space is reserved for the MTP draft model and KV cache
  4. iGPU Display: dGPU handles no display, all VRAM available for inference

Step 1: Download llama.cpp

MTP support merged into llama.cpp master on 2026-05-19. Just download the latest release — no manual compilation needed.

Head to llama.cpp Releases and download the latest version for your hardware:

  • NVIDIA GPUs: Download the cuda-12 or cuda-13 build
  • AMD GPUs: Download the rocm build
  • Intel GPUs / CPU-only: Download the vulkan build
  • macOS: Download the apple build

Step 2: Download the Model

1
2
3
4
5
6
7
# Install huggingface-cli if needed
pip install huggingface-hub

# Download Q4_K_XL quantization (recommended, balanced quality/speed)
huggingface-cli download havenoammo/Qwen3.6-35B-A3B-MTP-GGUF \
Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
--local-dir ./models

Available Quantizations:

Quant File Size Notes
Q2_K_XL 12.3 GB Smallest, quality loss
Q3_K_XL 16.5 GB Compact
Q4_K_XL 21.7 GB Recommended, balanced
Q5_K_XL 25.6 GB Higher quality
Q6_K_XL 30.5 GB High quality
Q8_K_XL 36.6 GB Maximum quality

💡 Tip: Although Q4_K_XL is 21.7GB, the -fitt parameter ensures only partial layers run on GPU — the rest run on CPU. So 12GB VRAM is sufficient.

Step 3: Launch llama-server

Original Command (RTX 4070 Super 12GB)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
-fitt 1536 \
-c 131072 \
-n 32768 \
-fa on \
-np 1 \
-ctk q8_0 \
-ctv q8_0 \
-ctkd q8_0 \
-ctvd q8_0 \
-ctxcp 64 \
--no-mmap \
--mlock \
--no-warmup \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0

Key Parameters Explained

Parameter Value Description
-fitt 1536 Most critical. Reserves MB on GPU for MTP draft model + KV cache. 1536 = 1.5GB
-c 131072 Context length 128K tokens
-n 32768 Max generation length 32K tokens
-fa on Flash Attention enabled
-ctk / -ctv q8_0 KV cache key/value quantized to Q8
-ctkd / -ctvd q8_0 MTP draft model KV cache also Q8
--spec-type draft-mtp Enable MTP draft model
--spec-draft-n-max 2 Predict up to 2 extra tokens per step
--no-mmap Disable memory mapping for stability
--mlock Lock memory to prevent swapping

💡 -fitt Tuning: If your dGPU handles display, 1536 may be too small. Start with 2048 or 2560 and decrease gradually.

Step 4: Understanding How MTP Works

What is MTP?

MTP (Multi-Token Prediction) is a special capability of the Qwen3.6 model. While normal models predict one token at a time, MTP models predict multiple tokens simultaneously per inference step.

Why Does It Speed Things Up?

llama.cpp uses the MTP draft model for speculative decoding:

  1. MTP draft model rapidly predicts 2-3 tokens
  2. Main model verifies these predictions
  3. Correct predictions are accepted; incorrect ones are regenerated from scratch
  4. With 80%+ draft acceptance rate, actual speed improves dramatically

--spec-draft-n-max Trade-offs

Value Speed Acceptance Rate Notes
2 Fast Higher Recommended, best balance
3 Slightly faster Lower Acceptance drops, overall gains minimal

Step 5: Run Benchmarks

1
2
3
4
5
# Download benchmark script
curl -O https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py

# Run benchmark
python3 mtp-bench.py

Original Benchmark Results

Task Pred Draft Acc Rate Speed (tok/s)
code_python 192 132 125 94.7% 80.8
code_cpp 58 40 37 92.5% 81.8
explain_concept 192 152 114 75.0% 70.0
summarize 53 40 32 80.0% 75.4
qa_factual 192 144 119 82.6% 77.8
translation 22 16 13 81.2% 81.9
creative_short 192 160 111 69.4% 69.2
stepwise_math 192 144 119 82.6% 76.5
long_code_review 192 148 117 79.0% 73.2

Average ~76 tok/s, code tasks peaking at 81.8 tok/s.

Windows Environment Test Results (from Reddit Comments)

The following data comes from community users who tested on Windows:

RTX 3060 12GB + R5 5600 + 32GB RAM (Windows)

Task Accept Rate Speed (tok/s)
code_python 78.4% 40.3
code_cpp 92.5% 49.3
explain_concept 78.4% 41.6
summarize 80.0% 44.4
qa_factual 82.6% 45.6
stepwise_math 88.4% 46.7
long_code_review 80.8% 43.5

Source: Reddit user ItsRektTime

RTX 3060 12GB + Ryzen 9 5950X + 40GB RAM (Windows 11 Pro)

Task Accept Rate Speed (tok/s)
code_python 88.5% 38.9
code_cpp 72.8% 35.0
explain_concept 67.7% 33.7
qa_factual 72.8% 35.2
stepwise_math 76.4% 35.8

Source: Reddit user RaspNAS (Q3_K_XL quantization)

RTX 5060 Ti 16GB (Windows 11)

  • Without MTP: ~55 tok/s
  • With MTP: ~66 tok/s
  • MTP improvement: +15%

Source: Reddit user the_masel

FAQ

Q: Can it run on 8GB VRAM?
A: Theoretically yes, but with aggressive CPU offloading. -fitt needs to increase to 3000+, and speed drops significantly. Reddit comments show users attempting this on RTX 3060 6GB laptops with further tuning needed.

Q: Windows support?
A: Yes. Use Vulkan or CUDA backend. Docker also works on Windows (requires WSL2 + Docker Desktop).

Q: Does MTP affect output quality?
A: No. MTP only affects generation speed, not model output quality. Draft model predictions are verified by the main model — wrong predictions are corrected.

Q: How’s the 128K context experience?
A: The original author set 128K context (-c 131072), but 32K is the sweet spot for cost-performance. Beyond 32K, some users report quality degradation (Qwen3 from 95% to 75%), depending on the task.

Q: Do I need CachyOS?
A: Not required, but highly recommended by the original author. CachyOS is a performance-optimized Arch Linux derivative with measurable speedups for local inference.

Q: How much faster than standard llama.cpp (no MTP)?
A: According to Reddit comments, MTP improves tok/s by approximately 15-30%. One user with RTX 5060 Ti 16GB reported: ~55 tok/s without MTP, ~66 tok/s with MTP.

Advanced Tips

1. Tuning -fitt for Different GPUs

1
2
3
4
5
6
7
8
9
10
11
# RTX 4070 Super (dGPU not displaying): 1536
-fitt 1536

# RTX 4070 Super (dGPU handles display): 2048-2560
-fitt 2048

# RTX 3060 12GB: 1736
-fitt 1736

# RTX 4060 Ti 8GB: 3000+
-fitt 3000

2. Reduce Context for Speed

If you don’t need 128K context, reducing -c frees more VRAM for inference:

1
2
3
4
5
# 32K context (recommended for daily use)
-c 32768

# 16K context (faster)
-c 16384

3. Use --models-preset to Simplify

Create a config file to avoid typing long commands:

1
2
3
4
5
6
7
# models/qwen35b-mtp.yaml
model: Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf
fit-target: 1536
ctx-size: 131072
flash-attn: on
cache-type-k: q8_0
cache-type-v: q8_0

⚠️ Note: Reddit comments report --models-preset mode causes 400 errors with mtp-bench.py. If you hit this, use CLI arguments directly.

Conclusion

You’ve learned the complete workflow for running Qwen3.6 35B A3B on a 12GB GPU:

  1. ✅ Downloaded llama.cpp with MTP support (merged into master 2026-05-19)
  2. ✅ Downloaded the MTP quantized model
  3. ✅ Understood the -fitt parameter’s critical role
  4. ✅ Launched llama-server and ran benchmarks
  5. ✅ Mastered -fitt tuning and context adjustment

MTP is a breakthrough for local inference — it enables 12GB GPUs to smoothly run 35B-parameter models at 80 tok/s with 128K context. For users without 24GB+ GPUs, this is currently the best bang-for-buck approach.

📖 Original Post: Reddit r/LocalLLaMA
📦 Model Download: HuggingFace
🔧 llama.cpp Releases: GitHub
📊 Benchmark: mtp-bench.py