Run Qwen 35B A3B at 80 tok/s on 12GB VRAM: The MTP + llama.cpp Cheat Sheet
Run Qwen 35B A3B at 80 tok/s on 12GB VRAM: The MTP + llama.cpp Cheat Sheet
Xiaoxin Software AlternativesWhat if a 12GB RTX 4070 Super could run a 35B-parameter model at 80 tok/s with 128K context? That’s not theoretical — it’s what Reddit user janvitos achieved on r/LocalLLaMA (667+ upvotes).
Core Mechanism: MTP (Multi-Token Prediction) lets llama.cpp predict multiple tokens per inference step instead of one, dramatically boosting generation speed. Combined with CPU offloading via the
-fittparameter, a 12GB GPU loads the 35B model body + MTP draft model + KV cache simultaneously, achieving 80%+ draft acceptance rate.
Prerequisites
Hardware
| Component | Original Config | Minimum Recommended |
|---|---|---|
| GPU | RTX 4070 Super 12GB | 8GB VRAM (slower) |
| CPU | AMD Ryzen 7 9700X | Any 8+ core |
| RAM | 48GB DDR5-6000 | 32GB DDR4/DDR5 |
| OS | CachyOS (highly recommended) | Windows / Linux / macOS |
💡 Key Trick: The original author plugged their monitor into the iGPU (integrated graphics), leaving all 12GB VRAM free for inference. If your dGPU handles display output, reserve some VRAM and increase
-fitt.
Software
- llama.cpp: Latest release (MTP support merged into master on 2026-05-19 — just download the latest release)
- Model: Qwen3.6-35B-A3B-MTP-GGUF (Unsloth UD XL quantization + MTP layers)
Overview
The configuration works through four key mechanisms:
- MTP Technology: Model predicts multiple tokens per inference step
- CPU Offloading: Model body too large for 12GB VRAM, partial layers offloaded to CPU
-fittParameter: Precisely controls how much GPU space is reserved for the MTP draft model and KV cache- iGPU Display: dGPU handles no display, all VRAM available for inference
Step 1: Download llama.cpp
MTP support merged into llama.cpp master on 2026-05-19. Just download the latest release — no manual compilation needed.
Head to llama.cpp Releases and download the latest version for your hardware:
- NVIDIA GPUs: Download the
cuda-12orcuda-13build - AMD GPUs: Download the
rocmbuild - Intel GPUs / CPU-only: Download the
vulkanbuild - macOS: Download the
applebuild
Step 2: Download the Model
1 | # Install huggingface-cli if needed |
Available Quantizations:
| Quant | File Size | Notes |
|---|---|---|
| Q2_K_XL | 12.3 GB | Smallest, quality loss |
| Q3_K_XL | 16.5 GB | Compact |
| Q4_K_XL | 21.7 GB | Recommended, balanced |
| Q5_K_XL | 25.6 GB | Higher quality |
| Q6_K_XL | 30.5 GB | High quality |
| Q8_K_XL | 36.6 GB | Maximum quality |
💡 Tip: Although Q4_K_XL is 21.7GB, the
-fittparameter ensures only partial layers run on GPU — the rest run on CPU. So 12GB VRAM is sufficient.
Step 3: Launch llama-server
Original Command (RTX 4070 Super 12GB)
1 | llama-server \ |
Key Parameters Explained
| Parameter | Value | Description |
|---|---|---|
-fitt |
1536 | Most critical. Reserves MB on GPU for MTP draft model + KV cache. 1536 = 1.5GB |
-c |
131072 | Context length 128K tokens |
-n |
32768 | Max generation length 32K tokens |
-fa |
on | Flash Attention enabled |
-ctk / -ctv |
q8_0 | KV cache key/value quantized to Q8 |
-ctkd / -ctvd |
q8_0 | MTP draft model KV cache also Q8 |
--spec-type |
draft-mtp | Enable MTP draft model |
--spec-draft-n-max |
2 | Predict up to 2 extra tokens per step |
--no-mmap |
— | Disable memory mapping for stability |
--mlock |
— | Lock memory to prevent swapping |
💡
-fittTuning: If your dGPU handles display, 1536 may be too small. Start with 2048 or 2560 and decrease gradually.
Step 4: Understanding How MTP Works
What is MTP?
MTP (Multi-Token Prediction) is a special capability of the Qwen3.6 model. While normal models predict one token at a time, MTP models predict multiple tokens simultaneously per inference step.
Why Does It Speed Things Up?
llama.cpp uses the MTP draft model for speculative decoding:
- MTP draft model rapidly predicts 2-3 tokens
- Main model verifies these predictions
- Correct predictions are accepted; incorrect ones are regenerated from scratch
- With 80%+ draft acceptance rate, actual speed improves dramatically
--spec-draft-n-max Trade-offs
| Value | Speed | Acceptance Rate | Notes |
|---|---|---|---|
| 2 | Fast | Higher | Recommended, best balance |
| 3 | Slightly faster | Lower | Acceptance drops, overall gains minimal |
Step 5: Run Benchmarks
1 | # Download benchmark script |
Original Benchmark Results
| Task | Pred | Draft | Acc | Rate | Speed (tok/s) |
|---|---|---|---|---|---|
| code_python | 192 | 132 | 125 | 94.7% | 80.8 |
| code_cpp | 58 | 40 | 37 | 92.5% | 81.8 |
| explain_concept | 192 | 152 | 114 | 75.0% | 70.0 |
| summarize | 53 | 40 | 32 | 80.0% | 75.4 |
| qa_factual | 192 | 144 | 119 | 82.6% | 77.8 |
| translation | 22 | 16 | 13 | 81.2% | 81.9 |
| creative_short | 192 | 160 | 111 | 69.4% | 69.2 |
| stepwise_math | 192 | 144 | 119 | 82.6% | 76.5 |
| long_code_review | 192 | 148 | 117 | 79.0% | 73.2 |
Average ~76 tok/s, code tasks peaking at 81.8 tok/s.
Windows Environment Test Results (from Reddit Comments)
The following data comes from community users who tested on Windows:
RTX 3060 12GB + R5 5600 + 32GB RAM (Windows)
| Task | Accept Rate | Speed (tok/s) |
|---|---|---|
| code_python | 78.4% | 40.3 |
| code_cpp | 92.5% | 49.3 |
| explain_concept | 78.4% | 41.6 |
| summarize | 80.0% | 44.4 |
| qa_factual | 82.6% | 45.6 |
| stepwise_math | 88.4% | 46.7 |
| long_code_review | 80.8% | 43.5 |
Source: Reddit user ItsRektTime
RTX 3060 12GB + Ryzen 9 5950X + 40GB RAM (Windows 11 Pro)
| Task | Accept Rate | Speed (tok/s) |
|---|---|---|
| code_python | 88.5% | 38.9 |
| code_cpp | 72.8% | 35.0 |
| explain_concept | 67.7% | 33.7 |
| qa_factual | 72.8% | 35.2 |
| stepwise_math | 76.4% | 35.8 |
Source: Reddit user RaspNAS (Q3_K_XL quantization)
RTX 5060 Ti 16GB (Windows 11)
- Without MTP: ~55 tok/s
- With MTP: ~66 tok/s
- MTP improvement: +15%
Source: Reddit user the_masel
FAQ
Q: Can it run on 8GB VRAM?
A: Theoretically yes, but with aggressive CPU offloading. -fitt needs to increase to 3000+, and speed drops significantly. Reddit comments show users attempting this on RTX 3060 6GB laptops with further tuning needed.
Q: Windows support?
A: Yes. Use Vulkan or CUDA backend. Docker also works on Windows (requires WSL2 + Docker Desktop).
Q: Does MTP affect output quality?
A: No. MTP only affects generation speed, not model output quality. Draft model predictions are verified by the main model — wrong predictions are corrected.
Q: How’s the 128K context experience?
A: The original author set 128K context (-c 131072), but 32K is the sweet spot for cost-performance. Beyond 32K, some users report quality degradation (Qwen3 from 95% to 75%), depending on the task.
Q: Do I need CachyOS?
A: Not required, but highly recommended by the original author. CachyOS is a performance-optimized Arch Linux derivative with measurable speedups for local inference.
Q: How much faster than standard llama.cpp (no MTP)?
A: According to Reddit comments, MTP improves tok/s by approximately 15-30%. One user with RTX 5060 Ti 16GB reported: ~55 tok/s without MTP, ~66 tok/s with MTP.
Advanced Tips
1. Tuning -fitt for Different GPUs
1 | # RTX 4070 Super (dGPU not displaying): 1536 |
2. Reduce Context for Speed
If you don’t need 128K context, reducing -c frees more VRAM for inference:
1 | # 32K context (recommended for daily use) |
3. Use --models-preset to Simplify
Create a config file to avoid typing long commands:
1 | # models/qwen35b-mtp.yaml |
⚠️ Note: Reddit comments report
--models-presetmode causes 400 errors with mtp-bench.py. If you hit this, use CLI arguments directly.
Conclusion
You’ve learned the complete workflow for running Qwen3.6 35B A3B on a 12GB GPU:
- ✅ Downloaded llama.cpp with MTP support (merged into master 2026-05-19)
- ✅ Downloaded the MTP quantized model
- ✅ Understood the
-fittparameter’s critical role - ✅ Launched llama-server and ran benchmarks
- ✅ Mastered
-fitttuning and context adjustment
MTP is a breakthrough for local inference — it enables 12GB GPUs to smoothly run 35B-parameter models at 80 tok/s with 128K context. For users without 24GB+ GPUs, this is currently the best bang-for-buck approach.
📖 Original Post: Reddit r/LocalLLaMA
📦 Model Download: HuggingFace
🔧 llama.cpp Releases: GitHub
📊 Benchmark: mtp-bench.py





