AI Tools Qwen3.6 MTP llama.cpp Local Inference 12GB VRAM

Run Qwen 35B A3B at 80 tok/s on 12GB VRAM: The MTP + llama.cpp Cheat Sheet

Xiaoxin Software AlternativesCreated2026-05-22

What if a 12GB RTX 4070 Super could run a 35B-parameter model at 80 tok/s with 128K context? That’s not theoretical — it’s what Reddit user janvitos achieved on r/LocalLLaMA (667+ upvotes).

Core Mechanism: MTP (Multi-Token Prediction) lets llama.cpp predict multiple tokens per inference step instead of one, dramatically boosting generation speed. Combined with CPU offloading via the -fitt parameter, a 12GB GPU loads the 35B model body + MTP draft model + KV cache simultaneously, achieving 80%+ draft acceptance rate.

Prerequisites

Hardware

Component	Original Config	Minimum Recommended
GPU	RTX 4070 Super 12GB	8GB VRAM (slower)
CPU	AMD Ryzen 7 9700X	Any 8+ core
RAM	48GB DDR5-6000	32GB DDR4/DDR5
OS	CachyOS (highly recommended)	Windows / Linux / macOS

💡 Key Trick: The original author plugged their monitor into the iGPU (integrated graphics), leaving all 12GB VRAM free for inference. If your dGPU handles display output, reserve some VRAM and increase -fitt.

Software

llama.cpp: Latest release (MTP support merged into master on 2026-05-19 — just download the latest release)
Model: Qwen3.6-35B-A3B-MTP-GGUF (Unsloth UD XL quantization + MTP layers)

Overview

The configuration works through four key mechanisms:

MTP Technology: Model predicts multiple tokens per inference step
CPU Offloading: Model body too large for 12GB VRAM, partial layers offloaded to CPU
-fitt Parameter: Precisely controls how much GPU space is reserved for the MTP draft model and KV cache
iGPU Display: dGPU handles no display, all VRAM available for inference

Step 1: Download llama.cpp

MTP support merged into llama.cpp master on 2026-05-19. Just download the latest release — no manual compilation needed.

Head to llama.cpp Releases and download the latest version for your hardware:

NVIDIA GPUs: Download the cuda-12 or cuda-13 build
AMD GPUs: Download the rocm build
Intel GPUs / CPU-only: Download the vulkan build
macOS: Download the apple build

Step 2: Download the Model

# Install huggingface-cli if needed
pip install huggingface-hub

# Download Q4_K_XL quantization (recommended, balanced quality/speed)
huggingface-cli download havenoammo/Qwen3.6-35B-A3B-MTP-GGUF \
  Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
  --local-dir ./models

Available Quantizations:

Quant	File Size	Notes
Q2_K_XL	12.3 GB	Smallest, quality loss
Q3_K_XL	16.5 GB	Compact
Q4_K_XL	21.7 GB	Recommended, balanced
Q5_K_XL	25.6 GB	Higher quality
Q6_K_XL	30.5 GB	High quality
Q8_K_XL	36.6 GB	Maximum quality

💡 Tip: Although Q4_K_XL is 21.7GB, the -fitt parameter ensures only partial layers run on GPU — the rest run on CPU. So 12GB VRAM is sufficient.

Step 3: Launch llama-server

Original Command (RTX 4070 Super 12GB)

llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
  -fitt 1536 \
  -c 131072 \
  -n 32768 \
  -fa on \
  -np 1 \
  -ctk q8_0 \
  -ctv q8_0 \
  -ctkd q8_0 \
  -ctvd q8_0 \
  -ctxcp 64 \
  --no-mmap \
  --mlock \
  --no-warmup \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

Key Parameters Explained

Parameter	Value	Description
`-fitt`	1536	Most critical. Reserves MB on GPU for MTP draft model + KV cache. 1536 = 1.5GB
`-c`	131072	Context length 128K tokens
`-n`	32768	Max generation length 32K tokens
`-fa`	on	Flash Attention enabled
`-ctk` / `-ctv`	q8_0	KV cache key/value quantized to Q8
`-ctkd` / `-ctvd`	q8_0	MTP draft model KV cache also Q8
`--spec-type`	draft-mtp	Enable MTP draft model
`--spec-draft-n-max`	2	Predict up to 2 extra tokens per step
`--no-mmap`	—	Disable memory mapping for stability
`--mlock`	—	Lock memory to prevent swapping

💡 -fitt Tuning: If your dGPU handles display, 1536 may be too small. Start with 2048 or 2560 and decrease gradually.

Step 4: Understanding How MTP Works

What is MTP?

MTP (Multi-Token Prediction) is a special capability of the Qwen3.6 model. While normal models predict one token at a time, MTP models predict multiple tokens simultaneously per inference step.

Why Does It Speed Things Up?

llama.cpp uses the MTP draft model for speculative decoding:

MTP draft model rapidly predicts 2-3 tokens
Main model verifies these predictions
Correct predictions are accepted; incorrect ones are regenerated from scratch
With 80%+ draft acceptance rate, actual speed improves dramatically

`--spec-draft-n-max` Trade-offs

Value	Speed	Acceptance Rate	Notes
2	Fast	Higher	Recommended, best balance
3	Slightly faster	Lower	Acceptance drops, overall gains minimal

Step 5: Run Benchmarks

# Download benchmark script
curl -O https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py

# Run benchmark
python3 mtp-bench.py

Original Benchmark Results

Task	Pred	Draft	Acc	Rate	Speed (tok/s)
code_python	192	132	125	94.7%	80.8
code_cpp	58	40	37	92.5%	81.8
explain_concept	192	152	114	75.0%	70.0
summarize	53	40	32	80.0%	75.4
qa_factual	192	144	119	82.6%	77.8
translation	22	16	13	81.2%	81.9
creative_short	192	160	111	69.4%	69.2
stepwise_math	192	144	119	82.6%	76.5
long_code_review	192	148	117	79.0%	73.2

Average ~76 tok/s, code tasks peaking at 81.8 tok/s.

Windows Environment Test Results (from Reddit Comments)

The following data comes from community users who tested on Windows:

RTX 3060 12GB + R5 5600 + 32GB RAM (Windows)

Task	Accept Rate	Speed (tok/s)
code_python	78.4%	40.3
code_cpp	92.5%	49.3
explain_concept	78.4%	41.6
summarize	80.0%	44.4
qa_factual	82.6%	45.6
stepwise_math	88.4%	46.7
long_code_review	80.8%	43.5

Source: Reddit user ItsRektTime

RTX 3060 12GB + Ryzen 9 5950X + 40GB RAM (Windows 11 Pro)

Task	Accept Rate	Speed (tok/s)
code_python	88.5%	38.9
code_cpp	72.8%	35.0
explain_concept	67.7%	33.7
qa_factual	72.8%	35.2
stepwise_math	76.4%	35.8

Source: Reddit user RaspNAS (Q3_K_XL quantization)

RTX 5060 Ti 16GB (Windows 11)

Without MTP: ~55 tok/s
With MTP: ~66 tok/s
MTP improvement: +15%

Source: Reddit user the_masel

FAQ

Q: Can it run on 8GB VRAM?
A: Theoretically yes, but with aggressive CPU offloading. -fitt needs to increase to 3000+, and speed drops significantly. Reddit comments show users attempting this on RTX 3060 6GB laptops with further tuning needed.

Q: Windows support?
A: Yes. Use Vulkan or CUDA backend. Docker also works on Windows (requires WSL2 + Docker Desktop).

Q: Does MTP affect output quality?
A: No. MTP only affects generation speed, not model output quality. Draft model predictions are verified by the main model — wrong predictions are corrected.

Q: How’s the 128K context experience?
A: The original author set 128K context (-c 131072), but 32K is the sweet spot for cost-performance. Beyond 32K, some users report quality degradation (Qwen3 from 95% to 75%), depending on the task.

Q: Do I need CachyOS?
A: Not required, but highly recommended by the original author. CachyOS is a performance-optimized Arch Linux derivative with measurable speedups for local inference.

Q: How much faster than standard llama.cpp (no MTP)?
A: According to Reddit comments, MTP improves tok/s by approximately 15-30%. One user with RTX 5060 Ti 16GB reported: ~55 tok/s without MTP, ~66 tok/s with MTP.

Advanced Tips

1. Tuning `-fitt` for Different GPUs

# RTX 4070 Super (dGPU not displaying): 1536
-fitt 1536

# RTX 4070 Super (dGPU handles display): 2048-2560
-fitt 2048

# RTX 3060 12GB: 1736
-fitt 1736

# RTX 4060 Ti 8GB: 3000+
-fitt 3000

2. Reduce Context for Speed

If you don’t need 128K context, reducing -c frees more VRAM for inference:

# 32K context (recommended for daily use)
-c 32768

# 16K context (faster)
-c 16384

3. Use `--models-preset` to Simplify

Create a config file to avoid typing long commands:

# models/qwen35b-mtp.yaml
model: Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf
fit-target: 1536
ctx-size: 131072
flash-attn: on
cache-type-k: q8_0
cache-type-v: q8_0

⚠️ Note: Reddit comments report --models-preset mode causes 400 errors with mtp-bench.py. If you hit this, use CLI arguments directly.

Conclusion

You’ve learned the complete workflow for running Qwen3.6 35B A3B on a 12GB GPU:

✅ Downloaded llama.cpp with MTP support (merged into master 2026-05-19)
✅ Downloaded the MTP quantized model
✅ Understood the -fitt parameter’s critical role
✅ Launched llama-server and ran benchmarks
✅ Mastered -fitt tuning and context adjustment

MTP is a breakthrough for local inference — it enables 12GB GPUs to smoothly run 35B-parameter models at 80 tok/s with 128K context. For users without 24GB+ GPUs, this is currently the best bang-for-buck approach.

📖 Original Post: Reddit r/LocalLLaMA
📦 Model Download: HuggingFace
🔧 llama.cpp Releases: GitHub
📊 Benchmark: mtp-bench.py