Speculative decoding accelerates rollout by letting a lightweight draft model generate ahead a few tokens and then verifying them with a single batched forward of the target model. When the draft is correct the target produces N tokens for the cost of one forward pass.Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
Enabling speculative decoding
For models with built-in MTP (Multi-Token Prediction) layers (GLM-4.7, DeepSeek-V3, DeepSeek-R1):enable_draft_weights_cpu_backup so SGLang can run training without
MTP weights resident on GPU
(miles/backends/sglang_utils/sglang_engine.py).
For an externally trained draft model (for example, trained with
SpecForge):
Drift over a long RL run
As RL training progresses, the target model’s distribution shifts away from the draft. Fewer draft tokens pass verification, and over many steps speculative decoding can become a net negative because the wasted draft compute outweighs the verified speedup. Miles supports training the draft alongside the target through online MTP-SFT.Online SFT for MTP-style draft models
| Flag | Notes |
|---|---|
--mtp-num-layers | Number of MTP layers in the checkpoint (1 matches GLM/DeepSeek release defaults). |
--enable-mtp-training | Backprop through MTP loss alongside the policy loss. |
--mtp-loss-scaling-factor | Weight of the MTP loss in the combined gradient (default 0.2). |
Checkpoint must contain MTP weights. Pass
--mtp-num-layers 1 when running convert_hf_to_torch_dist.py.
Without it the resulting torch_dist checkpoint will not contain the MTP
layer to train.External draft model SFT
Training an external (non-MTP) draft model online is not yet supported in Miles. The current path is to retrain the external draft offline every N rollouts and reload it.Pairs with
- Unified FP8. Draft and target both quantized the same way.
- INT4 QAT. A quantized draft is cheaper to verify.
- R3. R3 captures routing for the verified tokens emitted by the target.
When to skip
- Rollout-bound on dense models below ~13 B. The verification overhead can outweigh the benefit.
- Already at high draft acceptance and the bottleneck is verification compute, not generation.
Reading
- SpecForge: SGLang docs.

