Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

Speculative decoding accelerates rollout by letting a lightweight draft model generate ahead a few tokens and then verifying them with a single batched forward of the target model. When the draft is correct the target produces N tokens for the cost of one forward pass.

Enabling speculative decoding

For models with built-in MTP (Multi-Token Prediction) layers (GLM-4.7, DeepSeek-V3, DeepSeek-R1):
SGLANG_ARGS+=(
   --sglang-speculative-algorithm EAGLE
   --sglang-speculative-num-steps 3
   --sglang-speculative-eagle-topk 1
   --sglang-speculative-num-draft-tokens 4
)
These are passthrough flags forwarded to SGLang. Miles auto-enables enable_draft_weights_cpu_backup so SGLang can run training without MTP weights resident on GPU (miles/backends/sglang_utils/sglang_engine.py). For an externally trained draft model (for example, trained with SpecForge):
SGLANG_ARGS+=(
   --sglang-speculative-draft-model-path /data/draft_model/
)
Full reference: SGLang speculative decoding docs.

Drift over a long RL run

As RL training progresses, the target model’s distribution shifts away from the draft. Fewer draft tokens pass verification, and over many steps speculative decoding can become a net negative because the wasted draft compute outweighs the verified speedup. Miles supports training the draft alongside the target through online MTP-SFT.

Online SFT for MTP-style draft models

PERF_ARGS+=(
   --mtp-num-layers 1
   --enable-mtp-training
   --mtp-loss-scaling-factor 0.2
)
FlagNotes
--mtp-num-layersNumber of MTP layers in the checkpoint (1 matches GLM/DeepSeek release defaults).
--enable-mtp-trainingBackprop through MTP loss alongside the policy loss.
--mtp-loss-scaling-factorWeight of the MTP loss in the combined gradient (default 0.2).
Checkpoint must contain MTP weights. Pass --mtp-num-layers 1 when running convert_hf_to_torch_dist.py. Without it the resulting torch_dist checkpoint will not contain the MTP layer to train.

External draft model SFT

Training an external (non-MTP) draft model online is not yet supported in Miles. The current path is to retrain the external draft offline every N rollouts and reload it.

Pairs with

  • Unified FP8. Draft and target both quantized the same way.
  • INT4 QAT. A quantized draft is cheaper to verify.
  • R3. R3 captures routing for the verified tokens emitted by the target.

When to skip

  • Rollout-bound on dense models below ~13 B. The verification overhead can outweigh the benefit.
  • Already at high draft acceptance and the bottleneck is verification compute, not generation.

Reading