Miles Documentation

Miles is a high-performance, enterprise-ready reinforcement learning (RL) framework specifically optimized for Large-Scale model Post-Training. It couples SGLang for high-throughput rollout with Megatron-LM for scalable training, and ships the precision, stability, and observability features needed to run RL at trillion-parameter scale. “A journey of a thousand miles begins with a single rollout.” — Miles focuses on the low-level system optimizations that make large-scale RL stable, efficient, and reproducible.

Core features

Fast and stable support for the latest models. Day-0 enablement of frontier releases such as DeepSeek-V4, with rapid follow-on support for new architectures including GLM-5, Qwen 3.6, and Nemotron-3-Super.
Unified low-precision training. Customizable precision across the rollout and training engines, with unified BF16, FP8, MXFP8, and INT4 QAT recipes available now and an NVFP4 training recipe in progress.
Efficient Rollout Routing Replay (R3). For MoE models, expert routing captured during inference is replayed during the trainer’s forward pass, eliminating the mismatch that destabilizes large-scale MoE RL. Optimized with a routing-result cache and overlapped device-to-host (D2H) copy to reduce overhead in both single-turn and multi-turn RL.
Speculative rollout with online MTP-SFT. Miles keeps the draft model’s acceptance rate high through training by fine-tuning MTP layers on-policy.
LoRA training and serving. Both SFT and RL recipes support LoRA adapters, and the same adapters load directly into SGLang for rollout — no separate merge or conversion step.
Native agentic rollout. Tool use, multi-turn dialogue, search, code execution, and multi-agent co-evolution are all supported through clean Python extension points.
Minimal core, maximal extension. Twenty-plus plug-points let you replace the rollout, reward, loss, or filter without forking the trainer.
Broad hardware support. First-class on NVIDIA Hopper (H100, H200) and Blackwell (B100, B200, GB200, GB300), with AMD MI300X / MI325 / MI350 / MI355X also supported via ROCm.

Supported models

Each model name links to its recipe page.

Family	Models
DeepSeek	DeepSeek-V4 Pro DeepSeek-V4 Flash DeepSeek-R1 DeepSeek-V3
Qwen	Qwen3.6 MoE Qwen3.6 Qwen3.5-35B-A3B Qwen3.5-4B / 9B / 27B Qwen3-Next-80B-A3B-Thinking Qwen3-30B-A3B / 235B-A22B Qwen3-0.6B / 1.7B / 4B / 8B / 14B / 32B
GLM	GLM-5.1 GLM-5 GLM-4.7-Flash GLM-4.5 GLM-Z1-9B-0414
Kimi	Kimi-K2.6 Kimi-K2.5 Kimi-K2-Instruct / Thinking Moonlight-16B-A3B
Nemotron	Nemotron-3-Super-120B-A12B-FP8 Nemotron-3-Nano MoE Nemotron-3-Nano
MiMo	MiMo-7B-RL
GPT-OSS	gpt-oss-20b

See Models for exact conversion commands, launch scripts, and parallelism settings.

Supported hardware

NVIDIA: GB300, GB200, B200, B100, H200, H100, A100.
AMD: MI300X, MI325, MI350, MI355X (via ROCm).

See Platforms.

Latest updates

[2026/02] Complete argument reference. CLI Reference
[2026/01] INT4 W4A16 QAT. INT4 Quantization-Aware Training
[2026/01] Unified VLM/LLM multi-turn rollout. Multi-Agent Co-Evolution
[2025/12] Rollout Routing Replay (R3) for MoE. Rollout Routing Replay (R3)
[2025/11] Unified FP8 pipeline generally available. FP8 and Low Precision
[2025/11] Speculative decoding with online MTP-SFT. Speculative Decoding

Start here

Installation — Docker, bare metal, AMD.
Quick Start — a working training run in under an hour.
Core concepts — the four objects in every Miles job.
Training backend — Megatron-LM, parallelism, checkpoints, and hooks.
Training script walkthrough — every argument group in a launch script, annotated.

Contribute

GitHub: github.com/radixark/miles
Slack: slack.sglang.ai, channel #miles
Contributing: developer guide

Documentation Index

​Core features

​Supported models

​Supported hardware

​Latest updates

​Start here

​Contribute