A Miles launch script is plain bash — a sequence ofDocumentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
XXX_ARGS=( ... ) arrays handed
to train.py or train_async.py. This page walks through each group and then covers
the execution modes you turn on beyond the default recipe.
scripts/run-glm4-9B.sh is the reference script; other recipes follow the same shape.
The eight argument groups
Every launch script assembles eight bash arrays, passes them as CLI flags, and hands off totrain.py:
| Array | Governs |
|---|---|
MODEL_ARGS | Architecture constants (layers, hidden size, rotary base, …) |
CKPT_ARGS | Filesystem paths for the actor / reference / save directory |
ROLLOUT_ARGS | Prompt dataset, batch knobs, sampling parameters, reward type |
EVAL_ARGS | Eval dataset, cadence, sampling overrides for evaluation |
PERF_ARGS | Parallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching |
GRPO_ARGS | RL algorithm, KL, clipping, entropy bonus, advantage estimator |
OPTIMIZER_ARGS | Learning rate, schedule, weight decay, Adam betas |
SGLANG_ARGS | Engine TP, memory fraction, log level, --sglang-* passthrough |
MODEL_ARGS — architecture constants
Megatron needs the model architecture hardcoded at launch because it cannot introspect a HuggingFace checkpoint. Miles therefore sources a matching bash file fromscripts/models/<family>.sh:
MODEL_ARGS=(--num-layers ... --hidden-size ... --rotary-base ...).
CKPT_ARGS — paths
The three roles — actor, frozen reference, HuggingFace directory — are defined in Core Concepts. Here they map to four flags:--load and --save usually point at the same directory so the run is
restart-idempotent: the trainer reads whatever was last written there. When --load is
empty or missing latest_checkpointed_iteration.txt, the actor is warm-started from
--ref-load instead.
ROLLOUT_ARGS — where data comes from and how much flows
Every Miles iteration alternates between sampling (rollout) and consumption (training). Four knobs govern the balance. Sampling side--rollout-batch-size— prompts drawn per rollout.--n-samples-per-prompt— responses generated per prompt (used as the GRPO group).
--global-batch-size— samples used per optimizer step.--num-steps-per-rollout— optimizer steps per rollout. Leave at1for strict on-policy behavior; raise it for off-policy reuse of rollout batches.
--num-rollout— total sample/train iterations.
Optimizer step vs. weight sync.
--num-steps-per-rollout counts calls to optimizer.step(), not the weight
handshake between trainer and SGLang. The latter happens exactly once per rollout,
regardless of how many optimizer steps fired in between.EVAL_ARGS — a strict subset of rollout
Evaluation reuses the rollout machinery but lets you override sampling behavior so that eval is deterministic and comparable across runs.ROLLOUT_ARGS.
PERF_ARGS — parallelism and memory
This group controls how the model is sharded across GPUs and how much activation memory is recomputed vs. stored. Miles forwards Megatron’s parallelism flags untouched and adds two of its own.- Always pair
--tensor-model-parallel-size > 1with--sequence-parallel. The sequence-parallel pass reclaims the activation memory TP leaves behind. --use-dynamic-batch-sizeoverrides--micro-batch-size. When dynamic batching is active, Miles packs variable-length samples into the closest fit under--max-tokens-per-gpu. A sample whose length already exceeds the cap takes a whole micro-batch by itself — that batch still exceeds the cap and may OOM, so keep--rollout-max-response-len≤--max-tokens-per-gpu.- Under context parallel, the budget is shared. A CP group of size
Njointly processes up toN × max_tokens_per_gputokens per micro-batch. Size CP before tuningmax-tokens-per-gpu. - Loss correctness is preserved. Miles packs with proper attention masks and per-sample / per-token loss reductions — dynamic batching never changes the gradient value, only the throughput.
GRPO_ARGS — the RL objective
GRPO_ARGS is the only group that carries RL semantics. The defaults encode vanilla
GRPO with a DAPO-style asymmetric clip.
- KL as a monitor vs. KL in the loss.
--use-kl-lossalways loads the reference model and computes the divergence — its weight in the loss is controlled separately by--kl-loss-coef. Setting the coefficient to0.0turns KL into a pure observability signal, which is often what you want for early experiments. --advantage-estimatorcovers more than GRPO.gspo,reinforce_plus_plus,reinforce_plus_plus_baseline,ppo, andon_policy_distillationare all drop-in replacements.- Per-sample vs. per-token loss. The default reduction is per-sample mean:
mean(sum(sample_i) / len(sample_i)). Add--calculate-per-token-lossto switch tosum(sum(sample_i)) / sum(len(sample_i))— the correct choice for SFT-style loss or when you want length-proportional weighting. --use-tisis the numerical safety belt. Switch it on when rollout and trainer operate at different precisions or when you explicitly want off-policy reuse. See the R3 deep dive in Rollout Routing Replay (R3).
OPTIMIZER_ARGS — nothing surprising
Post-training is unusually sensitive to optimizer settings: the model is already in a good basin and large updates destabilize it.lr = 1e-6 with a constant schedule. If the loss plateaus
early, investigate the reward signal before raising the learning rate — in most cases
the reward pipeline collapsed (same score for every sample) rather than the optimizer
stalling.
SGLANG_ARGS — passthrough to the rollout engine
The only Miles-owned flag here is--rollout-num-gpus-per-engine, which corresponds
loosely to SGLang’s tp_size. Everything else prefixed with --sglang- is forwarded
verbatim.
| Flag | When to add it |
|---|---|
--sglang-mem-fraction-static 0.7 | Colocated mode; Megatron needs headroom after init. |
--sglang-context-length 32768 | Rollout max length exceeds the model’s config.json. |
--sglang-enable-ep-moe | MoE models. |
--sglang-enable-dp-attention | Long prompts on MoE. |
--sglang-log-level INFO | Debugging. |
dp_size is derived from
rollout-num-gpus / rollout-num-gpus-per-engine.
The eight argument groups describe what you’re training. The next set of sections describe how the training runs — the execution modes that flip Miles from its default one-rollout-then-one-train cadence into something more interesting.
Synchronous vs. asynchronous rollout
In the default cadence the trainer blocks on rollout:generate() returns, then
train_step() fires, then the next rollout kicks off. Every iteration’s wall-clock
time is the sum of the two.
Async rollout turns the cadence into two concurrent loops. A background worker keeps
--rollout-batch-size generations in flight at all times and pushes completed samples
into a queue; the trainer drains the queue, steps, and syncs weights. Per-iteration
wall time drops to roughly max(rollout_time, train_time).
Enable it with two changes to the launch script:
| Mode | Per-iteration latency | Throughput | When to use |
|---|---|---|---|
| Sync (default) | Lower | Lower overall | Strict on-policy, debugging |
| Async | Higher | Up to 2× | Rollout-bound jobs, long runs |
Colocation: share GPUs or don’t
In the default disaggregated layout, training and inference claim separate GPUs through Ray. The simplest form is:--rollout-num-gpus is ignored under --colocate; the two phases always share the
entire allocation.
What
--colocate flips on. Setting --colocate also enables (unless you’ve set them explicitly):--offload-train— train state offloads to CPU between phases--offload-rollout— rollout state offloads to CPU between phases--sglang-disable-piecewise-cuda-graph— avoids NVLS OOM in colocate mode
Dynamic sampling (DAPO-style filtering)
A common failure mode of GRPO is reward homogeneity: every trajectory in a group gets the same score, the advantage is zero, and the gradient goes flat. DAPO addresses this by oversampling and throwing away groups that lack reward variance. Miles exposes the same capability through two flags:--over-sampling-batch-size exceeds --rollout-batch-size, Miles draws the
larger batch, runs generation and reward scoring asynchronously, and applies the
filter function as results arrive. Groups that survive the filter enter the training
queue; groups that fail are discarded.
The shipped filter checks that reward standard deviation is strictly positive:
remaining_batch_size — the count of usable groups still needed to
fill the rollout. Whenever the filter discards enough samples that the count drops
below --rollout-batch-size, Miles automatically kicks off another oversampling wave.
The mechanism is self-healing: a strict filter just means more oversampling rounds,
not a stuck trainer.
Partial rollout: reclaim aborted work
Dynamic sampling implies that some in-flight generations will be abandoned. Without care, the compute invested in those half-finished trajectories is lost.--partial-rollout flips on a buffer that retains partial samples and resumes their
generation during the next rollout. The buffer dequeue policy is itself pluggable via
--buffer-filter-path; the default is a first-in-first-out pop:
sample.metadata (including the
rollout ID that first launched it), which is usually enough to reason about staleness
if you want a stricter eviction policy.
BF16 training with FP8 inference
The simplest way to exploit FP8 on Hopper-class hardware is to leave the trainer in BF16 and serve FP8 weights to SGLang. Miles supports this path without any code changes — only checkpoint pointers differ. Download the FP8 weights alongside the BF16 originals:--hf-checkpoint pointer to the FP8 directory while leaving the
Megatron side unchanged:
Next
- Configuration — the same material organized as a flag-by-flag reference.
- Server Arguments — the complete CLI surface.
- Customization — the twenty-plus Python extension points.
- Training Backends — Megatron vs FSDP and each one’s plumbing.

