Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

Miles is configured through command-line flags passed to train.py or train_async.py. The Megatron flags (such as --num-layers, --rotary-base, --recompute-granularity) are inherited via Megatron’s argument parser; Miles adds its own flags through an extra_args_provider. Run python3 train.py --help against your installed Megatron source for the canonical list. This page has two passes.
  1. Essentials lists the flags most runs actually touch.
  2. Complete reference lists every Miles flag with type and default.

Essentials

Cluster topology

FlagDefaultWhat
--actor-num-nodes1Total nodes for the actor.
--actor-num-gpus-per-node8GPUs per actor node.
--rollout-num-gpusderivedGPUs for SGLang rollout (ignored when --colocate).
--rollout-num-gpus-per-engine1TP size of each SGLang engine.
--colocateoffShare GPUs between actor and rollout.
See Training Script Walkthrough: Colocation for what --colocate flips on under the hood.

Batch sizing

The four-knob invariant:
rollout_batch_size × n_samples_per_prompt
  = global_batch_size × num_steps_per_rollout
FlagTypicalWhat
--rollout-batch-size16 – 256Prompts per rollout.
--n-samples-per-prompt4 – 16Responses per prompt (GRPO group size).
--global-batch-sizederivedSamples per optimizer step.
--num-steps-per-rollout1Optimizer steps per rollout.
--num-rollout1000 – 10000Total rollout iterations.

Memory and throughput

FlagDefaultWhat
--use-dynamic-batch-sizeoffPack varlen samples into micro-batches.
--max-tokens-per-gpuToken budget per micro-batch per GPU. Required when dynamic batching is on.
--context-parallel-size1Spread a single sample across N CP ranks.
--recompute-granularityMegatron defaultfull or selective.
--recompute-methodMegatron defaultuniform or block.
--recompute-num-layersMegatron defaultLayers per recompute chunk.
Rule of thumb: start with max_tokens_per_gpu = rollout_max_response_len / cp_size, then push up until you OOM.

RL algorithm

FlagDefaultWhat
--advantage-estimatorgrpogrpo, gspo, ppo, reinforce_plus_plus, reinforce_plus_plus_baseline, on_policy_distillation
--use-kl-lossoffCompute KL against the reference model.
--kl-loss-coef0.0Weight of KL in the loss (0 means monitor only).
--kl-loss-typek1k1, k2, k3, low_var_kl.
--entropy-coef0.0Entropy bonus weight.
--eps-clip0.2PPO/GRPO low clip.
--eps-clip-highAsymmetric high clip (DAPO-style).
--use-tisoffTruncated Importance Sampling for train/inference precision mismatch.

Sampling

FlagDefaultWhat
--rollout-temperature1.0Sampling temperature.
--rollout-top-p1.0Top-p truncation.
--rollout-max-response-lenMax tokens per response.
--rollout-stop-token-idsmodel defaultStop token IDs. Override when generations don’t stop.
--apply-chat-templateoffApply the tokenizer’s chat template.
--rollout-shuffleoffShuffle prompts each rollout.

Optimizer

FlagDefaultWhat
--optimizeradamadam, sgd.
--lr1e-6Learning rate. Post-training is sensitive to large updates; recipes typically stay near 1e-6.
--lr-decay-styleconstantconstant, linear, cosine.
--weight-decay0.1L2 weight decay.
--adam-beta1, --adam-beta20.9, 0.98Adam moments.

Logging

FlagDefaultWhat
--use-wandboffLog to Weights and Biases.
--wandb-projectwandb project name.
--log-interval1Stdout log cadence (rollouts).
--save-intervalCheckpoint cadence (rollouts). Recipes typically set 20 to 100.

SGLang passthrough

Any flag accepted by python -m sglang.launch_server is accepted by Miles with the --sglang- prefix:
--sglang-log-level INFO
--sglang-mem-fraction-static 0.8
--sglang-enable-overlap-schedule
--sglang-enable-ep-moe
--sglang-enable-dp-attention
See SGLang docs for the full list.

Environment variables

Set these in Ray’s env_vars for multi-node runs:
VariableEffect
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1Workaround for torch-compile JSONDecodeError.
RAY_DEDUP_LOGS=0Don’t deduplicate worker logs.
NCCL_DEBUG=INFONCCL diagnostics.
PYTHONPATH=/root/Megatron-LMRequired when using the Megatron backend.

Complete reference

Sections mirror the launch-script argument groups.

Cluster

FlagTypeDefaultNotes
--actor-num-nodesint1Total nodes for actor training.
--actor-num-gpus-per-nodeint8GPUs per actor node.
--rollout-num-gpusintderivedIgnored under --colocate.
--rollout-num-gpus-per-engineint1TP size of each SGLang engine.
--colocateflagoffShare GPUs between actor and rollout. Implicitly enables --offload-train, --offload-rollout, and --sglang-disable-piecewise-cuda-graph.

Model and checkpoints

FlagTypeDefaultNotes
--train-backendenummegatronmegatron or fsdp.
--hf-checkpointpathHF model dir. Provides tokenizer, config, and the weights FSDP loads.
--ref-loadpathReference model in torch_dist format (Megatron).
--loadpathActor checkpoint to resume from.
--savepathActor checkpoint write directory.
--save-intervalintRollouts between saves.
--model-namestrSet in multi-node to avoid transformers file-system race.
--spec<module> <fn>Plugin spec for custom architectures (e.g. miles_plugins.models.qwen3_5 get_qwen3_5_spec).

Rollout: data and batching

FlagTypeDefaultNotes
--prompt-datastrPath to a single JSONL file.
--input-keystrpromptJSONL key to Sample.prompt.
--label-keystrlabelJSONL key to Sample.label.
--metadata-keystrmetadataJSONL key to Sample.metadata.
--apply-chat-templateflagoffApply tokenizer chat template.
--rollout-shuffleflagoffShuffle prompts each rollout.
--num-rolloutintTotal rollout iterations. If unset, derived from dataset size.
--rollout-batch-sizeintPrompts per rollout.
--n-samples-per-promptint1Responses per prompt.
--global-batch-sizeintderivedSamples per optimizer step.
--num-steps-per-rolloutint1Optimizer steps per rollout.
--over-sampling-batch-sizeintOversample size for dynamic sampling (DAPO).
--balance-dataflagoffBalance per-rank token count.

Rollout: sampling

FlagTypeDefaultNotes
--rollout-max-response-lenintMax tokens per response.
--rollout-temperaturefloat1.0Sampling temperature.
--rollout-top-pfloat1.0Top-p truncation.
--rollout-top-kint-1Top-k truncation (-1 disables).
--rollout-stopstr+Stop strings.
--rollout-stop-token-idsint+Stop token IDs.

Eval

FlagTypeDefaultNotes
--eval-prompt-datastr+One or more name path pairs.
--eval-intervalintRollouts between eval runs.
--n-samples-per-eval-promptint1Responses per eval prompt.
--eval-max-response-lenintMax eval response length. Inherits from rollout if unset.
--eval-temperaturefloatEval temperature. Inherits from rollout if unset.
--eval-top-pfloatEval top-p. Inherits from rollout if unset.

Performance

FlagTypeDefaultNotes
--tensor-model-parallel-sizeint1TP.
--pipeline-model-parallel-sizeint1PP.
--context-parallel-sizeint1CP.
--expert-model-parallel-sizeint1EP (MoE).
--expert-tensor-parallel-sizeint1TP within experts.
--sequence-parallelflagoffEnable Megatron sequence parallel.
--use-dynamic-batch-sizeflagoffPack varlen samples. Recommended for varlen workloads.
--max-tokens-per-gpuintToken budget per micro-batch per GPU. Required when dynamic batching is on.
--micro-batch-sizeint1Ignored when dynamic batching is on.
--recompute-granularityenumMegatron defaultfull or selective.
--recompute-methodenumMegatron defaultuniform or block.
--recompute-num-layersintMegatron defaultRecompute chunk size.
--gradient-checkpointingflagoffFSDP equivalent of recompute flags.
--fsdp-cpu-offloadflagoffFSDP: offload params, grads, optimizer state to CPU.
--fsdp-cpu-backendstrglooFSDP: CPU backend for hybrid offload.
--attn-implementationenumflash_attention_2FSDP only: flash_attention_2, sdpa, eager.

RL algorithm

FlagTypeDefaultNotes
--advantage-estimatorenumgrpogrpo, gspo, ppo, reinforce_plus_plus, reinforce_plus_plus_baseline, on_policy_distillation
--use-kl-lossflagoffCompute KL vs. reference.
--kl-loss-coeffloat0.0KL weight in loss (0 means monitor).
--kl-loss-typeenumk1k1, k2, k3, low_var_kl.
--entropy-coeffloat0.0Entropy bonus weight.
--eps-clipfloat0.2PPO/GRPO low clip.
--eps-clip-highfloatAsymmetric high clip.
--use-tisflagoffTruncated Importance Sampling.
--use-routing-replayflagoffForward/backward routing consistency.
--use-rollout-routing-replayflagoffR3 — capture inference-side expert routing and replay it during training.
--calculate-per-token-lossflagoffPer-token loss reduction.
--no-check-for-nan-in-loss-and-gradflagoffSkip NaN/Inf guard (Megatron flag, debug only).
--true-on-policy-modeflagoffStrict on-policy: reject samples from a prior policy.

Optimizer

FlagTypeDefaultNotes
--optimizerenumadamadam, sgd.
--lrfloat1e-6Learning rate.
--lr-decay-styleenumconstantconstant, linear, cosine.
--lr-warmup-itersint0Warmup steps (Megatron flag).
--min-lrfloat0Lower LR bound for decay schedules (Megatron flag).
--weight-decayfloat0.1L2 weight decay.
--adam-beta1float0.9
--adam-beta2float0.98
--clip-gradfloat1.0Grad clipping (Megatron flag).
--optimizer-cpu-offloadflagoffMegatron CPU Adam (Megatron flag).
--overlap-cpu-optimizer-d2h-h2dflagoffOverlap D2H/H2D with compute (Megatron flag).
--use-precision-aware-optimizerflagoffPrecision-aware optimizer path (Megatron flag).

Reward and filters

FlagTypeDefaultNotes
--rm-typeenumBuilt-in reward: math, dapo, deepscaler, f1, gpqa, ifbench, remote_rm, random.
--rm-urlstrEndpoint when --rm-type remote_rm.
--group-rmflagoffBatched reward computation.
--custom-rm-pathstrCustom reward function (see Customization).
--dynamic-sampling-filter-pathstrGroup filter (DAPO-style).
--buffer-filter-pathstrBuffer dequeue filter.
--rollout-sample-filter-pathstrPer-sample filter.

SGLang and router

FlagTypeDefaultNotes
--sglang-router-ipstrExternal router IP. Miles starts its own router if unset.
--sglang-router-portintExternal router port.
--sglang-*passthroughAny flag accepted by python -m sglang.launch_server works with this prefix.
--router-*passthroughAny flag accepted by the active router works with this prefix.
Common --sglang-* flags:
--sglang-mem-fraction-static 0.8
--sglang-context-length 32768
--sglang-log-level INFO
--sglang-enable-ep-moe
--sglang-enable-dp-attention
--sglang-enable-deepep
--sglang-enable-overlap-schedule
--sglang-enforce-piecewise-cuda-graph     # off by default in colocate mode

MTP / speculative decoding

FlagTypeDefaultNotes
--mtp-num-layersint0Number of MTP layers in the checkpoint.
--enable-mtp-trainingflagoffTrain MTP alongside the policy.
--mtp-loss-scaling-factorfloat0.2Weight of MTP loss.

Fault tolerance

FlagTypeDefaultNotes
--use-fault-toleranceflagoffEnable rank-level recovery and heartbeats.
--rollout-health-check-first-waitint0Grace period before heartbeats start.
--rollout-health-check-intervalint30Seconds between heartbeats.
--rollout-health-check-timeoutint30Heartbeat timeout.

Async / partial rollout

FlagTypeDefaultNotes
--partial-rolloutflagoffResume aborted rollouts in the next iteration.

Logging

FlagTypeDefaultNotes
--use-wandbflagoffEnable wandb.
--wandb-projectstrProject name.
--wandb-groupstrGroup name.
--log-intervalint1Stdout log cadence (rollouts).
--custom-rollout-log-function-pathstrCustom train logger.
--custom-eval-rollout-log-function-pathstrCustom eval logger.

Profiling

FlagTypeDefaultNotes
--profile-targetenum+[train_overall]Which sub-loop to profile: train_overall, train_actor, train_log_probs.
--use-pytorch-profilerflagoffFSDP only: enable PyTorch profiler.
--profile-step-startint10FSDP only: first step to profile.
--profile-step-endint12FSDP only: last step to profile.
--memory-snapshot-pathstrsnapshot.pickleFSDP only: memory snapshot output.
--tensorboard-dirstrFSDP only: TensorBoard output dir.

Debugging

FlagTypeDefaultNotes
--debug-rollout-onlyflagoffSkip Megatron, only spin up SGLang.
--debug-train-onlyflagoffSkip SGLang, only spin up Megatron.
--save-debug-rollout-datapathPickle every rollout to disk.
--load-debug-rollout-datapathReplay rollouts from disk (implies --debug-train-only).
--deterministic-modeflagoffMegatron deterministic mode.

Customization

See Customization for the full catalog of --*-path flags that replace or extend Miles’s behavior.