CLI Reference

Miles is configured through command-line flags passed to train.py or train_async.py. The Megatron flags (such as --num-layers, --rotary-base, --recompute-granularity) are inherited via Megatron’s argument parser; Miles adds its own flags through an extra_args_provider. Run python3 train.py --help against your installed Megatron source for the canonical list. This page has two passes.

Essentials lists the flags most runs actually touch.
Complete reference lists every Miles flag with type and default.

Essentials

Cluster topology

Flag	Default	What
`--actor-num-nodes`	`1`	Total nodes for the actor.
`--actor-num-gpus-per-node`	`8`	GPUs per actor node.
`--rollout-num-gpus`	derived	GPUs for SGLang rollout (ignored when `--colocate`).
`--rollout-num-gpus-per-engine`	`1`	TP size of each SGLang engine.
`--colocate`	off	Share GPUs between actor and rollout.

See Training Script Walkthrough: Colocation for what --colocate flips on under the hood.

Batch sizing

The four-knob invariant:

rollout_batch_size × n_samples_per_prompt
  = global_batch_size × num_steps_per_rollout

Flag	Typical	What
`--rollout-batch-size`	`16 – 256`	Prompts per rollout.
`--n-samples-per-prompt`	`4 – 16`	Responses per prompt (GRPO group size).
`--global-batch-size`	derived	Samples per optimizer step.
`--num-steps-per-rollout`	`1`	Optimizer steps per rollout.
`--num-rollout`	`1000 – 10000`	Total rollout iterations.

Memory and throughput

Flag	Default	What
`--use-dynamic-batch-size`	off	Pack varlen samples into micro-batches.
`--max-tokens-per-gpu`	`–`	Token budget per micro-batch per GPU. Required when dynamic batching is on.
`--context-parallel-size`	`1`	Spread a single sample across N CP ranks.
`--recompute-granularity`	Megatron default	`full` or `selective`.
`--recompute-method`	Megatron default	`uniform` or `block`.
`--recompute-num-layers`	Megatron default	Layers per recompute chunk.

Rule of thumb: start with max_tokens_per_gpu = rollout_max_response_len / cp_size, then push up until you OOM.

RL algorithm

Flag	Default	What
`--advantage-estimator`	`grpo`	`grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation`
`--use-kl-loss`	off	Compute KL against the reference model.
`--kl-loss-coef`	`0.0`	Weight of KL in the loss (0 means monitor only).
`--kl-loss-type`	`k1`	`k1`, `k2`, `k3`, `low_var_kl`.
`--entropy-coef`	`0.0`	Entropy bonus weight.
`--eps-clip`	`0.2`	PPO/GRPO low clip.
`--eps-clip-high`	`–`	Asymmetric high clip (DAPO-style).
`--use-tis`	off	Truncated Importance Sampling for train/inference precision mismatch.

Sampling

Flag	Default	What
`--rollout-temperature`	`1.0`	Sampling temperature.
`--rollout-top-p`	`1.0`	Top-p truncation.
`--rollout-max-response-len`	`–`	Max tokens per response.
`--rollout-stop-token-ids`	model default	Stop token IDs. Override when generations don’t stop.
`--apply-chat-template`	off	Apply the tokenizer’s chat template.
`--rollout-shuffle`	off	Shuffle prompts each rollout.

Optimizer

Flag	Default	What
`--optimizer`	`adam`	`adam`, `sgd`.
`--lr`	`1e-6`	Learning rate. Post-training is sensitive to large updates; recipes typically stay near `1e-6`.
`--lr-decay-style`	`constant`	`constant`, `linear`, `cosine`.
`--weight-decay`	`0.1`	L2 weight decay.
`--adam-beta1`, `--adam-beta2`	`0.9, 0.98`	Adam moments.

Logging

Flag	Default	What
`--use-wandb`	off	Log to Weights and Biases.
`--wandb-project`	–	wandb project name.
`--log-interval`	`1`	Stdout log cadence (rollouts).
`--save-interval`	–	Checkpoint cadence (rollouts). Recipes typically set 20 to 100.

SGLang passthrough

Any flag accepted by python -m sglang.launch_server is accepted by Miles with the --sglang- prefix:

--sglang-log-level INFO
--sglang-mem-fraction-static 0.8
--sglang-enable-overlap-schedule
--sglang-enable-ep-moe
--sglang-enable-dp-attention

See SGLang docs for the full list.

Environment variables

Set these in Ray’s env_vars for multi-node runs:

Variable	Effect
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1`	Workaround for torch-compile JSONDecodeError.
`RAY_DEDUP_LOGS=0`	Don’t deduplicate worker logs.
`NCCL_DEBUG=INFO`	NCCL diagnostics.
`PYTHONPATH=/root/Megatron-LM`	Required when using the Megatron backend.

Complete reference

Sections mirror the launch-script argument groups.

Cluster

Flag	Type	Default	Notes
`--actor-num-nodes`	int	`1`	Total nodes for actor training.
`--actor-num-gpus-per-node`	int	`8`	GPUs per actor node.
`--rollout-num-gpus`	int	derived	Ignored under `--colocate`.
`--rollout-num-gpus-per-engine`	int	`1`	TP size of each SGLang engine.
`--colocate`	flag	off	Share GPUs between actor and rollout. Implicitly enables `--offload-train`, `--offload-rollout`, and `--sglang-disable-piecewise-cuda-graph`.

Model and checkpoints

Flag	Type	Default	Notes
`--train-backend`	enum	`megatron`	`megatron` or `fsdp`.
`--hf-checkpoint`	path	–	HF model dir. Provides tokenizer, config, and the weights FSDP loads.
`--ref-load`	path	–	Reference model in `torch_dist` format (Megatron).
`--load`	path	–	Actor checkpoint to resume from.
`--save`	path	–	Actor checkpoint write directory.
`--save-interval`	int	–	Rollouts between saves.
`--model-name`	str	–	Set in multi-node to avoid `transformers` file-system race.
`--spec`	`<module> <fn>`	–	Plugin spec for custom architectures (e.g. `miles_plugins.models.qwen3_5 get_qwen3_5_spec`).

Rollout: data and batching

Flag	Type	Default	Notes
`--prompt-data`	str	–	Path to a single JSONL file.
`--input-key`	str	`prompt`	JSONL key to `Sample.prompt`.
`--label-key`	str	`label`	JSONL key to `Sample.label`.
`--metadata-key`	str	`metadata`	JSONL key to `Sample.metadata`.
`--apply-chat-template`	flag	off	Apply tokenizer chat template.
`--rollout-shuffle`	flag	off	Shuffle prompts each rollout.
`--num-rollout`	int	–	Total rollout iterations. If unset, derived from dataset size.
`--rollout-batch-size`	int	–	Prompts per rollout.
`--n-samples-per-prompt`	int	`1`	Responses per prompt.
`--global-batch-size`	int	derived	Samples per optimizer step.
`--num-steps-per-rollout`	int	`1`	Optimizer steps per rollout.
`--over-sampling-batch-size`	int	–	Oversample size for dynamic sampling (DAPO).
`--balance-data`	flag	off	Balance per-rank token count.

Rollout: sampling

Flag	Type	Default	Notes
`--rollout-max-response-len`	int	–	Max tokens per response.
`--rollout-temperature`	float	`1.0`	Sampling temperature.
`--rollout-top-p`	float	`1.0`	Top-p truncation.
`--rollout-top-k`	int	`-1`	Top-k truncation (-1 disables).
`--rollout-stop`	str+	–	Stop strings.
`--rollout-stop-token-ids`	int+	–	Stop token IDs.

Eval

Flag	Type	Default	Notes
`--eval-prompt-data`	str+	–	One or more `name path` pairs.
`--eval-interval`	int	–	Rollouts between eval runs.
`--n-samples-per-eval-prompt`	int	`1`	Responses per eval prompt.
`--eval-max-response-len`	int	–	Max eval response length. Inherits from rollout if unset.
`--eval-temperature`	float	–	Eval temperature. Inherits from rollout if unset.
`--eval-top-p`	float	–	Eval top-p. Inherits from rollout if unset.

Performance

Flag	Type	Default	Notes
`--tensor-model-parallel-size`	int	`1`	TP.
`--pipeline-model-parallel-size`	int	`1`	PP.
`--context-parallel-size`	int	`1`	CP.
`--expert-model-parallel-size`	int	`1`	EP (MoE).
`--expert-tensor-parallel-size`	int	`1`	TP within experts.
`--sequence-parallel`	flag	off	Enable Megatron sequence parallel.
`--use-dynamic-batch-size`	flag	off	Pack varlen samples. Recommended for varlen workloads.
`--max-tokens-per-gpu`	int	–	Token budget per micro-batch per GPU. Required when dynamic batching is on.
`--micro-batch-size`	int	`1`	Ignored when dynamic batching is on.
`--recompute-granularity`	enum	Megatron default	`full` or `selective`.
`--recompute-method`	enum	Megatron default	`uniform` or `block`.
`--recompute-num-layers`	int	Megatron default	Recompute chunk size.
`--gradient-checkpointing`	flag	off	FSDP equivalent of recompute flags.
`--fsdp-cpu-offload`	flag	off	FSDP: offload params, grads, optimizer state to CPU.
`--fsdp-cpu-backend`	str	`gloo`	FSDP: CPU backend for hybrid offload.
`--attn-implementation`	enum	`flash_attention_2`	FSDP only: `flash_attention_2`, `sdpa`, `eager`.

RL algorithm

Flag	Type	Default	Notes
`--advantage-estimator`	enum	`grpo`	`grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation`
`--use-kl-loss`	flag	off	Compute KL vs. reference.
`--kl-loss-coef`	float	`0.0`	KL weight in loss (0 means monitor).
`--kl-loss-type`	enum	`k1`	`k1`, `k2`, `k3`, `low_var_kl`.
`--entropy-coef`	float	`0.0`	Entropy bonus weight.
`--eps-clip`	float	`0.2`	PPO/GRPO low clip.
`--eps-clip-high`	float	–	Asymmetric high clip.
`--use-tis`	flag	off	Truncated Importance Sampling.
`--use-routing-replay`	flag	off	Forward/backward routing consistency.
`--use-rollout-routing-replay`	flag	off	R3 — capture inference-side expert routing and replay it during training.
`--calculate-per-token-loss`	flag	off	Per-token loss reduction.
`--no-check-for-nan-in-loss-and-grad`	flag	off	Skip NaN/Inf guard (Megatron flag, debug only).
`--true-on-policy-mode`	flag	off	Strict on-policy: reject samples from a prior policy.

Optimizer

Flag	Type	Default	Notes
`--optimizer`	enum	`adam`	`adam`, `sgd`.
`--lr`	float	`1e-6`	Learning rate.
`--lr-decay-style`	enum	`constant`	`constant`, `linear`, `cosine`.
`--lr-warmup-iters`	int	`0`	Warmup steps (Megatron flag).
`--min-lr`	float	`0`	Lower LR bound for decay schedules (Megatron flag).
`--weight-decay`	float	`0.1`	L2 weight decay.
`--adam-beta1`	float	`0.9`
`--adam-beta2`	float	`0.98`
`--clip-grad`	float	`1.0`	Grad clipping (Megatron flag).
`--optimizer-cpu-offload`	flag	off	Megatron CPU Adam (Megatron flag).
`--overlap-cpu-optimizer-d2h-h2d`	flag	off	Overlap D2H/H2D with compute (Megatron flag).
`--use-precision-aware-optimizer`	flag	off	Precision-aware optimizer path (Megatron flag).

Reward and filters

Flag	Type	Default	Notes
`--rm-type`	enum	–	Built-in reward: `math`, `dapo`, `deepscaler`, `f1`, `gpqa`, `ifbench`, `remote_rm`, `random`.
`--rm-url`	str	–	Endpoint when `--rm-type remote_rm`.
`--group-rm`	flag	off	Batched reward computation.
`--custom-rm-path`	str	–	Custom reward function (see Customization).
`--dynamic-sampling-filter-path`	str	–	Group filter (DAPO-style).
`--buffer-filter-path`	str	–	Buffer dequeue filter.
`--rollout-sample-filter-path`	str	–	Per-sample filter.

SGLang and router

Flag	Type	Default	Notes
`--sglang-router-ip`	str	–	External router IP. Miles starts its own router if unset.
`--sglang-router-port`	int	–	External router port.
`--sglang-*`	passthrough		Any flag accepted by `python -m sglang.launch_server` works with this prefix.
`--router-*`	passthrough		Any flag accepted by the active router works with this prefix.

Common --sglang-* flags:

--sglang-mem-fraction-static 0.8
--sglang-context-length 32768
--sglang-log-level INFO
--sglang-enable-ep-moe
--sglang-enable-dp-attention
--sglang-enable-deepep
--sglang-enable-overlap-schedule
--sglang-enforce-piecewise-cuda-graph     # off by default in colocate mode

MTP / speculative decoding

Flag	Type	Default	Notes
`--mtp-num-layers`	int	`0`	Number of MTP layers in the checkpoint.
`--enable-mtp-training`	flag	off	Train MTP alongside the policy.
`--mtp-loss-scaling-factor`	float	`0.2`	Weight of MTP loss.

Fault tolerance

Flag	Type	Default	Notes
`--use-fault-tolerance`	flag	off	Enable rank-level recovery and heartbeats.
`--rollout-health-check-first-wait`	int	`0`	Grace period before heartbeats start.
`--rollout-health-check-interval`	int	`30`	Seconds between heartbeats.
`--rollout-health-check-timeout`	int	`30`	Heartbeat timeout.

Async / partial rollout

Flag	Type	Default	Notes
`--partial-rollout`	flag	off	Resume aborted rollouts in the next iteration.

Logging

Flag	Type	Default	Notes
`--use-wandb`	flag	off	Enable wandb.
`--wandb-project`	str	–	Project name.
`--wandb-group`	str	–	Group name.
`--log-interval`	int	`1`	Stdout log cadence (rollouts).
`--custom-rollout-log-function-path`	str	–	Custom train logger.
`--custom-eval-rollout-log-function-path`	str	–	Custom eval logger.

Profiling

Flag	Type	Default	Notes
`--profile-target`	enum+	`[train_overall]`	Which sub-loop to profile: `train_overall`, `train_actor`, `train_log_probs`.
`--use-pytorch-profiler`	flag	off	FSDP only: enable PyTorch profiler.
`--profile-step-start`	int	`10`	FSDP only: first step to profile.
`--profile-step-end`	int	`12`	FSDP only: last step to profile.
`--memory-snapshot-path`	str	`snapshot.pickle`	FSDP only: memory snapshot output.
`--tensorboard-dir`	str	–	FSDP only: TensorBoard output dir.

Debugging

Flag	Type	Default	Notes
`--debug-rollout-only`	flag	off	Skip Megatron, only spin up SGLang.
`--debug-train-only`	flag	off	Skip SGLang, only spin up Megatron.
`--save-debug-rollout-data`	path	–	Pickle every rollout to disk.
`--load-debug-rollout-data`	path	–	Replay rollouts from disk (implies `--debug-train-only`).
`--deterministic-mode`	flag	off	Megatron deterministic mode.

Customization

See Customization for the full catalog of --*-path flags that replace or extend Miles’s behavior.

Documentation Index

​Essentials

​Cluster topology

​Batch sizing

​Memory and throughput

​RL algorithm

​Sampling

​Optimizer

​Logging

​SGLang passthrough

​Environment variables

​Complete reference

​Cluster

​Model and checkpoints

​Rollout: data and batching

​Rollout: sampling

​Eval

​Performance

​RL algorithm

​Optimizer

​Reward and filters

​SGLang and router

​MTP / speculative decoding

​Fault tolerance

​Async / partial rollout

​Logging

​Profiling

​Debugging

​Customization

Essentials

Cluster topology

Batch sizing

Memory and throughput

RL algorithm

Sampling

Optimizer

Logging

SGLang passthrough

Environment variables

Complete reference

Cluster

Model and checkpoints

Rollout: data and batching

Rollout: sampling

Eval

Performance

RL algorithm

Optimizer

Reward and filters

SGLang and router

MTP / speculative decoding

Fault tolerance

Async / partial rollout

Logging

Profiling

Debugging

Customization