Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

A Miles training job is a loop over four objects. Once you understand what each one is and how data flows between them, every flag in the system has an obvious home.

The four objects

ObjectRoleLives in
Prompt datasetSource of input examplesJSONL on disk (or --data-source-path)
Rollout (SGLang engines)Generates responses given promptsOne or more SGLang servers behind a router
Reward modelMaps (prompt, response, label) → scoreBuilt-in (--rm-type) or custom (--custom-rm-path)
Actor (Megatron / FSDP)The model being trainedMegatron torch_dist checkpoint, or HF directory under FSDP
ReferenceFrozen copy of the actor for KL anchoringLoaded from --ref-load, never updated

The training loop

for it in range(num_rollout):
    # 1. Sample
    prompts   = dataset.sample(rollout_batch_size)
    responses = sglang.generate(prompts, n=n_samples_per_prompt)

    # 2. Score
    rewards   = reward_fn(prompts, responses, labels)

    # 3. Optimize
    for step in range(num_steps_per_rollout):
        batch = pack(prompts, responses, rewards, size=global_batch_size)
        loss  = grpo_loss(actor, ref_model, batch)
        loss.backward(); optimizer.step()

    # 4. Sync
    p2p_weight_transfer(actor → sglang_engines)
That’s the whole thing. Every flag in Miles configures one of these four steps.

The four-knob invariant

Two knobs govern the sampling half of the loop, two govern the training half, and they are locked into a single equation:
rollout_batch_size × n_samples_per_prompt
  = global_batch_size × num_steps_per_rollout
Every sample produced by rollout is consumed by training, and every sample consumed by training was produced by rollout. Set any three sides; Miles fills in the fourth. Set all four inconsistently and Miles aborts with a validation error.

Where every flag goes

Use this map when reading any launch script:
Argument groupConcerns
MODEL_ARGSArchitecture constants (layers, hidden size, rotary base, …)
CKPT_ARGSFilesystem paths for the actor / reference / save directory
ROLLOUT_ARGSPrompt dataset, batch knobs, sampling parameters, reward type
EVAL_ARGSEval dataset, cadence, sampling overrides for evaluation
PERF_ARGSParallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching
GRPO_ARGSRL algorithm, KL, clipping, entropy bonus, advantage estimator
OPTIMIZER_ARGSLearning rate, schedule, weight decay, Adam betas
SGLANG_ARGSEngine TP, memory fraction, log level, --sglang-* passthrough

Next