Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

What you’ll learn: how to use Miles for plain supervised fine-tuning. No RL, no rollout, no reward — just data → loss → optimizer. Why use Miles for SFT? Two reasons:
  1. Same launch convention as your RL run — one config, one Ray cluster.
  2. Async data prefetching — the SFT loop reuses the rollout machinery to overlap data loading with training.

Prerequisites

  • You completed the Qwen3-4B recipe (we reuse the conversion).
  • ~50 GB free disk for OpenHermes-2.5.

Quick start

1. Convert Qwen3-4B-Base

If you don’t already have it:
hf download Qwen/Qwen3-4B-Base --local-dir /root/Qwen3-4B-Base

cd /root/miles
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/Qwen3-4B-Base \
   --save           /root/Qwen3-4B-Base_torch_dist

2. Prepare the dataset

OpenHermes ships in a custom shape. Convert to OpenAI messages format:
from datasets import load_dataset

ds = load_dataset("teknium/OpenHermes-2.5")["train"]

def convert(sample):
    role_map = {"human": "user", "gpt": "assistant", "system": "system"}
    return {
        "messages": [
            {"role": role_map[turn["from"]], "content": turn["value"]}
            for turn in sample["conversations"]
        ]
    }

ds = ds.map(convert)
ds.to_parquet("/root/openhermes2_5.parquet")

3. Run

bash scripts/run-qwen3-4B-base-sft.sh

What changes vs. the GRPO recipe

Compare to run-qwen3-4B.sh. The deltas:
- python3 train.py
+ python3 train_async.py        # async for data prefetch

- ROLLOUT_ARGS=( ... GRPO knobs, n-samples-per-prompt, ... )
+ SFT_ARGS=(
+    --rollout-function-path miles.rollout.sft_rollout.generate_rollout
+    --prompt-data /root/openhermes2_5.parquet
+    --input-key messages
+    --rollout-shuffle
+    --num-epoch 3
+    --rollout-batch-size 128
+    --global-batch-size 128
+
+    --loss-type sft_loss
+    --calculate-per-token-loss
+    --disable-compute-advantages-and-returns
+    --debug-train-only
+ )

- GRPO_ARGS=( ... )            # removed entirely
- SGLANG_ARGS=( ... )          # removed — no inference needed

Why each flag

FlagWhy
--rollout-function-path miles.rollout.sft_rollout.generate_rolloutRead from disk instead of generating
--rollout-batch-size = --global-batch-sizeOne batch read = one optimizer step
No --n-samples-per-promptSFT has one target per input
--loss-type sft_lossCross-entropy instead of policy-gradient
--calculate-per-token-lossStandard SFT averages over unmasked tokens
--disable-compute-advantages-and-returnsNo advantage / return needed
--debug-train-onlySkip SGLang init (we don’t need rollout)
train_async.pyAsync data prefetch overlaps load with train

What to watch

sft/loss                       decreasing
sft/per_token_loss             decreasing (mirrors loss when using per-token)
sft/tokens_seen                steadily increasing
sft/epoch                      0 → num_epoch
data/prefetch_queue_depth      > 0 (else loader is the bottleneck)
If data/prefetch_queue_depth stays at 0, your data loader is too slow — increase worker count or use parquet (we already do).

Tuning knobs

KnobEffect
--num-epochTotal passes over dataset
--rollout-batch-sizeBigger = better GPU utilization, more memory
--max-tokens-per-gpuAs always — push it up until OOM
--lrSFT typically 1e-5 to 5e-5 (10× higher than RL)
--lr-decay-style cosine --lr-warmup-iters 100Standard SFT schedule

Variations

Mix datasets

Pass multiple --prompt-data entries:
SFT_ARGS+=(
   --prompt-data \
      hermes  /root/openhermes2_5.parquet \
      slimorca /data/slimorca.parquet
)
Per-source loss is logged separately.

Continue with RL

After SFT, point the RL run at the SFT checkpoint:
CKPT_ARGS=(
   --hf-checkpoint /root/Qwen3-4B-Base
   --ref-load      /root/Qwen3-4B-Base_torch_dist     # original (anchor)
   --load          /root/Qwen3-4B-Base_sft/           # SFT output (start point)
   --save          /root/Qwen3-4B-Base_rl/
)

LoRA SFT

Use the LoRA hooks (--lora-rank 16) to keep VRAM low when fine-tuning a larger model. See examples/lora/ in the repo.