What you’ll learn: how to use Miles for plain supervised fine-tuning. No RL, no rollout, no reward — just data → loss → optimizer. Why use Miles for SFT? Two reasons:Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
- Same launch convention as your RL run — one config, one Ray cluster.
- Async data prefetching — the SFT loop reuses the rollout machinery to overlap data loading with training.
Prerequisites
- You completed the Qwen3-4B recipe (we reuse the conversion).
- ~50 GB free disk for OpenHermes-2.5.
Quick start
1. Convert Qwen3-4B-Base
If you don’t already have it:2. Prepare the dataset
OpenHermes ships in a custom shape. Convert to OpenAI messages format:3. Run
What changes vs. the GRPO recipe
Compare to run-qwen3-4B.sh. The deltas:Why each flag
| Flag | Why |
|---|---|
--rollout-function-path miles.rollout.sft_rollout.generate_rollout | Read from disk instead of generating |
--rollout-batch-size = --global-batch-size | One batch read = one optimizer step |
No --n-samples-per-prompt | SFT has one target per input |
--loss-type sft_loss | Cross-entropy instead of policy-gradient |
--calculate-per-token-loss | Standard SFT averages over unmasked tokens |
--disable-compute-advantages-and-returns | No advantage / return needed |
--debug-train-only | Skip SGLang init (we don’t need rollout) |
train_async.py | Async data prefetch overlaps load with train |
What to watch
data/prefetch_queue_depth stays at 0, your data loader is too slow — increase
worker count or use parquet (we already do).
Tuning knobs
| Knob | Effect |
|---|---|
--num-epoch | Total passes over dataset |
--rollout-batch-size | Bigger = better GPU utilization, more memory |
--max-tokens-per-gpu | As always — push it up until OOM |
--lr | SFT typically 1e-5 to 5e-5 (10× higher than RL) |
--lr-decay-style cosine --lr-warmup-iters 100 | Standard SFT schedule |
Variations
Mix datasets
Pass multiple--prompt-data entries:
Continue with RL
After SFT, point the RL run at the SFT checkpoint:LoRA SFT
Use the LoRA hooks (--lora-rank 16) to keep VRAM low when fine-tuning a
larger model. See examples/lora/ in the repo.
