This page takes you fromDocumentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
docker pull to a running GRPO training job on Qwen3-4B. It
assumes an 8-GPU node (H100 / H200 / B-series) and roughly 200 GB of disk.
For other models, see Models.
1. Start the container
On the host:2. Download model and data
3. Convert to Megatron format
Megatron consumes atorch_dist checkpoint, not the raw HuggingFace directory.
torchrun --nproc-per-node 8 (optionally
multi-node). See the Models section for per-family conversion
commands.
4. Launch training
5. What’s happening
Each iteration runs the same four steps:- Sample
rollout-batch-sizeprompts and generaten-samples-per-promptresponses. - Score responses with the reward model (
--rm-type deepscalerin this recipe). - Compute the GRPO objective and step the optimizer.
- Push updated weights back to the SGLang engines via P2P.
Inspecting a run
| Question | Where to look |
|---|---|
| Is the policy learning? | loss and reward columns in stdout, or wandb |
| Rollout or train bottleneck? | rollout= vs. train= timings per iteration |
| Are GPUs saturated? | nvidia-smi dmon -s u |
| SGLang internals? | tail -f /tmp/sglang/*.log |
| Ranks crashing? | ~/.ray/session_latest/logs/worker-*.err |
Next steps
- Core concepts — the model behind rollout / actor / reference.
- Training script walkthrough — an annotated tour through every argument group in a launch script, plus colocation, dynamic sampling, partial rollout, and BF16+FP8 inference.
- Training backends — Megatron vs FSDP.
- Customization — plug in custom rollout / reward.
- Models — recipes for Qwen3.5, GLM4.5, DeepSeek R1, Kimi K2, and more.

