Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Qwen3 is the latest generation of Alibaba’s Qwen language model series, available in dense and MoE variants with both Instruct and reasoning-enhanced Thinking editions. Key highlights:- Stronger general intelligence: significant improvements in instruction following, logical reasoning, mathematics, science, coding, and tool usage over Qwen2.5.
- Extended context length: trained for 256 K-token contexts, useful for long-document reasoning and agentic workflows.
- Flexible deployment options: dense sizes from 0.6 B up to 32 B; this page covers the dense recipes (MoE recipes live in qwen3-moe).
- Stronger agent interaction: improved tool-use and search-based agent performance.
2. Supported Variants
| Model | HF ID |
|---|---|
| Qwen3-0.6B | Qwen/Qwen3-0.6B |
| Qwen3-1.7B | Qwen/Qwen3-1.7B |
| Qwen3-4B | Qwen/Qwen3-4B |
| Qwen3-4B-Instruct-2507 | Qwen/Qwen3-4B-Instruct-2507 |
| Qwen3-8B | Qwen/Qwen3-8B |
| Qwen3-14B | Qwen/Qwen3-14B |
| Qwen3-32B | Qwen/Qwen3-32B |
3. Environment Setup
3.1 Download model + datasets
3.2 HF → Megatron torch_dist conversion
WORLD_SIZE; for larger sizes drive it with torchrun --nproc-per-node 8. The FSDP launcher loads the HF checkpoint directly and skips this step.
4. Launch
4.1 Quick start
run-qwen3-32B.sh, run-qwen3-4B-fsdp.sh, etc.) and the qwen3-XB.sh model config.
The Qwen3-4B-Instruct-2507 config (scripts/models/qwen3-4B-Instruct-2507.sh) just sets MODEL_ARGS_ROTARY_BASE=5000000 and re-sources qwen3-4B.sh — source it when converting / launching the Instruct-2507 checkpoint.
5. Recipe Configuration
5.1 Parallelism
| Script | TP | PP | CP | EP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
run-qwen3-4B.sh | 2 | 1 | 1 | 1 | 9216 | 8 (1 × 8) |
run-qwen3-4B_4xgpu.sh | 2 | 1 | 1 | 1 | 9216 | 4 (1 × 4) |
run-qwen3-32B.sh | 8 | 1 | 1 | 1 | 20480 | 8 (1 × 8) |
--sequence-parallel is on whenever TP > 1.
5.2 Algorithm
GRPO baseline across all dense recipes:--rm-type deepscaler against dapo-math-17k. The SFT recipe (run-qwen3-4B-base-sft.sh) trains on /root/openhermes2_5.parquet.
5.3 Rollout & SGLang
run-qwen3-32B.sh additionally pins --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256). The FSDP variant uses --attn-implementation flash_attention_3, SGLang attention backend fa3, and adds --update-weight-buffer-size 536870912 --gradient-checkpointing.
5.4 Optimizer
run-qwen3-32B.sh enables CPU Adam:
5.5 Notable quirks
- BF16 train + FP8 inference:
run-qwen3-4B.shships a commented--hf-checkpoint /root/Qwen3-4B-FP8alternative — uncomment it (and downloadQwen/Qwen3-4B-FP8) to swap rollout to FP8 while keeping BF16 training. See Low Precision RL. - FSDP backend:
run-qwen3-4B-fsdp.shruns the same recipe with--train-backend fsdp; no Megatrontorch_distconversion needed. - AMD ROCm:
scripts/amd/run-qwen3-4B-amd.shmirrors the recipe with${NUM_GPUS}resolved from the AMD environment.
6. Pairs Well With
- Low Precision RL
- Backends Beyond Megatron — for the FSDP variant.

