Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

1. Model Introduction

Qwen3 is the latest generation of Alibaba’s Qwen language model series, available in dense and MoE variants with both Instruct and reasoning-enhanced Thinking editions. Key highlights:
  • Stronger general intelligence: significant improvements in instruction following, logical reasoning, mathematics, science, coding, and tool usage over Qwen2.5.
  • Extended context length: trained for 256 K-token contexts, useful for long-document reasoning and agentic workflows.
  • Flexible deployment options: dense sizes from 0.6 B up to 32 B; this page covers the dense recipes (MoE recipes live in qwen3-moe).
  • Stronger agent interaction: improved tool-use and search-based agent performance.

2. Supported Variants

ModelHF ID
Qwen3-0.6BQwen/Qwen3-0.6B
Qwen3-1.7BQwen/Qwen3-1.7B
Qwen3-4BQwen/Qwen3-4B
Qwen3-4B-Instruct-2507Qwen/Qwen3-4B-Instruct-2507
Qwen3-8BQwen/Qwen3-8B
Qwen3-14BQwen/Qwen3-14B
Qwen3-32BQwen/Qwen3-32B

3. Environment Setup

3.1 Download model + datasets

hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
hf download --repo-type dataset zhuzilin/aime-2024     --local-dir /root/aime-2024

3.2 HF → Megatron torch_dist conversion

cd /root/miles
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint /root/Qwen3-4B \
   --save          /root/Qwen3-4B_torch_dist
The converter auto-derives PP from WORLD_SIZE; for larger sizes drive it with torchrun --nproc-per-node 8. The FSDP launcher loads the HF checkpoint directly and skips this step.

4. Launch

4.1 Quick start

cd /root/miles
bash scripts/run-qwen3-4B.sh
Other variants follow the same pattern — replace the script name (run-qwen3-32B.sh, run-qwen3-4B-fsdp.sh, etc.) and the qwen3-XB.sh model config. The Qwen3-4B-Instruct-2507 config (scripts/models/qwen3-4B-Instruct-2507.sh) just sets MODEL_ARGS_ROTARY_BASE=5000000 and re-sources qwen3-4B.sh — source it when converting / launching the Instruct-2507 checkpoint.

5. Recipe Configuration

5.1 Parallelism

ScriptTPPPCPEPmax_tokens_per_gpuGPUs
run-qwen3-4B.sh211192168 (1 × 8)
run-qwen3-4B_4xgpu.sh211192164 (1 × 4)
run-qwen3-32B.sh8111204808 (1 × 8)
--sequence-parallel is on whenever TP > 1.

5.2 Algorithm

GRPO baseline across all dense recipes:
GRPO_ARGS=(
   --advantage-estimator grpo
   --use-kl-loss
   --kl-loss-coef 0.00
   --kl-loss-type low_var_kl
   --entropy-coef 0.00
   --eps-clip 0.2
   --eps-clip-high 0.28
)
Rollout uses --rm-type deepscaler against dapo-math-17k. The SFT recipe (run-qwen3-4B-base-sft.sh) trains on /root/openhermes2_5.parquet.

5.3 Rollout & SGLang

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 2
   --sglang-mem-fraction-static 0.7
)
run-qwen3-32B.sh additionally pins --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256). The FSDP variant uses --attn-implementation flash_attention_3, SGLang attention backend fa3, and adds --update-weight-buffer-size 536870912 --gradient-checkpointing.

5.4 Optimizer

run-qwen3-32B.sh enables CPU Adam:
--optimizer-cpu-offload
--overlap-cpu-optimizer-d2h-h2d
--use-precision-aware-optimizer
The 4 B / 8 B / 14 B recipes leave Adam on GPU.

5.5 Notable quirks

  • BF16 train + FP8 inference: run-qwen3-4B.sh ships a commented --hf-checkpoint /root/Qwen3-4B-FP8 alternative — uncomment it (and download Qwen/Qwen3-4B-FP8) to swap rollout to FP8 while keeping BF16 training. See Low Precision RL.
  • FSDP backend: run-qwen3-4B-fsdp.sh runs the same recipe with --train-backend fsdp; no Megatron torch_dist conversion needed.
  • AMD ROCm: scripts/amd/run-qwen3-4B-amd.sh mirrors the recipe with ${NUM_GPUS} resolved from the AMD environment.

6. Pairs Well With