Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Qwen3 MoE is the Mixture-of-Experts branch of the Qwen3 series, available in two sizes: 30 B-A3B (single-node) and 235 B-A22B (multi-node). Key highlights:- Sparse MoE architecture: 30 B / 3 B-active and 235 B / 22 B-active variants, scaling capacity without proportional compute cost.
- Strong reasoning and coding: shares the Qwen3 generation’s improvements in instruction following, math, and tool usage.
- Long-context capability: 256 K-token context inherited from the Qwen3 series.
- Flexible scaling: 30 B fits a single 8-GPU node; 235 B is the canonical multi-node target with FP8 rollout.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Qwen3-30B-A3B | 3 B / 30 B | Qwen/Qwen3-30B-A3B |
| Qwen3-235B-A22B | 22 B / 235 B | Qwen/Qwen3-235B-A22B |
3. Environment Setup
3.1 Required env vars
The 235 B bash launcher requires:3.2 Download model + datasets
3.3 HF → Megatron torch_dist conversion
$BASE_FOLDER/Qwen3-235B-A22B_torch_dist/ as --ref-load.
4. Launch
4.1 Quick start
4.2 Multi-node fan-out
run-qwen3-235B-A22B.sh ssh-fans-out to workers via /root/mpi_rack_hostfile itself; you only need the env vars set on the head node. The 30 B launcher is single-node.
5. Recipe Configuration
5.1 Parallelism
| Script | Backend | TP | PP | CP | EP | expert-TP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|---|---|
run_qwen3_30b_a3b.py (H100, 1 node) | Megatron | 4 | 1 | 1 | 8 | 1 | 32768 | 8 (1 × 8) |
run-qwen3-235B-A22B.sh | Megatron | 4 | 4 | 2 | 16 | 1 | 16384 | 64 (8 × 8) |
run-qwen3-235B-A22B-sft.sh | Megatron | 4 | 1 | 1 | 32 | 1 | 9216 | 32 (4 × 8) |
run-qwen3-235B-A22B.sh sets --decoder-last-pipeline-num-layers 22 to balance the layer count across PP=4.
5.2 Algorithm
- 30 B Python launcher: GRPO with
--eps-clip 0.2 --eps-clip-high 0.28. - 235 B bash launcher: GSPO (
--advantage-estimator gspo,--eps-clip 4e-4);--use-kl-lossis commented out.
5.3 Rollout & SGLang
run_qwen3_30b_a3b.py (H100, 1 node, BF16 rollout):
run-qwen3-235B-A22B.sh:
5.4 Optimizer
Bothrun_qwen3_30b_a3b.py (H100, 1 node) and run-qwen3-235B-A22B.sh enable CPU Adam:
run_qwen3_30b_a3b.py removes them when running on Blackwell (B200/B300/GB200/GB300) per the hardware match in the launcher.
5.5 Notable quirks
- 30 B Python launcher supports FP8 / MXFP8 / INT4 rollout, Blackwell hardware, Megatron-bridge mode, and MIS via Typer flags.
- 235 B defaults to FP8 HF checkpoint — the BF16 directory is available as a commented alternative in
CKPT_ARGS. - R3 not on by default; opt-in via
run_qwen3_30b_a3b.py --enable-mis(TIS / RS) for routing-stability experiments.

