Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Qwen3-Next is Alibaba’s next-generation Qwen architecture, swapping classical attention for a hybrid Gated DeltaNet + Full Attention design. Key highlights:- Hybrid Attention: combines Gated DeltaNet (linear attention) with Full Attention to handle context lengths up to 262 K tokens efficiently.
- Highly Sparse MoE: 80 B total / 3 B active per token — drastically reduces FLOPs per token without sacrificing model capacity.
- Multi-Token Prediction (MTP): built-in MTP layer enables EAGLE-style speculative rollout out of the box.
- HuggingFace-wrapped Megatron backend: miles loads the
Qwen/Qwen3-Next-80B-A3BHF module as a Megatron stage without re-implementing GDN from scratch.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Qwen3-Next-80B-A3B-Thinking | 3 B / 80 B | Qwen/Qwen3-Next-80B-A3B-Thinking |
3. Environment Setup
3.1 Required env vars
run-qwen3-next-80B-A3B.sh, run-qwen3-next-80B-A3B-8gpus.sh, run-qwen3-next-80B-A3B-fsdp.sh) hard-fail if these aren’t set.
3.2 Download model + datasets
3.3 HF → Megatron torch_dist conversion
4. Launch
4.1 Quick start
4.2 Multi-node fan-out
run-qwen3-next-80B-A3B.sh performs ssh fan-out internally — set BASE_FOLDER / MASTER_ADDR on the head node and the launcher reaches out to the workers. The 8-GPU and FSDP variants are single-node.
5. Recipe Configuration
5.1 Parallelism
| Script | Backend | TP | PP | CP | EP | expert-TP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|---|---|
scripts/run-qwen3-next-80B-A3B.sh | Megatron | 2 | 4 | 2 | 8 | 1 | 8192 | 32 (4 × 8) |
5.2 Algorithm
All three scripts use GSPO (--advantage-estimator gspo --eps-clip 4e-4); --use-kl-loss is commented out.
5.3 Rollout & SGLang
The canonical script enables EAGLE speculative rollout:--rollout-num-gpus-per-engine 2 --rollout-num-gpus 2 --sglang-mem-fraction-static 0.8 --sglang-ep-size 1.
5.4 Optimizer
The Megatron variants enable CPU Adam:5.5 Notable quirks
- Gated DeltaNet (GDN) is loaded via the HuggingFace bridge; miles doesn’t re-implement GDN in Megatron native code.

