Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Qwen3.5-35B-A3B is the MoE branch of the Qwen3.5 line — 3 B active / 35 B total — combining the gated-attention architecture with a built-in MTP head. Key highlights:- Sparse MoE: 3 B active out of 35 B total parameters.
- Attention-output gate: shared with the Qwen3.5 dense series, with FP32-preserved
A_log. - Multi-Token Prediction (MTP):
--mtp-num-layers 1baked into the model config; the recipe trains the MTP head and uses EAGLE speculative decoding at rollout. - Single-node footprint: full recipe fits on 1 × 8 GPU.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Qwen3.5-35B-A3B | 3 B / 35 B | Qwen/Qwen3.5-35B-A3B |
3. Environment Setup
3.1 Download model + datasets
3.2 HF → Megatron torch_dist conversion
--mtp-num-layers 1 during conversion preserves the MTP layer so it survives into Megatron format.
4. Launch
4.1 Quick start
5. Recipe Configuration
5.1 Parallelism
| TP | PP | CP | EP | expert-TP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 8 | 1 | 8192 | 8 (1 × 8) |
5.2 Algorithm
GRPO with--eps-clip 0.2 --eps-clip-high 0.28 --use-kl-loss --kl-loss-coef 0.00. Plus MTP training:
5.3 Rollout & SGLang
5.4 Optimizer
CPU Adam is enabled (--optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer).
5.5 Notable quirks
- The Megatron side uses
--moe-token-dispatcher-type flex; DeepEP isn’t enabled here, unlike Qwen3-Next. - The model config (
scripts/models/qwen3.5-35B-A3B.sh) reuses the Qwen3.5 spec:--attention-output-gate,--rotary-base 10000000,--rotary-percent 0.25,A_logkept in FP32 via the bridge. See Backends Beyond Megatron.

