Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Qwen3.6-35B-A3B is the sparse MoE branch of Alibaba’s Qwen3.6 line — 35 B total / 3 B active parameters on a Gated Delta Networks backbone. Like the dense Qwen3.6-27B, it’s tuned for agentic-coding workflows and long-session reasoning, with native hybrid thinking mode, built-in tool calling, and multimodal text / image / video input. Context reaches 262 K and extends past 1 M; weights are Apache 2.0 in BF16 and FP8. Qwen3.6 also ships native Multi-Token Prediction for speculative decoding, which this recipe trains and serves via EAGLE. In miles, Qwen3.6-35B-A3B reuses the Qwen3.5 spec (miles_plugins.models.qwen3_5.get_qwen3_5_spec) and bakes in MTP training
plus a shared-expert gate.
Key highlights:
- Sparse MoE on a GDN backbone: 256 experts, top-8 routing, 3 B active / 35 B total.
- Attention-output gate: shared with the Qwen3.5 / 3.6 dense series.
- Shared expert + gate:
--moe-shared-expert-intermediate-size 512 --moe-shared-expert-gate. - Multi-Token Prediction (MTP):
--mtp-num-layers 1; trained alongside the policy and served via EAGLE at rollout. - Dispatcher:
--moe-token-dispatcher-type alltoallfor HF→Megatron conversion; runtime usesflex(set in the launcher). - Long context: 262 K tokens, extensible past 1 M.
- Single-node footprint: full recipe fits on 1 × 8 GPU (H200).
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Qwen3.6-35B-A3B | 3 B / 35 B | Qwen/Qwen3.6-35B-A3B |
3. Environment Setup
3.1 Download model + datasets
3.2 HF → Megatron torch_dist conversion
--mtp-num-layers 1 during conversion preserves the MTP layer so it survives into Megatron format.
4. Launch
4.1 Quick start
The launcher is a parametrized Typer script (8 × H200) that exercises arbitrary (TP, EP, CP, PP, ETP) cells:--mode debug_minimal, 8 GPUs, max_tokens_per_gpu=8192,
rollout_batch_size=8, n_samples_per_prompt=2, global_batch_size=16,
rollout_max_response_len=1024. Override via flags for longer runs.
5. Recipe Configuration
5.1 Parallelism
The default cell isTP=1 EP=8 CP=1 PP=1 ETP=1. Sequence parallelism is on; activation
checkpointing defaults on (--recompute-granularity full --recompute-method uniform --recompute-num-layers 1)
and can be turned off with --no-recompute.
| TP | PP | CP | EP | expert-TP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 8 | 1 | 8192 | 8 (1 × 8) |
5.2 Algorithm
GRPO with low-variance KL plus MTP training:5.3 Rollout & SGLang
5.4 Optimizer
CPU Adam is enabled (--optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer).
5.5 Notable quirks
Fromscripts/models/qwen3.6-35B-A3B.sh and scripts/run_qwen3_6_35b_a3b_mtp.py:
--spec miles_plugins.models.qwen3_5 get_qwen3_5_spec— Qwen3.6 reuses the Qwen3.5 spec.- 256 experts,
--moe-router-topk 8,--moe-router-score-function softmax. --moe-shared-expert-gateand--moe-shared-expert-intermediate-size 512.- Megatron-side dispatcher overridden to
--moe-token-dispatcher-type flexat runtime; conversion usesalltoall. --moe-grouped-gemm,--moe-token-drop-policy probs,--moe-router-dtype fp32,--moe-permute-fusion,--moe-aux-loss-coeff 0.--attention-output-gate,--rotary-base 10000000,--rotary-percent 0.25,--vocab-size 248320.

