Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Qwen3.5 is the next iteration of the Qwen3 dense series, introducing the gated-attention architecture and an FP32-preservedA_log parameter.
Key highlights:
- Attention-output gate: a learned gate on the attention output, trained alongside attention weights for stronger long-context behavior.
- Extended rotary base:
--rotary-base 10000000,--rotary-percent 0.25— wider effective context than the original Qwen3. - Larger vocabulary: 248320 tokens.
- FP32
A_logpreservation: a parameter that must stay in FP32 through Megatron’s mixed-precision pipeline; miles handles this via the bridge.
2. Supported Variants
| Model | HF ID |
|---|---|
| Qwen3.5-4B | Qwen/Qwen3.5-4B |
| Qwen3.5-9B | Qwen/Qwen3.5-9B |
| Qwen3.5-27B | Qwen/Qwen3.5-27B |
3. Environment Setup
3.1 Download model + datasets
3.2 HF → Megatron torch_dist conversion
4. Launch
4.1 Quick start
5. Recipe Configuration
5.1 Parallelism
| Script | TP | PP | CP | max_tokens_per_gpu | SGLang mem-fraction-static | CPU Adam | GPUs |
|---|---|---|---|---|---|---|---|
scripts/run-qwen3.5-4B.sh | 2 | 1 | 1 | 9216 | 0.7 | – | 8 (1 × 8) |
scripts/run-qwen3.5-9B.sh | 2 | 1 | 1 | 9216 | 0.6 | – | 8 (1 × 8) |
scripts/run-qwen3.5-27B.sh | 4 | 1 | 1 | 8192 | 0.5 | ✓ | 8 (1 × 8) |
5.2 Algorithm
GRPO across all three sizes:5.3 Rollout & SGLang
--rollout-num-gpus-per-engine 1 because SGLang TP > 1 produced garbage output for Qwen3.5 on 0.5.9 (in-source comment, sglang#21039). If you bump the SGLang version past the fix, you can raise this back up.
5.4 Optimizer
Only the 27 B script enables CPU Adam (--optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer). The 4 B and 9 B recipes leave Adam on GPU.
5.5 Notable quirks
Fromscripts/models/qwen3.5-4B.sh (and analogous configs for 9 B / 27 B):
--spec miles_plugins.models.qwen3_5 get_qwen3_5_spec— attention-output gate,A_logparameter handling.--rotary-base 10000000,--rotary-percent 0.25.--vocab-size 248320.--apply-layernorm-1p,--qk-layernorm,--group-query-attention.--attention-output-gate.
A_log through Megatron’s mixed-precision pipeline.

