Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
GLM-4.7-Flash is a lightweight, high-speed MoE model in the GLM-4.7 series from Zhipu AI, designed for single-GPU-node deployment. Key highlights:- Compact MoE architecture: 30 B total / 3 B active, sparse activation for efficient inference.
- MLA attention: Multi-head Latent Attention with q-LoRA rank 768 and kv-LoRA rank 512.
- MTP head + EAGLE speculative: built-in
--mtp-num-layers 1and EAGLE rollout enabled by default. - R3 on by default: both miles launchers enable
--use-rollout-routing-replayout of the box.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| GLM-4.7-Flash | 3 B / 30 B | zai-org/GLM-4.7-Flash |
3. Environment Setup
3.1 Download model + datasets
BASE_DIR=/root/shared. The Python launcher downloads zhuzilin/dapo-math-17k and zhuzilin/aime-2024 automatically.
3.2 HF → Megatron torch_dist conversion
4. Launch
4.1 Quick start
ScriptArgs): model_org=zai-org, model_name=GLM-4.7-Flash, num_gpus_per_node=8, hardware=H200, data_dir=/root/datasets, model_dir=/root/models.
5. Recipe Configuration
5.1 Parallelism
| TP | PP | CP | EP | expert-TP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
| 4 | 1 | 1 | 8 | 1 | 32768 | 8 (1 × 8) |
--rollout-num-gpus-per-engine 4 (TP must divide 20 attention heads, so TP=4). The bash launcher’s SGLANG_ARGS keeps --sglang-enable-dp-attention / --sglang-dp-size commented out — the in-source comment notes that DP-attention requires tp_size % dp_size == 0.
5.2 Algorithm
GRPO with--eps-clip 0.2 --eps-clip-high 0.28 --use-kl-loss --kl-loss-coef 0.00.
5.3 Rollout & SGLang
5.4 Optimizer
CPU Adam on:5.5 Notable quirks
- Megatron-side DeepEP /
flexdispatcher are commented out by default in this recipe. - R3 (
--use-rollout-routing-replay) is enabled by default — atypical for the rest of the model lineup.
6. Pairs Well With
- Rollout Routing Replay (R3) — already on by default.
- Low Precision RL

