Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
GLM-5 is the most powerful language model in Zhipu AI’s GLM series, scaling to 744 B parameters (40 B active) and integrating DeepSeek Sparse Attention (DSA) for long-context efficiency. GLM-5.1 is the next-generation model for agentic engineering on top of GLM-5, sharing the same model architectures, Key highlights:- Sparse MoE at frontier scale: 744 B total / 40 B active per token, 256 routed experts top-8 + 1 shared.
- MLA + DSA attention: Multi-head Latent Attention (q-LoRA 2048 / kv-LoRA 512) combined with DeepSeek Sparse Attention to keep KV-cache cost low at long context.
- Speculative decoding: EAGLE/MTP rollout supported via
--enable-mtp. - PD disaggregation: prefill/decode disaggregation enabled by default for ≥1 node.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| GLM-5.1 | 40 B / 744 B | zai-org/GLM-5.1 |
| GLM-5 | 40 B / 744 B | zai-org/GLM-5 |
3. Environment Setup
3.1 Download model + datasets
The Python launcher’sprepare subcommand handles download + dataset staging:
3.2 HF → Megatron torch_dist conversion
Also handled by prepare. The launcher patches config.json to set model_type=deepseek_v32 (_process_glm_checkpoint) before conversion — GLM-5 is loaded through the DeepseekV32 architecture path. Run prepare-cp afterwards on every node to copy the converted checkpoint from shared NFS to local disk.
4. Launch
4.1 Quick start
--hardware to {H200, B200, GB300}.
5. Recipe Configuration
5.1 Parallelism
Verbatim from_execute_train, --num-nodes ≥ 16 branch:
| TP | PP | CP | EP | expert-TP | decoder-last-pipeline-num-layers | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|---|
| 4 | 4 | 2 | 32 | 1 | 18 | 16384 | ≥ 128 (≥ 16 × 8) |
--use-dynamic-batch-size, --data-pad-size-multiplier 4096, --log-probs-chunk-size 1024, --recompute-granularity full --recompute-method uniform --recompute-num-layers 1.
5.2 Algorithm
GRPO with--eps-clip 0.2 --eps-clip-high 0.28. R3 (--use-rollout-routing-replay) is not enabled by default.
5.3 Rollout & SGLang
Always-on flags:5.4 Optimizer
--enable-optimizer-offload adds --optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer (opt-in).
5.5 Notable quirks
The launcher exposes these as flags:--fp8-rollout— runstools/convert_hf_to_fp8.py --strategy block --block-size 128 128and feeds the FP8 directory to SGLang (Megatron stays BF16).--enable-mtp— adds SGLang EAGLE speculative decoding (--sglang-speculative-{algorithm,num-steps,eagle-topk,num-draft-tokens}).--enable-pd(defaultTruefor ≥1 node) — enables prefill/decode disaggregation; with PD the launcher uses larger SGLang world sizes (16 for<16nodes, 64 for≥16nodes).--use-deepep(defaultTrue) — enables Megatron-side DeepEP (--moe-enable-deepep --moe-token-dispatcher-type flex); falls back toalltoall. Forced off on GB300.
6. Pairs Well With
- PD Disaggregation — on by default for
num_nodes ≥ 1. - Low Precision RL — opt-in via
--fp8-rollout. - Speculative Decoding — opt-in via
--enable-mtp.

