Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

1. Model Introduction

GLM-4.5 is Zhipu AI’s flagship MoE language model with advanced capabilities in reasoning, function calling, and multi-modal understanding. Key highlights:
  • Sparse MoE architecture: 355 B / 32 B-active for frontier runs and 106 B / 12 B-active for two-node experimentation.
  • Strong reasoning: built-in step-by-step reasoning, with FP8 rollout supported on Blackwell hardware.
  • Speculative decoding: EAGLE/MTP rollout supported by the bash launcher; the Python launcher exposes --enable-mtp.
  • R3 / MIS opt-in: routing-stability extensions available behind a flag (--enable-mis) on the Python launcher.

2. Supported Variants

ModelActive / TotalHF ID
GLM-4.5-355B-A32B32 B / 355 Bzai-org/GLM-4.5
GLM-4.5-Air (106B-A12B)12 B / 106 Bzai-org/GLM-4.5-Air
The 106B-A12B variant has no launcher under scripts/; the canonical recipe is examples/p2p_weight_transfer/GLM-4.5-Air.sh (8-node, P2P weight transfer).

3. Environment Setup

3.1 Required env vars

The bash launcher (run-glm4.5-355B-A32B.sh) requires:
export BASE_DIR=<shared FS path, reachable from every node>
# so it comes from the cluster orchestrator.
The Python launcher (run_glm45_355b_a32b.py) reads no env vars — pass options via the Typer CLI.

3.2 Download model + datasets

hf download zai-org/GLM-4.5 --local-dir $BASE_DIR/GLM-4.5-355B-A32B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir $BASE_DIR/dapo-math-17k
hf download --repo-type dataset zhuzilin/aime-2024     --local-dir $BASE_DIR/rl_data/aime-2024

3.3 HF → Megatron torch_dist conversion

The bash launcher does not convert for you — produce $BASE_DIR/GLM-4.5-355B-A32B_torch_dist/ ahead of time:
cd /root/miles
source scripts/models/glm4.5-355B-A32B.sh
PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \
   tools/convert_hf_to_torch_dist.py \
   ${MODEL_ARGS[@]} \
   --hf-checkpoint $BASE_DIR/GLM-4.5-355B-A32B \
   --save          $BASE_DIR/GLM-4.5-355B-A32B_torch_dist
The Python launcher automates the full flow (download → optional tools/convert_hf_to_fp8.pyconvert_checkpointrsync to model_local_dir → submit).

4. Launch

4.1 Quick start

# Bash launcher (8 nodes × 8 GPU)
cd /root/miles
export BASE_DIR=...
bash scripts/run-glm4.5-355B-A32B.sh

# Python launcher (Blackwell hardware only — _execute_train asserts hardware != "H100")
python scripts/run_glm45_355b_a32b.py train --hardware GB300

4.2 Multi-node fan-out

run-glm4.5-355B-A32B.sh performs Ray fan-out internally via the ssh loop over /root/mpi_rack_hostfile.

5. Recipe Configuration

5.1 Parallelism

SourceTPPPCPEPexpert-TPmax_tokens_per_gpuGPUs
run-glm4.5-355B-A32B.sh8421611638464 (8 × 8)
run_glm45_355b_a32b.py (num_nodes ≤ 4, debug)4114116384≤ 32 (≤ 4 × 8)
run_glm45_355b_a32b.py (num_nodes == 8)482811638464 (8 × 8)

5.2 Algorithm

SourceAdvantageNotable flags
run-glm4.5-355B-A32B.shGSPO--eps-clip 1e-4 --eps-clip-high 2e-4 --use-tis
run_glm45_355b_a32b.pyGRPO--eps-clip 1e-4 --eps-clip-high 2e-4 --use-tis
Neither launcher enables --use-rollout-routing-replay by default. The Python launcher exposes --enable-mis (TIS/RS config) as an opt-in.

5.3 Rollout & SGLang

SGLANG_ARGS=(
   --rollout-num-gpus-per-engine 32
   --sglang-mem-fraction-static 0.7
   --sglang-enable-dp-attention
   --sglang-dp-size 4
   --sglang-ep-size 32
   --sglang-enable-dp-lm-head
   --sglang-moe-dense-tp-size 1

   # mtp / EAGLE
   --sglang-speculative-algorithm EAGLE
   --sglang-speculative-num-steps 1
   --sglang-speculative-eagle-topk 1
   --sglang-speculative-num-draft-tokens 2
   --sglang-enable-draft-weights-cpu-backup
)
Megatron side: --moe-token-dispatcher-type flex, --moe-enable-deepep.

5.4 Optimizer

CPU Adam on:
--optimizer-cpu-offload
--overlap-cpu-optimizer-d2h-h2d
--use-precision-aware-optimizer

5.5 Notable quirks

  • The bash launcher does not set --load/--save in CKPT_ARGS--load defaults to the value of --ref-load.
  • run_glm45_355b_a32b.py is Blackwell-only: _execute_train asserts args.hardware != "H100".

6. Pairs Well With