Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Moonlight is Moonshot AI’s compact MoE — 16 B total / 3 B active, trained with the Muon optimizer — and a useful single-node test target for MoE RL code changes before scaling to Kimi K2. Key highlights:- Compact MoE: 16 B total / 3 B active, 27 layers (1 dense + 26 MoE), 64 routed experts top-6 + 2 shared.
- MLA attention: Multi-head Latent Attention with
kv-LoRA rank 512. - Single-node footprint: full RL recipe fits on 1 × 8 H100.
- Muon-trained base: pretrained with the Muon optimizer; weight decay matters at scale.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Moonlight-16B-A3B | 3 B / 16 B | moonshotai/Moonlight-16B-A3B |
3. Environment Setup
3.1 Download model + datasets
3.2 HF → Megatron torch_dist conversion
4. Launch
4.1 Quick start
5. Recipe Configuration
5.1 Parallelism
| TP | PP | CP | EP | expert-TP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
| 4 | 1 | 1 | 8 | 1 | 8192 | 8 (1 × 8) |
5.2 Algorithm
GRPO with--eps-clip 0.2 --eps-clip-high 0.28 --use-kl-loss --kl-loss-coef 0.00. R3 is not enabled.
5.3 Rollout & SGLang
--moe-enable-deepep --moe-token-dispatcher-type flex.
5.4 Optimizer
CPU Adam on:5.5 Notable quirks
--attention-backend flashis commented out in this script (script comment: “need to comment this when using model with MLA”).

