Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
NVIDIA Nemotron-3-Super-120B-A12B-FP8 is the Super-tier sibling of Nemotron-3-Nano: the samenemotron_h block
pattern (interleaved Mamba and attention blocks, no RoPE, squared-relu FFNs)
scaled to 120 B total / 12 B active with a sparse MoE FFN, and shipped as
an FP8-native checkpoint.
miles loads it through the megatron.bridge AutoBridge with the shared
NemotronH MoE bridge shim (miles_plugins/megatron_bridge/nemotron_h.py)
that wires routed_scaling_factor, n_group, and topk_group onto the
Megatron provider — without the shim the routed output is silently scaled 1.0×,
the same drift class that affects the Nano-MoE recipe.
Key highlights:
- Hybrid + MoE: Mamba + attention + sparse MoE in the
nemotron_hfamily. - FP8 native: weights ship in FP8; load + train without an offline upcast.
- Sigmoid routing with per-token group selection, aux-free expert-bias load balancing.
- Bridge-mode load:
--megatron-to-hf-mode bridge— notorch_distconversion step. - No RoPE:
--position-embedding-type none.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Nemotron-3-Super-120B-A12B-FP8 | 12 B / 120 B | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
3. Environment Setup
3.1 Download model + datasets
3.2 No torch_dist conversion
AutoBridge + the NemotronH MoE shim load the FP8 HF checkpoint directly. Both
--hf-checkpoint and --ref-load point at the HF directory:
4. Launch
4.1 Quick start
TP=4 PP=2 EP=8.
5. Recipe Configuration
5.1 Parallelism
Default cell isTP=4 PP=2 EP=8 on 16 GPUs. The 120B-A12B footprint requires
either a wider EP fan-out or PP=2 to fit the activation memory of the hybrid
Mamba+Attention stack at the FP8 weight resolution.
| Cell | TP | PP | CP | EP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
| default (run script) | 4 | 2 | 1 | 8 | 1024 | 16 (2 × 8) |
| TP=4×EP=8 (1 node) | 4 | 1 | 1 | 8 | 1024 | 8 (1 × 8) |
| TP=2×PP=2×EP=8 + SP | 2 | 2 | 1 | 8 | 1024 | 16 (2 × 8) |
--sequence-parallel is enabled in the run script. Activation checkpointing is
on (--recompute-granularity full --recompute-method uniform --recompute-num-layers 1). --log-probs-chunk-size 128 keeps the smoke
memory budget intact for FP8.
5.2 Algorithm
GRPO with low-variance KL — same defaults as Nano-MoE:5.3 Rollout & SGLang
--use-miles-router --use-rollout-routing-replay keeps train and rollout
logprobs aligned for the sigmoid-routed MoE — the same routing-replay rule that
Nano-MoE uses.
5.4 Optimizer
CPU Adam (--optimizer-cpu-offload) is the default for the 120B-A12B smoke
recipe; the FP8 weights save GPU memory but the Adam states still dominate at
this scale.
5.5 Notable quirks
- FP8 native load: the HF checkpoint is FP8; the bridge passes the per-block scales through to Megatron — no offline upcast step.
- No
--spec: AutoBridge + the NemotronH shim synthesize the Megatron MoE spec from HF config. - Routing follows the family default:
--moe-router-score-function sigmoid --moe-router-pre-softmax --moe-router-topk-scaling-factor 2.5. - Aux-free balancing:
--moe-router-enable-expert-bias --moe-router-load-balancing-type seq_aux_loss --moe-router-bias-update-rate 0 --moe-aux-loss-coeff 0. --moe-grouped-gemm,--moe-router-dtype fp32.--position-embedding-type none,--vocab-size 131072 --make-vocab-size-divisible-by 128.--attention-backend auto(Mamba layers select their own kernel).
routed_scaling_factor / n_group / topk_group onto
the Megatron provider, and FP8 & Low Precision
for the FP8 weight format.

