Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
NVIDIA Nemotron-3-Nano-30B-A3B-BF16 is a hybrid Mamba + attention + MoE model. It pairs thenemotron_h block
pattern from the dense 4B with a 128-expert sparse layer (top-6 routing,
1 shared expert, DSv3-style sigmoid routing with routed_scaling_factor=2.5).
miles loads it through the megatron.bridge AutoBridge with a custom
NemotronH MoE bridge shim (miles_plugins/megatron_bridge/nemotron_h.py) that
wires routed_scaling_factor, n_group, and topk_group onto the Megatron
provider. Without the shim the routed output is silently scaled 1.0× → ~0.28
logprob drift between train and rollout.
Key highlights:
- Hybrid + MoE: Mamba + attention + sparse MoE in the
nemotron_hfamily. - 128 experts, top-6 routing, 1 shared expert (3712-dim), aux-free expert-bias load balancing.
- Sigmoid routing with
--moe-router-topk-scaling-factor 2.5. - Bridge-mode load:
--megatron-to-hf-mode bridge— notorch_distconversion step. - No RoPE:
--position-embedding-type none.
2. Supported Variants
| Model | Active / Total | HF ID |
|---|---|---|
| Nemotron-3-Nano-30B-A3B | 3 B / 30 B | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
3. Environment Setup
3.1 Download model + datasets
3.2 No torch_dist conversion
AutoBridge + the NemotronH MoE shim load the HF checkpoint directly. Both
--hf-checkpoint and --ref-load point at the HF directory:
4. Launch
4.1 Quick start
TP=2 PP=2 EP=2.
5. Recipe Configuration
5.1 Parallelism
Default cell isTP=2 PP=2 EP=2. Other verified cells from the upstream PR
(10-step RL smoke, max logprob diff ≈ 0.014):
| Cell | TP | PP | CP | EP | max_tokens_per_gpu | GPUs |
|---|---|---|---|---|---|---|
| default (run script) | 2 | 2 | 1 | 2 | 1024 | 8 (1 × 8) |
| EP=4 | 1 | 1 | 1 | 4 | 1024 | 8 |
| TP=2×EP=4+SP | 2 | 1 | 1 | 4 | 1024 | 8 |
| PP=2×EP=4 | 1 | 2 | 1 | 4 | 1024 | 8 |
| CP=2×EP=4 | 1 | 1 | 2 | 4 | 1024 | 8 |
| TP=2×PP=2×EP=2+SP | 2 | 2 | 1 | 2 | 1024 | 8 |
--sequence-parallel is enabled in the run script. Activation checkpointing is on
(--recompute-granularity full --recompute-method uniform --recompute-num-layers 1).
--log-probs-chunk-size 128 is required for the smoke memory budget.
5.2 Algorithm
GRPO with low-variance KL:5.3 Rollout & SGLang
--use-miles-router --use-rollout-routing-replay is what keeps train and rollout
logprobs aligned for the sigmoid-routed MoE — drop them and you’ll see the same
~0.28 drift the bridge shim was added to fix.
5.4 Optimizer
GPU Adam in the smoke recipe (no--optimizer-cpu-offload). Switch on CPU Adam if
memory pressure rises.
5.5 Notable quirks
Fromscripts/models/nemotron-3-nano-30b-a3b.sh and scripts/run-nemotron-3-nano-30b-a3b.sh:
- No
--spec: AutoBridge + the NemotronH shim synthesize the Megatron MoE spec from HF config. - 128 experts,
--moe-router-topk 6, shared expert (3712-dim). - Routing:
--moe-router-score-function sigmoid --moe-router-pre-softmax --moe-router-topk-scaling-factor 2.5. - Group routing:
--moe-router-num-groups 1 --moe-router-group-topk 1(no-op forn_group=1, kept for parity with HF config). - Aux-free balancing:
--moe-router-enable-expert-bias --moe-router-load-balancing-type seq_aux_loss --moe-router-bias-update-rate 0 --moe-aux-loss-coeff 0. --moe-grouped-gemm,--moe-router-dtype fp32.--position-embedding-type none,--vocab-size 131072 --make-vocab-size-divisible-by 128.--attention-backend auto(Mamba layers select their own kernel).
routed_scaling_factor / n_group / topk_group onto the Megatron provider.

