NVIDIA Blackwell (GB300 / GB200 / B200 / B100) and Hopper (H200 / H100) are Miles’s first-class targets.Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
Recommended setup
| Component | Why pinned |
|---|---|
| CUDA 12.4+ | Required for FP8 GEMM via cuBLAS |
| FlashAttention-3 (default), Flashinfer | Best-in-class attention kernels |
| DeepGEMM | Kernel for grouped GEMM (MoE) |
| NCCL 2.20+ | NVLink SHARP, IB-aware collectives |
| TransformerEngine | FP8 forward/backward |
Per-GPU notes
H100 / H200
- The default target the recipes on this site are tuned against.
- H200 ships with 141 GB HBM (vs. 80 GB on H100), so you can often reduce TP for the same model — e.g. TP 8 → TP 4 on a single 8-GPU node.
B100 / B200
- Same launch flags as H-series; FP8 GEMM uses the same code path.
- First-run kernel compilation can take longer than on H-series. If the rollout engine
is flagged unhealthy during warm-up, raise
--rollout-health-check-first-wait(e.g. 600s).
A100
- No FP8 GEMM — the BF16 path is used automatically.
- Supported, but not part of CI; expect rougher edges than on H/B-series.
Multi-node networking
- InfiniBand HDR/NDR: ~200/400 Gbps per port. Default in most H100 deployments.
- RoCEv2: works, configure
NCCL_IB_HCAto your physical NICs. - Slingshot 11: requires
NCCL_NET_PLUGIN=cassini.
ib_send_bw between two ranks before launching a multi-day run.
Common environment variables
NVLink + IB topology
For 8× GPUs per node:- All-to-all NVLink connectivity (
nvidia-smi topo -mshould showNV4between every pair). - 4–8 IB NICs per node, one per GPU pair, configured via
NCCL_IB_HCA.
nvidia-smi topo -m shows PIX or PHB instead of NV*, you’ve lost a link —
fix before training.

