Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

NVIDIA Blackwell (GB300 / GB200 / B200 / B100) and Hopper (H200 / H100) are Miles’s first-class targets.
docker pull radixark/miles:latest

docker run --rm \
  --gpus all --ipc=host --shm-size=32g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --network=host \
  -it radixark/miles:latest /bin/bash
The image bundles:
ComponentWhy pinned
CUDA 12.4+Required for FP8 GEMM via cuBLAS
FlashAttention-3 (default), FlashinferBest-in-class attention kernels
DeepGEMMKernel for grouped GEMM (MoE)
NCCL 2.20+NVLink SHARP, IB-aware collectives
TransformerEngineFP8 forward/backward

Per-GPU notes

H100 / H200

  • The default target the recipes on this site are tuned against.
  • H200 ships with 141 GB HBM (vs. 80 GB on H100), so you can often reduce TP for the same model — e.g. TP 8 → TP 4 on a single 8-GPU node.

B100 / B200

  • Same launch flags as H-series; FP8 GEMM uses the same code path.
  • First-run kernel compilation can take longer than on H-series. If the rollout engine is flagged unhealthy during warm-up, raise --rollout-health-check-first-wait (e.g. 600s).

A100

  • No FP8 GEMM — the BF16 path is used automatically.
  • Supported, but not part of CI; expect rougher edges than on H/B-series.

Multi-node networking

  • InfiniBand HDR/NDR: ~200/400 Gbps per port. Default in most H100 deployments.
  • RoCEv2: works, configure NCCL_IB_HCA to your physical NICs.
  • Slingshot 11: requires NCCL_NET_PLUGIN=cassini.
Confirm bandwidth with ib_send_bw between two ranks before launching a multi-day run.

Common environment variables

NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=COLL,P2P
NCCL_IB_HCA=mlx5_0,mlx5_1,...
NCCL_TIMEOUT=900
NVTE_FUSED_ATTN=1            # default, but verify
TORCHINDUCTOR_CACHE_DIR=/data/.inductor
For 8× GPUs per node:
  • All-to-all NVLink connectivity (nvidia-smi topo -m should show NV4 between every pair).
  • 4–8 IB NICs per node, one per GPU pair, configured via NCCL_IB_HCA.
If nvidia-smi topo -m shows PIX or PHB instead of NV*, you’ve lost a link — fix before training.

Quick health probe

Before submitting a job, sanity-check the node:
nvidia-smi                          # GPUs visible, driver / CUDA versions
nvidia-smi topo -m                  # NVLink mesh (NV* between every pair)
ibstat                              # IB ports up (multi-node only)
If anything’s wrong, fix it here — chasing it inside Ray is harder.