Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

When the model is large enough that even FP8 will not fit on one node, the options are spreading across more nodes (and paying cross-node bandwidth) or quantizing further. Miles ships an INT4 W4A16 quant-aware-training pipeline. On an 8 × 141 GB H200 node, this is the path used to fit very large models in a single box. The recipe is inspired by the Kimi K2-Thinking team’s report.

What W4A16 means

TermBitsNotes
W44-bit weightsGroup-quantized (typical group size 32–128)
A1616-bit activationsBF16 activation pathway
The combination keeps the weights small (memory-bound) while activations stay in BF16 (math-bound). With QAT the model trains with the quantization in the loop, so the weights round well during inference.

Calibration

Convert a BF16 HuggingFace checkpoint to INT4 with tools/convert_hf_to_int4.py (GPTQ via llmcompressor):
python tools/convert_hf_to_int4.py \
   --input-dir  /root/MyModel \
   --output-dir /root/MyModel-INT4 \
   --data-dir   /root/calibration_dataset \
   --quant-type W4A16 \
   --num-calibration-samples 256 \
   --quant-group-size 128
FlagDefaultNotes
--quant-typeW4A16Also accepts W8A16.
--num-calibration-samples256Calibration set size.
--quant-group-size32GPTQ group size; 128 is also common.
--max-sequence-length2048Calibration sequence length.
--dampening-frac0.01GPTQ damping.
--trust-remote-codeoffPass when the HF config requires custom code.
The output is a HuggingFace directory with per-group INT4 weights and scales. Point --hf-checkpoint at it; SGLang autodetects the quantization at load time.

Enabling QAT

QAT is currently driven by environment variables passed through Ray’s runtime env rather than CLI flags. The canonical recipe is examples/low_precision/run-qwen3-30B-A3B-int4.sh:
RUNTIME_ENV_JSON='{
  "env_vars": {
    "OPEN_TRAINING_INT4_FAKE_QAT_FLAG": "1",
    "OPEN_TRAINING_INT4_GROUP_SIZE": "128"
  }
}'

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json="${RUNTIME_ENV_JSON}" \
   -- python3 train.py ...
Pair the INT4 --hf-checkpoint with a BF16 --ref-load torch_dist directory so the KL anchor stays full-precision.

Tuning

SymptomTry
Eval reward drops noticeably vs BF16Lower OPEN_TRAINING_INT4_GROUP_SIZE (e.g. 64), or recalibrate with more samples.
Slower than BF16Confirm --sglang-cuda-graph-bs covers your batch sizes.

Pairs with

  • R3. Keeps MoE routing stable across the quantized forward.
  • P2P weight transfer. INT4 weights are 4× smaller, so weight sync transfers less data.
  • Speculative decoding. Compounds for end-to-end rollout speedup.

When QAT is not appropriate

  • The model fits comfortably without it.
  • The model architecture is still in development; introduce QAT after a BF16 baseline.
  • Tasks that are highly precision-sensitive (some math and safety eval suites).