Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
When the model is large enough that even FP8 will not fit on one node, the
options are spreading across more nodes (and paying cross-node bandwidth) or
quantizing further. Miles ships an INT4 W4A16 quant-aware-training pipeline.
On an 8 × 141 GB H200 node, this is the path used to fit very large models in
a single box.
The recipe is inspired by the
Kimi K2-Thinking team’s report.
What W4A16 means
| Term | Bits | Notes |
|---|
| W4 | 4-bit weights | Group-quantized (typical group size 32–128) |
| A16 | 16-bit activations | BF16 activation pathway |
The combination keeps the weights small (memory-bound) while activations stay
in BF16 (math-bound). With QAT the model trains with the quantization in the
loop, so the weights round well during inference.
Calibration
Convert a BF16 HuggingFace checkpoint to INT4 with tools/convert_hf_to_int4.py
(GPTQ via llmcompressor):
python tools/convert_hf_to_int4.py \
--input-dir /root/MyModel \
--output-dir /root/MyModel-INT4 \
--data-dir /root/calibration_dataset \
--quant-type W4A16 \
--num-calibration-samples 256 \
--quant-group-size 128
| Flag | Default | Notes |
|---|
--quant-type | W4A16 | Also accepts W8A16. |
--num-calibration-samples | 256 | Calibration set size. |
--quant-group-size | 32 | GPTQ group size; 128 is also common. |
--max-sequence-length | 2048 | Calibration sequence length. |
--dampening-frac | 0.01 | GPTQ damping. |
--trust-remote-code | off | Pass when the HF config requires custom code. |
The output is a HuggingFace directory with per-group INT4 weights and scales.
Point --hf-checkpoint at it; SGLang autodetects the quantization at load time.
Enabling QAT
QAT is currently driven by environment variables passed through Ray’s runtime
env rather than CLI flags. The canonical recipe is
examples/low_precision/run-qwen3-30B-A3B-int4.sh:
RUNTIME_ENV_JSON='{
"env_vars": {
"OPEN_TRAINING_INT4_FAKE_QAT_FLAG": "1",
"OPEN_TRAINING_INT4_GROUP_SIZE": "128"
}
}'
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py ...
Pair the INT4 --hf-checkpoint with a BF16 --ref-load torch_dist directory
so the KL anchor stays full-precision.
Tuning
| Symptom | Try |
|---|
| Eval reward drops noticeably vs BF16 | Lower OPEN_TRAINING_INT4_GROUP_SIZE (e.g. 64), or recalibrate with more samples. |
| Slower than BF16 | Confirm --sglang-cuda-graph-bs covers your batch sizes. |
Pairs with
- R3. Keeps MoE routing stable across the quantized forward.
- P2P weight transfer. INT4 weights are 4× smaller,
so weight sync transfers less data.
- Speculative decoding. Compounds for end-to-end
rollout speedup.
When QAT is not appropriate
- The model fits comfortably without it.
- The model architecture is still in development; introduce QAT after a BF16
baseline.
- Tasks that are highly precision-sensitive (some math and safety eval suites).