Miles ships recipes for the DeepSeek family across two generations: DeepSeek-V4 Flash introduces sparse multi-head latent attention with a learned indexer and KV compressors (8-node H200), while V3 / R1 remain the canonical 16-node 671 B-parameter recipes (BF16 train + 128×128 block-wise FP8 rollout, DeepEP, DAPO-style dynamic sampling).Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
Variants
| Model | Active / Total | HF ID | Recipe |
|---|---|---|---|
| DeepSeek-V4-Pro | 49 B / 1.6 T | TBA | deepseek-v4-pro |
| DeepSeek-V4-Flash | 13 B / 284 B | sgl-project/DeepSeek-V4-Flash-FP8 | deepseek-v4-flash |
| DeepSeek-V3 | 37 B / 671 B | deepseek-ai/DeepSeek-V3 | deepseek |
| DeepSeek-R1 | 37 B / 671 B | deepseek-ai/DeepSeek-R1 | deepseek |
radixark/miles#1046 for tracking.
Fastest path to train
DeepSeek-V4-Flash needs 8 nodes of 8× H200 and theradixark/miles:deepseek-v4 image:
scripts/run_deepseek.py).
Pairs well with
- PD Disaggregation — 671 B is where PD really earns its keep.
- P2P Weight Transfer — amortize weight sync across ranks.
- Fault Tolerance — node failures are inevitable at 16-node scale.

