These features live in the Miles tree but are not production-ready. They typically have rough edges, missing parallelism, or known bugs against current dependency versions. Use them when you want to iterate quickly or co-develop a feature, not for the long-running training jobs you’d publish results from.Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
FSDP backend
A PyTorch FSDP2 training backend lives atmiles/backends/fsdp_utils/.
It trades maximum throughput for zero conversion overhead: there is no
torch_dist step, Miles reads architecture information from the HuggingFace
config.json, and weights load directly via AutoModelForCausalLM.from_pretrained().
The distributed optimizer is built into FSDP, and mixed precision falls out of standard
PyTorch.
When to reach for it
- Iterating on a new model architecture and you don’t want to write a Megatron spec yet.
- Small-to-mid dense workloads where the parallelism story doesn’t matter.
- You want a HuggingFace-native checkpoint at every step with no conversion.
Enabling it
Flag mapping vs. Megatron
Most RL-level flags carry over unchanged. Backend-specific differences:| Concern | Megatron | FSDP |
|---|---|---|
| Model load | --load + architecture args | --hf-checkpoint (single flag, required) |
| Tensor parallel | --tensor-model-parallel-size | Not supported yet |
| Pipeline parallel | --pipeline-model-parallel-size | Not supported yet |
| Expert parallel | --expert-model-parallel-size | Not supported yet |
| Context parallel | --context-parallel-size | Not supported yet |
| Optimizer | --use-distributed-optimizer (forced on by Miles) | Built-in |
| Gradient checkpoint | --recompute-granularity / method / num-layers | --gradient-checkpointing (boolean) |
| CPU offload | Distributed optimizer | --fsdp-cpu-offload |
| CPU backend | (in distributed optimizer) | --fsdp-cpu-backend |
| Attention backend | Decided by Megatron Core | --attn-implementation flash_attention_2 / sdpa / eager |
| Mixed precision | --fp16 / --bf16 | --fp16 (bf16 inferred) |
| Extra backend config | — | --config <yaml> |

