TheDocumentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
--use-fault-tolerance flag enables Miles’s rollout-side
fault-tolerance machinery. It gates two code paths:
- A
RolloutHealthMonitorthread per server group, started inmiles/ray/rollout.py, which periodically heart-beats each SGLang engine. - A recovery hook in the trainer’s weight-update step
(
miles/backends/megatron_utils/actor.py), which restarts engines that the health monitor has killed.
action="store_true", default False
(miles/utils/arguments.py).
Health monitor
RolloutHealthMonitor (miles/utils/health_monitor.py) runs in a daemon
thread. Lifecycle: start (called once during init), pause and resume
(called when engines offload / onload), stop (called during dispose).
pause / resume are wired up in miles/ray/rollout.py and called
around offload / onload events.
Each loop iteration does:
- After a
resume, wait--rollout-health-check-first-waitseconds before the first check (intended to cover model compilation and initialization). - For every active engine in the group, call
engine.health_generate.remote(timeout=self._check_timeout). - If the call raises, run
_kill_engine:engine.shutdown.remote(),ray.kill(engine), and the engine slot is set toNone(miles/utils/health_monitor.py). - Sleep
--rollout-health-check-intervalseconds, then repeat.
Flags
| Flag | Default | Source help text |
|---|---|---|
--rollout-health-check-interval | 30.0 | ”Interval in seconds between rollout engine /health_generate checks during generate/eval.” |
--rollout-health-check-timeout | 30.0 | ”Timeout in seconds to wait for a rollout engine /health_generate response before killing it.” |
--rollout-health-check-first-wait | 0 | ”Initial grace period (in seconds) before starting health checks. This allows time for model compilation and initialization. Increase this value significantly when using deepgemm.” |
Engine recovery
When--use-fault-tolerance is on, MegatronActor.update_weights calls
rollout_manager.recover_updatable_engines on rank 0 before each weight
update (miles/backends/megatron_utils/actor.py).
recover_updatable_engines (miles/ray/rollout.py):
- Pauses health monitoring.
- Calls
srv.recover()on the updatable server.
srv.recover() (miles/ray/rollout.py):
- Finds engine slots set to
None(killed by the health monitor). - Calls
start_enginesfor each affected group. - Releases memory occupation on the new engines.
recover_updatable_engines returns, the weight updater connects to
the new engines and the next weight transfer proceeds normally.
P2P weight transfer timeouts
When--update-weight-transfer-mode p2p is on, every P2P transfer is
bounded by --p2p-transfer-timeout (default 30.0s, defined in
miles/utils/arguments.py; consumed at
miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py).
On timeout the failed transfer is logged ([P2P] Transfer future failed: ...)
in p2p_transfer_utils.py. There is no automatic retry or automatic
broadcast-mode fallback in the source today.
Dumper-mode interaction
In dumper mode (miles/utils/arguments.py), Miles forces
use_fault_tolerance = False and rollout_health_check_interval = 1e18
to keep heartbeats from firing.
