Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
Why do I see garbled text during training?
Why do I see garbled text during training?
latest_checkpointed_iteration.txt. Verify:--load(and/or--ref-load) point to a directory with that file.- If you want a specific iteration, use
--ckpt-step <N>.
My job is stuck on the Ray submission page.
My job is stuck on the Ray submission page.
- Colocated (
--colocate): total GPUs ≥actor_num_nodes × actor_num_gpus_per_node. - Disaggregated: total GPUs ≥
actor_num_nodes × actor_num_gpus_per_node + rollout_num_gpus.
--colocate when sharing GPUs.I'm OOM during training. What is max_tokens_per_gpu?
I'm OOM during training. What is max_tokens_per_gpu?
max-tokens-per-gpu caps how many tokens a single GPU sees per micro-batch (only when
--use-dynamic-batch-size is on, which it should be).Safe starting point:--context-parallel-size N) to spread one sample across N ranks.Multi-node training fails with transformers cannot find a model.
Multi-node training fails with transformers cannot find a model.
AutoConfig.from_pretrained on a shared filesystem race each
other. Set --model-name <hf-id> so workers don’t re-resolve the path.How do I resume training?
How do I resume training?
--load to whatever directory --save was writing to. That’s it.How is the batch size calculated?
How is the batch size calculated?
rollout_batch_sizeprompts are sampled.- Each prompt produces
n_samples_per_promptresponses.
rollout_batch_size × n_samples_per_prompt samples.--num-steps-per-rollout decides how many optimizer steps consume that data. The
invariant is:
rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout
Does Miles do data packing / varlen?
Does Miles do data packing / varlen?
--calculate-per-token-loss). You
never need to pad manually.SGLang gives Max retries exceeded with url: /get_model_info.
SGLang gives Max retries exceeded with url: /get_model_info.
--rollout-num-gpus-per-engine 8 so there’s exactly
one server per host.Gradient norm is huge and training crashes.
Gradient norm is huge and training crashes.
SGLang takes forever, GPUs at 100%, no output.
SGLang takes forever, GPUs at 100%, no output.
--rollout-stop or --rollout-stop-token-ids.SGLang error: illegal memory access.
SGLang error: illegal memory access.
--sglang-mem-fraction-static.JSONDecodeError from torch.compile / inductor.
JSONDecodeError from torch.compile / inductor.
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 to your
Ray env vars and re-run.Gradient is NaN / Inf.
Gradient is NaN / Inf.
--no-check-for-nan-in-loss-and-grad to skip the offending steps temporarily —
then go investigate the data + model alignment that caused it.Where do logs live?
Where do logs live?
| What | Path |
|---|---|
| Trainer stdout | wherever you redirected ray job submit |
| SGLang server | stdout/stderr captured by Ray under ~/.ray/session_latest/logs/; pass --sglang-log-dir <path> to write to a chosen directory instead |
| Ray workers | ~/.ray/session_latest/logs/ |
| wandb | wandb/ in your run dir, plus the cloud UI |

