What you’ll learn: how to make rollout production and trainer consumption fully parallel, with a queue in between, by using a custom rollout function. In the default training loop, every iteration looks like:Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
generate() blocks train_step(). With Async Rollout the loop is split: a background
thread runs generate continuously, and the trainer drains a queue. The two run in
parallel and the wall-clock time per iteration drops to roughly max(generate, train)
instead of the sum.
Prerequisites
- You completed the Qwen3-4B recipe (or have an equivalent model + dataset).
- Comfortable with Customization — async rollout uses a custom rollout function.
Files
Quick start
What changes vs. the default recipe
Just two flags:Walkthrough
The interesting code is small. Here’s the global worker manager:fully_async_rollout.py
- Singleton. One worker per process — multiple
train.pycalls share it. - Thread + asyncio loop. Cheaper than a subprocess; SGLang HTTP calls are I/O-bound, so an asyncio loop in a single thread saturates them.
atexithook. Worker is torn down when the process exits — no orphaned generation tasks.
--rollout-batch-size tasks in flight using
generate_and_rm_group:
What’s happening underneath
The producer loop is decoupled from the consumer loop. As long as the queue stays populated, the trainer never blocks on generation.Tuning knobs
| Knob | Effect |
|---|---|
--rollout-batch-size | Worker target in-flight count |
--sglang-server-concurrency | Per-engine concurrency cap |
--num-steps-per-rollout | Increase to consume more per drain (off-policy) |
--num-steps-per-rollout (you’ll be slightly off-policy) or scale up trainer
parallelism.
If queue depth stays at 0, rollout is the bottleneck — that’s where async helps least
because there’s nothing waiting to be consumed.
What to watch
consumer_drain_seconds > producer_cycle_time, your trainer is starving the queue —
check GPU utilization.
Limitations
- No evaluation mode in this example. Eval still runs through the synchronous path
in
train_async.py. Adding async eval is straightforward — copy the worker pattern and useevaluation=True. - Best-effort ordering. Samples are sorted by index at drain time, but exact-order guarantees aren’t provided.
- Minimal error handling. If a generate task throws, it’s logged but the worker keeps going. Production users wire in fault tolerance.
Variations
Async on a 30 B MoE
run_qwen3_30b_a3b_fully_async.py shows the same pattern with tp=4 ep=8 and
--sglang-enable-ep-moe. The only practical difference is increasing
--rollout-batch-size to 64+ to keep the larger engine pool fed.
Async + R3
Async rollout and R3 stack cleanly. Add:return_routed_experts=true because
it uses generate_and_rm_group under the hood.
Async + partial rollout
If you also use--partial-rollout, half-finished trajectories are saved to disk and
resumed — useful when the worker is killed mid-flight.
