Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

In a typical SGLang deployment, every engine handles both prefill (the one-shot forward over the prompt) and decode (the per-token autoregressive loop). The two phases have different compute profiles:
PhaseCompute patternBottleneck
PrefillLong sequence × full batchFLOPs
DecodeOne token × batchMemory bandwidth
Mixing them in one engine means each engine is sized for the worse case. PD disaggregation splits them into two pools, each sized for its own workload.

Enable

--prefill-num-servers 2
--prefill-num-servers is a Miles-native flag added by add_prefill_decode_disaggregation_arguments in miles/utils/arguments.py. When set, miles/ray/rollout.py calls SglangConfig.from_prefill_num_servers(args) to dedicate that many SGLang servers to prefill, with the rest used for decode. --prefill-num-servers is mutually exclusive with the sglang_config attribute (the YAML server_groups config), and also cannot be combined with --rollout-external (arguments.py).

When PD is worth it

  • Long prompts (≥ 4K). Prefill dominates total latency.
  • High decode batch sizes. Decode is memory-bandwidth bound.
  • MoE models. Decode benefits disproportionately from EP scaling that prefill does not need.
For typical post-training with 1–2K prompts, the routing and KV-transfer overhead can outweigh the speedup. Measure first.

How requests flow

The KV cache produced by the prefill pool is migrated to the decode pool. SGLang handles the cache transfer when the SGLang Model Gateway fronts the engines (PD support is a feature of the SGLang router).

Sizing the pools

A starting heuristic:
prefill_servers = ceil(rollout_qps × avg_prompt_tokens / single_engine_prefill_tps)
decode_servers  = N - prefill_servers
Without measurements, start at prefill_num_servers = N / 4 and adjust based on observed queueing:
SymptomAction
Prefill queue backing upIncrease prefill_num_servers
Decode latency creeping upDecrease prefill_num_servers
Both queues growingScale up both pools, then revisit the ratio

Pairs with

When PD is not useful

  • Short prompts (under ~1K tokens). Prefill is already cheap.
  • Single-node setups. Pool boundaries do not help.
  • Highly variable workloads. Fixed pool sizes are wasteful.