P2P Weight Transfer

miles supports P2P (point-to-point) weight transfer between training and rollout engines. By using --update-weight-transfer-mode p2p, miles enables more efficient weight transfer from training ranks to rollout engine ranks. More details on the design and implementation can be found in this issue.

Usage

To enable P2P weight transfer, add the following flag to your training command:

--update-weight-transfer-mode p2p

How It Works

The default weight transfer mode in miles is broadcast: after training, updated weights are broadcast via NCCL to all rollout engine ranks. This works but does not fully utilize the available bandwidth, as redundant copies of the same weights are transferred to multiple ranks. P2P mode addresses this by having each training rank transfer only the specific weight shards required by its target rollout engine rank(s), writing them directly to remote memory without redundant copies. The key steps are:

Initialization: Training ranks establish point-to-point connections (via RDMA) to their target rollout engine ranks. Including:
- Create a transfer plan that maps each training rank to its target rollout rank(s) based on GPU counts and parallelism configuration.
- Query remote rollout engines for their weight memory registration info (addresses and sizes for RDMA writes).
- Query remote parallelism config and construct a local CPU model replica that mirrors the target’s sharding layout, enabling correct weight format conversion before transfer.
Weight gather: Megatron TP/EP shards are all-gathered and converted to HF format, same as the broadcast path.
P2P transfer: Instead of a collective broadcast, each source rank writes bucketed weight tensors directly to the destination rollout rank’s memory, in a write-only fashion.
Synchronization: Once all RDMA writes are confirmed complete, rollout engines increment their weight version and resume generation for the next training step.

Architecture

Both broadcast and P2P modes share the same bucketed weight-update pipeline in miles/backends/megatron_utils/update_weight/. The diagram below shows which components are shared and which are P2P-specific.

Shared components (broadcast & P2P)

Component	Description
TP/EP all-gather	Megatron TP shards are all-gathered within each PP stage; EP shards are gathered per-bucket when the accumulated expert data exceeds `buffer_size * ep_size`. Both modes perform this identically via `common.py`.
Bucketed update	Weights are not transferred one parameter at a time. Instead, converted tensors are accumulated into a fixed-size buffer (`--update-weight-buffer-size`, default 1 GB). When the buffer is full, the entire bucket is flushed — via NCCL broadcast or RDMA write depending on the mode. This amortizes per-transfer overhead. Non-expert and expert weights use separate buckets.
PP independence	Each pipeline-parallel stage updates its own weights independently. In broadcast mode, each PP rank has its own NCCL group (`miles-pp_{pp_rank}`). In P2P mode, each PP rank has its own transfer plan. No cross-PP synchronization is needed during weight transfer, which is key to scaling.
HF format conversion	After all-gather, Megatron-format tensors (with custom naming and sharding) are converted to HuggingFace-format names expected by the sglang rollout engine.

P2P-specific components

Component	File	Description
Transfer plan	`p2p_transfer_utils.py`	Maps each training rank to its target rollout engine rank(s). Uses round-robin assignment with load balancing: the first `min(sources, targets)` ranks get 1:1 mapping, remaining targets are distributed evenly. This minimizes the number of RDMA sessions per source.
CPU model replica	`p2p.py`	A full sglang model is instantiated on CPU (not GPU) to mirror the target engine’s parallelism layout. This replica provides the correct `weight_loader` functions to re-shard all-gathered HF weights into the exact format expected by each target rank. Only the first engine’s replica pins memory; subsequent engines reuse the mapping via `ParameterMapper`.
Shared pinned buffer	`p2p.py`	A single CPU pinned memory buffer is registered with the mooncake TransferEngine for RDMA. This buffer is reused across all target engines (O(1) memory, not O(num_engines)). The buffer is overwritten per-engine, per-bucket.
Pipelined transfer	`p2p.py`	RDMA writes to multiple target engines are pipelined: for non-last engines, the transfer manager waits for the previous write to complete before reusing the buffer; for the last engine, writes are fire-and-forget to a background thread pool, overlapping with the next bucket’s load phase.

Supported Model Architectures

P2P weight transfer relies on a unified weight name mapping interface between Megatron and sglang (see sglang#17326). The following sglang model classes are supported:

sglang Model Class	Model Family	Example Models
`Qwen2ForCausalLM`	Qwen2 (dense)	Qwen2.5-0.5B, Qwen2.5-7B
`Qwen3ForCausalLM`	Qwen3 (dense)	Qwen3-4B, Qwen3-8B
`Qwen3MoeForCausalLM`	Qwen3-MoE	Qwen3-30B-A3B, Qwen3-235B-A22B
`Glm4ForCausalLM`	GLM4 (dense)	GLM-Z1-9B-0414
`Glm4MoeForCausalLM`	GLM4-MoE	GLM-4.5-Air
`Glm4MoeLiteForCausalLM`	GLM4-MoE	GLM-4.7-9B-Flash
`DeepseekV2ForCausalLM`	DeepSeek V2	Moonlight-16B-A3B
`DeepseekV3ForCausalLM`	DeepSeek V3p2	GLM-5 (744B-A40B)
`DeepseekV3ForCausalLM`	DeepSeek V3	Kimi-K2 (1T) *

Note: All the above models are tested on H100-80GB clusters. For Kimi-K2, we set training nodes = inference nodes = 32 to ensure sufficient memory. The original Kimi-K2 checkpoint uses a block-quant size of [128, 128], which will trigger errors if sglang-tp-size = 32. To work around this, we re-quantize to [64, 64] and update all affected scale tensors accordingly.

Validated Models

All models below have been validated with --check-weight-update-equal in P2P mode.

Model	sglang Model Class	Nodes
Qwen3-4B	`Qwen3ForCausalLM`	1
GLM-Z1-9B-0414	`Glm4ForCausalLM`	1
Moonlight-16B-A3B	`DeepseekV2ForCausalLM`	2
GLM-4.7-9B-Flash	`Glm4MoeLiteForCausalLM`	2
GLM-5_4layer	`DeepseekV3ForCausalLM`	2
Qwen3-30B-A3B	`Qwen3MoeForCausalLM`	4
GLM-4.5-Air	`Glm4MoeForCausalLM`	8

Enabling --check-weight-update-equal for Kimi-K2 is non-trivial due to several issues:

The user must first dequantize the Kimi-K2 model to BF16 for training, then re-quantize the parameters for weight updating. Meanwhile, the rollout side snapshots the original Kimi-K2 tensors as the reference when enabling --check-weight-update-equal. This means the rollout SGLang engine should load the checkpoints processed by the dequant-requant pipeline.

We use modified checkpoints with block-quant size [64, 64], so any quant/dequant code that hard-codes the block-quant size as [128, 128] will break.

Certain tensors that are only initialized on the rollout side (e.g., k_scale / v_scale) must be skipped during the weight-check process.

We verified P2P correctness for Kimi-K2 by enabling --check-weight-update-equal with hard-coded workarounds for the issues above. The transferred weights were confirmed correct. These hard-coded workarounds are verification-only and will not be merged into the main branch to keep the codebase maintainable.

Profiling Results

For M source ranks and N target ranks, with source pp size src_pp and target ep size sgl_ep, the benefit of P2P transfer is approximately:

utilizes M // src_pp times more source transmission bandwidth.
each target rank receives sgl_ep times less data.

Thus we expect our solution to scale better, especially on the MoE models. All profiling is run on H100-80GB clusters with 1GB transfer bucket. Timing measures after pause_generation call returns and before update_weight call exits to exclude request queue abortion time. Table includes steady-state steps 3–12 average. For Kimi-K2, the RDMA (ms) column includes GPU-side post-processing time (post_load_weights ~884ms) since this model requires GPU-side weight requantization after RDMA transfer. Models marked with ★ are MoE architectures, where P2P benefits are most pronounced due to expert-parallel sharding reducing per-target transfer volume.

Model Family	Model Name	Total Param	sglang Model Class	Train Config	Inference Config	NCCL (ms)	RDMA (ms)	Delta
GLM4	GLM-Z1-9B-0414	9B	`Glm4ForCausalLM`	TP=2, PP=1, CP=2, EP=1, ETP=1, 1 node	TP=4, EP=1, 1 node	694.6	707.1	+1.8%
DeepSeek-V2 ★	Moonlight-16B-A3B	16B(3B)	`DeepseekV2ForCausalLM`	TP=2, PP=1, CP=1, EP=8, ETP=1, 1 node	TP=8, EP=8, 1 node	1,482.0	1,073.3	−27.6%
GLM4-MoE ★	GLM-4.7-9B-Flash	30B(3B)	`Glm4MoeLiteForCausalLM`	TP=4, PP=1, CP=1, EP=8, ETP=1, 1 node	TP=4, EP=4, 1 node	2,508.6	4,229.0	+68.6%
DeepSeek-V3 ★	GLM-5_4layer	4-layer	`DeepseekV3ForCausalLM`	TP=4, PP=1, CP=1, EP=8, ETP=1, 1 node	TP=8, EP=8, 1 node	732.2	1,260.8	+72.2%
Qwen3-MoE ★	Qwen3-30B-A3B	30B(3B)	`Qwen3MoeForCausalLM`	TP=4, PP=1, CP=1, EP=8, ETP=1, 2 nodes	TP=8, EP=8, 2 nodes	2,670.0	2,160.2	−19.1%
GLM4-MoE ★	GLM-4.5-Air	106B(12B)	`Glm4MoeForCausalLM`	TP=1, PP=4, CP=1, EP=8, ETP=1, 4 nodes	TP=8, EP=8, 4 nodes	5,001.1	2,637.2	−47.3%
Qwen3-MoE ★	Qwen3-235B-A22B	235B(22B)	`Qwen3MoeForCausalLM`	TP=4, PP=4, CP=2, EP=16, ETP=1, 8 nodes	TP=32, EP=32, 8 nodes	10,753.6	3,162.0	−70.6%
DeepSeek-V3p2 ★	GLM-5	744B(40B)	`DeepseekV3ForCausalLM`	TP=4, PP=8, CP=2, EP=16, ETP=1, 16 nodes	TP=64, EP=64, 16 nodes	58,301.5	8,479.7	−85.5%
DeepSeek-V3 ★	Kimi-K2	1T(64B)	`DeepseekV3ForCausalLM`	TP=8, PP=8, CP=4, EP=32, ETP=1, 32 nodes	TP=32, EP=32, 32 nodes	53,279.1	7,227.3	−86.4%

* Kimi-K2 RDMA time includes ~884 ms GPU-side post_load_weights requantization on rollout engines.

Examples

CI Test (single-node, Qwen3-4B)

The P2P weight transfer E2E test validates correctness on a single node using Qwen3-4B:

#
# Train: 4 GPUs (TP=2, CP=2)
# Rollout: 4 GPUs (sglang, 2 engines × 2 GPUs each)
# Flags: --update-weight-transfer-mode p2p --check-weight-update-equal

Documentation Index

​Usage

​How It Works

​Architecture

​Shared components (broadcast & P2P)

​P2P-specific components

​Supported Model Architectures

​Validated Models

​Profiling Results

​Examples

​CI Test (single-node, Qwen3-4B)