Adding a new architecture (such as Qwen3-Next’s Gated-Delta-Net) directly to Megatron-LM’s native code path is invasive. Miles takes a different approach: wrap the model’s official HuggingFace implementation as a black-box module and embed it inside Megatron’s parallel scheduling. This trades some throughput ceiling (no TP inside the wrapped module) for a much shorter time-to-train when the architecture is new. This page uses Qwen3-Next 80B-A3B as the running example.Documentation Index
Fetch the complete documentation index at: https://www.radixark.com/llms.txt
Use this file to discover all available pages before exploring further.
How it works
Megatron instantiates a model in two steps:- Generate a layer specification (
ModuleSpec/ decoder block spec). - Instantiate concrete PyTorch modules from that spec.
1. Custom decoder spec
miles_plugins/models/qwen3_next.py defines get_qwen3_next_spec. It
starts from get_gpt_decoder_block_spec, then for the layers whose HF
layer_types[i] == "linear_attention" it overrides the layer’s
submodules.self_attention with a ModuleSpec(module=Attention, params={"args": args}) (referenced from miles_plugins/models/):
--spec:
2. Abstract Megatron-side wrapper
miles_plugins/models/hf_attention.py defines an abstract
HuggingfaceAttention(MegatronModule, ABC) whose __init__ takes
(args, config, layer_number, cp_comm_type, pg_collection). It loads the
HuggingFace config from args.hf_checkpoint and prepares the layout
adapters Megatron’s parallelism contract requires (sequence parallel, CP
zigzag/packed-shard conversions). Concrete Attention classes subclass it
and embed the actual HF attention module.
3. Align weights with mbridge
The HF parameter layout differs from Megatron’s.miles_plugins/mbridge/
ships per-architecture bridges that reconcile the two. For Qwen3-Next:
_ATTENTION_MAPPING (and _MLP_MAPPING, etc.) extends the
parent bridge with the layer-name substitutions specific to this
architecture. Bridges that need to reshape weights at conversion time
override _weight_to_mcore_format. See
mbridge for the parent bridges.
Capabilities and limits
| Patch Megatron core | Miles wrapper approach | |
|---|---|---|
| Pipeline parallel | Supported | Supported |
| Sequence parallel | Supported | Supported |
| MoE acceleration | Supported | Supported |
| TP inside the wrapped module | Supported | Not supported |
Mixed precision: keeping fp32 parameters fp32
Some architectures need certain parameters to remain fp32 even when the rest of the model is bf16. Qwen3.5’sA_log is the canonical example. Rounding it
to bf16 makes Megatron-side activations diverge from SGLang-side rollout,
causing precision drift.
The canonical cast point is Megatron’s Float16Module, which casts
every floating-point parameter to bf16/fp16 at wrap time. The mbridge
weight-conversion path (_weight_to_mcore_format and friends) is the
other place fp32 weights can be silently downcast. Two steps are required
to keep tagged params in fp32.
Mark the parameter
enforce_marked_param_dtypes(model) (already wired into the training and
checkpoint conversion entry points) restores tagged params to fp32 after
Float16Module casts the rest of the model to bf16.
Override the bridge
When this path fits
- New architectures not yet integrated into Megatron core.
- Research models with non-standard layers (Mamba-style state space, Gated-Delta-Net, etc.).
- Cases where the cost of patching Megatron exceeds the value of squeezing the last few percent of throughput.
When native Megatron is preferable
- Stable, frozen architectures (Qwen3 standard, GLM4) where Megatron’s native path is mature.
- Cases where TP inside the new module is critical.

