Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.radixark.com/llms.txt

Use this file to discover all available pages before exploring further.

Most of Miles’s behavior can be replaced with user-supplied Python by passing a --*-path flag. This page lists every such hook, the function signature it expects, and the default it replaces.

At a glance

StageFlagReplaces
Rollout--rollout-function-pathThe whole rollout loop
--custom-generate-function-pathA single sample’s generation
--data-source-pathHow prompts are loaded
--eval-function-pathThe eval rollout
Reward--custom-rm-pathReward computation
--custom-reward-post-process-pathReward normalization
Filtering--dynamic-sampling-filter-pathPer-group filter (DAPO)
--buffer-filter-pathBuffer dequeue filter
--rollout-sample-filter-pathPer-sample loss filter
--rollout-all-samples-process-pathInspect all samples post-rollout
--rollout-data-postprocess-pathMutate samples post-logprob
Training--custom-loss-function-pathThe loss formula
--custom-tis-function-pathImportance sampling correction
--custom-pg-loss-reducer-function-pathLoss reduction (Dr.GRPO)
--custom-convert-samples-to-train-data-pathSample to tensor batch
Megatron hooks--custom-megatron-init-pathAfter Megatron init
--custom-megatron-before-log-prob-hook-pathBefore logprob compute
--custom-megatron-before-train-step-hook-pathBefore each train step
Logging--custom-rollout-log-function-pathTrain-rollout logging
--custom-eval-rollout-log-function-pathEval-rollout logging
Routing--miles-router-middleware-pathsRouter middleware
Model--custom-model-provider-pathMegatron model factory

Rollout

--rollout-function-path

Replace the entire rollout function. Use this only for fundamentally different flows such as multi-agent co-evolution.
def generate_rollout(args, rollout_id, data_source, evaluation=False) \
        -> RolloutFnTrainOutput | RolloutFnEvalOutput:
    ...
Default: miles.rollout.sglang_rollout.generate_rollout, or miles.rollout.inference_rollout.inference_rollout_common.InferenceRolloutFn when enable_experimental_rollout_refactor() is on. Reference: examples/multi_agent/rollout_with_multi_agents.py.

--custom-generate-function-path

Replace just the generation step inside the default rollout. Most tool-use, RAG, and multi-turn workflows live here.
async def custom_generate(args, sample: Sample, sampling_params: dict) -> Sample:
    ...
Reference: examples/search-r1/generate_with_search.py.

--data-source-path

class CustomDataSource(DataSource):
    def get_samples(self, num_samples) -> list[list[Sample]]: ...
    def add_samples(self, samples) -> None: ...
    def save(self, rollout_id) -> None: ...
    def load(self, rollout_id=None) -> None: ...
Default: miles.rollout.data_source.RolloutDataSourceWithBuffer.

--eval-function-path

Same signature as --rollout-function-path. Defaults to whatever rollout function is configured.

Reward

--custom-rm-path

async def custom_rm(args, sample: Sample) -> float:
    ...

# Batched mode (set --group-rm)
async def batched_custom_rm(args, samples: list[Sample]) -> list[float]:
    ...
Built-in --rm-type options: math, dapo, deepscaler, f1, gpqa, ifbench, remote_rm (with --rm-url), random.

--custom-reward-post-process-path

Hook to normalize rewards differently from the default GRPO normalization.

Filtering

--dynamic-sampling-filter-path

Per-group filter; runs after scoring, before queueing for training.
def filter_function(args, samples: list[Sample], **kwargs) -> DynamicFilterOutput:
    return DynamicFilterOutput(keep=True, reason=None)
Stock implementation: miles.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std.

--buffer-filter-path

Pops samples from the rollout buffer at dequeue time. The default is pop_first in miles/rollout/data_source.py.
def buffer_filter(
    args,
    rollout_id: int | None,
    buffer: list[list[Sample]],
    num_samples: int,
) -> list[list[Sample]]:
    ...

--rollout-sample-filter-path

Per-sample, in-place. Set s.remove_sample = True to exclude a sample from the loss (advantage normalization still uses it). The framework passes data: list[list[Sample]] — a list of n_samples_per_prompt-size groups — so iterate the outer list once to reach Sample objects:
def filter_function(args, data: list[list[Sample]]) -> None:
    for group in data:
        for s in group:
            if not_good(s):
                s.remove_sample = True

--rollout-all-samples-process-path

Runs after rollout completes and can see all samples, including filtered ones. Useful for logging or analysis.

--rollout-data-postprocess-path

Runs after log probabilities have been computed but before training. Useful for updating loss masks based on per-token logprobs.

Training

--custom-loss-function-path

Replace the GRPO/PPO loss. Requires --loss-type custom_loss. Useful for novel objectives or multi-objective work.

--custom-tis-function-path

Importance sampling correction for off-policy training when train and inference diverge. Reference: examples/train_infer_mismatch_helper/mis.py.

--custom-pg-loss-reducer-function-path

def get_pg_loss_reducer(
    total_lengths: list[int],
    response_lengths: list[int],
    loss_masks: list[torch.Tensor],
    calculate_per_token_loss: bool = False,
) -> Callable[[torch.Tensor], torch.Tensor]:
    ...
Use case: Dr.GRPO divides by a constant instead of effective token count. Reference: examples/DrGRPO/custom_reducer.py.

--custom-convert-samples-to-train-data-path

def convert_samples_to_train_data(args, samples) -> dict:
    return {
        "tokens":           [...],
        "response_lengths": [...],
        "rewards":          [...],
        "raw_reward":       [...],
        "truncated":        [...],
        "sample_indices":   [...],
        "loss_masks":       [...],
        # optional
        "round_number":            [...],
        "rollout_log_probs":       [...],
        "rollout_routed_experts":  [...],
        "metadata":                [...],
        "multimodal_train_inputs": [...],
        "teacher_log_probs":       [...],
    }

Megatron hooks

FlagSignature
--custom-megatron-init-pathdef custom_init(args) -> None
--custom-megatron-before-log-prob-hook-pathdef custom_hook(args, model, store_prefix) -> None
--custom-megatron-before-train-step-hook-pathdef custom_hook(args, rollout_id, step_id, model, optimizer, opt_param_scheduler) -> None
These give per-step access to the live Megatron model and optimizer, useful for custom probes, weight clipping, or surgical interventions.

Logging

# Training rollouts
def log_rollout_data(rollout_id, args, samples, rollout_extra_metrics, rollout_time) -> bool:
    ...

# Eval rollouts
def log_eval_rollout_data(rollout_id, args, data, extra_metrics) -> bool:
    ...
Return True to suppress Miles’s default logging, False to layer on top.

Router

--miles-router-middleware-paths

Inject middleware into the router for request and response transformation, caching, or custom routing.

Model

--custom-model-provider-path

Replace Megatron’s default model factory.
def custom_model_provider(
    pre_process: bool,
    post_process: bool,
    vp_stage: int | None = None,
) -> GPTModel:
    ...

Worked example

A custom rollout plus a custom reward in one launch script:
ROLLOUT_ARGS+=(
   --custom-generate-function-path my_pkg.search_rollout.generate
   --custom-rm-path                my_pkg.rewards.f1_with_grounding
   --metadata-key metadata
   --rollout-max-response-len 4096
)
That is the entire delta from the stock GRPO recipe, with no source changes to Miles. → Next: Server arguments reference