Multi-Agent Co-Evolution

What you’ll learn: how to wire up an asynchronous multi-agent system in Miles, where two (or more) specialized agents take alternating turns and the joint outcome drives a single shared reward. This example uses a dual-agent setup that interleaves a “thinker” and a “verifier”, but the same pattern scales to:

Doctor / patient simulations.
Multi-step DeepResearch pipelines.
Adversarial games (proposer / solver).

The supporting framework for the production version of this is MrlX — Miles ships the kernel of the same idea so you can hack on it without pulling in MrlX’s full dependency tree.

Prerequisites

You’ve completed the Qwen3-30B-A3B recipe (the example uses that model).
Familiar with Customization.

Files

examples/multi_agent/
├── agent_system.py                       # the agent state machine
├── prompts.py                            # role / system prompts
├── rollout_with_multi_agents.py          # custom rollout (calls agent_system)
└── run-qwen3-30B-A3B-multi-agent.sh      # launch script

Quick start

cd /root/miles
bash examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh

Configuration

MULTI_AGENT_CONFIGS = {
    "custom_multi_agent_function_path":
        "examples.multi_agent.agent_system.run_agent_system",
    "num_parallel": 5,                  # parallel agent runs per prompt
    "incorrect_reward_weight": 0.8,     # weight on agent A's reward when wrong
    "correct_reward_weight": 1.2,       # weight on agent A's reward when right
}

Asymmetric reward weighting (0.8 / 1.2) gives a small bias toward upweighting “correct” trajectories, which empirically stabilizes early training when most attempts fail.

Launch script highlights

ROLLOUT_ARGS=(
   --custom-generate-function-path \
       examples.multi_agent.rollout_with_multi_agents.generate_with_multi_agents
   --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
   --input-key prompt --label-key label
   --apply-chat-template --rollout-shuffle
   --rm-type deepscaler

   --num-rollout 3000
   --rollout-batch-size 32
   --n-samples-per-prompt 8

   --rollout-max-context-len 16384       # entire conversation budget
   --rollout-max-response-len 8192       # per-turn cap

   --global-batch-size 256
   --balance-data
)

Two flags matter most:

--rollout-max-context-len — total context budget across all turns. Larger than --rollout-max-response-len because we accumulate.
--global-batch-size 256 = 32 × 8 — matches the rollout invariant.

Walkthrough — the agent loop

The shipped pipeline is solver → rewriter → selector: num_parallel solver attempts run in parallel, each rewriter rewrites the previous solutions, and a SelectorAgent picks one. Sampling params are set on args upstream by the rollout helper, so run_agent_system only takes (args, sample).

agent_system.py

async def run_agent_system(args, sample):
    args = deepcopy(args)
    args.sample = sample
    args.results_dict = {"solver": [], "rewriter": [], "selector": []}

    problem_statement = sample.prompt

    # 1. Solver: num_parallel attempts in parallel.
    tasks = [solver_worker(args, problem_statement, i)
             for i in range(args.num_parallel)]
    solver_solutions = await asyncio.gather(*tasks, return_exceptions=True)
    rewards = await batched_async_rm(args, args.results_dict["solver"])
    for s, r in zip(args.results_dict["solver"], rewards):
        s.reward = r

    previous = [r for r in solver_solutions if isinstance(r, str)]

    # 2. Rewriter: each worker rewrites the previous solutions.
    tasks = [rewrite_worker(args, previous, problem_statement, i)
             for i in range(args.num_parallel)]
    rewritten = [r for r in await asyncio.gather(*tasks, return_exceptions=True)
                 if isinstance(r, str)]

    # 3. Selector: pick one of the rewritten solutions.
    selector = SelectorAgent()
    response = await selector.select(args, problem_statement, rewritten)

    # ... apply asymmetric reward weighting using
    # args.incorrect_reward_weight / args.correct_reward_weight on the solver
    # and rewriter samples, then return them all together.
    return args.results_dict["solver"] + args.results_dict["rewriter"] + ...

Both roles share the same SGLang process — solver_worker, rewrite_worker, and SelectorAgent.select all post to the same engine, just with different prompts. So both agents are the same model updating in lockstep. For architecturally distinct agents (separate models), see the MrlX repo.

Walkthrough — rollout integration

rollout_with_multi_agents.py exposes generate_with_multi_agents(args, sample, sampling_params, evaluation=False). Internally it:

Sets args.sampling_params = sampling_params and args.tokenizer, then loads the custom multi-agent function from args.custom_multi_agent_function_path.
Calls await custom_multi_agent_func(args, sample) to get the list of samples produced by the solver / rewriter / selector pipeline.
Returns the shuffled list of Samples for the trainer to pack.

The per-sample tokenization and reward already happen inside solver_worker / rewrite_worker / SelectorAgent.select (which call batched_async_rm), so the rollout integration itself is a thin wrapper.

Tuning knobs

Knob	Effect
`MAX_TURNS`	Conversation depth — longer = more context = slower
`incorrect_reward_weight` / `correct_reward_weight`	Asymmetric shaping
`num_parallel`	Rollouts per prompt running concurrently
`--rollout-max-context-len`	Stops the conversation when budget is hit

What to watch

multi_agent/avg_turns                       2.5 – 4.0
multi_agent/early_termination_rate          0.4 – 0.6 (reaches <final_answer>)
multi_agent/conversation_token_count        4096 – 12288
loss_mask/role_split                        balanced (~50/50)
reward/avg                                  trending up

If loss_mask/role_split is heavily skewed, one role is dominating — typically the verifier becomes verbose. Tighten its system prompt or reduce its max_tokens.

Troubleshooting

Symptom	Fix
OOM mid-rollout	Reduce `MAX_TURNS` or `--rollout-max-context-len`
Both agents repeat each other	Verifier prompt is too permissive — make it adversarial
Reward never moves	Check that `<final_answer>` extraction matches the verifier output
Rollout much slower than baseline	Per-turn SGLang RTT × MAX_TURNS — consider async rollout

Variations

VLM multi-turn

Replace call_role with a VLM-aware caller that includes images in messages. Miles supports VLM multi-turn natively — same pattern, just multimodal_train_inputs in the sample dict (see Customization #13).

True asymmetric agents

Run two SGLang services — one per agent — and have your rollout function call the appropriate URL per turn. The trainer can either train both jointly (one optimizer per model) or train one and freeze the other (PvE).

Adversarial pairing

Instead of a verifier, the second agent is an adversary that tries to find weaknesses in the thinker’s answer. Reward both: thinker for surviving, adversary for breaking. This is the seed of self-play RLHF.

Documentation Index

​Prerequisites

​Files

​Quick start

​Configuration

​Launch script highlights

​Walkthrough — the agent loop

​Walkthrough — rollout integration

​Tuning knobs

​What to watch

​Troubleshooting

​Variations

​VLM multi-turn

​True asymmetric agents

​Adversarial pairing