Multi-Agent Post-Co-Training of Large Language Models via Reinforcement Learning
DOI:
https://doi.org/10.71465/fapm716Keywords:
Post-training, multi-agent learning, LLM collaboration, verifier-based reward, discussion optimizationAbstract
This study introduces MAPoRL2, a post-training framework that enhances collaborative LLM performance through multi-agent reinforcement learning and structured discussion. Multiple LLM agents independently generate candidate solutions, engage in iterative discussion rounds, and are jointly optimized using verifier-based rewards that assess both correctness and corrective reasoning. Experiments across 5 reasoning and generation benchmarks with 4,500 training samples demonstrate improvements of 18.9% in answer accuracy and 22.4% in correction efficiency over single-agent post-training, highlighting the effectiveness of discussion-aware RL signals.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 James L. Carter, Yuxuan Liu, Thomas K. Lee (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.