Multi-Agent Post-Co-Training of Large Language Models via Reinforcement Learning

Authors

  • James L. Carter Department of Computer Science, Princeton University, Princeton, NJ 08544, USA Author
  • Yuxuan Liu Department of Computer Science, Princeton University, Princeton, NJ 08544, USA Author
  • Thomas K. Lee Department of Computer Science, Princeton University, Princeton, NJ 08544, USA Author

DOI:

https://doi.org/10.71465/fapm716

Keywords:

Post-training, multi-agent learning, LLM collaboration, verifier-based reward, discussion optimization

Abstract

This study introduces MAPoRL2, a post-training framework that enhances collaborative LLM performance through multi-agent reinforcement learning and structured discussion. Multiple LLM agents independently generate candidate solutions, engage in iterative discussion rounds, and are jointly optimized using verifier-based rewards that assess both correctness and corrective reasoning. Experiments across 5 reasoning and generation benchmarks with 4,500 training samples demonstrate improvements of 18.9% in answer accuracy and 22.4% in correction efficiency over single-agent post-training, highlighting the effectiveness of discussion-aware RL signals.

Downloads

Download data is not yet available.

Downloads

Published

2026-03-10