Pareto-Front Agentic RL with Dynamic Preference Conditioning for Cost–Risk–Success Trade-offs in Web Tasks
DOI:
https://doi.org/10.71465/fapm749Keywords:
Multi-objective reinforcement learning, Pareto optimality, preference conditioning, web agents, risk quantiles, cost–latency trade-off, policy calibrationAbstract
Web agent deployment requires navigating trade-offs among success rate, monetary/API costs, latency, and failure risk, which vary by user and scenario. We propose a preference-conditioned multi-objective RL framestudy that learns a Pareto set of policies for web tasks. The method trains a single agent conditioned on a preference vector w over (success, cost1_11, cost2_22, …, risk), using (i) multi-gradient conflict resolution to stabilize updates across objectives, and (ii) Pareto replay that balances samples from distinct regions of the frontier. To ensure tail-risk control, the risk objective is defined as quantile-based failure loss (e.g., 90th/95th percentile). We recommend benchmarking on 1,200+ tasks across 40–70 site templates, sweeping preference vectors to obtain a Pareto curve and reporting hypervolume improvement, frontier coverage, and policy switching stability. This approach enables “one model, many operating points,” supporting practical deployment where budgets and risk tolerance change dynamically.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Marco Rossi, Giulia Bianchi, Alessandro Conti (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.