Pareto-Front Agentic RL with Dynamic Preference Conditioning for Cost–Risk–Success Trade-offs in Web Tasks

Authors

  • Marco Rossi Department of Information Engineering, University of Padua, 35131 Padua, Italy Author
  • Giulia Bianchi Department of Information Engineering, University of Padua, 35131 Padua, Italy Author
  • Alessandro Conti Department of Information Engineering, University of Padua, 35131 Padua, Italy Author

DOI:

https://doi.org/10.71465/fapm749

Keywords:

Multi-objective reinforcement learning, Pareto optimality, preference conditioning, web agents, risk quantiles, cost–latency trade-off, policy calibration

Abstract

Web agent deployment requires navigating trade-offs among success rate, monetary/API costs, latency, and failure risk, which vary by user and scenario. We propose a preference-conditioned multi-objective RL framestudy that learns a Pareto set of policies for web tasks. The method trains a single agent conditioned on a preference vector w over (success, cost1_11, cost2_22, …, risk), using (i) multi-gradient conflict resolution to stabilize updates across objectives, and (ii) Pareto replay that balances samples from distinct regions of the frontier. To ensure tail-risk control, the risk objective is defined as quantile-based failure loss (e.g., 90th/95th percentile). We recommend benchmarking on 1,200+ tasks across 40–70 site templates, sweeping preference vectors to obtain a Pareto curve and reporting hypervolume improvement, frontier coverage, and policy switching stability. This approach enables “one model, many operating points,” supporting practical deployment where budgets and risk tolerance change dynamically.

Downloads

Download data is not yet available.

Downloads

Published

2026-03-15