Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives Journal Article uri icon

Overview

abstract

  • Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior. Manually crafting such functions is often tedious and error-prone. A more principled alternative is to specify behavioral requirements using a formal, unambiguous language that can be automatically translated into a reward function. Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis. However, existing approaches using omega-regular specifications typically rely on discounted reward RL in an episodic setting, where the environment is periodically reset to an initial state during learning. This setup is misaligned with the semantics of omega-regular specifications, which describe properties over infinite behavior traces. In such cases, the average reward criterion and the continuing setting—where the agent interacts with the environment over a single, uninterrupted lifetime—are more appropriate.; To address the challenges of infinite-horizon, continuing tasks, we restrict our focus to the subclass of omega-regular languages known as absolute liveness specifications. These specifications cannot be violated by any finite prefix of the agent’s behavior, aligning naturally with the continuing setting. We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives. In contrast to prior work, our approach enables learning in communicating Markov Decision Processes without episodic resetting. We further introduce a reward structure for lexicographic multi-objective optimization, where the goal is to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given absolute liveness omega-regular specification. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL. Empirical results across various benchmarks demonstrate that our average-reward approach in the continuing setting is more effective than competing methods based on discounting.

publication date

  • January 1, 2026

Date in CU Experts

  • January 31, 2026 9:55 AM

Full Author List

  • Kazemi M; Perez M; Somenzi F; Soudjani S; Trivedi A; Velasquez A

author count

  • 6

Other Profiles

Electronic International Standard Serial Number (EISSN)

  • 1076-9757

Additional Document Info

volume

  • 85