Provably Efficient Algorithms for Safe Reinforcement Learning

Honghao Wei

doi:10.7302/8204

ScienceGate Book Chapters

JOURNAL ARTICLE

Provably Efficient Algorithms for Safe Reinforcement Learning

Honghao Wei

Year: 2023 Journal: Deep Blue (University of Michigan) Publisher: University of Michigan–Ann Arbor

DOI: 10.7302/8204

Get Full-Text PDF Get Analytical Report

Abstract

Safe reinforcement learning (RL) is an area of research focused on developing algorithms and methods that ensure the safety of RL agents during learning and decision-making processes. The goal is to enable RL agents to interact with their environments and learn optimal decisions while avoiding actions that can lead to harmful or undesirable outcomes. This dissertation provides a comprehensive study of {em model-free}, {em simulator-free} reinforcement learning algorithms for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation, with the focus on three settings: $(1)$ episodic CMDPs; $(2)$ infinite-horizon average-reward CMDPs and $(3)$ non-stationary episodic CMDPs. The first part provides the first model-free, simulator-free safe-RL algorithm with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action value function) for the cumulative reward, a Q-function for the cumulative utility of the constraint, and a virtual Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves sublinear regret. Furthermore, Triple-Q guarantees zero constraint violation, both on expectation and with a high probability, when the number of episode is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, and is computationally efficient. In Chapter III, the results are extended to infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs). The proposed algorithm guarantees sublinear regret and zero constraint violation. Then in Chapter IV the dissertation studies safe-RL in a more challenging setting, non-stationary CMDPs, where the rewards/utilities and dynamics are time-varying and likely unknown a priori. In the nonstationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach to both tabular and linear approximation settings.

Keywords:

Reinforcement learning Computer science Algorithm Artificial intelligence

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Reinforcement Learning in Robotics

Physical Sciences → Computer Science → Artificial Intelligence

Provably Efficient Algorithms for Safe Reinforcement Learning

Abstract

Metrics

Topics

Related Documents

Provably Efficient Reinforcement Learning

Implicit Safe Set Algorithm for Provably Safe Reinforcement Learning

Provably efficient information-directed sampling algorithms for multi-agent reinforcement learning

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

Reducing Safety Interventions in Provably Safe Reinforcement Learning