Safe reinforcement learning (RL) is an area of research focused on developing algorithms and methods that ensure the safety of RL agents during learning and decision-making processes. The goal is to enable RL agents to interact with their environments and learn optimal decisions while avoiding actions that can lead to harmful or undesirable outcomes. This dissertation provides a comprehensive study of {em model-free}, {em simulator-free} reinforcement learning algorithms for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation, with the focus on three settings: $(1)$ episodic CMDPs; $(2)$ infinite-horizon average-reward CMDPs and $(3)$ non-stationary episodic CMDPs. The first part provides the first model-free, simulator-free safe-RL algorithm with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action value function) for the cumulative reward, a Q-function for the cumulative utility of the constraint, and a virtual Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves sublinear regret. Furthermore, Triple-Q guarantees zero constraint violation, both on expectation and with a high probability, when the number of episode is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, and is computationally efficient. In Chapter III, the results are extended to infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs). The proposed algorithm guarantees sublinear regret and zero constraint violation. Then in Chapter IV the dissertation studies safe-RL in a more challenging setting, non-stationary CMDPs, where the rewards/utilities and dynamics are time-varying and likely unknown a priori. In the nonstationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach to both tabular and linear approximation settings.
Weiye ZhaoFeihan LiTairan HeChangliu Liu
Qiaosheng ZhangChenjia BaiShuyue HuZhen WangXuelong Li
Tim WalterHannah MarkgrafJonathan KülzMatthias Althoff
Jakob ThummGuillaume PelatMatthias Althoff