Offline reinforcement learning (RL) involves learning policies from datasets, rather than online interaction. The dissertation first investigates a critical component in offline RL: offline policy selection (OPS). Given that most offline RL algorithms require careful hyperparameter tuning, we need to select the best policy amongst a set of candidate policies before deployment. In the first part of the dissertation, we provide clarity on when OPS is sample efficient by building a clear connection to off-policy policy evaluation (OPE) and Bellman error estimation. This dissertation then presents algorithms to leverage offline data. We begin by examining environments that include exogenous variables with limited agent impact and endogenous variables under full agent control. We show that policy evaluation and selection become straightforward under such conditions. Additionally, we present an algorithm based on Fitted-Q Iteration with data augmentation and show its ability to find nearly optimal policies with polynomial sample complexity. We then study OPE in non-stationary environments and introduce the regression-assisted doubly robust estimator, which effectively incorporates the past data without introducing a large bias and improves on existing OPE estimators with the use of auxiliary information and a regression approach. We evaluate our algorithms across a variety of problems, some built using real-world datasets, including optimal order execution, inventory management, hybrid car control and recommendation systems.
Siyuan GuoLixin ZouHechang ChenBohao QuHaotian ChiPhilip S. YuYi Chang
Sheng YueYongheng DengG. Y. WangJu RenYaoxue Zhang