DISSERTATION

Towards Practical Offline Reinforcement Learning: Sample Efficient Policy Selection and Evaluation

Abstract

Offline reinforcement learning (RL) involves learning policies from datasets, rather than online interaction. The dissertation first investigates a critical component in offline RL: offline policy selection (OPS). Given that most offline RL algorithms require careful hyperparameter tuning, we need to select the best policy amongst a set of candidate policies before deployment. In the first part of the dissertation, we provide clarity on when OPS is sample efficient by building a clear connection to off-policy policy evaluation (OPE) and Bellman error estimation. This dissertation then presents algorithms to leverage offline data. We begin by examining environments that include exogenous variables with limited agent impact and endogenous variables under full agent control. We show that policy evaluation and selection become straightforward under such conditions. Additionally, we present an algorithm based on Fitted-Q Iteration with data augmentation and show its ability to find nearly optimal policies with polynomial sample complexity. We then study OPE in non-stationary environments and introduce the regression-assisted doubly robust estimator, which effectively incorporates the past data without introducing a large bias and improves on existing OPE estimators with the use of auxiliary information and a regression approach. We evaluate our algorithms across a variety of problems, some built using real-world datasets, including optimal order execution, inventory management, hybrid car control and recommendation systems.

Keywords:
Reinforcement learning Leverage (statistics) Estimator Selection (genetic algorithm) Variety (cybernetics) Set (abstract data type) Hyperparameter Sample (material)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Reinforcement Learning in Robotics
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Bandit Algorithms Research
Social Sciences →  Decision Sciences →  Management Science and Operations Research
Adaptive Dynamic Programming Control
Physical Sciences →  Computer Science →  Computational Theory and Mathematics

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.