Xuesong WangHou DiyuanYuhu Cheng
Offline-online reinforcement learning combines the advantages of offline learning in terms of sample efficiency and online learning in terms of exploration and trial-and-error. On the other hand, the partial observability of the environment is a challenging issue in reinforcement learning. This paper proposes a Partially Observable Offline-Online Actor-Critic (PO3AC) algorithm. In PO3AC, during both the offline and online training phases, different weighted behavior cloning regularization terms are considered to alleviate the distribution shift. During training, the actor and critic are trained separately, and a recurrent neural network is used to train a state prediction model that predicts the mapping of observation to belief states. Experimental results conducted on the PyBullet physics engine demonstrate that the proposed algorithm outperforms existing algorithms.
I. ElhananyChristopher Allen NiedzwiedzZ. LiuScott C. Livingston
Xuesong WangDiyuan HouLongyang HuangYuhu Cheng