This work introduces a new reinforcement learning (RL) method, which has safe exploration and estimates uncertainty in state-action pairs using Monte Carlo (MC) dropout. The proposed method outperforms biased exploration in terms of reward obtained during training. The study also investigates the sensitivity of the algorithm to the uncertainty threshold hyperparameter, suggesting that a lower value leads to a safer policy, but a higher value can result in faster convergence. The proposed algorithm is evaluated in guiding a 2 degree-of-freedom planar robot in its task-space, showing that it can converge to an optimal policy while ensuring safety constraints are met.
Qisong YangThiago D. SimãoNils JansenSimon H. TindemansMatthijs T. J. Spaan