We extend knowledge gradient (KG) policy for the multi-objective multi-armed bandit problems to efficiently explore the Pareto optimal arms. We consider two partial order relationships to order the mean vectors, i.e. Pareto and scalarized functions. Pareto KG finds the optimal arms using Pareto search, while the scalarizations-KG transform the multi-objectives arms into one-objective arm to find the optimal arms. To measure the performance of the proposed algorithms, we propose three regret measures. We compare the performance of knowledge gradient policy with UCB1 on a multi-objective multi-armed bandit problem, where KG outperforms UCB1.
Edouard FouchéJunpei KomiyamaKlemens Böhm
Saba Q. YahyaaMădălina M. DruganBernard Manderick
Saba Q. YahyaaMădălina M. DruganBernard Manderick