We consider a problem on the synthesis of optimal reactive controllers with an a priori unknown performance criterion while satisfying a given temporal logic specification through the interaction with an uncontrolled environment. We decouple the problem into two sub-problems. First, we extract a (maximally) permissive strategy for the system, which encodes multiple (possibly all) ways in which the system can react to the adversarial environment and satisfy the specifications. Then, we quantify the a priori unknown performance criterion as a (still unknown) reward function, and compute - by using the so-called maximin-Q learning algorithm - an optimal strategy for the system within the operating envelope allowed by the permissive strategy. We establish both correctness (with respect to the temporal logic specifications) and optimality (with respect to the a priori unknown performance criterion) of this two-step technique for a fragment of temporal logic specifications. For specifications beyond this fragment, correctness can still be preserved, but the learned strategy may be sub-optimal. We present an algorithm to the overall problem, and demonstrate its use and computational requirements on a set of robot motion planning examples.
Derya AksarayYasin YazıcıoğluAhmet Semi Asarkaya
Mingyu CaiZhangli ZhouLin LiShaoping XiaoZhen Kan
Daiying TianHao FangQingkai YangHaoyong YuWenyu LiangYan Wu