In this article, we propose a(τ, ε)-greedy reinforcement learning algorithm for anti-jamming wireless communications, which chooses previous action with probability τ and applies ε-greedy with probability 1-τ. The key idea of our algorithm is that the more valuable the previous action is, the higher probability of directly performing it at the current time slot without learning. For this purpose, the average utility of several previous actions is first calculated as a threshold for the valuable action judgment. Then, probability τ is formulated as a Gaussian-like function with respect to the difference between the threshold and the utility of the previous action, which makes the wireless devices find the optimal action at a faster speed in the early stage, and eventually ensures the convergence. As a concrete example, the proposed algorithm is implemented in a wireless communication system against multiple jammers. Simulation results show that compared with ε-greedy, the (τ, ε)-greedy obtains faster convergence rate and slightly higher signalto-interference-plus-noise ratio when being applied to Qlearning, deep Q-networks (DQN), double DQN (DDQN), and prioritized experience reply based DDQN (PDDQN). The source code is available at https://github.com/GZHUDVL/tau-epsilon-greedy-RL.
Jie QiHongming ZhangXiaolei QiMugen Peng
Chen WangYifan ChenZhiping LinQiaoxin ChenLiang Xiao
Zhiping LinLiang XiaoHongyi ChenZefang Lv