JOURNAL ARTICLE

Proximal Policy Optimization With Policy Feedback

Yang GuYuhu ChengC. L. Philip ChenXuesong Wang

Year: 2021 Journal:   IEEE Transactions on Systems Man and Cybernetics Systems Vol: 52 (7)Pages: 4600-4610   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Proximal policy optimization (PPO) is a deep reinforcement learning algorithm based on the actor–critic (AC) architecture. In the classic AC architecture, the Critic (value) network is used to estimate the value function while the Actor (policy) network optimizes the policy according to the estimated value function. The efficiency of the classic AC architecture is limited due that the policy does not directly participate in the value function update. The classic AC architecture will make the value function estimation inaccurate, which will affect the performance of the PPO algorithm. For improvement, we designed a novel AC architecture with policy feedback (AC-PF) by introducing the policy into the update process of the value function and further proposed the PPO with policy feedback (PPO-PF). For the AC-PF architecture, the policy-based expected (PBE) value function and discount reward formulas are designed by drawing inspiration from expected Sarsa. In order to enhance the sensitivity of the value function to the change of policy and to improve the accuracy of PBE value estimation at the early learning stage, we proposed a policy update method based on the clipped discount factor. Moreover, we specifically defined the loss functions of the policy network and value network to ensure that the policy update of PPO-PF satisfies the unbiased estimation of the trust region. Experiments on Atari games and control tasks show that compared to PPO, PPO-PF has faster convergence speed, higher reward, and smaller variance of reward.

Keywords:
Reinforcement learning Value (mathematics) Bellman equation Function (biology) Computer science Architecture Variance (accounting) Process (computing) Value network Convergence (economics) Mathematical optimization Artificial intelligence Machine learning Mathematics Economics

Metrics

153
Cited By
8.89
FWCI (Field Weighted Citation Impact)
44
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Reinforcement Learning in Robotics
Physical Sciences →  Computer Science →  Artificial Intelligence
Adaptive Dynamic Programming Control
Physical Sciences →  Computer Science →  Computational Theory and Mathematics

Related Documents

JOURNAL ARTICLE

Off-Policy Proximal Policy Optimization

Wenjia MengQian ZhengGang PanYilong Yin

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2023 Vol: 37 (8)Pages: 9162-9170
BOOK-CHAPTER

Proximal policy optimization

Ge Cheng

Elsevier eBooks Year: 2025 Pages: 123-135
JOURNAL ARTICLE

Policy Optimization in Reinforcement Learning: Proximal Policy Optimization

Saurugger, Bernd

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
JOURNAL ARTICLE

Policy Optimization in Reinforcement Learning: Proximal Policy Optimization

Saurugger, Bernd

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2023
© 2026 ScienceGate Book Chapters — All rights reserved.