On-policy Reinforcement Learning Algorithm Concepts
Date:
Table of Contents
Basic Concepts.
The Standard RL objective: Maximise the expected sum of rewards (return - $ G(\tau) $), following a policy $ \pi $.
The policy induces a value function
\[v^{\pi}(s_{t}) \doteq E[G_{t}|s_{t}]\] \[J(\pi) \doteq \sum_{t} E_{s_{t}, a_{t} \sim\pi_{\theta}}\bigg[r_{t}(s_{t}, a_{t})\bigg] = E_{\tau \sim\pi_{\theta}} \bigg[ G(\tau) \bigg].\]You can calculate the return using either:
(1) discounted-RL setting (--return_type discount
):
(2) Average-Reward-RL setting (--return_type average
):
\(\hat{G_{t}} = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \cdots + R_{t+n} - r(\pi) + \hat{v}^{\pi}(s_{t+n}).\) then the advantage function is,
\(A^{\pi}(s_{t}, a_{t}) \doteq G_{t} - v^{\pi}(s_{t}).\) The The Maximum Entropy RL Objective (used in the SAC algorithm where the discounted setting is used to calculate the returns). \(J(\pi) \doteq \sum_{t=1}^{\infty} E_{s_{0}, a_{0:t-1}\sim\pi} \Bigg[\bigg(r_{t} + \alpha H\bigg(\pi(\cdot|s_{t})\bigg) \Bigg].\)
A2C - Advantage Actor Critic
\[L^{\pi}_{A2C}(\theta) = \hat{E}_{t}\Bigg[ log \pi_{\theta}(a_{t}|s_{t}) \hat{A}_{t} + \beta_{s}H \bigg(\pi(\cdot|s_{t}) \bigg) \Bigg].\]\(L^{v}(\phi) = \hat{E}_{t}\left[\frac{1}{2} \Bigg(v_{\phi}(s_{t}) - \hat{v}_{t}^{target} \Bigg)^{2}\right].\) Reference:
@inproceedings{mnih2016asynchronous,
title=,
author={Mnih, Volodymyr and Badia, Adria Puigdomenech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy and Harley, Tim and Silver, David and Kavukcuoglu, Koray},
booktitle={33\textsuperscript{rd} International Conference on Machine Learning},
pages={Proceedings (Vol. 48, Pp. 1928--1937)},
year={2016},
organization={PMLR},
}
PPO - Proximal Policy Optimisation
\[L^{\pi}_{PPO}(\theta) = \hat{E}_{t}\Bigg[ \Bigg. min \Bigg(\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\hat{A}_{t}, clip \bigg(\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}, 1-\epsilon, 1+\epsilon \bigg)\hat{A}_{t} \Bigg) + \beta_{s}H \bigg(\pi(\cdot|s_{t}) \bigg)\Bigg. \Bigg].\] \[L^{v}(\phi) = \hat{E}_{t}\left[\frac{1}{2} \Bigg(v_{\phi}(s_{t}) - \hat{v}_{t}^{target} \Bigg)^{2}\right].\]Reference:
@article{schulman2017proximal,
title=,
author={Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg},
journal={arXiv:1707.06347},
year={2017},
doi={https://doi.org/10.48550/arXiv.1707.06347}
G2P2C - Glucose Control by Glucose Prediction & Planning
Consists of 3 optimisation phases, the first phase is similar to PPO.
Model Learning Phase
\(L^{M^{\Pi}}(\theta) = \hat{E}_{t}\Bigg[ \Bigg. -log \bigg(M^{\Pi}_{\theta}(g_{t+1}|s_{t}, a_{t}) \bigg) + \beta_{1}d_{KL} \bigg[\pi_{\theta_{ppo}}(\cdot|s_{t}), \pi_{\theta}(\cdot|s_{t}) \bigg]\Bigg. \Bigg].\) \(L^{M^{V}}(\phi) = \hat{E}_{t}\Bigg[ \Bigg. -log \bigg(M^{V}_{\phi}(g_{t+1}|s_{t}, a_{t}) \bigg) + \beta_{2}\frac{1}{2} \bigg(v_{\phi_{ppo}}(s_{t}) - \hat{v}_{\phi}(s_{t}) \bigg)^{2}\Bigg. \Bigg].\) Planning Phase
\(\tau^{*} = \arg \max_{\tau}\Bigg[ \Bigg. (\sum_{q=1}^{n_{plan}}R_{q}) + v(s_{n_{plan}})\Bigg].\) \(L^{plan}(\theta) = \hat{E}_{t}\Bigg[ \Bigg. -log \bigg(\pi_{\theta}(a^{*}_{t}|s_{t}) \bigg)\Bigg].\) Reference:
@article{hettiarachchi2024g2p2c,
title={G2P2C—A modular reinforcement learning algorithm for glucose control by glucose prediction and planning in Type 1 Diabetes},
author={Hettiarachchi, Chirath and Malagutti, Nicolo and Nolan, Christopher J and Suominen, Hanna and Daskalaki, Elena},
journal={Biomedical Signal Processing and Control},
volume={90},
pages={105839},
year={2024},
publisher={Elsevier}
}
SAC - Soft Actor Critic
\[L^{Q}_{SAC}(\phi_{j}) = \hat{E}_{t}\left[\frac{1}{2} \Bigg(Q_{\phi_{j}}(s_{t}) - \hat{Q}_{t}^{target} \Bigg)^{2}\right] \text{for } j \in \{1,2\}.\] \[\bar{\phi_{j}} = \tau \phi_{j} + (1-\tau) \bar{\phi_{j}} \text{ for } j \in \{1,2\}.\] \[L^{\pi}_{SAC}(\theta) = \hat{E}_{t}\Bigg[ \alpha log(\pi_{\theta}(a_{t}|s_{t})) - Q_{\phi}(s_{t}, a_{t}) \Bigg].\]\(L^{\alpha}_{SAC}(\alpha) = \hat{E}_{t}\Bigg[ -\alpha log \pi_{t}(a_{t}|s_{t}) - \alpha \bar{H} \Bigg].\) Reference:
@article{haarnoja2018soft,
title={Soft actor-critic algorithms and applications},
author={Haarnoja, Tuomas and Zhou, Aurick and Hartikainen, Kristian and Tucker, George and Ha, Sehoon and Tan, Jie and Kumar, Vikash and Zhu, Henry and Gupta, Abhishek and Abbeel, Pieter and others},
journal={arXiv preprint arXiv:1812.05905},
year={2018}
}