[人工智能] 策略梯度法（policy gradient）的数学推导

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 策略梯度法（policy gradient）的数学推导 -> 正文阅读

[人工智能]策略梯度法（policy gradient）的数学推导

本文通过整理李宏毅老师的2021年春季机器学习教程的内容，简要介绍深度强化学习（deep reinforcement learning）中策略梯度法的数学推导。

李宏毅老师课程的B站链接：
李宏毅, 2021年春季机器学习教程

设：
一次游戏的轨迹（trajectory）： $\tau$
玩家（actor）策略（policy）： $\theta$

则收益（reward）的期望值可通过 N 次采样（sampling）估算：
$\bar R_{\theta} = \sum_{\tau} R(\tau) P(\tau | \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n})$

最优策略为：
$\theta^{*} = \arg \max_{\theta} \bar R_{\theta}$

优化方法即为梯度上升法（gradient ascent）。
收益的梯度：
$\triangledown \bar R_{\theta} = \sum_{\tau} R(\tau) \triangledown P(\tau | \theta) = \sum_{\tau} R(\tau) P(\tau | \theta) \frac {\triangledown P(\tau | \theta)} {P(\tau | \theta)} = \sum_{\tau} R(\tau) P(\tau | \theta) \triangledown \ln P(\tau | \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n}) \triangledown \ln P(\tau^{n} | \theta)$

其中，取对数的操作原理：
$\frac {d \ln (f(x))} {dx} = \frac{1}{f(x)} \frac{df(x)}{dx}$

由于轨迹在策略的条件发生的概率：
$P(\tau | \theta) = p(s_1) p(a_1 | s_1, \theta) p(r_1, s_2 | s_1, a_1) p(a_2 | s_2, \theta) p(r_2, s_3 | s_2, a_2) \cdots = p(s_1) \prod_{t=1}^{T} p(a_t | s_t, \theta) p(r_t, s_{t+1} | s_t, a_t)$

其中， $s$ 为各时刻的游戏状态（state）， $a$ 为玩家的动作（action）。
只有 $p(a_t | s_t, \theta)$ 部分与玩家的策略 $\theta$ 有关，另外两项 $p(s_1)$ 和 $p(r_t, s_{t+1} | s_t, a_t)$ 均与玩家策略无关。

因此对数项的梯度：
$\ln P(\tau | \theta) = \ln p(s_1) + \sum_{t=1}^{T} [\ln p(a_t | s_t, \theta) + \ln p(r_t, s_{t+1} | s_t, a_t)] \\ \triangledown \ln P(\tau | \theta) = \sum_{t=1}^{T} \triangledown \ln p(a_t | s_t, \theta)$

从而得出收益的梯度：
$\triangledown \bar R_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n}) \triangledown \ln P(\tau^{n} | \theta) = \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n}) \sum_{t=1}^{T_n} \triangledown \ln p(a^n_t | s^n_t, \theta) = \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} R(\tau^{n}) \triangledown \ln p(a^n_t | s^n_t, \theta)$