强化学习(三)—— 策略学习(Policy-Based)及策略梯度(Policy Gradient)
1. 策略学习
Policy Network
- 通过策略网络近似策略函数
π
(
a
∣
s
t
)
≈
π
(
a
∣
s
t
;
θ
)
π(a|s_t)≈π(a|s_t;\theta)
π(a∣st?)≈π(a∣st?;θ) - 状态价值函数及其近似
V
π
(
s
t
)
=
∑
a
π
(
a
∣
s
t
)
Q
π
(
s
t
,
a
)
V_π(s_t)=\sum_aπ(a|s_t)Q_π(s_t,a)
Vπ?(st?)=a∑?π(a∣st?)Qπ?(st?,a)
V
(
s
t
;
θ
)
=
∑
a
π
(
a
∣
s
t
;
θ
)
?
Q
π
(
s
t
,
a
)
V(s_t;\theta)=\sum_aπ(a|s_t;\theta)·Q_π(s_t,a)
V(st?;θ)=a∑?π(a∣st?;θ)?Qπ?(st?,a) - 策略学习最大化的目标函数
J
(
θ
)
=
E
S
[
V
(
S
;
θ
)
]
J(\theta)=E_S[V(S;\theta)]
J(θ)=ES?[V(S;θ)] - 依据策略梯度上升进行
θ
←
θ
+
β
?
?
V
(
s
;
θ
)
?
θ
\theta\gets\theta+\beta·\frac{\partial V(s;\theta)}{\partial \theta}
θ←θ+β??θ?V(s;θ)?
2. 策略梯度
Policy Gradient
?
V
(
s
;
θ
)
θ
=
∑
a
Q
π
(
s
,
a
)
?
π
(
a
∣
s
;
θ
)
?
θ
=
∫
a
Q
π
(
s
,
a
)
?
π
(
a
∣
s
;
θ
)
?
θ
=
∑
a
π
(
a
∣
s
;
θ
)
?
Q
π
(
s
,
a
)
?
l
n
[
π
(
a
∣
s
;
θ
)
]
?
θ
=
E
A
~
π
(
a
∣
s
;
θ
)
[
Q
π
(
s
,
A
)
?
l
n
[
π
(
A
∣
s
;
θ
)
]
?
θ
]
≈
Q
π
(
s
t
,
a
t
)
?
l
n
[
π
(
a
t
∣
s
t
;
θ
)
]
?
θ
\frac{\partial V(s;\theta)}{\theta}=\sum_a{Q_\pi(s,a)\frac{\partial\pi(a|s;\theta)}{\partial\theta}}\\=\int_a{Q_\pi(s,a)\frac{\partial\pi(a|s;\theta)}{\partial\theta}}\\=\sum_a{\pi(a|s;\theta)·Q_\pi(s,a)\frac{\partial ln[\pi(a|s;\theta)]}{\partial\theta}}\\=E_{A\sim\pi(a|s;\theta)}[Q_\pi(s,A)\frac{\partial ln[\pi(A|s;\theta)]}{\partial\theta}]\\≈Q_\pi(s_t,a_t)\frac{\partial ln[\pi(a_t|s_t;\theta)]}{\partial\theta}
θ?V(s;θ)?=a∑?Qπ?(s,a)?θ?π(a∣s;θ)?=∫a?Qπ?(s,a)?θ?π(a∣s;θ)?=a∑?π(a∣s;θ)?Qπ?(s,a)?θ?ln[π(a∣s;θ)]?=EA~π(a∣s;θ)?[Qπ?(s,A)?θ?ln[π(A∣s;θ)]?]≈Qπ?(st?,at?)?θ?ln[π(at?∣st?;θ)]?
- 观测得到状态
s
t
s_t
st? - 依据策略函数随机采样动作
a
t
=
π
(
a
t
∣
s
t
;
θ
)
a_t = \pi(a_t|s_t;\theta)
at?=π(at?∣st?;θ) - 计算价值函数
q
t
=
Q
π
(
s
t
,
a
t
)
q_t = Q_\pi(s_t,a_t)
qt?=Qπ?(st?,at?) - 求取策略网络的梯度
d
θ
,
t
=
?
l
n
[
π
(
a
t
∣
s
t
;
θ
)
]
?
θ
∣
θ
=
θ
t
d_{\theta,t}=\frac{\partial ln[\pi(a_t|s_t;\theta)]}{\partial\theta}|\theta=\theta_t
dθ,t?=?θ?ln[π(at?∣st?;θ)]?∣θ=θt? - 计算近似的策略梯度
g
(
a
t
,
θ
t
)
=
q
t
?
d
θ
,
t
g(a_t,\theta _t)=q_t·d_{\theta,t}
g(at?,θt?)=qt??dθ,t? - 更新策略网络
θ
t
+
1
=
θ
t
+
β
?
g
(
a
t
,
θ
t
)
\theta_{t+1}=\theta_t+\beta·g(a_t,\theta_t)
θt+1?=θt?+β?g(at?,θt?)
3. 案例
目前没有好的方法近似动作价值函数,则未撰写案例。
by CyrusMay 2022 03 29
|