1.贝尔曼等式
V
(
s
)
=
R
(
s
)
+
γ
∑
s
′
∈
S
P
(
s
′
∣
s
)
V
(
s
′
)
V(s) = R(s) + \gamma \sum\limits_{s' \in S} {P(s'|s)V(s')}
V(s)=R(s)+γs′∈S∑?P(s′∣s)V(s′)其中:
-
R
(
s
)
R(s)
R(s)是Immediate reward
-
γ
∑
s
′
∈
S
P
(
s
′
∣
s
)
V
(
s
′
)
\gamma \sum\limits_{s' \in S}{P(s'|s)V(s')}
γs′∈S∑?P(s′∣s)V(s′)是Discounted sum of future reward
-
s
′
s'
s′是未来的所有状态
-
V
(
s
′
)
V(s')
V(s′)代表未来某一个状态的价值
-
P
(
s
′
∣
s
)
P(s'|s)
P(s′∣s)代表从当前状态转移到未来状态的概率
贝尔曼等式定义了当前状态与未来状态之间的关系
2.基础巩固
- 条件概率公式:
P
(
A
,
B
)
=
P
(
A
B
)
=
P
(
A
∣
B
)
P
(
B
)
=
P
(
B
∣
A
)
P
(
A
)
P(A,B) = P(AB) = P(A|B)P(B) = P(B|A)P(A)
P(A,B)=P(AB)=P(A∣B)P(B)=P(B∣A)P(A)
- 全概率公式:
P
(
B
)
=
∑
i
P
(
A
i
)
P
(
B
∣
A
i
)
P(B) = \sum\limits_i {P({A_i})P(B|{A_i})}
P(B)=i∑?P(Ai?)P(B∣Ai?)
- 贝叶斯公式:
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
)
P(A|B) = \frac{{P(B|A)P(A)}}{{P(B)}}
P(A∣B)=P(B)P(B∣A)P(A)?
3.全期望公式(LIE)
若
A
i
A_i
Ai?是样本空间的有限或可数的划分,则全期望公式可表示为:
E
(
X
)
=
∑
i
E
(
X
∣
A
i
)
P
(
A
i
)
E\left( X \right) = \sum\limits_i {E\left( {X|{A_i}} \right)P({A_i})}
E(X)=i∑?E(X∣Ai?)P(Ai?)为了简洁,令
s
=
s
t
s=s_t
s=st?,
g
′
=
G
t
+
1
g'=G_{t+1}
g′=Gt+1?,
s
′
=
s
t
+
1
s'=s_{t+1}
s′=st+1?,则回报的期望可以表示为:
E
[
G
t
+
1
∣
s
t
+
1
]
=
E
[
g
′
∣
s
′
]
=
∑
g
′
g
p
(
g
′
∣
s
′
)
\mathbb{E}\left[ {{G_{t + 1}}|{s_{t + 1}}} \right] = \mathbb{E}\left[ {g'|s'} \right] = \sum\limits_{g'} {gp(g'|s')}
E[Gt+1?∣st+1?]=E[g′∣s′]=g′∑?gp(g′∣s′)令
s
t
=
s
s_t=s
st?=s,对上式求期望有:
E
[
E
[
G
t
+
1
∣
s
t
+
1
]
∣
s
t
]
=
E
[
E
[
g
′
∣
s
′
]
∣
s
]
=
∑
s
′
∑
g
′
E
[
E
[
g
′
∣
s
′
]
]
p
(
g
′
∣
s
′
,
s
)
p
(
s
′
∣
s
)
=
∑
s
′
∑
g
′
g
′
p
(
g
′
∣
s
′
,
s
)
p
(
s
′
∣
s
)
p
(
s
)
p
(
s
)
=
∑
s
′
∑
g
′
g
′
p
(
g
′
∣
s
′
,
s
)
p
(
s
′
,
s
)
p
(
s
)
=
∑
g
′
∑
s
′
g
′
p
(
g
′
,
s
′
,
s
)
p
(
s
)
=
∑
g
′
g
′
p
(
g
′
,
s
)
p
(
s
)
=
∑
g
′
g
′
p
(
g
′
∣
s
)
=
E
[
g
′
∣
s
]
=
E
[
G
t
+
1
∣
s
t
]
\begin{array}{l}\mathbb{E}\left[ {\mathbb{E}\left[ {{G_{t + 1}}|{s_{t + 1}}} \right]|{s_t}} \right] = \mathbb{E}\left[ {\mathbb{E}\left[ {g'|s'} \right]|s} \right]\\{\rm{ = }}\sum\limits_{s'} {\sum\limits_{g'} {\mathbb{E}\left[ {\mathbb{E}\left[ {g'|s'} \right]} \right]p(g'|s',s)p(s'|s)} } \\{\rm{ = }}\sum\limits_{s'} {\sum\limits_{g'} {\frac{{g'p(g'|s',s)p(s'|s)p(s)}}{{p(s)}}} } \\{\rm{ = }}\sum\limits_{s'} {\sum\limits_{g'} {\frac{{g'p(g'|s',s)p(s',s)}}{{p(s)}}} } \\{\rm{ = }}\sum\limits_{g'} {\sum\limits_{s'} {\frac{{g'p(g',s',s)}}{{p(s)}}} } \\ = \sum\limits_{g'} {\frac{{g'p(g',s)}}{{p(s)}}} \\ = \sum\limits_{g'} {g'p(g'|s)} \\ = \mathbb{E}\left[ {g'|s} \right] = \mathbb{E}\left[ {{G_{t + 1}}|{s_t}} \right]\end{array}
E[E[Gt+1?∣st+1?]∣st?]=E[E[g′∣s′]∣s]=s′∑?g′∑?E[E[g′∣s′]]p(g′∣s′,s)p(s′∣s)=s′∑?g′∑?p(s)g′p(g′∣s′,s)p(s′∣s)p(s)?=s′∑?g′∑?p(s)g′p(g′∣s′,s)p(s′,s)?=g′∑?s′∑?p(s)g′p(g′,s′,s)?=g′∑?p(s)g′p(g′,s)?=g′∑?g′p(g′∣s)=E[g′∣s]=E[Gt+1?∣st?]?即:
E
[
V
(
s
t
+
1
)
∣
s
t
]
=
E
[
E
[
G
t
+
1
∣
s
t
+
1
]
∣
s
t
]
=
E
[
G
t
+
1
∣
s
t
]
\mathbb{E}\left[ {V({s_{t + 1}})|{s_t}} \right] = \mathbb{E}\left[ {\mathbb{E}\left[ {{G_{t + 1}}|{s_{t + 1}}} \right]|{s_t}} \right] = \mathbb{E}\left[ {{G_{t + 1}}|{s_t}} \right]
E[V(st+1?)∣st?]=E[E[Gt+1?∣st+1?]∣st?]=E[Gt+1?∣st?]
4.贝尔曼等式推导
V
(
s
)
=
E
[
G
t
∣
s
t
=
s
]
=
E
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
∣
s
t
=
s
]
=
E
[
R
t
+
1
∣
s
t
=
s
]
+
γ
E
[
R
t
+
2
+
γ
R
t
+
3
+
γ
2
R
t
+
4
…
∣
s
t
=
s
]
=
R
(
s
)
+
γ
E
[
G
t
+
1
∣
s
t
=
s
]
=
R
(
s
)
+
γ
E
[
V
(
s
t
+
1
)
∣
s
t
=
s
]
=
R
(
s
)
+
γ
∑
s
′
∈
S
P
(
s
′
∣
s
)
V
(
s
′
)
\begin{array}{l}V(s) = \mathbb{E}\left[ {{G_t}|{s_t} = s} \right]\\ = \mathbb{E}\left[ {{R_{t + 1}} + \gamma {R_{t + 2}} + {\gamma ^2}{R_{t + 3}} + \ldots |{s_t} = s} \right]\\ = \mathbb{E}\left[ {{R_{t + 1}}|{s_t} = s} \right] + \gamma \mathbb{E}\left[ {{R_{t + 2}} + \gamma {R_{t + 3}} + {\gamma ^2}{R_{t + 4}} \ldots |{s_t} = s} \right]\\ = R(s) + \gamma \mathbb{E}\left[ {{G_{t + 1}}|{s_t} = s} \right]\\ = R(s) + \gamma \mathbb{E}\left[ {V({s_{t + 1}})|{s_t} = s} \right]\\ = R(s){\rm{ + }}\gamma \sum\limits_{s' \in S} {P(s'|s)V(s')} \end{array}
V(s)=E[Gt?∣st?=s]=E[Rt+1?+γRt+2?+γ2Rt+3?+…∣st?=s]=E[Rt+1?∣st?=s]+γE[Rt+2?+γRt+3?+γ2Rt+4?…∣st?=s]=R(s)+γE[Gt+1?∣st?=s]=R(s)+γE[V(st+1?)∣st?=s]=R(s)+γs′∈S∑?P(s′∣s)V(s′)?贝尔曼等式就是当前状态与未来状态的迭代关系,表示当前状态的值函数可以通过下个状态的值函数来计算。
小白经验记录,大神请批评指正。 如果文章对你有用,请点个赞吧~
|