CS6789-6 Linear Bellman Completion
参考资料
RL Theory Book 第三章
背景
之前的问题setting:
- Infinite horizon discounted 且是Tabular的MDP
- VI或PI算法
- Generative Model的交互假设,回避了exploration的问题
现在添加一个额外的structural assumption——Linear Bellman Completeness:
- 稍微扩充一下Tabular MDP到Large Scale MDP,即离散的状态动作扩充到连续的状态动作
- 其它维持不变
-
VI迭代式:
Q
n
+
1
(
s
,
a
)
=
r
(
s
,
a
)
+
γ
E
s
′
~
p
(
?
∣
s
,
a
)
[
max
?
a
′
Q
n
(
s
′
,
a
′
)
]
?
s
,
a
∈
S
×
A
\text{VI迭代式:}Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A
VI迭代式:Qn+1?(s,a)=r(s,a)+γEs′~p(?∣s,a)?[maxa′?Qn?(s′,a′)]?s,a∈S×A会出问题,因为不能用Q-table的方式来表示Q-function即
Q
(
s
,
a
)
Q(s,a)
Q(s,a),要进行function approximation
- 这里的结构假设,是linear function approximation
最终问题是:在原有问题setting扩充了状态动作空间的容量(简单说,离散到连续),能否有sample efficient的算法能找到
?
\epsilon
?-optimal的策略?
逻辑
本篇证明比较多,逻辑链条比较长。整体来看
- Least Square的理论分析是基础,因为LSVI用到了Least Square(在4.2.2节)
- 然后D-optimal的假设,是为了保证用于Least Square的数据集有充足的多样性(在3.3节)
- 接着通过Least Square + D-optimal + Linear Bellman Completion 能得到low inherent Bellman error(在4.2.3节)
- 有了Low Inherent Bellman error,就能bound住policy performance (在4.2.2节)
文章的写作逻辑,以介绍概念——理解定理——证明定理为主,不想看证明可直接跳过第四章节,第四章节总逻辑在4.2.3节。
一、从infinite horizon到finite horizon
为了便于分析,已知infinite horizon中只要把effective horizon即
1
1
?
γ
\frac{1}{1-\gamma}
1?γ1?换成Finite horizon即
H
H
H即可,另外再弱化关于transition 的stationarity即
P
(
?
∣
s
,
a
)
P(\cdot|s,a)
P(?∣s,a)变成跟时间有关
P
h
(
?
∣
s
,
a
)
,
h
≤
H
P_h(\cdot|s,a),h\leq H
Ph?(?∣s,a),h≤H
所以infinite horizon的定义
M
=
(
S
,
A
,
r
,
P
,
γ
)
\mathcal M=(S,A,r,P,\gamma)
M=(S,A,r,P,γ)改写成finite horizon的定义
M
=
(
S
,
A
,
r
,
P
h
,
H
)
\mathcal M=(S,A,r,P_h,H)
M=(S,A,r,Ph?,H)
它们之间的联系:
H
=
1
1
?
γ
H=\frac{1}{1-\gamma}
H=1?γ1? & stationarity
lim
?
h
→
∞
P
h
=
P
\lim_{h\rightarrow \infty}P_h=P
limh→∞?Ph?=P
二、Linear Bellman Completeness
为了解决Tabular MDP到Large Scale MDP中,Q函数的表示问题,首先假设要进行linear function approximation,其次它要满足Bellman Completeness,称为Linear Bellman Completeness 定义如下:
给定一个已知的特征
?
(
s
,
a
)
∈
R
d
\phi(s,a)\in \mathbb R^d
?(s,a)∈Rd,假设其满足Bellman 完备性,即
?
s
∈
S
,
a
∈
A
,
h
∈
[
H
]
,
θ
∈
R
d
,
?
w
∈
R
d
\forall s\in S,a\in A,h\in [H],\theta\in\mathbb R^d,\exist w\in\mathbb R^d
?s∈S,a∈A,h∈[H],θ∈Rd,?w∈Rd满足
w
?
?
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
?
?
(
s
′
,
a
′
)
]
w^\top\phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\theta^\top \phi(s',a')]
w??(s,a)=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?θ??(s′,a′)]
关于Q函数的Bellman Optimality为:
Q
?
(
s
,
a
)
=
r
(
s
,
a
)
+
γ
E
s
′
~
P
(
?
∣
s
,
a
)
[
max
?
a
′
Q
?
(
s
′
,
a
′
)
]
Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim P(\cdot|s,a)}[\max_{a'}Q^\star(s',a')]
Q?(s,a)=r(s,a)+γEs′~P(?∣s,a)?[a′max?Q?(s′,a′)]
可以观察到Bellman Optimality 与 Linear Bellman Completeness非常相似。
- 简单理解一下:Q本来是在
(
s
,
a
)
(s,a)
(s,a)空间构成的映射,变成了在
?
(
s
,
a
)
\phi(s,a)
?(s,a)张成的空间中的点。
- 当
?
(
s
,
a
)
=
one-hot
(
s
,
a
)
\phi(s,a)=\text{one-hot}(s,a)
?(s,a)=one-hot(s,a),此时便可看成Tabular MDP,即
?
(
s
,
a
)
=
(
1
,
0
,
0
,
.
.
.
.
,
0
)
\phi(s,a)=(1,0,0,...., 0)
?(s,a)=(1,0,0,....,0)
- 假设s有两个值
s
1
,
s
2
s_1,s_2
s1?,s2?,动作有一个值a,它们的特征可设计成
?
(
s
,
a
)
=
(
s
1
s
2
,
s
1
2
,
s
2
2
,
s
1
a
,
s
2
a
,
a
,
a
2
,
1
)
\phi(s,a)=(s_1s_2,s_1^2,s_2^2,s_1a,s_2a,a,a^2,1)
?(s,a)=(s1?s2?,s12?,s22?,s1?a,s2?a,a,a2,1),可看作LQR问题
一句话总结:
?
(
s
,
a
)
\phi(s,a)
?(s,a)已知,该假设的参数
w
,
θ
w,\theta
w,θ唯一标识了不同的Q函数,在第h时刻最优的Q函数
Q
h
?
(
s
,
a
)
=
θ
h
?
?
(
s
,
a
)
Q^\star_h(s,a)=\theta_h^\star\phi(s,a)
Qh??(s,a)=θh???(s,a)
三、LSVI的理论
3.1 LSVI算法流程
LSVI是Least Squar Value Iteration,是在Linear Bellman Completion的强假设下的一种sample efficient的算法,能在多项式时间找到
?
\epsilon
?-optimal的策略。先以与时间有关的角度,改写下Linear Bellman Completion:
θ
h
?
?
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
h
+
1
?
?
(
s
′
,
a
′
)
]
\theta_h^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_{h}(\cdot|s,a)}[\max_{a'}\theta_{h+1}^\top \phi(s',a')]
θh???(s,a)=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?θh+1???(s′,a′)]
LSVI的算法流程:
- 在每个时间刻
h
∈
0
,
1
,
.
.
.
,
H
?
1
h\in 0,1,...,H-1
h∈0,1,...,H?1,利用generative model的方式
s
′
~
P
h
(
?
∣
s
,
a
)
s'\sim P_h(\cdot|s,a)
s′~Ph?(?∣s,a)进行数据收集,得到
D
0
,
D
1
,
.
.
.
,
D
H
?
1
D_0,D_1,...,D_{H-1}
D0?,D1?,...,DH?1?个数据集,其中一个数据样本以
(
s
,
a
,
r
,
s
′
)
(s,a,r,s')
(s,a,r,s′)的方式存储
- 处理边界情况,令
V
H
(
s
)
=
0
,
?
s
∈
S
V_H(s)=0,\forall s\in S
VH?(s)=0,?s∈S
- 对于
h
=
H
?
1
→
0
h=H-1\rightarrow0
h=H?1→0,根据数据集
D
h
D_h
Dh?,学习参数
θ
h
\theta_h
θh?
1.
θ
h
=
arg?min
?
θ
∑
D
h
(
θ
?
?
(
s
,
a
)
?
r
?
V
h
+
1
(
s
′
)
)
2
1.\quad\theta_h=\argmin_{\theta}\sum_{D_h}\left(\theta^\top \phi(s,a)-r-V_{h+1}(s')\right)^2
1.θh?=θargmin?Dh?∑?(θ??(s,a)?r?Vh+1?(s′))2
2.
V
h
(
s
)
=
max
?
a
θ
h
?
?
(
s
,
a
)
,
?
s
2.\quad V_h(s)=\max_a \theta_h^\top\phi(s,a),\forall s
2.Vh?(s)=amax?θh???(s,a),?s
最后的策略从学习到的Q值中得到:
π
^
h
(
s
)
=
arg?max
?
a
θ
h
?
?
(
s
,
a
)
?
h
\widehat \pi_h(s)=\argmax_a\theta_h^\top\phi(s,a)\quad \forall h
π
h?(s)=aargmax?θh???(s,a)?h
3.2 LSVI样本复杂度
给出定理: 在finite horizon,Large Scale MDP,generative model的设定下,假设Linear Bellman Completion,使用LSVI算法得到策略
π
^
\widehat \pi
π
,如果采集到的数据集
{
D
h
}
h
=
0
H
?
1
\{D_h\}_{h=0}^{H-1}
{Dh?}h=0H?1?满足D-optimal design,则有至少
1
?
δ
1-\delta
1?δ的概率,找到有
?
\epsilon
?-optimal策略即
V
π
^
?
V
?
≤
?
V^{\widehat \pi}-V^\star\leq \epsilon
Vπ
?V?≤?,每个数据集
D
h
D_h
Dh?的样本复杂度为
O
(
d
2
+
H
6
d
2
/
?
2
)
O(d^2+H^6d^2/\epsilon^2)
O(d2+H6d2/?2)
其中d为特征空间
?
(
s
,
a
)
\phi(s,a)
?(s,a)的维度,这里是把样本复杂度中无限空间的
∣
S
∣
∣
A
∣
|S||A|
∣S∣∣A∣换成了有限的特征空间
?
(
s
,
a
)
\phi(s,a)
?(s,a)
为什么假设了Linear Bellman Completion,还需要D-optimal Design呢?因为这里涉及到了通过数据集
D
h
D_h
Dh?进行最小二乘法回归的问题,这必定会涉及到数据分布,如果测试数据离训练数据很远,是没办法泛化到的,也就难以保证
?
\epsilon
?-optimal中的optimal,所以还得加一个使数据集有足够多样性的假设D-optimal design
3.3 D-optimal Design
样本所在空间
X
∈
R
d
\mathcal X\in \mathbb R^d
X∈Rd,对
X
\mathcal X
X以分布
p
p
p的方式进行采样得到数据集
D
=
{
x
:
x
~
p
}
D=\{x:x\sim p\}
D={x:x~p},衡量数据集
D
D
D的多样性,相当于对样本空间
X
\mathcal X
X的采样分布
p
p
p进行多样性衡量。
D-optimality:假设采样分布
p
p
p至多包含样本空间
X
\mathcal X
X中
d
(
d
+
1
)
/
2
d(d+1)/2
d(d+1)/2个点,定义
Σ
=
E
x
~
p
[
x
x
T
]
\Sigma=\mathbb E_{x\sim p}[xx^T]
Σ=Ex~p?[xxT]且有
x
?
Σ
?
1
x
≤
d
?
x
∈
X
x^\top\Sigma^{-1} x\leq d\quad \forall x\in \mathcal X
x?Σ?1x≤d?x∈X,则称分布
p
p
p为D-optimal design
首先
d
(
d
+
1
)
/
2
d(d+1)/2
d(d+1)/2保证分布
p
p
p有一定数量的点,
Σ
\Sigma
Σ是定义了分布内的点的信息量,
x
?
Σ
x
≤
d
,
?
x
∈
X
x^\top\Sigma x\leq d,\forall x\in \mathcal X
x?Σx≤d,?x∈X表示了整个样本空间中的点,离
p
p
p分布的点距离不超过
d
d
d,完成了多样性的量化
以
D
D
D-optimality为原则,来量化多样性,最大化以下目标筛选采样分布
p
=
arg?max
?
v
ln
?
det
?
E
x
~
v
[
x
x
?
]
p=\argmax_v \ln\det \mathbb E_{x\sim v}[xx^\top]
p=vargmax?lndetEx~v?[xx?]
因为采样分布
p
p
p是D-optimal design的,因此根据它从样本空间
X
\mathcal X
X采样得到的数据集
D
D
D,具备一定的多样性(coverage)
3.4 LSVI样本复杂度的最终命题
四、LSVI样本复杂度的证明
4.1 熟悉time-dependent的finite MDP
主要是在做Q-value iteration方面的迭代,因此改写下与Q相关的公式:
- Bellman Optimality
Infinite?Stationary:
Q
?
(
s
,
a
)
=
r
(
s
,
a
)
+
γ
E
s
′
~
p
(
?
∣
s
,
a
)
[
max
?
a
′
Q
?
(
s
′
,
a
′
)
]
V
?
(
s
)
=
max
?
a
Q
?
(
s
,
a
)
Q
=
B
Q
?(简写)
Finite?Non-stationary:
Q
h
?
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
Q
h
+
1
?
(
s
′
,
a
′
)
]
V
h
?
(
s
)
=
max
?
a
Q
h
?
(
s
,
a
)
Q
h
=
B
h
Q
h
+
1
?(简写)
\begin{aligned} \text{Infinite Stationary:}&\\ Q^\star(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right]\\ V^\star(s)&=\max_a Q^\star(s,a)\\ Q&=\mathcal BQ\text{ (简写)}\\ \text{Finite Non-stationary:}&\\ Q_h^\star(s,a)&=r(s,a)+ \mathbb E_{s'\sim P_h(\cdot|s,a)}\left[\max_{a'}Q_{h+1}^\star (s',a')\right]\\ V_h^\star(s)&=\max_a Q_h^\star(s,a)\\ Q_h&=\mathcal B_hQ_{h+1}\text{ (简写)}\\ \end{aligned}
Infinite?Stationary:Q?(s,a)V?(s)QFinite?Non-stationary:Qh??(s,a)Vh??(s)Qh??=r(s,a)+γEs′~p(?∣s,a)?[a′max?Q?(s′,a′)]=amax?Q?(s,a)=BQ?(简写)=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?Qh+1??(s′,a′)]=amax?Qh??(s,a)=Bh?Qh+1??(简写)? - Non-stationary下
Q
h
?
Q_h^\star
Qh??满足:
Q
H
?
=
0
,
Q
H
?
1
?
=
r
,
Q
h
?
=
B
h
Q
h
+
1
?
Q^\star_H=0,Q^\star_{H-1}=r,Q_h^\star=\mathcal B_h Q^\star_{h+1}
QH??=0,QH?1??=r,Qh??=Bh?Qh+1??
4.2 Inherent Bellman Error
如果通过某个算法估计的
Q
^
h
\widehat Q_h
Q
?h?有
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
≤
?
\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon
∥Q
?h??Bh?Q
?h+1?∥∞?≤?成立(这个条件称做Inherent Bellman Error),则
-
∥
Q
^
h
?
Q
h
?
∥
∞
≤
(
H
?
h
)
?
?
h
∈
[
0
,
1
,
2...
,
H
?
1
]
\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1]
∥Q
?h??Qh??∥∞?≤(H?h)??h∈[0,1,2...,H?1]
-
∣
V
π
^
?
V
?
∣
≤
2
H
2
?
|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon
∣Vπ
?V?∣≤2H2?,其中
π
^
h
(
s
)
=
arg?max
?
a
Q
^
h
(
s
,
a
)
\widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a)
π
h?(s)=aargmax?Q
?h?(s,a)
4.2.1 证明
∥
Q
^
h
?
Q
h
?
∥
∞
≤
(
H
?
h
)
?
\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon
∥Q
?h??Qh??∥∞?≤(H?h)?
- 已知
Q
H
(
s
,
a
)
=
0
Q_H(s,a)=0
QH?(s,a)=0,根据Inherent Bellman Error
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
≤
?
\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon
∥Q
?h??Bh?Q
?h+1?∥∞?≤?,令h=H-1,有
∥
Q
^
H
?
1
?
r
∥
∞
≤
?
\|\widehat Q_{H-1}-r\|_\infty\leq \epsilon
∥Q
?H?1??r∥∞?≤?
- 由
Q
H
?
1
?
=
r
Q^\star_{H-1}=r
QH?1??=r,得到
∥
Q
^
H
?
1
?
Q
H
?
1
?
∥
∞
≤
?
\|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon
∥Q
?H?1??QH?1??∥∞?≤?
- 分析目标
∥
Q
^
h
?
Q
h
?
∥
∞
≤
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
+
∥
B
h
Q
^
h
+
1
?
Q
h
?
∥
∞
≤
?
+
∥
B
h
Q
^
h
+
1
?
B
h
Q
h
+
1
?
∥
∞
≤
?
+
∥
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
Q
^
h
+
1
(
s
′
,
a
′
)
]
?
E
s
′
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
′
Q
h
+
1
?
(
s
′
′
,
a
′
′
)
]
∥
∞
≤
?
+
∥
Q
^
h
+
1
?
Q
h
+
1
?
∥
∞
\begin{aligned} \|\widehat Q_h-Q^\star_h\|_\infty&\leq \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty+\|\mathcal B_h\widehat Q_{h+1}-Q^\star_h\|_\infty\\ &\leq \epsilon +\|\mathcal B_h\widehat Q_{h+1}-\mathcal B_hQ^\star_{h+1}\|_\infty\\ &\leq \epsilon +\|\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]-E_{s''\sim P_h(\cdot|s,a)}[\max_{a''}Q^\star_{h+1}(s'',a'')]\|_\infty\\ &\leq \epsilon + \|\widehat Q_{h+1}-Q^\star_{h+1}\|_\infty \end{aligned}
∥Q
?h??Qh??∥∞??≤∥Q
?h??Bh?Q
?h+1?∥∞?+∥Bh?Q
?h+1??Qh??∥∞?≤?+∥Bh?Q
?h+1??Bh?Qh+1??∥∞?≤?+∥Es′~Ph?(?∣s,a)?[a′max?Q
?h+1?(s′,a′)]?Es′′~Ph?(?∣s,a)?[a′′max?Qh+1??(s′′,a′′)]∥∞?≤?+∥Q
?h+1??Qh+1??∥∞?? - 熟悉的套娃环节,到
∥
Q
^
H
?
1
?
Q
H
?
1
?
∥
∞
≤
?
\|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon
∥Q
?H?1??QH?1??∥∞?≤?,因此有
∥
Q
^
h
?
Q
h
?
∥
∞
≤
(
H
?
h
)
?
?
h
∈
[
0
,
1
,
2...
,
H
?
1
]
\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1]
∥Q
?h??Qh??∥∞?≤(H?h)??h∈[0,1,2...,H?1]
4.2.2 证明
∣
V
π
^
?
V
?
∣
≤
2
H
2
?
|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon
∣Vπ
?V?∣≤2H2?
- 从后向前考虑,先看H-1时刻(下面每一步恒等变形都是为了对齐动作)
∣
V
H
?
1
?
(
s
)
?
V
H
?
1
π
^
(
s
)
∣
=
∣
Q
H
?
1
?
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
H
?
1
π
^
(
s
,
π
^
H
?
1
(
s
)
)
∣
=
∣
Q
H
?
1
?
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
+
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
?
Q
H
?
1
π
^
(
s
,
π
^
H
?
1
(
s
)
)
∣
=
∣
Q
H
?
1
?
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
∣
=
∣
Q
H
?
1
?
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
^
H
?
1
(
s
,
π
H
?
1
?
(
s
)
)
+
Q
^
H
?
1
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
∣
≤
∣
Q
H
?
1
?
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
^
H
?
1
(
s
,
π
H
?
1
?
(
s
)
)
+
Q
^
H
?
1
(
s
,
π
^
H
?
1
(
s
)
)
?
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
∣
≤
∣
Q
H
?
1
?
(
s
,
π
H
?
1
?
(
s
)
)
?
Q
^
H
?
1
(
s
,
π
H
?
1
?
(
s
)
)
∣
+
∣
Q
^
H
?
1
(
s
,
π
^
H
?
1
(
s
)
)
?
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
∣
≤
2
?
\begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))+Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &= |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s))| + |\widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq 2\epsilon \end{aligned}
∣VH?1??(s)?VH?1π
?(s)∣?=∣QH?1??(s,πH?1??(s))?QH?1π
?(s,π
H?1?(s))∣=∣QH?1??(s,πH?1??(s))?QH?1??(s,π
H?1?(s))+QH?1??(s,π
H?1?(s))?QH?1π
?(s,π
H?1?(s))∣=∣QH?1??(s,πH?1??(s))?QH?1??(s,π
H?1?(s))∣=∣QH?1??(s,πH?1??(s))?Q
?H?1?(s,πH?1??(s))+Q
?H?1?(s,πH?1??(s))?QH?1??(s,π
H?1?(s))∣≤∣QH?1??(s,πH?1??(s))?Q
?H?1?(s,πH?1??(s))+Q
?H?1?(s,π
H?1?(s))?QH?1??(s,π
H?1?(s))∣≤∣QH?1??(s,πH?1??(s))?Q
?H?1?(s,πH?1??(s))∣+∣Q
?H?1?(s,π
H?1?(s))?QH?1??(s,π
H?1?(s))∣≤2??
- 注意到
?
s
,
a
\forall s,a
?s,a,有
Q
H
?
1
?
(
s
,
a
)
=
r
(
s
,
a
)
=
Q
H
?
1
π
^
(
s
,
a
)
Q_{H-1}^\star(s,a)=r(s,a)=Q^{\widehat \pi}_{H-1}(s,a)
QH?1??(s,a)=r(s,a)=QH?1π
?(s,a),因此
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
?
Q
H
?
1
π
^
(
s
,
π
^
H
?
1
(
s
)
)
=
0
Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))=0
QH?1??(s,π
H?1?(s))?QH?1π
?(s,π
H?1?(s))=0
- 另外
Q
H
?
1
?
(
s
,
π
^
H
?
1
(
s
)
)
?
Q
^
H
?
1
(
s
,
π
^
H
?
1
(
s
)
)
≠
0
Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-\widehat Q_{H-1}(s,\widehat \pi_{H-1}(s))\neq0
QH?1??(s,π
H?1?(s))?Q
?H?1?(s,π
H?1?(s))?=0
- 考虑任意时刻
h
h
h之间的迭代式:
∣
V
h
?
(
s
)
?
V
h
π
^
(
s
)
∣
=
∣
Q
h
?
(
s
,
π
h
?
(
s
)
)
?
Q
h
π
^
(
s
,
π
^
h
(
s
)
)
∣
=
∣
Q
h
?
(
s
,
π
h
?
(
s
)
)
?
Q
h
?
(
s
,
π
^
h
(
s
)
)
+
Q
h
?
(
s
,
π
^
h
(
s
)
)
?
Q
h
π
^
(
s
,
π
^
h
(
s
)
)
∣
=
∣
Q
h
?
(
s
,
π
h
?
(
s
)
)
?
Q
h
?
(
s
,
π
^
h
(
s
)
)
+
E
s
′
~
P
h
[
V
h
+
1
?
(
s
′
)
?
V
h
+
1
π
^
(
s
′
)
]
∣
=
∣
Q
h
?
(
s
,
π
h
?
(
s
)
)
?
Q
^
h
(
s
,
π
^
h
(
s
)
)
+
Q
^
h
(
s
,
π
^
h
(
s
)
)
?
Q
h
?
(
s
,
π
^
h
(
s
)
)
+
E
s
′
~
P
h
[
V
h
+
1
?
(
s
′
)
?
V
h
+
1
π
^
(
s
′
)
]
∣
≤
∣
Q
h
?
(
s
,
π
h
?
(
s
)
)
?
Q
^
h
(
s
,
π
h
?
(
s
)
)
+
Q
^
h
(
s
,
π
^
h
(
s
)
)
?
Q
h
?
(
s
,
π
^
h
(
s
)
)
+
E
s
′
~
P
h
[
V
h
+
1
?
(
s
′
)
?
V
h
+
1
π
^
(
s
′
)
]
∣
≤
2
(
H
?
h
)
?
+
∣
E
s
′
~
P
h
[
V
h
+
1
?
(
s
′
)
?
V
h
+
1
π
^
(
s
′
)
]
∣
\begin{aligned} |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+Q_{h}^\star(s,\widehat \pi_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\widehat \pi_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\pi^\star_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq 2(H-h)\epsilon +|\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ \end{aligned}
∣Vh??(s)?Vhπ
?(s)∣?=∣Qh??(s,πh??(s))?Qhπ
?(s,π
h?(s))∣=∣Qh??(s,πh??(s))?Qh??(s,π
h?(s))+Qh??(s,π
h?(s))?Qhπ
?(s,π
h?(s))∣=∣Qh??(s,πh??(s))?Qh??(s,π
h?(s))+Es′~Ph??[Vh+1??(s′)?Vh+1π
?(s′)]∣=∣Qh??(s,πh??(s))?Q
?h?(s,π
h?(s))+Q
?h?(s,π
h?(s))?Qh??(s,π
h?(s))+Es′~Ph??[Vh+1??(s′)?Vh+1π
?(s′)]∣≤∣Qh??(s,πh??(s))?Q
?h?(s,πh??(s))+Q
?h?(s,π
h?(s))?Qh??(s,π
h?(s))+Es′~Ph??[Vh+1??(s′)?Vh+1π
?(s′)]∣≤2(H?h)?+∣Es′~Ph??[Vh+1??(s′)?Vh+1π
?(s′)]∣? - 按时刻展开迭代式:
∣
V
H
?
1
?
(
s
)
?
V
H
?
1
π
^
(
s
)
∣
≤
2
?
∣
V
H
?
2
?
(
s
)
?
V
H
?
2
π
^
(
s
)
∣
≤
4
?
+
2
?
∣
V
H
?
3
?
(
s
)
?
V
H
?
3
π
^
(
s
)
∣
≤
6
?
+
4
?
+
2
?
?
∣
V
h
?
(
s
)
?
V
h
π
^
(
s
)
∣
≤
2
?
(
H
?
h
)
H
\begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&\leq 2\epsilon\\ |V_{H-2}^\star(s)-V_{H-2}^{\widehat \pi}(s)|&\leq 4\epsilon + 2\epsilon\\ |V_{H-3}^\star(s)-V_{H-3}^{\widehat \pi}(s)|&\leq 6\epsilon + 4\epsilon + 2\epsilon\\ &\cdots\\ |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&\leq 2\epsilon(H-h)H\\ \end{aligned}
∣VH?1??(s)?VH?1π
?(s)∣∣VH?2??(s)?VH?2π
?(s)∣∣VH?3??(s)?VH?3π
?(s)∣∣Vh??(s)?Vhπ
?(s)∣?≤2?≤4?+2?≤6?+4?+2??≤2?(H?h)H? - 因此h=0时,有
∣
V
0
?
(
s
)
?
V
0
π
^
(
s
)
∣
≤
2
H
2
?
|V_{0}^\star(s)-V_{0}^{\widehat \pi}(s)|\leq 2 H^2 \epsilon
∣V0??(s)?V0π
?(s)∣≤2H2?
4.3 LSVI算法的假设与证明
关键:说明LSVI算法有
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
≤
?
\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon
∥Q
?h??Bh?Q
?h+1?∥∞?≤?成立,就可以利用上述结果
4.3.1 LSVI的隐含假设
- Linear Bellman Completeness
Q
^
h
(
s
,
a
)
=
θ
^
h
?
(
s
,
a
)
B
h
Q
^
h
+
1
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
Q
^
h
+
1
(
s
′
,
a
′
)
]
B
h
Q
^
h
+
1
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
]
\begin{aligned} \widehat Q_h(s,a)&=\widehat \theta_h\phi(s,a)\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] \end{aligned}
Q
?h?(s,a)Bh?Q
?h+1?(s,a)Bh?Q
?h+1?(s,a)?=θ
h??(s,a)=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?Q
?h+1?(s′,a′)]=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?θ
h+1??(s′,a′)]? - D-optimal Design
?
(
s
,
a
)
∈
R
d
Σ
=
E
?
(
s
,
a
)
?
(
s
,
a
)
?
p
=
arg?max
?
v
ln
?
det
?
E
(
s
,
a
)
~
v
?
(
s
,
a
)
?
(
s
,
a
)
?
D
h
=
{
(
s
(
i
)
,
a
(
i
)
,
r
(
i
)
,
(
s
′
)
(
i
)
)
∣
(
s
,
a
)
~
p
,
s
′
~
P
h
(
?
∣
s
,
a
)
}
i
=
1
N
Σ
h
=
∑
s
,
a
∈
D
h
?
(
s
,
a
)
?
(
s
,
a
)
?
是可逆的
?
(
s
,
a
)
,
N
Σ
≤
?
(
s
,
a
)
?
Σ
h
?
1
?
(
s
,
a
)
≤
d
\begin{aligned} &\phi(s,a)\in \mathbb R^d\\ &\Sigma=\mathbb E_{\phi(s,a)\phi(s,a)^\top}\\ &p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v} \phi(s,a)\phi(s,a)^\top\\ &D_h=\{(s^{(i)},a^{(i)},r^{(i)},(s')^{(i)})|(s,a)\sim p,s'\sim P_h(\cdot|s,a)\}_{i=1}^{N}\\ &\Sigma_h=\sum_{s,a\in D_h}\phi(s,a)\phi(s,a)^\top \text{是可逆的}\\ & \forall (s,a), N\Sigma\leq\phi(s,a)^\top \Sigma_h^{-1}\phi(s,a)\leq d \end{aligned}
??(s,a)∈RdΣ=E?(s,a)?(s,a)??p=vargmax?lndetE(s,a)~v??(s,a)?(s,a)?Dh?={(s(i),a(i),r(i),(s′)(i))∣(s,a)~p,s′~Ph?(?∣s,a)}i=1N?Σh?=s,a∈Dh?∑??(s,a)?(s,a)?是可逆的?(s,a),NΣ≤?(s,a)?Σh?1??(s,a)≤d? - LSVI中的Least Square
正常的LS:
y
=
θ
x
+
?
Q
h
的LS:
y
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
]
,
θ
=
θ
^
h
,
x
=
?
(
s
,
a
)
\begin{aligned} \text{正常的LS:}&y=\theta x+\epsilon\\ \text{$Q_h$的LS:}&y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')], \theta=\widehat \theta_h,x=\phi(s,a) \end{aligned}
正常的LS:Qh?的LS:?y=θx+?y=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?θ
h+1??(s′,a′)],θ=θ
h?,x=?(s,a)?
4.3.2 Ordinary Least Square
简要复习下普通最小二乘法的理论,这里蕴含的假设是fixed design analysis,具体证明见2012年的论文Random Design Analysis of Ridge Regression。
已知一个
N
N
N个样本的数据集
D
=
{
x
(
i
)
,
y
(
i
)
}
i
=
1
N
D=\{x^{(i)},y^{(i)}\}_{i=1}^N
D={x(i),y(i)}i=1N?,其中
x
(
i
)
∈
R
d
,
y
(
i
)
∈
R
1
x^{(i)}\in \mathbb R^d,y^{(i)}\in \mathbb R^1
x(i)∈Rd,y(i)∈R1,
x
j
(
i
)
x^{(i)}_j
xj(i)?表示第
i
i
i个样本的第
j
j
j维。它们的结构假设是线性的,存在一个最优向量参数
θ
?
\theta^\star
θ?,使得
y
(
i
)
=
(
θ
?
)
?
x
(
i
)
+
?
(
i
)
y^{(i)}=(\theta^\star)^\top x^{(i)}+\epsilon^{(i)}
y(i)=(θ?)?x(i)+?(i),其中噪声假设为
E
[
?
]
=
0
,
∣
?
∣
≤
σ
\mathbb E[\epsilon]=0,|\epsilon|\leq \sigma
E[?]=0,∣?∣≤σ。
令样本矩阵
X
∈
R
N
×
d
X\in \mathbb R^{N\times d}
X∈RN×d,它们的协方差矩阵
Σ
=
X
?
X
=
∑
i
=
1
N
x
(
i
)
(
x
(
i
)
)
?
\Sigma=X^\top X=\sum_{i=1}^N x^{(i)}(x^{(i)})^\top
Σ=X?X=∑i=1N?x(i)(x(i))?,假设
Σ
\Sigma
Σ可逆,最小化二乘法学习到的参数
θ
^
\hat \theta
θ^:
θ
^
=
arg?min
?
θ
(
X
θ
?
Y
)
2
=
(
X
?
X
)
?
1
X
?
Y
=
Σ
?
1
X
?
Y
\hat\theta=\argmin_\theta (X\theta-Y)^2=(X^\top X)^{-1}X^\top Y=\Sigma^{-1}X^\top Y
θ^=θargmin?(Xθ?Y)2=(X?X)?1X?Y=Σ?1X?Y
根据2012年的论文Random Design Analysis of Ridge Regression推导,选取
δ
∈
(
0
,
1
)
\delta\in (0,1)
δ∈(0,1),至少
1
?
δ
1-\delta
1?δ的概率有如下成立:
(
θ
^
?
θ
?
)
?
Σ
(
θ
^
?
θ
?
)
=
∥
θ
^
?
θ
?
∥
Σ
2
≤
σ
2
(
d
+
d
ln
?
(
1
/
δ
)
+
2
ln
?
(
1
/
δ
)
)
(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))
(θ^?θ?)?Σ(θ^?θ?)=∥θ^?θ?∥Σ2?≤σ2(d+dln(1/δ)
?+2ln(1/δ))
具体展开看,
(
θ
^
?
θ
?
)
?
Σ
(
θ
^
?
θ
?
)
=
(
θ
^
?
θ
?
)
?
X
?
X
(
θ
^
?
θ
?
)
=
(
X
(
θ
^
?
θ
?
)
)
?
(
X
(
θ
^
?
θ
?
)
)
=
(
Y
^
?
Y
)
2
=
∑
i
=
1
N
(
θ
^
x
(
i
)
?
θ
?
x
(
i
)
)
(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star)=(\hat \theta -\theta^\star)^\top X^\top X(\hat \theta -\theta^\star)=(X(\hat \theta-\theta^\star))^\top (X(\hat \theta-\theta^\star))=(\hat Y-Y)^2=\sum_{i=1}^N (\hat \theta x^{(i)}-\theta^\star x^{(i)})
(θ^?θ?)?Σ(θ^?θ?)=(θ^?θ?)?X?X(θ^?θ?)=(X(θ^?θ?))?(X(θ^?θ?))=(Y^?Y)2=i=1∑N?(θ^x(i)?θ?x(i))
4.3.3 LSVI的Inherent Bellman Error
先套Least Square的理论结果得到第h时刻估计参数
θ
^
h
\hat \theta_h
θ^h?与最优参数
θ
?
\theta^\star
θ?的界
- 套用LS的理论结果:
(
θ
^
?
θ
?
)
?
Σ
(
θ
^
?
θ
?
)
=
∥
θ
^
?
θ
?
∥
Σ
2
≤
σ
2
(
d
+
d
ln
?
(
1
/
δ
)
+
2
ln
?
(
1
/
δ
)
)
(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))
(θ^?θ?)?Σ(θ^?θ?)=∥θ^?θ?∥Σ2?≤σ2(d+dln(1/δ)
?+2ln(1/δ)) - LSVI中第h时刻从数据集
D
h
D_h
Dh?拟合参数
θ
^
h
\widehat \theta_h
θ
h?时的对应关系:
- 每一个
D
h
D_h
Dh?中的样本为
(
s
,
a
,
r
,
s
′
)
(s,a,r,s')
(s,a,r,s′)中的输入
(
s
,
a
)
(s,a)
(s,a),采样
s
′
~
P
h
(
?
∣
s
,
a
)
s'\sim P_h(\cdot|s,a)
s′~Ph?(?∣s,a),其真实值为
y
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
]
y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')]
y=r(s,a)+Es′~Ph?(?∣s,a)?[maxa′?θ
h+1??(s′,a′)]
- 从
s
′
~
P
h
(
?
∣
s
,
a
)
s'\sim P_h(\cdot|s,a)
s′~Ph?(?∣s,a)采样一个样本
(
s
,
a
,
r
,
s
′
)
(s,a,r,s')
(s,a,r,s′),观察到的采样为
y
(
i
)
=
r
(
s
,
a
)
+
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
y^{(i)}=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')
y(i)=r(s,a)+maxa′?θ
h+1??(s′,a′)
- 因此关于一个
(
s
,
a
)
(s,a)
(s,a),采样带来的噪声为
?
(
i
)
=
y
(
i
)
?
y
=
r
(
s
,
a
)
+
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
?
r
(
s
,
a
)
?
E
s
′
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
^
h
+
1
?
(
s
′
′
,
a
′
)
]
\epsilon^{(i)}=y^{(i)}-y=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')-r(s,a)-\mathbb E_{s''\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s'',a')]
?(i)=y(i)?y=r(s,a)+a′max?θ
h+1??(s′,a′)?r(s,a)?Es′′~Ph?(?∣s,a)?[a′max?θ
h+1??(s′′,a′)]
很自然我们有:
E
[
?
]
=
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
r
(
s
,
a
)
+
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
]
?
θ
^
h
?
(
s
,
a
)
=
0
\mathbb E[\epsilon]=\mathbb E_{s'\sim P_h(\cdot|s,a)}[r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')]-\widehat\theta_h\phi(s,a)=0
E[?]=Es′~Ph?(?∣s,a)?[r(s,a)+a′max?θ
h+1??(s′,a′)]?θ
h??(s,a)=0 在Infinite Horizon中Q函数的范围为
(
0
,
1
1
?
γ
)
(0,\frac{1}{1-\gamma})
(0,1?γ1?),因此在finite horizon中Q函数的范围为
(
0
,
H
)
(0,H)
(0,H),因此噪声项有
∣
?
∣
≤
2
H
|\epsilon|\leq 2H
∣?∣≤2H -
Σ
=
Σ
h
=
∑
(
s
,
a
)
∈
D
h
?
(
s
,
a
)
?
(
s
,
a
)
?
\Sigma=\Sigma_h=\sum_{(s,a)\in D_h}\phi(s,a)\phi(s,a)^\top
Σ=Σh?=∑(s,a)∈Dh???(s,a)?(s,a)?
-
θ
?
\theta^\star
θ?能令
x
=
?
(
s
,
a
)
x=\phi(s,a)
x=?(s,a)为真实值y,即
θ
?
x
=
y
\theta^\star x=y
θ?x=y,套进来为:
(
θ
?
)
?
?
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
^
h
+
1
?
(
s
′
,
a
′
)
]
(\theta^\star)^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')]
(θ?)??(s,a)=r(s,a)+Es′~Ph?(?∣s,a)?[a′max?θ
h+1??(s′,a′)]
- 所以LSVI在第h时刻的最小二乘法可以得到:
(
θ
^
h
?
θ
?
)
?
Σ
h
(
θ
^
h
?
θ
?
)
=
∥
θ
^
h
?
θ
?
∥
Σ
h
2
≤
4
H
2
(
d
+
d
ln
?
(
1
/
δ
)
+
2
ln
?
(
1
/
δ
)
)
(\hat \theta_h -\theta^\star)^\top \Sigma_h (\hat \theta_h -\theta^\star)=\|\hat \theta_h-\theta^\star\|_{\Sigma_h}^2 \leq 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))
(θ^h??θ?)?Σh?(θ^h??θ?)=∥θ^h??θ?∥Σh?2?≤4H2(d+dln(1/δ)
?+2ln(1/δ))
- 分析目标
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
=
max
?
s
,
a
∣
θ
^
h
?
?
(
s
,
a
)
?
r
(
s
,
a
)
?
E
s
′
~
P
h
(
?
∣
s
,
a
)
[
max
?
a
′
θ
^
h
+
1
?
?
(
s
′
,
a
′
)
]
∣
=
max
?
s
,
a
∣
θ
^
h
?
?
(
s
,
a
)
?
(
θ
?
)
?
?
(
s
,
a
)
∣
=
max
?
s
,
a
∣
(
θ
^
h
?
θ
?
)
?
?
(
s
,
a
)
∣
=
max
?
s
,
a
∣
(
θ
^
h
?
θ
?
)
?
?
(
s
,
a
)
∣
2
?(利用
∑
a
i
b
i
≤
∑
a
i
2
∑
b
i
2
进行放缩)
≤
max
?
s
,
a
∣
(
θ
^
h
?
θ
?
)
?
(
θ
^
h
?
θ
?
)
∣
∣
?
(
s
,
a
)
?
?
(
s
,
a
)
∣
=
max
?
s
,
a
∣
(
θ
^
h
?
θ
?
)
?
Σ
h
(
θ
^
h
?
θ
?
)
∣
∣
?
(
s
,
a
)
?
Σ
h
?
1
?
(
s
,
a
)
∣
≤
max
?
s
,
a
4
H
2
(
d
+
d
ln
?
(
1
/
δ
)
+
2
ln
?
(
1
/
δ
)
)
d
N
≤
max
?
s
,
a
4
H
d
ln
?
1
/
δ
N
=
4
H
d
ln
?
1
/
δ
N
\begin{aligned} \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty&=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-r(s,a)-\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}^\top\phi(s',a')]|\\ &=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-(\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|^2}\text{ (利用$\sum a_ib_i\leq \sum a_i^2\sum b_i^2$进行放缩)}\\ &\leq \max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\phi(s,a)|} \\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top\Sigma_h (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\Sigma_h^{-1}\phi(s,a)|} \\ &\leq\max_{s,a}\sqrt{ 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) \frac{d}{N}}\\ &\leq \max_{s,a} \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}}= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}} \end{aligned}
∥Q
?h??Bh?Q
?h+1?∥∞??=s,amax?∣θ^h???(s,a)?r(s,a)?Es′~Ph?(?∣s,a)?[a′max?θ
h+1???(s′,a′)]∣=s,amax?∣θ^h???(s,a)?(θ?)??(s,a)∣=s,amax?∣(θ^h??θ?)??(s,a)∣=s,amax?∣(θ^h??θ?)??(s,a)∣2
??(利用∑ai?bi?≤∑ai2?∑bi2?进行放缩)≤s,amax?∣(θ^h??θ?)?(θ^h??θ?)∣∣?(s,a)??(s,a)∣
?=s,amax?∣(θ^h??θ?)?Σh?(θ^h??θ?)∣∣?(s,a)?Σh?1??(s,a)∣
?≤s,amax?4H2(d+dln(1/δ)
?+2ln(1/δ))Nd?
?≤s,amax?N
?4Hdln1/δ
??=N
?4Hdln1/δ
??? - 求得数据集
D
h
D_h
Dh?的大小
N
N
N,令
?
=
4
H
d
ln
?
1
/
δ
N
\epsilon= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}}
?=N
?4Hdln1/δ
??,得
N
=
16
H
2
d
2
ln
?
(
H
/
δ
)
?
2
N=\frac{16H^2d^2\ln(H/\delta)}{\epsilon^2}
N=?216H2d2ln(H/δ)?
4.3.4 LSVI样本复杂度的总逻辑
- 4.2节
如果算法估计的
Q
^
h
\widehat Q_h
Q
?h?有
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
≤
?
\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon
∥Q
?h??Bh?Q
?h+1?∥∞?≤?成立(Inherent Bellman Error),则
-
∥
Q
^
h
?
Q
h
?
∥
∞
≤
(
H
?
h
)
?
?
h
∈
[
0
,
1
,
2...
,
H
?
1
]
\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1]
∥Q
?h??Qh??∥∞?≤(H?h)??h∈[0,1,2...,H?1]
-
∣
V
π
^
?
V
?
∣
≤
2
H
2
?
|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon
∣Vπ
?V?∣≤2H2?,其中
π
^
h
(
s
)
=
arg?max
?
a
Q
^
h
(
s
,
a
)
\widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a)
π
h?(s)=aargmax?Q
?h?(s,a)
- 4.3.3节
利用了Least Squre + D-optimal Design后,当
N
≥
16
H
2
d
2
ln
?
(
H
/
δ
)
?
2
N\geq \frac{16H^2d^2\ln(H/\delta)}{\epsilon^2}
N≥?216H2d2ln(H/δ)?有
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
≤
?
\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon
∥Q
?h??Bh?Q
?h+1?∥∞?≤?成立 - 总逻辑
- 以
p
=
arg?max
?
v
ln
?
det
?
E
(
s
,
a
)
~
v
[
?
(
s
,
a
)
?
(
s
,
a
)
?
]
p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v}[\phi(s,a)\phi(s,a)^\top]
p=vargmax?lndetE(s,a)~v?[?(s,a)?(s,a)?]方式得到在
(
s
,
a
)
(s,a)
(s,a)空间上的最优采样分布
p
p
p
- 然后在
h
=
0
,
1
,
.
.
.
,
H
?
1
h=0,1,...,H-1
h=0,1,...,H?1的每个时刻,对于一个
(
s
,
a
)
(s,a)
(s,a),从其下一状态的真实分布
P
h
(
s
′
∣
s
,
a
)
P_h(s'|s,a)
Ph?(s′∣s,a)中随机采样
?
p
(
s
,
a
)
N
?
\lceil p(s,a)N\rceil
?p(s,a)N?个下一状态值
s
′
s'
s′作为样本,所以
D
h
D_h
Dh?的大小为
∑
s
,
a
?
p
(
s
,
a
)
N
?
\sum_{s,a}\lceil p(s,a)N\rceil
∑s,a??p(s,a)N?。
- 因此H个数据集的总样本数为
总
的
样
本
数
=
H
∑
s
,
a
∈
p
?
p
(
s
,
a
)
N
?
≤
H
∑
s
,
a
∈
p
(
1
+
p
(
s
,
a
)
N
)
总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N)
总的样本数=Hs,a∈p∑??p(s,a)N?≤Hs,a∈p∑?(1+p(s,a)N) - 设定目标的误差为
∣
V
π
^
?
V
?
∣
≤
2
H
2
?
=
?
′
|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon'
∣Vπ
?V?∣≤2H2?=?′,所以
?
=
?
′
2
H
2
\epsilon=\frac{\epsilon'}{2H^2}
?=2H2?′?,这要求Inherent Bellman error
∥
Q
^
h
?
B
h
Q
^
h
+
1
∥
∞
≤
?
′
2
H
2
\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \frac{\epsilon'}{2H^2}
∥Q
?h??Bh?Q
?h+1?∥∞?≤2H2?′?,可知
N
=
64
H
6
d
2
ln
?
H
/
δ
(
?
′
)
2
N= \frac{64H^6d^2\ln H/\delta}{(\epsilon')^2}
N=(?′)264H6d2lnH/δ?
- 因此为了保持
?
′
\epsilon'
?′-optimal的策略
∣
V
π
^
?
V
?
∣
≤
2
H
2
?
=
?
′
|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon'
∣Vπ
?V?∣≤2H2?=?′ ,需要
总
的
样
本
数
=
H
∑
s
,
a
∈
p
?
p
(
s
,
a
)
N
?
≤
H
∑
s
,
a
∈
p
(
1
+
p
(
s
,
a
)
N
)
≤
H
(
d
2
+
64
H
6
d
2
ln
?
H
/
δ
(
?
′
)
2
)
总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N)\leq H(d^2+\frac{64H^6d^2\ln H/\delta}{(\epsilon')^2})
总的样本数=Hs,a∈p∑??p(s,a)N?≤Hs,a∈p∑?(1+p(s,a)N)≤H(d2+(?′)264H6d2lnH/δ?)
总结
假设很多,这里分一下类。
- 关于问题建模本身的假设:finite horizon、Large Scale MDP、time-dependent policy(non-stationary policy)
- 关于最优Q值的结构假设:Linear 、 结构变化是Linear Bellman Completion
- 关于数据集多样性的分布假设:D-optimality
- 关于算法中最小二乘的假设:fixed design analysis
这些假设,是相应算法在多项式时间内找到
?
\epsilon
?-optimal策略的保证。
|