迁移强化学习论文笔记(一)(Successor Features)
一.Background and problem formulation
M
≡
(
S
,
A
,
p
,
R
,
)
M equiv(mathcal{S}, mathcal{A}, p, R, gamma)
M≡(S,A,p,R,)
S
cal S
S:状态空间
A
cal A
A:行动空间
p
p
p:
p
(
⋅
∣
s
t
,
a
t
)
p(cdotmid s_t,a_t)
p(⋅∣st,at)状态转移概率
R
R
R:
R
(
s
t
,
a
t
,
s
t
+
1
)
R(s_t,a_t,s_{t+1})
R(st,at,st+1)奖励
二.Successor features
假设奖励函数可以写为
r
(
s
,
a
,
s
′
)
=
(
s
,
a
,
s
′
)
⊤
w
,
rleft(s, a, s^{prime}right)=boldsymbol{phi}left(s, a, s^{prime}right)^{top} mathbf{w},
r(s,a,s′)=(s,a,s′)⊤w,
其中
(
s
,
a
,
s
′
)
boldsymbolphi(s,a,s’)
(s,a,s′)是d维向量,
w
mathbf w
w是对应的权重。利用这种形式,我们有以下结论(定义
t
+
1
=
(
s
t
,
a
t
,
s
t
+
1
)
boldsymbol phi_{t+1}=boldsymbol phi(s_t,a_t,s_{t+1})
t+1=(st,at,st+1))
Q
(
s
,
a
)
=
E
[
r
t
+
1
+
r
t
+
2
+
…
∣
S
t
=
s
,
A
t
=
a
]
=
E
[
t
+
1
⊤
w
+
t
+
2
⊤
w
+
…
∣
S
t
=
s
,
A
t
=
a
]
=
E
[
∑
i
=
t
∞
i
−
t
i
+
1
∣
S
t
=
s
,
A
t
=
a
]
⊤
w
=
(
s
,
a
)
⊤
w
.
begin{aligned} Q^pi(s, a) & =mathrm{E}^pileft[r_{t+1}+gamma r_{t+2}+ldots mid S_t=s, A_t=aright] \ & =mathrm{E}^pileft[boldsymbol{phi}_{t+1}^{top} mathbf{w}+gamma boldsymbol{phi}_{t+2}^{top} mathbf{w}+ldots mid S_t=s, A_t=aright] \ & =mathrm{E}^pileft[sum_{i=t}^{infty} gamma^{i-t} boldsymbol{phi}_{i+1} mid S_t=s, A_t=aright]^{top} mathbf{w}=boldsymbol{psi}^pi(s, a)^{top} mathbf{w} . end{aligned}
Q(s,a)=E[rt+1+rt+2+…∣St=s,At=a]=E[t+1⊤w+t+2⊤w+…∣St=s,At=a]=E[i=t∑∞i−ti+1∣St=s,At=a]⊤w=(s,a)⊤w.
(
s
,
a
)
boldsymbol psi^{pi}(s,a)
(s,a)是在策略
pi
下
(
s
,
a
)
(s,a)
(s,a)的Successor Features(SFs)
由定义知
(
s
,
a
)
=
E
[
t
+
1
+
t
+
2
+
2
t
+
3
+
⋯
∣
S
t
=
s
,
A
t
=
a
]
boldsymbol psi^{pi}(s,a)=mathrm E^{pi}[boldsymbol{phi}_{t+1}+gamma boldsymbol{phi}_{t+2}+gamma^2boldsymbol{phi}_{t+3}+cdotsmid S_t=s,A_t=a]
(s,a)=E[t+1+t+2+2t+3+⋯∣St=s,At=a]可得如下贝尔曼公式
(
s
,
a
)
=
E
[
t
+
1
+
t
+
2
+
2
t
+
3
+
⋯
∣
S
t
=
s
,
A
t
=
a
]
=
E
S
t
+
1
,
A
t
+
1
[
t
+
1
+
(
S
t
+
1
,
A
t
+
1
)
∣
S
t
=
s
,
A
t
=
a
]
如果采取确定策略
=
t
+
1
(
s
,
a
)
+
E
S
t
+
1
[
(
S
t
+
1
,
(
S
t
+
1
)
)
∣
S
t
=
s
,
A
t
=
a
]
begin{aligned} boldsymbol psi^{pi}(s,a)&=mathrm E^{pi}[boldsymbol{phi}_{t+1}+gamma boldsymbol{phi}_{t+2}+gamma^2boldsymbol{phi}_{t+3}+cdotsmid S_t=s,A_t=a]\ &=mathrm{E}_{S_{t+1},A_{t+1}}[boldsymbol{phi}_{t+1}+boldsymbol psi^{pi}(S_{t+1},A_{t+1})mid S_t=s,A_t=a]text{如果采取确定策略}pi\ &=boldsymbol phi_{t+1}(s,a)+mathrm E_{S_{t+1}}[boldsymbol psi^{pi}(S_{t+1},pi(S_{t+1}))mid S_t=s,A_t=a] end{aligned}
(s,a)=E[t+1+t+2+2t+3+⋯∣St=s,At=a]=ESt+1,At+1[t+1+(St+1,At+1)∣St=s,At=a]如果采取确定策略=t+1(s,a)+ESt+1[(St+1,(St+1))∣St=s,At=a]
利用上式即可迭代求解
(
s
,
a
)
boldsymbol psi^{pi}(s,a)
(s,a),而对于
w
mathbf w
w的求解则是一个有监督学习问题很多机器学习算法都可进行。
这样对于不同的任务只要求解出不同的
w
mathbf w
w即可。
三.Generalized policy improvement
作者在论文中还证明了迁移强化学习的泛化误差界
Theorem 1. (Generalized Policy Improvement) Let
1
,
2
,
…
,
n
pi_1, pi_2, ldots, pi_n
1,2,…,n be
n
n
n decision policies and let
Q
~
1
,
Q
~
2
,
…
,
Q
~
n
tilde{Q}^{pi_1}, tilde{Q}^{pi_2}, ldots, tilde{Q}^{pi_n}
Q~1,Q~2,…,Q~n be approximations of their respective action-value functions such that
∣
Q
i
(
s
,
a
)
−
Q
~
i
(
s
,
a
)
∣
≤
forall
s
∈
S
,
a
∈
A
,and
i
∈
{
1
,
2
,
…
,
n
}
.
left|Q^{pi_i}(s, a)-tilde{Q}^{pi_i}(s, a)right| leq epsilon text { for all } s in mathcal{S}, a in mathcal{A} text {, and } i in{1,2, ldots, n} text {. }
Qi(s,a)−Q~i(s,a)
≤foralls∈S,a∈A,andi∈{1,2,…,n}.
Define
(
s
)
∈
argmax
a
max
i
Q
~
i
(
s
,
a
)
.
pi(s) in underset{a}{operatorname{argmax}} max _i tilde{Q}^{pi_i}(s, a) .
(s)∈aargmaximaxQ~i(s,a).
Then,
Q
(
s
,
a
)
≥
max
i
Q
i
(
s
,
a
)
−
2
1
−
Q^pi(s, a) geq max _i Q^{pi_i}(s, a)-frac{2}{1-gamma} epsilon
Q(s,a)≥imaxQi(s,a)−1−2
for anys
∈
S
s in mathcal{S}
s∈S and
a
∈
A
a in mathcal{A}
a∈A, where
Q
Q^pi
Q is the action-value function of
pi
.
proof:为简化符号,定义
Q
m
a
x
(
s
,
a
)
=
max
i
Q
i
(
s
,
a
)
(
在策略
i
中的最优动作价值函数
)
Q
~
m
a
x
(
s
,
a
)
=
max
i
Q
i
~
(
s
,
a
)
(
在策略
i
中最优动作价值函数的估计值
)
Q_{max}(s,a)=text{max}_{i}Q^{pi_i}(s,a)(在策略pi_{i}中的最优动作价值函数)\ tilde{Q}_{max}(s,a)=text{max}_{i}tilde{Q^{pi_{i}}}(s,a)(在策略pi_{i}中最优动作价值函数的估计值)
Qmax(s,a)=maxiQi(s,a)(在策略i中的最优动作价值函数)Q~max(s,a)=maxiQi~(s,a)(在策略i中最优动作价值函数的估计值)
借助以上符号我们有如下不等式
∣
Q
max
(
s
,
a
)
−
Q
~
max
(
s
,
a
)
∣
=
∣
max
i
Q
i
(
s
,
a
)
−
max
i
Q
~
i
(
s
,
a
)
∣
≤
max
i
∣
Q
i
(
s
,
a
)
−
Q
~
i
(
s
,
a
)
∣
≤
.
left|Q_{max }(s, a)-tilde{Q}_{max }(s, a)right|=left|max _i Q^{pi_i}(s, a)-max _i tilde{Q}^{pi_i}(s, a)right| leq max _ileft|Q^{pi_i}(s, a)-tilde{Q}^{pi_i}(s, a)right| leq epsilon .
Qmax(s,a)−Q~max(s,a)
=
imaxQi(s,a)−imaxQ~i(s,a)
≤imax
Qi(s,a)−Q~i(s,a)
≤.
于是我们可得
Q
max
(
s
,
a
)
−
≤
Q
~
max
(
s
,
a
)
Q_{max }(s, a)-epsilon leqtilde{Q}_{max }(s, a)
Qmax(s,a)−≤Q~max(s,a)
借助贝尔曼算子
T
T^{pi}
T,其中
T
f
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
V
(
s
′
)
]
V
(
s
′
)
=
E
a
∼
(
a
∣
s
′
)
[
f
(
s
′
,
a
)
]
r
(
s
,
a
)
=
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
r
(
s
,
a
,
s
′
)
]
T^{pi}f(s,a)=r(s,a)+gammamathrm E_{s’sim p(s’mid s,a)}[V(s’)]\ V(s’)=mathrm E_{asim pi(amid s’)}[f(s’,a)]\ r(s,a)=mathrm E_{s’sim p(s’mid s,a)}[r(s,a,s’)]
Tf(s,a)=r(s,a)+Es′∼p(s′∣s,a)[V(s′)]V(s′)=Ea∼(a∣s′)[f(s′,a)]r(s,a)=Es′∼p(s′∣s,a)[r(s,a,s′)]
因我们采用确定策略
pi
(在所有策略中选取能使得动作价值最大的动作),
V
(
s
′
)
=
f
(
s
′
,
(
s
′
)
)
V(s’)=f(s’,pi(s’))
V(s′)=f(s′,(s′))
对于任意
(
s
,
a
)
∈
S
A
(s,a)in cal S times cal A
(s,a)∈SA和任意策略
i
pi_{i}
i我们都有下式成立
T
Q
~
max
(
s
,
a
)
=
r
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
~
max
(
s
′
,
(
s
′
)
)
=
r
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
~
max
(
s
′
,
b
)
≥
r
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
max
(
s
′
,
b
)
−
≥
r
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
max
(
s
′
,
i
(
s
′
)
)
−
≥
r
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
i
(
s
′
,
i
(
s
′
)
)
−
=
T
i
Q
i
(
s
,
a
)
−
=
Q
i
(
s
,
a
)
−
.
begin{aligned} T^pi tilde{Q}_{max }(s, a) & =r(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s,aright) tilde{Q}_{max }left(s^{prime}, pileft(s^{prime}right)right) \ & =r(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) max _b tilde{Q}_{max }left(s^{prime}, bright) \ & geq r(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) max _b Q_{max }left(s^{prime}, bright)-gamma epsilon \ & geq r(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) Q_{max }left(s^{prime}, pi_ileft(s^{prime}right)right)-gamma epsilon \ & geq r(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) Q^{pi_i}left(s^{prime}, pi_ileft(s^{prime}right)right)-gamma epsilon \ & =T^{pi_i} Q^{pi_i}(s, a)-gamma epsilon \ & =Q^{pi_i}(s, a)-gamma epsilon . end{aligned}
TQ~max(s,a)=r(s,a)+s′∑p(s′∣s,a)Q~max(s′,(s′))=r(s,a)+s′∑p(s′∣s,a)bmaxQ~max(s′,b)≥r(s,a)+s′∑p(s′∣s,a)bmaxQmax(s′,b)−≥r(s,a)+s′∑p(s′∣s,a)Qmax(s′,i(s′))−≥r(s,a)+s′∑p(s′∣s,a)Qi(s′,i(s′))−=TiQi(s,a)−=Qi(s,a)−.
又因
T
Q
~
max
(
s
,
a
)
≥
Q
i
(
s
,
a
)
−
T^pi tilde{Q}_{max }(s, a)geq Q^{pi_i}(s, a)-gamma epsilon
TQ~max(s,a)≥Qi(s,a)−对任意策略成立
T
Q
~
max
(
s
,
a
)
≥
Q
i
(
s
,
a
)
−
f
o
r
∀
i
≥
max
i
Q
i
−
≥
Q
~
max
(
s
,
a
)
−
−
begin{aligned} T^pi tilde{Q}_{max }(s, a)&geq Q^{pi_i}(s, a)-gamma epsilon qquad for forall pi_{i}\ &geq text{max}_{i}Q^{pi_{i}}-gamma epsilon\ &geq tilde{Q}_{max }(s, a)-gamma-gammaepsilon end{aligned}
TQ~max(s,a)≥Qi(s,a)−for∀i≥maxiQi−≥Q~max(s,a)−−
为得出最终结论。我们还需要证明以下事实
T
(
f
(
s
,
a
)
+
c
)
=
r
(
s
,
a
)
+
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
f
(
s
′
,
(
s
′
)
)
+
c
]
=
r
(
s
,
a
)
+
E
s
′
∼
p
(
s
′
∣
s
,
a
)
[
f
(
s
′
,
(
s
′
)
)
]
+
⋅
c
=
T
(
f
(
s
,
a
)
)
+
⋅
c
begin{aligned} T^{pi}(f(s,a)+c)&=r(s,a)+gammamathrm E_{s’sim p(s’mid s,a)}[f(s’,pi(s’))+c]\ &=r(s,a)+gammamathrm E_{s’sim p(s’mid s,a)}[f(s’,pi(s’))]+gammacdot c\ &=T^{pi}(f(s,a))+gammacdot c end{aligned}
T(f(s,a)+c)=r(s,a)+Es′∼p(s′∣s,a)[f(s′,(s′))+c]=r(s,a)+Es′∼p(s′∣s,a)[f(s′,(s′))]+⋅c=T(f(s,a))+⋅c
于是我们可知
T
Q
~
max
(
s
,
a
)
≥
Q
~
max
(
s
,
a
)
−
(
1
+
)
T
(
T
Q
~
max
(
s
,
a
)
)
≥
T
Q
~
max
(
s
,
a
)
−
(
1
+
)
⋮
(
T
)
k
(
Q
~
max
(
s
,
a
)
)
≥
(
T
)
k
−
1
−
k
−
1
(
1
+
)
begin{aligned} T^{pi}tilde{Q}_{max }(s, a)&geq tilde{Q}_{max }(s, a)-(1+gamma)epsilon\ T^{pi}(T^{pi}tilde{Q}_{max }(s, a))&geq T^{pi}tilde{Q}_{max }(s, a)-gamma(1+gamma)epsilon\ vdots\ (T^{pi})^{k}(tilde{Q}_{max }(s, a))&geq (T^{pi})^{k-1}-gamma^{k-1}(1+gamma)epsilon end{aligned}
TQ~max(s,a)T(TQ~max(s,a))⋮(T)k(Q~max(s,a))≥Q~max(s,a)−(1+)≥TQ~max(s,a)−(1+)≥(T)k−1−k−1(1+)
将上式连续相加,且当
k
k
k趋于无穷时可知
Q
(
s
,
a
)
=
lim
k
→
∞
(
T
)
k
Q
~
max
(
s
,
a
)
≥
Q
~
max
(
s
,
a
)
−
1
+
1
−
≥
Q
max
(
s
,
a
)
−
−
1
+
1
−
=
max
i
Q
i
(
s
,
a
)
−
2
1
−
begin{aligned} Q^pi(s, a) & =lim _{k rightarrow infty}left(T^piright)^k tilde{Q}_{max }(s, a) \ & geq tilde{Q}_{max }(s, a)-frac{1+gamma}{1-gamma} epsilon \ & geq Q_{max }(s, a)-epsilon-frac{1+gamma}{1-gamma} epsilon\ & = max _i Q^{pi_i}(s, a)-frac{2}{1-gamma} epsilon end{aligned}
Q(s,a)=k→∞lim(T)kQ~max(s,a)≥Q~max(s,a)−1−1+≥Qmax(s,a)−−1−1+=imaxQi(s,a)−1−2
证毕
想要证明最后误差界,我们还需借助以下引理
Lemma 1. Let
i
j
=
max
s
,
a
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
delta_{i j}=max _{s, a}left|r_i(s, a)-r_j(s, a)right|
ij=maxs,a∣ri(s,a)−rj(s,a)∣. Then,
Q
i
i
∗
(
s
,
a
)
−
Q
i
j
∗
(
s
,
a
)
≤
2
i
j
1
−
.
Q_i^{pi_i^*}(s, a)-Q_i^{pi_j^*}(s, a) leq frac{2 delta_{i j}}{1-gamma} .
Qii∗(s,a)−Qij∗(s,a)≤1−2ij.
proof为简化记号,令
Q
i
j
(
s
,
a
)
≡
Q
i
j
∗
(
s
,
a
)
Q_i^j(s, a) equiv Q_i^{pi_j^*}(s, a)
Qij(s,a)≡Qij∗(s,a).
Q
i
i
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
=
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
+
Q
j
j
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
≤
∣
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
∣
+
∣
Q
j
j
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
∣
.
begin{aligned} Q_i^i(s, a)-Q_i^j(s, a) & =Q_i^i(s, a)-Q_j^j(s, a)+Q_j^j(s, a)-Q_i^j(s, a) \ & leqleft|Q_i^i(s, a)-Q_j^j(s, a)right|+left|Q_j^j(s, a)-Q_i^j(s, a)right| . end{aligned}
Qii(s,a)−Qij(s,a)=Qii(s,a)−Qjj(s,a)+Qjj(s,a)−Qij(s,a)≤
Qii(s,a)−Qjj(s,a)
+
Qjj(s,a)−Qij(s,a)
.
令
i
j
=
max
s
,
a
∣
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
∣
Delta_{i j}=max _{s, a}left|Q_i^i(s, a)-Q_j^j(s, a)right|
ij=maxs,a
Qii(s,a)−Qjj(s,a)
.
∣
Q
i
i
(
s
,
a
)
−
Q
j
j
(
s
,
a
)
∣
=
∣
r
i
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
i
i
(
s
′
,
b
)
−
r
j
(
s
,
a
)
−
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
Q
j
j
(
s
′
,
b
)
∣
=
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
(
max
b
Q
i
i
(
s
′
,
b
)
−
max
b
Q
j
j
(
s
′
,
b
)
)
∣
≤
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
∑
s
′
p
(
s
′
∣
s
,
a
)
∣
max
b
Q
i
i
(
s
′
,
b
)
−
max
b
Q
j
j
(
s
′
,
b
)
∣
≤
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
∑
s
′
p
(
s
′
∣
s
,
a
)
max
b
∣
Q
i
i
(
s
′
,
b
)
−
Q
j
j
(
s
′
,
b
)
∣
≤
i
j
+
i
j
.
begin{aligned} left|Q_i^i(s, a)-Q_j^j(s, a)right| & =left|r_i(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) max _b Q_i^ileft(s^{prime}, bright)-r_j(s, a)-gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) max _b Q_j^jleft(s^{prime}, bright)right| \ & =left|r_i(s, a)-r_j(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright)left(max _b Q_i^ileft(s^{prime}, bright)-max _b Q_j^jleft(s^{prime}, bright)right)right| \ & leqleft|r_i(s, a)-r_j(s, a)right|+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright)left|max _b Q_i^ileft(s^{prime}, bright)-max _b Q_j^jleft(s^{prime}, bright)right| \ & leqleft|r_i(s, a)-r_j(s, a)right|+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) max _bleft|Q_i^ileft(s^{prime}, bright)-Q_j^jleft(s^{prime}, bright)right| \ & leq delta_{i j}+gamma Delta_{i j} . end{aligned}
Qii(s,a)−Qjj(s,a)
=
ri(s,a)+s′∑p(s′∣s,a)bmaxQii(s′,b)−rj(s,a)−s′∑p(s′∣s,a)bmaxQjj(s′,b)
=
ri(s,a)−rj(s,a)+s′∑p(s′∣s,a)(bmaxQii(s′,b)−bmaxQjj(s′,b))
≤∣ri(s,a)−rj(s,a)∣+s′∑p(s′∣s,a)
bmaxQii(s′,b)−bmaxQjj(s′,b)
≤∣ri(s,a)−rj(s,a)∣+s′∑p(s′∣s,a)bmax
Qii(s′,b)−Qjj(s′,b)
≤ij+ij.
从上式中可知
i
j
≤
1
1
−
i
j
.
Delta_{i j} leq frac{1}{1-gamma} delta_{i j} .
ij≤1−1ij.
定义
i
j
′
=
Delta_{i j}^{prime}=
ij′=
max
s
,
a
∣
Q
i
i
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
∣
max _{s, a}left|Q_i^i(s, a)-Q_i^j(s, a)right|
maxs,a
Qii(s,a)−Qij(s,a)
.
∣
Q
j
j
(
s
,
a
)
−
Q
i
j
(
s
,
a
)
∣
=
∣
r
j
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
j
j
(
s
′
,
j
∗
(
s
′
)
)
−
r
i
(
s
,
a
)
−
∑
s
′
p
(
s
′
∣
s
,
a
)
Q
i
j
(
s
′
,
j
∗
(
s
′
)
)
∣
=
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
(
Q
j
j
(
s
′
,
j
∗
(
s
′
)
)
−
Q
i
j
(
s
′
,
j
∗
(
s
′
)
)
)
∣
≤
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
∑
s
′
p
(
s
′
∣
s
,
a
)
∣
Q
j
j
(
s
′
,
j
∗
(
s
′
)
)
−
Q
i
j
(
s
′
,
j
∗
(
s
′
)
)
∣
≤
i
j
+
i
j
′
.
begin{aligned} left|Q_j^j(s, a)-Q_i^j(s, a)right| & =left|r_j(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) Q_j^jleft(s^{prime}, pi_j^*left(s^{prime}right)right)-r_i(s, a)-gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright) Q_i^jleft(s^{prime}, pi_j^*left(s^{prime}right)right)right| \ & =left|r_i(s, a)-r_j(s, a)+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright)left(Q_j^jleft(s^{prime}, pi_j^*left(s^{prime}right)right)-Q_i^jleft(s^{prime}, pi_j^*left(s^{prime}right)right)right)right| \ & leqleft|r_i(s, a)-r_j(s, a)right|+gamma sum_{s^{prime}} pleft(s^{prime} mid s, aright)left|Q_j^jleft(s^{prime}, pi_j^*left(s^{prime}right)right)-Q_i^jleft(s^{prime}, pi_j^*left(s^{prime}right)right)right| \ & leq delta_{i j}+gamma Delta_{i j}^{prime} . end{aligned}
Qjj(s,a)−Qij(s,a)
=
rj(s,a)+s′∑p(s′∣s,a)Qjj(s′,j∗(s′))−ri(s,a)−s′∑p(s′∣s,a)Qij(s′,j∗(s′))
=
ri(s,a)−rj(s,a)+s′∑p(s′∣s,a)(Qjj(s′,j∗(s′))−Qij(s′,j∗(s′)))
≤∣ri(s,a)−rj(s,a)∣+s′∑p(s′∣s,a)
Qjj(s′,j∗(s′))−Qij(s′,j∗(s′))
≤ij+ij′.
同样可知
i
j
′
≤
1
1
−
i
j
.
Delta_{i j}^{prime} leq frac{1}{1-gamma} delta_{i j} .
ij′≤1−1ij.
证毕
Theorem 2. Let
M
i
∈
M
M_i in mathcal{M}^phi
Mi∈M and let
Q
i
j
∗
Q_i^{pi_j^*}
Qij∗ be the value function of an optimal policy of
M
j
∈
M
M_j in mathcal{M}^phi
Mj∈M when executed in
M
i
M_i
Mi. Given the set
{
Q
~
i
1
∗
,
Q
~
i
2
∗
,
…
,
Q
~
i
n
∗
}
left{tilde{Q}_i^{pi_1^*}, tilde{Q}_i^{pi_2^*}, ldots, tilde{Q}_i^{pi_n^*}right}
{Q~i1∗,Q~i2∗,…,Q~in∗} such that
∣
Q
i
j
∗
(
s
,
a
)
−
Q
~
i
j
∗
(
s
,
a
)
∣
≤
forall
s
∈
S
,
a
∈
A
,and
j
∈
{
1
,
2
,
…
,
n
}
,
left|Q_i^{pi_j^*}(s, a)-tilde{Q}_i^{pi_j^*}(s, a)right| leq epsilon text { for all } s in S, a in A text {, and } j in{1,2, ldots, n},
Qij∗(s,a)−Q~ij∗(s,a)
≤foralls∈S,a∈A,andj∈{1,2,…,n},
let
(
s
)
∈
argmax
a
max
j
Q
~
i
j
∗
(
s
,
a
)
.
pi(s) in underset{a}{operatorname{argmax}} max _j tilde{Q}_i^{pi_j^*}(s, a) .
(s)∈aargmaxjmaxQ~ij∗(s,a).
Finally, let
max
=
max
s
,
a
∥
(
s
,
a
)
∥
phi_{max }=max _{s, a}|phi(s, a)|
max=maxs,a∥(s,a)∥, where
∥
⋅
∥
|cdot|
服务器托管
∥⋅∥ is the norm induced by the inner product adopted. Then,
Q
i
∗
(
s
,
a
)
−
Q
i
(
s
,
a
)
≤
2
1
−
(
max
min
j
∥
w
i
−
w
j
∥
+
)
.
Q_i^*(s, a)-Q_i^pi(s, a) leq frac{2}{1-gamma}left(phi_{max } min _jleft|mathbf{w}_i-mathbf{w}_jright|+epsilonright) .
Qi∗(s,a)−Qi(s,a)≤1−2(maxjmin∥wi−wj∥+).
proof:
Q
i
∗
(
s
,
a
)
−
Q
i
(
s
,
a
)
≤
Q
i
∗
(
s
,
a
)
−
Q
i
j
∗
(
s
,
a
)
+
2
1
−
≤
2
1
−
max
s
,
a
∣
r
i
(
s
,
a
)
−
r
j
(
s
,
a
)
∣
+
2
1
−
=
2
1
−
max
s
,
a
∣
(
s
,
a
)
⊤
w
i
−
(
s
,
a
)
⊤
w
j
∣
+
2
1
−
=
2
1
−
max
s
,
a
∣
(
s
,
a
)
⊤
(
w
i
−
w
j
)
∣
+
2
1
−
≤
2
1
−
max
s
,
a
∥
(
s
,
a
)
∥
∥
w
i
−
w
j
∥
+
2
1
−
=
2
max
1
−
∥
w
i
−
w
j
∥
+
2
1
−
.
begin{aligned} Q_i^*(s, a)-Q_i^pi(s, a) & leq Q_i^*(s, a)-Q_i^{pi_j^*}(s, a)+frac{2}{1-gamma} epsilon \ & leq frac{2}{1-gamma} max _{s, a}left|r_i(s, a)-r_j(s, a)right|+frac{2}{1-gamma} epsilon \ & =frac{2}{1-gamma} max _{s, a}left|phi(s, a)^{top} mathbf{w}_i-phi(s, a)^{top} mathbf{w}_jright|+frac{2}{1-gamma} epsilon \ & =frac{2}{1-gamma} max _{s, a}left|phi(s, a)^{top}left(mathbf{w}_i-mathbf{w}_jright)right|+frac{2}{1-gamma} epsilon \ & leq frac{2}{1-gamma} max _{s, a}|phi(s, a)|left|mathbf{w}_i-mathbf{w}_jright|+frac{2}{1-gamma} epsilon \ & =frac{2 phi_{max }}{1-gamma}left|mathbf{w}_i-mathbf{w}_jright|+frac{2}{1-gamma} epsilon . end{aligned}
Qi∗(s,a)−Qi(s,a)≤Qi∗(s,a)−Qij∗(s,a)+1−2≤1−2s,amax∣ri(s,a)−rj(s,a)∣+1−2=1−2s,amax
(s,a)⊤wi−(s,a)⊤wj
+1−2=1−2s,amax
(s,a)⊤(wi−wj)
+1−2≤1−2s,amax∥(s,a)∥∥wi−wj∥+1−2=1−2max∥wi−wj∥+1−2.
服务器托管,北京服务器托管,服务器租用 http://www.fwqtg.net