本笔记基于清华大学《机器学习》的课程讲义梯度下降相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。
Smoothness assumption
-
Upper Bound for
∇
2
f
(
x
)
nabla^2f(x)
∇2f(x):
∥
∇
2
f
(
w
)
∥
≤
L
left|nabla^2f(w)right |le L
∇2f(w)
≤L
f
(
w
′
)
≤
f
(
w
)
+
⟨
∇
f
(
w
)
,
w
′
−
w
⟩
+
L
2
∥
w
′
−
w
∥
2
f(w’)le f(w)+langle nabla f(w), w’-wrangle+frac{L}{2}left|w’-wright |^2
f(w′)≤f(w)+⟨∇f(w),w′−w⟩+2L∥w′−w∥2
equivalent to
∥
∇
f
(
w
)
−
∇
f
(
w
′
)
∥
≤
L
∥
w
−
w
′
∥
left|nabla f(w)-nabla f(w’)right |le Lleft|w-w’right |
∥∇f(w)−∇f(w′)∥≤L∥w−w′∥
-
Proof:
-
⇒
Rightarrow
∥
∇
f
(
w
)
−
∇
f
(
w
′
)
∥
≤
∥
∇
2
f
(
服务器托管网
x)
∥
∥
w
−
w
′
∥
≤
L
∥
w
−
w
′
∥
left|nabla f(w)-nabla f(w’)right |le left|nabla^2 f(x)right|left|w-w’right |le Lleft|w-w’right |
-
⇐
Leftarrow
-
-
2-side:
∣
f
(
w
′
)
−
f
(
w
)
−
⟨
∇
f
(
w
)
,
w
′
−
w
⟩
∣
≤
L
2
∥
w
′
−
w
∥
2
left|f(w’)-f(w)-langle nabla f(w), w’-wrangleright|le frac{L}{2}left|w’-wright |^2
∣f(w′)−f(w)−⟨∇f(w),w′−w⟩∣≤2L∥w′−w∥2
-
contains both upper/lower bound
-
-
When
w
′
=
w
−
∇
f
(
w
)
w’=w-etanabla f(w)
w′=w−∇f(w),
=
1
L
eta=frac{1}{L}
=L1 to make sure
f
(
w
′
)
−
f
(
w
)
=
−
1
2
∥
∇
f
(
x
)
∥
2
f(w′)−f(w)=−21∥∇f(x)∥20
Convex Function
-
Lower Bound for
∇
2
f
(
x
)
nabla^2f(x)
∇2f(x)
f
(
w
′
)
≥
f
(
w
)
+
∇
f
(
w
)
T
(
w
′
−
w
)
f(w’)ge f(w)+nabla f(w)^T(w’-w)
f(w′)≥f(w)+∇f(w)T(w′−w)
-
min
∇
2
f
(
w
)
≥
0
lambdaminnabla^2 f(w)ge 0
-
-
Strong convex function:
f
(
w
′
)
≥
f
(
w
)
+
∇
f
(
w
)
T
(
w
′
−
w
)
+
2
∥
w
−
w
′
∥
2
f(w’)ge f(w)+nabla f(w)^T(w’-w)+frac{mu}{2}left|w-w’right|^2
f(w′)≥f(w)+∇f(w)T(w′−w)+2∥w−w′∥2
-
min
∇
2
f
(
w
)
≥
≥
0
lambdamin nabla^2f(w)ge mu ge 0
-
Convergence Analysis
-
f
f
f is
L
L
L-smooth
-
The sequence is
w
0
,
w
1
,
.
.
.
,
w
t
w_0,w_1,…,w_t
w0,w1,…,wt and the optimized point is
w
∗
w^*
w∗,
-
w
i
=
w
i
−
1
−
∇
f
(
w
i
−
1
)
w_i=w_{i-1}-eta nabla f(w_{i-1})
wi=wi−1−∇f(wi−1)
-
L2
-
Gradient Descent make progress
f
(
w
′
)
−
f
(
w
)
≤
−
(
1
−
L
2
)
∥
∇
f
(
w
)
∥
2
f(w’)-f(w)le -etaleft(1-frac{Leta}{2}right)left|nabla f(w)right |^2
f(w′)−f(w)≤−(1−2L)∥∇f(w)∥2
- Convex function:
1
/
t
1/t
f
(
w
t
)
−
f
(
w
∗
)
≤
∥
w
0
−
w
∗
∥
2
2
t
f(w_t)-f(w^*)le frac{left|w_0-w^*right |^2}{2teta}
f(wt)−f(w∗)≤2t∥w0−w∗∥2
Stochastic Gradient Descent(SGD)
-
f
f
f is
L
L
L-smooth and convex
-
w
t
+
1
=
w
t
−
G
t
,
E
[
G
t
]
=
∇
f
(
w
t
)
w_{t+1}=w_t-eta G_t,E[G_t]=nabla f(w_t)
wt+1=wt−Gt,E[Gt]=∇f(wt)
E
f
(
w
t
‾
)
−
f
(
w
∗
)
≤
∥
w
0
−
w
∗
∥
2
2
t
+
2
E f(overline{w_t})-f(w^*)le frac{left|w_0-w^*right |^2}{2teta}+eta sigma^2
Ef(wt)−f(w∗)≤2t∥w0−w∗∥2+2
- Convergence rate
1
/
t
1/sqrt{t}
SVRG
Proof: To read.
Mirror Descent
-
After
k
k
k iterations,
f
(
x
)
−
f
(
u
)
≤
1
k
∑
t
=
0
k
−
1
⟨
∇
f
(
x
t
)
,
x
t
−
u
⟩
f(bar{x})-f(u)le frac{1}{k}sum_{t=0}^{k-1}langle nabla f(x_t),x_t-urangle
f(x)−f(u)≤k1∑t=0k−1⟨∇f(xt),xt−u⟩ (also calls regret)becomes smaller.
-
f
f
f is
rho
-Lipschitz, that is
∣
∇
f
(
x
)
∣
≤
|nabla f(x)|le rho
∣∇f(x)∣≤. After
T
=
O
(
2
2
)
T=O(frac{rho^2}{epsilon^2})
T=O(22),
f
(
x
)
−
f
(
x
∗
)
≤
f(bar{x})-f(x^*)le epsilon
f(x)−f(x∗)≤.
1
/
t
1/sqrt{t}
1/t
convergence rate.
-
Linear Coupling:
1
/
t
2
1/t^2
1/t2 convergence rate.
t
≥
(
1
)
1
/
2
tge Omegaleft(frac{1}{epsilon}right)^{1/2}
t≥(1)1/2
服务器托管,北京服务器托管,服务器租用 http://www.fwqtg.net