本笔记基于清华大学《机器学习》的课程讲义无监督学习相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。
Principle Component Analysis
- Dimension reductio: JL lemma
d
=
(
log
服务器托管网
n
2
)
d=Omegaleft(frac{log n}{epsilon^2}right)
n
n
- Goal of PCA
- maximize variance:
E
[
(
v
⊤
x
)
2
]
=
v
⊤
X
X
⊤
v
mathbb{E}[(v^top x)^2]=v^top XX^top v
∣
v
∣
=
1
|v|=1
- minimize reconstruction error:
E
[
∣
x
−
(
v
⊤
x
)
v
∣
2
]
mathbb{E}[|x-(v^top x)v|^2]
- maximize variance:
- Find
v
i
v_i
v
1
,
v
2
,
.
.
,
v
d
v_1,v_2,..,v_d
- How to find
v
v
- Eigen decomposition:
X
X
⊤
=
U
U
⊤
XX^top =USigma U^top
-
v
1
v_1
- Power method
- Eigen decomposition:
Nearest Neighbor Classification
- KNN: K-nearest neighbor
- nearest neighbor search: Locality sensitive hashing algorithm(LSH)*
- Randomized
c
c
R
R
(
c
,
R
)
(c,R)
c
R
cR
R
R
- A family
H
H
(
R
,
c
R
,
P
1
,
P
2
)
(R,cR,P_1,P_2)
p
,
q
∈
R
d
p,qin mathbb{R}^d
- if
∣
p
−
q
∣
≤
R
|p-q|le R
Pr
H
[
h
(
q
)
=
h
(
p
)
]
≥
P
1
Pr_H[h(q)=h(p)]ge P_1
- if
∣
p
−
q
∣
≥
c
R
|p-q|ge cR
Pr
H
[
h
(
q
)
=
h
(
p
)
]
≤
P
1
Pr_H[h(q)=h(p)]le P_1
-
P
1
>
P
2
P_1>P_2
- if
- Algroithm based on LSH family:
- Construct
g
i
(
x
)
=
(
h
i
,
1
(
x
)
,
h
i
,
2
(
x
)
,
.
.
.
,
h
i
,
k
(
x
)
)
,
1
≤
i
≤
L
g_i(x)=(h_{i,1}(x),h_{i,2}(x),…,h_{i,k}(x)),1le ile L
h
i
,
j
h_{i,j}
H
H
- Check the element in the bucket of
g
i
(
q
)
g_i(q)
c
R
cR
q
q
2
L
+
1
2L+1
- if
R
R
1
2
−
1
e
frac{1}{2}-frac{1}{e}
c
R
cR
-
=
log
1
/
P
1
log
1
/
P
2
,
k
=
log
1
/
P
2
(
n
)
,
L
=
n
rho=frac{log 1/P_1}{log 1/P_2},k=log_{1/P_2}(n),L=n^rho
- Proof
- Construct
- Randomized
Metric Learning
- project
x
i
x_i
f
(
x
i
)
f(x_i)
- Hard version(compare label of its neighbor)- soft version
- Neighborhood Component Analysis(NCA)
-
p
i
,
j
∼
exp
(
−
∥
f
(
x
i
)
−
f
(
x
j
)
∥
2
)
p_{i,j}sim exp(-|f(x_i)-f(x_j)|^2)
- maximize
∑
i
∑
j
∈
C
i
p
服务器托管网 i
,
j
sum_{i}sum_{jin C_i}p_{i,j}
-
- LMNN:
L
=
max
(
0
,
∥
f
(
x
)
−
f
(
x
+
)
∥
2
−
∥
f
(
x
)
−
f
(
x
−
)
∥
2
+
r
)
L=max(0,|f(x)-f(x^+)|_2-|f(x)-f(x^-)|_2+r)
-
x
+
,
x
−
x^+,x^-
- r is margin
-
Spectral Cluster
- K-means
- Spectral graph clustring
- Graph laplacian:
L
=
D
−
A
L=D-A
A
A
- #zero eigenvalue = # connected component
- Smallest
k
k
k
k
k
k
- Ratio cut can be transfered into finding the
k
k
- Graph laplacian:
SimCLR*
-
Intelligence is positioning
-
InfoNCE loss
L
(
q
,
p
1
,
{
p
i
}
i
=
2
N
)
=
−
log
exp
(
−
∥
f
(
q
)
−
f
(
p
1
)
∣
2
/
(
2
)
∑
i
=
1
N
exp
(
−
∥
f
(
q
)
−
f
(
p
i
)
∣
2
/
(
2
)
L(q,p_1,{p_i}_{i=2}^N)=-log frac{exp(-|f(q)-f(p_1)|^2/(2tau)}{sum_{i=1}^{N}exp(-|f(q)-f(p_{i})|^2/(2tau)}
L(q,p1,{pi}i=2N)=−log∑i=1Nexp(−∥f(q)−f(pi)∣2/(2)exp(−∥f(q)−f(p1)∣2/(2)
-
Learn
Z
=
f
(
x
)
Z=f(x)
Z=f(x): map original data points into a space that semantic similarity is captured naturally.
- Reproducing kernel Hilbert space:
k
(
f
(
x
1
)
,
f
(
x
2
)
)
=
⟨
(
f
(
x
1
)
)
,
(
f
(
x
2
)
)
⟩
H
k(f(x_1),f(x_2))=langlephi(f(x_1)),phi(f(x_2))rangle_H
- Usually,
K
Z
,
i
,
j
=
k
(
Z
i
−
Z
j
)
K_{Z,i,j}=k(Z_i-Z_j)
k
k
- Reproducing kernel Hilbert space:
-
We have a similarity matrix
pi
about the dataset previously.
i
,
j
pi_{i,j}
i,j is the similarity of data
i
i
i and
j
j
j. We want the similarity matrix
K
Z
K_Z
KZ of
f
(
x
)
f(x)
f(x) is the same as that of
x
x
x which is given manually. Let
W
X
∼
,
W
Z
∼
K
Z
W_Xsim pi,W_Zsim K_Z
WX∼,WZ∼KZ, we want these two samples are the same.
-
Minimize crossentropy loss:
H
k
(
Z
)
=
−
E
W
X
∼
P
(
⋅
;
)
[
log
P
(
W
Z
=
W
X
;
K
Z
)
]
H_{pi}^{k}(Z)=-mathbb{E}_{W_Xsim P(cdot ;pi)}[log P(W_Z=W_X;K_Z)]
Hk(Z)=−EWX∼P(⋅;)[logP(WZ=WX;KZ)]
- Equivalent to InfoNCE loss: Only care about row
i
i
log
(
W
Z
,
i
=
W
X
,
i
)
log(W_{Z,i}=W_{X,i})
q
,
p
1
q,p_1
pi
W
X
∼
P
(
⋅
;
)
W_Xsim P(cdot;pi)
- Equivalent to spectral clustering: equaivalent to
arg
min
Z
t
r
(
Z
⊤
L
∗
Z
)
arg min_Ztr(Z^top L^*Z)
- Equivalent to InfoNCE loss: Only care about row
t-SNE
-
data visualization: map data into low dimension space(2D)
-
SNE: Same as NCA, want
q
i
,
j
∼
exp
(
−
∥
f
(
x
i
)
−
f
(
x
j
)
∥
2
/
(
2
2
)
)
q_{i,j}sim exp(-|f(x_i)-f(x_j)|^2/(2sigma^2))
qi,j∼exp(−∥f(xi)−f(xj)∥2/(22)) to be similar to
p
i
,
j
∼
exp
(
−
∥
x
i
−
x
j
∥
2
/
(
2
i
2
)
)
p_{i,j}sim exp (-|x_i-x_j|^2/(2sigma_i^2))
pi,j∼exp(−∥xi−xj∥2/(2i2))
- CrossEntropy loss
−
p
i
,
j
⋅
log
q
i
,
j
p
i
,
j
-p_{i,j}cdot log frac{q_{i,j}}{p_{i,j}}
- CrossEntropy loss
-
Crowding problem
-
Solved by t-SNE: let
q
i
,
j
∼
(
1
+
∥
y
j
−
y
i
∥
2
)
−
1
q_{i,j}sim (1+|y_j-y_i|^2)^{-1}
qi,j∼(1+∥yj−yi∥2)−1(student t-distribution)
- The power
−
1
-1
- The power
服务器托管,北京服务器托管,服务器租用 http://www.fwqtg.net