【Machine Learning】Unsupervised Learning - 服务器托管|北京服务器租用|机房托管租用|IDC托管租用|机房机柜带宽租用-价格及费用咨询

本笔记基于清华大学《机器学习》的课程讲义无监督学习相关部分，基本为笔者在考试前一两天所作的Cheat Sheet。内容较多，并不详细，主要作为复习和记忆的资料。

Principle Component Analysis

Dimension reductio: JL lemma $n}{epsilon^2}right) d=(2logn) to remain the distance of n n n data points.$
Goal of PCA
- maximize variance: $mathbb{E}[(v^top x)^2]=v^top XX^top v E[(v⊤x)2]=v⊤XX⊤v for ∣ v ∣ = 1 |v|=1 ∣v∣=1$
- minimize reconstruction error: $mathbb{E}[|x-(v^top x)v|^2] E[∣x−(v⊤x)v∣2]$
Find $v_i vi iteratively, project data points onto subspace expanded by v 1 , v 2 , . . , v d v_1,v_2,..,v_d v1,v2,..,vd$
How to find
v

v

$v$ ?
- Eigen decomposition: $XX^top =USigma U^top XX⊤=UU⊤$
- $v_1 v1 is the eigenvector of maximum eigenvalue.$
- Power method

Nearest Neighbor Classification

KNN: K-nearest neighbor
nearest neighbor search: Locality sensitive hashing algorithm(LSH)*
- Randomized
- A family
  H
  
  H
  
  $H$ is called
  
  (
  
  R
  
  ,
  
  c
  
  R
  
  ,
  
  P
  
  1
  
  ,
  
  P
  
  2
  
  )
  
  (R,cR,P_1,P_2)
  
  $(R, c R, P_{1}, P_{2})$ -sensitive if for any
  
  p
  
  ,
  
  q
  
  ∈
  
  R
  
  d
  
  p,qin mathbb{R}^d
  
  $p, q \in R^{d}$
  
  if $Pr_H[h(q)=h(p)]ge P_1 PrH[h(q)=h(p)]≥P1$
  
  if $Pr_H[h(q)=h(p)]le P_1 PrH[h(q)=h(p)]≤P1$
  
  $P_1>P_2 P1>P2$
- Algroithm based on LSH family:
  - Construct $g_i(x)=(h_{i,1}(x),h_{i,2}(x),…,h_{i,k}(x)),1le ile L gi(x)=(hi,1(x),hi,2(x),…,hi,k(x)),1≤i≤L. All h i , j h_{i,j} hi,j are iid from H H H.$
  - Check the element in the bucket of $g_i(q) gi(q), whether it’s c R cR cR-near neighbor of q q q. Until we check 2 L + 1 2L+1 2L+1 times.$
  - if $R R neighbor exists, w.p. at least 1 2 − 1 e frac{1}{2}-frac{1}{e} find c R cR -neighbor$
  - $1/P_1}{log 1/P_2},k=log_{1/P_2}(n),L=n^rho =log1/P2log1/P1,k=log1/P2(n),L=n$
  - Proof

Metric Learning

project $x_i xi into f ( x i ) f(x_i) f(xi)$
Hard version(compare label of its neighbor)- soft version
Neighborhood Component Analysis(NCA)
- $p_{i,j}sim exp(-|f(x_i)-f(x_j)|^2) pi,j∼exp(−∥f(xi)−f(xj)∥2)$
- maximize $sum_{i}sum_{jin C_i}p_{i,j} ∑i∑j∈Cipi,j$
LMNN:
L

=

max

⁡

(

0

,

∥

f

(

x

)

−

f

(

x

+

)

∥

2

−

∥

f

(

x

)

−

f

(

x

−

)

∥

2

+

r

)

L=max(0,|f(x)-f(x^+)|_2-|f(x)-f(x^-)|_2+r)

$L = max (0, ∥ f (x) - f (x^{+}) ∥_{2} - ∥ f (x) - f (x^{-}) ∥_{2} + r)$
- $x^+,x^- x+,x− are worst cases.$
- r is margin

Spectral Cluster

K-means
Spectral graph clustring
- Graph laplacian:
- #zero eigenvalue = # connected component
- Smallest
- Ratio cut can be transfered into finding the

SimCLR*

Intelligence is positioning
InfoNCE loss

L

(

q

,

p

1

,

{

p

i

}

i

=

2

N

)

=

−

log

⁡

exp

⁡

(

−

∥

f

(

q

)

−

f

(

p

1

)

∣

2

/

(

2

)

∑

i

=

1

N

exp

⁡

(

−

∥

f

(

q

)

−

f

(

p

i

)

∣

2

/

(

2

)

L(q,p_1,{p_i}_{i=2}^N)=-log frac{exp(-|f(q)-f(p_1)|^2/(2tau)}{sum_{i=1}^{N}exp(-|f(q)-f(p_{i})|^2/(2tau)}

$L (q, p_{1}, {p_{i}}_{i = 2}^{N}) = - lo g \frac{exp ( - ∥ f ( q ) - f ( p _{1} ) ∣ ^{2} / ( 2 )}{\sum _{i = 1}^{N} exp ( - ∥ f ( q ) - f ( p _{i} ) ∣ ^{2} / ( 2 )}$
Learn

Z

=

f

(

x

)

Z=f(x)

$Z = f (x)$ : map original data points into a space that semantic similarity is captured naturally.
- Reproducing kernel Hilbert space: $k(f(x_1),f(x_2))=langlephi(f(x_1)),phi(f(x_2))rangle_H k(f(x1),f(x2))=⟨(f(x1)),(f(x2))⟩H. Inner product is a kernel function.$
- Usually, $K_{Z,i,j}=k(Z_i-Z_j) KZ,i,j=k(Zi−Zj), k k k is gaussian.$
We have a similarity matrix

pi

about the dataset previously.

i

,

j

pi_{i,j}

$_{i, j}$ is the similarity of data

i

i

$i$ and

j

j

$j$ . We want the similarity matrix

K

Z

K_Z

$K_{Z}$ of

f

(

x

)

f(x)

$f (x)$ is the same as that of

x

x

$x$ which is given manually. Let

W

X

∼

,

W

Z

∼

K

Z

W_Xsim pi,W_Zsim K_Z

$W_{X} \sim, W_{Z} \sim K_{Z}$ , we want these two samples are the same.
Minimize crossentropy loss:

H

k

(

Z

)

=

−

E

W

X

∼

P

(

⋅

;

)

[

log

⁡

P

(

W

Z

=

W

X

;

K

Z

)

]

H_{pi}^{k}(Z)=-mathbb{E}_{W_Xsim P(cdot ;pi)}[log P(W_Z=W_X;K_Z)]

$H_{k} (Z) = - E_{W_{X} \sim P (\cdot;)} [lo g P (W_{Z} = W_{X}; K_{Z})]$
- Equivalent to InfoNCE loss: Only care about row $log(W_{Z,i}=W_{X,i}) log(WZ,i=WX,i). The given pair q , p 1 q,p_1 q,p1 are sampled from similarity matrix pi , which corresponds to W X ∼ P ( ⋅ ; ) W_Xsim P(cdot;pi) WX∼P(⋅;).$
- Equivalent to spectral clustering: equaivalent to $min_Ztr(Z^top L^*Z) argminZtr(Z⊤L∗Z)$

t-SNE

data visualization: map data into low dimension space(2D)
SNE: Same as NCA, want

q

i

,

j

∼

exp

⁡

(

−

∥

f

(

x

i

)

−

f

(

x

j

)

∥

2

/

(

2

2

)

)

q_{i,j}sim exp(-|f(x_i)-f(x_j)|^2/(2sigma^2))

$q_{i, j} \sim exp (- ∥ f (x_{i}) - f (x_{j}) ∥^{2} / (2^{2}))$ to be similar to

p

i

,

j

∼

exp

⁡

(

−

∥

x

i

−

x

j

∥

2

/

(

2

i

2

)

)

p_{i,j}sim exp (-|x_i-x_j|^2/(2sigma_i^2))

$p_{i, j} \sim exp (- ∥ x_{i} - x_{j} ∥^{2} / (2_{i}^{2}))$
- CrossEntropy loss $-p_{i,j}cdot log frac{q_{i,j}}{p_{i,j}} −pi,j⋅logpi,jqi,j$
Crowding problem
Solved by t-SNE: let

q

i

,

j

∼

(

1

+

∥

y

j

−

y

i

∥

2

)

−

1

q_{i,j}sim (1+|y_j-y_i|^2)^{-1}

$q_{i, j} \sim (1 + ∥ y_{j} - y_{i} ∥^{2})^{- 1}$ (student t-distribution)
- The power

服务器托管，北京服务器托管，服务器租用 http://www.fwqtg.net

Principle Component Analysis

Nearest Neighbor Classification

Metric Learning

Spectral Cluster

SimCLR*

t-SNE

服务器托管，北京服务器托管，服务器租用，机房机柜带宽租用