Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Announcements spring Convex Optimization next quarter ML stuff EE 578 Margam FaZe CS 547 Tim Althoff Modeling how to formulate real world problems as convex optimization Data science constrained optimization KKT duality ESE 535 YinTat Lee I cry g fa Algorithms first order Large scale data andy Analysis Statistics convergence proofs stuff stat 538 Zaid Hardhaoui VC dimension covering What is tearable and inference Kevin Jamieson 2018 1

Kernels Machine Learning CSE546 Kevin Jamieson University of Washington November 6, 2018 2018 Kevin Jamieson 2

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R nx `i(w) Each `i(w) is convex. Learning a model s parameters: Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson 2018 3

What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p 1 Feature space can get really large really quickly! 2018 Kevin Jamieson 4

Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2018 Kevin Jamieson 5

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 U v UzUz Utu 2 2018 Kevin Jamieson 6

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 u v 1 UzUz Utu General d : OH Iu Dimension of Locus Ocu fund (u) is roughly p d if u 2 R p 2018 Kevin Jamieson 7

Kernel Trick bw = arg min w nx There exists an 2 R n : bw = (y i x T i w) 2 + w 2 w WEIRD x crd nx i x i Why? suppose not Then w Taiko t WL There wiki O fi HEaixc.tw lz HIaixiHtHwtH 2018 Kevin Jamieson 8

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w There exists an 2 R n : bw = b = arg min nx i x i nx X n (y i j hx j,x i i) 2 + j=1 D nx j=1 Hi nx i j hx i,x j i D 06650Cx 2018 Kevin Jamieson 9

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w There exists an 2 R n : bw = nx i x i b = arg min = arg min nx X n (y i j hx j,x i i) 2 + j=1 nx X n (y i j K(x i,x j )) 2 + j=1 nx nx i j hx i,x j i j=1 nx j=1 nx i j K(x i,x j ) New = arg min y K 2 2 + T K Z tw 2 c Rd predict Z.xiztxi I.aiklz.si Ki K(x i,x j )=h (x i ), (x j )i 2018 Kevin Jamieson 10

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K F o K y ka 14K 0 Key _t KthI x KttI w xtxy XW XXTI KI KCKHII.ly 2018 Kevin Jamieson 11

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K Unregularized kernel least squares can (over) fit any data! b = K 1 y Kay 2018 Kevin Jamieson 12

Common kernels HG g 0650cg Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid 2 2 2018 Kevin Jamieson 13

Mercer s Theorem When do we have a valid Kernel K(x,x )? Sufficient: K(x, x 0 ) is a valid kernel if there exists (x) such that K(x, x 0 )= (x) T (x 0 ) Mercer s Theorem: K(x, x 0 ) is a valid kernel if and only if K is symmetric and positive semi-definite for any pointset (x 1,...,x n )wherek i,j = K(x i,x j ). 2018 Kevin Jamieson 14

RBF Kernel u v 2 K(u, v) =exp 2 2 2 Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2018 Kevin Jamieson 15

RBF Kernel u v 2 K(u, v) =exp 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) 2018 Kevin Jamieson 16

RBF Kernel K(u, v) = exp The bandwidth sigma has an enormous effect on fit: = 10 4 = 10 4 = 10 1 = 10 2 = 10 3 = 10 4 = 10 1 = 10 u 2 = 10 0 A = 10 4 0 fb(x) = 2018 Kevin Jamieson v 22 2 n X bi K(xi, x) 17

RBF Kernel u v 2 K(u, v) =exp 2 2 2 Basis representation in 1d? [ (x)] i = 1 p i! e x2 2 x i for i =0, 1,... (x) T (x 0 )= 1X i=0 = e x2 +(x 0 ) 2 2 1 pi! e x2 2 x i 1 pi! e (x0 ) 2 2 (x 0 ) i = e x x0 2 /2 1X i=0 1 i! (xx0 ) i F KHITY If n is very large, allocating an n-by-n matrix is tough. Can we truncate the above sum to approximate the kernel? 2018 Kevin Jamieson 18

RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 e jz = cos(z)+j sin(z) = E w,b [2 cos(w T x + b) cos(w T y + b)] Ewb cos wt W cos wt x y 2018 Kevin Jamieson 19

RBF Classification min,b nx bw = min max{0, 1 y i (b + x T i w)} + w 2 2 S w nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i j=1 i,j=1 2018 Kevin Jamieson 21

Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^2 2018 Kevin Jamieson 22

String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2018 Kevin Jamieson 23

Principal Component Analysis Machine Learning CSE546 Kevin Jamieson University of Washington November 6, 2018 2018 Kevin Jamieson 24

Linear projections on Given x 1,...,x n 2 R d, for q d find a compressed representation with 1,..., n 2 R q such that x i µ + V q i and Vq T V q = I nx min µ,v q,{ i } i kx i µ V q i k 2 2 2018 Kevin Jamieson 25

Linear projections Given x 1,...,x n 2 R d, for q d find a compressed representation with 1,..., n 2 R q such that x i µ + V q i and Vq T V q = I nx min µ,v q,{ i } i kx i µ V q i k 2 2 Fix V q and solve for µ, i: µ = x = 1 n P n x i i = V T q (x i x) Which gives us: 2018 Kevin Jamieson did 4 Fk V q Vq T is a projection matrix that minimizes error in basis of size q 26

Linear projections NX (x i x) V q Vq T (x i x) 2 2 I Mi IIE zcxi x styytc.cc cxi ITVWVgtcxc.sc I Ipcc Elli Cai E UNECxi I := NX (x i x)(x i x) T V T q V q = I q Tr exo E cxi.si Tr vet Xi Ika E tug Tr E tr VE I Ve 2018 Kevin Jamieson 27

Linear projections NX (x i x) V q Vq T (x i x) 2 2 NX := (x i x)(x i x) T V T q V q = I q min V q NX (x i x) V q V T q (x i x) 2 2 =min V q Tr( ) Tr(V T q V q ) Eigenvalue decomposition of = U D UT 9 1 III v TE v iiitie Dis MY Di V U 2018 Kevin Jamieson 28

Linear projections NX (x i x) V q Vq T (x i x) 2 2 NX := (x i x)(x i x) T V T q V q = I q min V q NX (x i x) V q V T q (x i x) 2 2 =min V q Tr( ) Tr(V T q V q ) Eigenvalue decomposition of = ot V q are the first q eigenvectors of Minimize reconstruction error and capture the most variance in your data. 2018 Kevin Jamieson 29

Pictures V q are the first q eigenvectors of V q are the first q principal components := NX (x i x)(x i x) T 2018 Kevin Jamieson 30

Linear projections Given x i 2 R d and some q<dconsider where V q =[v 1,v 2,...,v q ] is orthonormal: V T q V q = I q V q are the first q eigenvectors of V q are the first q principal components NX := (x i x)(x i x) T Principal Component Analysis (PCA) projects (X 1 x T ) down onto V q (X 1 x T )V q = U q diag(d 1,...,d q ) U T q U q = I q 2018 Kevin Jamieson 31

Singular Value Decomposition (SVD) Theorem (SVD): Let A 2 R m n with rank r apple min{m, n}. ThenA = USV T where S 2 R r r is diagonal with positive entries, U T U = I, V T V = I. A T Av i = AA T u i = 2018 Kevin Jamieson 32

Singular Value Decomposition (SVD) Theorem (SVD): Let A 2 R m n with rank r apple min{m, n}. ThenA = USV T where S 2 R r r is diagonal with positive entries, U T U = I, V T V = I. A T Av i = S 2 i,iv i AA T u i = S 2 i,iu i V are the first r eigenvectors of A T A with eigenvalues diag(s) U are the first r eigenvectors of AA T with eigenvalues diag(s) 2018 Kevin Jamieson 33

Dimensionality reduction V q are the first q eigenvectors of and SVD X 1 x T = USV T U 2 X 1 x T U 1 2018 Kevin Jamieson Kevin Jamieson 2018 35

Dimensionality reduction V q are the first q eigenvectors of and SVD X 1 x T = USV T Handwritten 3 s, 16x16 pixel image so that x i 2 R 256 (X 1 x T )V 2 = U 2 S 2 2 R n 2 diag(s) 2018 Kevin Jamieson Kevin Jamieson 2018 36

Kernel PCA V q are the first q eigenvectors of and SVD X 1 x T = USV T (X 1 x T )V q = U q S q 2 R n q JX = X 1 x T = USV T J = I 11 T /n (JX)(JX) T = 2018 Kevin Jamieson Kevin Jamieson 2018 37

Kernel PCA V q are the first q eigenvectors of and SVD X 1 x T = USV T (X 1 x T )V q = U q S q 2 R n q JX = X 1 x T = USV T J = I 11 T /n (JX)(JX) T = US 2 U T 2018 Kevin Jamieson Kevin Jamieson 2018 38

PCA Algorithm 2018 Kevin Jamieson Kevin Jamieson 2018 39