Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Size: px

Start display at page:

Download "Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering"

Junior Palmer
5 years ago
Views:

1 Announcements spring Convex Optimization next quarter ML stuff EE 578 Margam FaZe CS 547 Tim Althoff Modeling how to formulate real world problems as convex optimization Data science constrained optimization KKT duality ESE 535 YinTat Lee I cry g fa Algorithms first order Large scale data andy Analysis Statistics convergence proofs stuff stat 538 Zaid Hardhaoui VC dimension covering What is tearable and inference Kevin Jamieson

2 Kernels Machine Learning CSE546 Kevin Jamieson University of Washington November 6, Kevin Jamieson 2

3 Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R nx `i(w) Each `i(w) is convex. Learning a model s parameters: Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson

4 What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p 1 Feature space can get really large really quickly! 2018 Kevin Jamieson 4

5 Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v Kevin Jamieson 5

6 Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 U v UzUz Utu Kevin Jamieson 6

7 Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 u v 1 UzUz Utu General d : OH Iu Dimension of Locus Ocu fund (u) is roughly p d if u 2 R p 2018 Kevin Jamieson 7

8 Kernel Trick bw = arg min w nx There exists an 2 R n : bw = (y i x T i w) 2 + w 2 w WEIRD x crd nx i x i Why? suppose not Then w Taiko t WL There wiki O fi HEaixc.tw lz HIaixiHtHwtH 2018 Kevin Jamieson 8

9 Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w There exists an 2 R n : bw = b = arg min nx i x i nx X n (y i j hx j,x i i) 2 + j=1 D nx j=1 Hi nx i j hx i,x j i D 06650Cx 2018 Kevin Jamieson 9

10 Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w There exists an 2 R n : bw = nx i x i b = arg min = arg min nx X n (y i j hx j,x i i) 2 + j=1 nx X n (y i j K(x i,x j )) 2 + j=1 nx nx i j hx i,x j i j=1 nx j=1 nx i j K(x i,x j ) New = arg min y K T K Z tw 2 c Rd predict Z.xiztxi I.aiklz.si Ki K(x i,x j )=h (x i ), (x j )i 2018 Kevin Jamieson 10

11 Why regularization? Typically, K 0. What if = 0? b = arg min y K T K F o K y ka 14K 0 Key _t KthI x KttI w xtxy XW XXTI KI KCKHII.ly 2018 Kevin Jamieson 11

12 Why regularization? Typically, K 0. What if = 0? b = arg min y K T K Unregularized kernel least squares can (over) fit any data! b = K 1 y Kay 2018 Kevin Jamieson 12

13 Common kernels HG g 0650cg Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid Kevin Jamieson 13

14 Mercer s Theorem When do we have a valid Kernel K(x,x )? Sufficient: K(x, x 0 ) is a valid kernel if there exists (x) such that K(x, x 0 )= (x) T (x 0 ) Mercer s Theorem: K(x, x 0 ) is a valid kernel if and only if K is symmetric and positive semi-definite for any pointset (x 1,...,x n )wherek i,j = K(x i,x j ) Kevin Jamieson 14

15 RBF Kernel u v 2 K(u, v) =exp Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2018 Kevin Jamieson 15

16 RBF Kernel u v 2 K(u, v) =exp The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) 2018 Kevin Jamieson 16

17 RBF Kernel K(u, v) = exp The bandwidth sigma has an enormous effect on fit: = 10 4 = 10 4 = 10 1 = 10 2 = 10 3 = 10 4 = 10 1 = 10 u 2 = 10 0 A = fb(x) = 2018 Kevin Jamieson v 22 2 n X bi K(xi, x) 17

18 RBF Kernel u v 2 K(u, v) =exp Basis representation in 1d? [ (x)] i = 1 p i! e x2 2 x i for i =0, 1,... (x) T (x 0 )= 1X i=0 = e x2 +(x 0 ) pi! e x2 2 x i 1 pi! e (x0 ) 2 2 (x 0 ) i = e x x0 2 /2 1X i=0 1 i! (xx0 ) i F KHITY If n is very large, allocating an n-by-n matrix is tough. Can we truncate the above sum to approximate the kernel? 2018 Kevin Jamieson 18

19 RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 e jz = cos(z)+j sin(z) = E w,b [2 cos(w T x + b) cos(w T y + b)] Ewb cos wt W cos wt x y 2018 Kevin Jamieson 19

20 RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 e jz = cos(z)+j sin(z) = E w,b [2 cos(w T x + b) cos(w T y + b)] = e x y [Rahimi, Recht NIPS 2007] NIPS Test of Time Award, Kevin Jamieson 20

21 RBF Classification min,b nx bw = min max{0, 1 y i (b + x T i w)} + w 2 2 S w nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i j=1 i,j= Kevin Jamieson 21

22 Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^ Kevin Jamieson 22

23 String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2018 Kevin Jamieson 23

24 Principal Component Analysis Machine Learning CSE546 Kevin Jamieson University of Washington November 6, Kevin Jamieson 24

25 Linear projections on Given x 1,...,x n 2 R d, for q d find a compressed representation with 1,..., n 2 R q such that x i µ + V q i and Vq T V q = I nx min µ,v q,{ i } i kx i µ V q i k Kevin Jamieson 25

26 Linear projections Given x 1,...,x n 2 R d, for q d find a compressed representation with 1,..., n 2 R q such that x i µ + V q i and Vq T V q = I nx min µ,v q,{ i } i kx i µ V q i k 2 2 Fix V q and solve for µ, i: µ = x = 1 n P n x i i = V T q (x i x) Which gives us: 2018 Kevin Jamieson did 4 Fk V q Vq T is a projection matrix that minimizes error in basis of size q 26

27 Linear projections NX (x i x) V q Vq T (x i x) 2 2 I Mi IIE zcxi x styytc.cc cxi ITVWVgtcxc.sc I Ipcc Elli Cai E UNECxi I := NX (x i x)(x i x) T V T q V q = I q Tr exo E cxi.si Tr vet Xi Ika E tug Tr E tr VE I Ve 2018 Kevin Jamieson 27

28 Linear projections NX (x i x) V q Vq T (x i x) 2 2 NX := (x i x)(x i x) T V T q V q = I q min V q NX (x i x) V q V T q (x i x) 2 2 =min V q Tr( ) Tr(V T q V q ) Eigenvalue decomposition of = U D UT 9 1 III v TE v iiitie Dis MY Di V U 2018 Kevin Jamieson 28

29 Linear projections NX (x i x) V q Vq T (x i x) 2 2 NX := (x i x)(x i x) T V T q V q = I q min V q NX (x i x) V q V T q (x i x) 2 2 =min V q Tr( ) Tr(V T q V q ) Eigenvalue decomposition of = ot V q are the first q eigenvectors of Minimize reconstruction error and capture the most variance in your data Kevin Jamieson 29

30 Pictures V q are the first q eigenvectors of V q are the first q principal components := NX (x i x)(x i x) T 2018 Kevin Jamieson 30

31 Linear projections Given x i 2 R d and some q<dconsider where V q =[v 1,v 2,...,v q ] is orthonormal: V T q V q = I q V q are the first q eigenvectors of V q are the first q principal components NX := (x i x)(x i x) T Principal Component Analysis (PCA) projects (X 1 x T ) down onto V q (X 1 x T )V q = U q diag(d 1,...,d q ) U T q U q = I q 2018 Kevin Jamieson 31

32 Singular Value Decomposition (SVD) Theorem (SVD): Let A 2 R m n with rank r apple min{m, n}. ThenA = USV T where S 2 R r r is diagonal with positive entries, U T U = I, V T V = I. A T Av i = AA T u i = 2018 Kevin Jamieson 32

33 Singular Value Decomposition (SVD) Theorem (SVD): Let A 2 R m n with rank r apple min{m, n}. ThenA = USV T where S 2 R r r is diagonal with positive entries, U T U = I, V T V = I. A T Av i = S 2 i,iv i AA T u i = S 2 i,iu i V are the first r eigenvectors of A T A with eigenvalues diag(s) U are the first r eigenvectors of AA T with eigenvalues diag(s) 2018 Kevin Jamieson 33

34 Linear projections Given x i 2 R d and some q<dconsider where V q =[v 1,v 2,...,v q ] is orthonormal: V T q V q = I q V q are the first q eigenvectors of V q are the first q principal components NX := (x i x)(x i x) T Principal Component Analysis (PCA) projects (X 1 x T ) down onto V q (X 1 x T )V q = U q diag(d 1,...,d q ) U T q U q = I q Singular Value Decomposition defined as X 1 x T = USV T 2018 Kevin Jamieson 34

35 Dimensionality reduction V q are the first q eigenvectors of and SVD X 1 x T = USV T U 2 X 1 x T U Kevin Jamieson Kevin Jamieson

36 Dimensionality reduction V q are the first q eigenvectors of and SVD X 1 x T = USV T Handwritten 3 s, 16x16 pixel image so that x i 2 R 256 (X 1 x T )V 2 = U 2 S 2 2 R n 2 diag(s) 2018 Kevin Jamieson Kevin Jamieson

37 Kernel PCA V q are the first q eigenvectors of and SVD X 1 x T = USV T (X 1 x T )V q = U q S q 2 R n q JX = X 1 x T = USV T J = I 11 T /n (JX)(JX) T = 2018 Kevin Jamieson Kevin Jamieson

38 Kernel PCA V q are the first q eigenvectors of and SVD X 1 x T = USV T (X 1 x T )V q = U q S q 2 R n q JX = X 1 x T = USV T J = I 11 T /n (JX)(JX) T = US 2 U T 2018 Kevin Jamieson Kevin Jamieson

39 PCA Algorithm 2018 Kevin Jamieson Kevin Jamieson

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal