Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Similar documents
Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Announcements. Proposals graded

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Classification Logistic Regression

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Data Mining Techniques

Classification Logistic Regression

4 Bias-Variance for Ridge Regression (24 points)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

MLCC 2015 Dimensionality Reduction and PCA

Support Vector Machines and Kernel Methods

Lecture: Face Recognition and Feature Reduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Machine Learning - MT & 14. PCA and MDS

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture: Face Recognition and Feature Reduction

Is the test error unbiased for these programs?

Basis Expansion and Nonlinear SVM. Kai Yu

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Kernel Methods. Charles Elkan October 17, 2007

PCA, Kernel PCA, ICA

Review: Support vector machines. Machine learning techniques and image analysis

Kernel methods, kernel SVM and ridge regression

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

(Kernels +) Support Vector Machines

PCA and admixture models

Advanced Introduction to Machine Learning CMU-10715

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

CS798: Selected topics in Machine Learning

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Warm up: risk prediction with logistic regression

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Knowledge Discovery and Data Mining 1 (VO) ( )

4 Bias-Variance for Ridge Regression (24 points)

Dimensionality reduction

Support Vector Machine (SVM) and Kernel Methods

Nonlinearity & Preprocessing

Principal Component Analysis

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

SVMs, Duality and the Kernel Trick

Data Mining Techniques

Announcements Kevin Jamieson

CS281 Section 4: Factor Analysis and PCA

Classification Logistic Regression

Introduction to Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

LECTURE 16: PCA AND SVD

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

7 Principal Component Analysis

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Support Vector Machines

Introduction to Machine Learning

Principal Component Analysis and Linear Discriminant Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Kernel Principal Component Analysis

Kernel Methods in Machine Learning

Expectation Maximization

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

CIS 520: Machine Learning Oct 09, Kernel Methods

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Machine Learning. Support Vector Machines. Manfred Huber

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Regularized Least Squares

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Linear Algebra (Review) Volker Tresp 2017

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Support Vector Machine (SVM) and Kernel Methods

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Nonlinear Dimensionality Reduction

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Support Vector Machines for Classification: A Statistical Portrait

Singular Value Decompsition

LECTURE NOTE #11 PROF. ALAN YUILLE

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernel Methods. Barnabás Póczos

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Lecture 7 Spectral methods

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Perceptron Revisited: Linear Separators. Support Vector Machines

Kernel Methods and Support Vector Machines

Discriminative Models

Machine Learning (BSMC-GA 4439) Wenke Liu

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Kernel methods for comparing distributions, measuring dependence

COMP 562: Introduction to Machine Learning

Transcription:

Announcements spring Convex Optimization next quarter ML stuff EE 578 Margam FaZe CS 547 Tim Althoff Modeling how to formulate real world problems as convex optimization Data science constrained optimization KKT duality ESE 535 YinTat Lee I cry g fa Algorithms first order Large scale data andy Analysis Statistics convergence proofs stuff stat 538 Zaid Hardhaoui VC dimension covering What is tearable and inference Kevin Jamieson 2018 1

Kernels Machine Learning CSE546 Kevin Jamieson University of Washington November 6, 2018 2018 Kevin Jamieson 2

Machine Learning Problems Have a bunch of iid data of the form: {(x i,y i )} n x i 2 R d y i 2 R nx `i(w) Each `i(w) is convex. Learning a model s parameters: Hinge Loss: `i(w) = max{0, 1 y i x T i w} Logistic Loss: `i(w) = log(1 + exp( y i x T i w)) Squared error Loss: `i(w) =(y i x T i w)2 All in terms of inner products! Even nearest neighbor can use inner products! Kevin Jamieson 2018 3

What if the data is not linearly separable? Use features of features of features of features. (x) :R d! R p 1 Feature space can get really large really quickly! 2018 Kevin Jamieson 4

Dot-product of polynomials exactly d d =1: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2018 Kevin Jamieson 5

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 U v UzUz Utu 2 2018 Kevin Jamieson 6

Dot-product of polynomials d =1: (u) = d =2: (u) = apple u1 u 2 h (u), (v)i = u 1 v 1 + u 2 v 2 2 6 4 u 2 1 u 2 2 u 1 u 2 u 2 u 1 3 exactly d 7 5 h (u), (v)i = u2 1v1 2 + u 2 2v2 2 +2u 1 u 2 v 1 v 2 u v 1 UzUz Utu General d : OH Iu Dimension of Locus Ocu fund (u) is roughly p d if u 2 R p 2018 Kevin Jamieson 7

Kernel Trick bw = arg min w nx There exists an 2 R n : bw = (y i x T i w) 2 + w 2 w WEIRD x crd nx i x i Why? suppose not Then w Taiko t WL There wiki O fi HEaixc.tw lz HIaixiHtHwtH 2018 Kevin Jamieson 8

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w There exists an 2 R n : bw = b = arg min nx i x i nx X n (y i j hx j,x i i) 2 + j=1 D nx j=1 Hi nx i j hx i,x j i D 06650Cx 2018 Kevin Jamieson 9

Kernel Trick bw = arg min w nx (y i x T i w) 2 + w 2 w There exists an 2 R n : bw = nx i x i b = arg min = arg min nx X n (y i j hx j,x i i) 2 + j=1 nx X n (y i j K(x i,x j )) 2 + j=1 nx nx i j hx i,x j i j=1 nx j=1 nx i j K(x i,x j ) New = arg min y K 2 2 + T K Z tw 2 c Rd predict Z.xiztxi I.aiklz.si Ki K(x i,x j )=h (x i ), (x j )i 2018 Kevin Jamieson 10

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K F o K y ka 14K 0 Key _t KthI x KttI w xtxy XW XXTI KI KCKHII.ly 2018 Kevin Jamieson 11

Why regularization? Typically, K 0. What if = 0? b = arg min y K 2 2 + T K Unregularized kernel least squares can (over) fit any data! b = K 1 y Kay 2018 Kevin Jamieson 12

Common kernels HG g 0650cg Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel u v 2 K(u, v) =exp 2 Sigmoid 2 2 2018 Kevin Jamieson 13

Mercer s Theorem When do we have a valid Kernel K(x,x )? Sufficient: K(x, x 0 ) is a valid kernel if there exists (x) such that K(x, x 0 )= (x) T (x 0 ) Mercer s Theorem: K(x, x 0 ) is a valid kernel if and only if K is symmetric and positive semi-definite for any pointset (x 1,...,x n )wherek i,j = K(x i,x j ). 2018 Kevin Jamieson 14

RBF Kernel u v 2 K(u, v) =exp 2 2 2 Note that this is like weighting bumps on each point like kernel smoothing but now we learn the weights 2018 Kevin Jamieson 15

RBF Kernel u v 2 K(u, v) =exp 2 2 2 The bandwidth sigma has an enormous effect on fit: = 10 2 = 10 4 = 10 1 = 10 4 = 10 0 = 10 4 bf(x) = nx b i K(x i,x) 2018 Kevin Jamieson 16

RBF Kernel K(u, v) = exp The bandwidth sigma has an enormous effect on fit: = 10 4 = 10 4 = 10 1 = 10 2 = 10 3 = 10 4 = 10 1 = 10 u 2 = 10 0 A = 10 4 0 fb(x) = 2018 Kevin Jamieson v 22 2 n X bi K(xi, x) 17

RBF Kernel u v 2 K(u, v) =exp 2 2 2 Basis representation in 1d? [ (x)] i = 1 p i! e x2 2 x i for i =0, 1,... (x) T (x 0 )= 1X i=0 = e x2 +(x 0 ) 2 2 1 pi! e x2 2 x i 1 pi! e (x0 ) 2 2 (x 0 ) i = e x x0 2 /2 1X i=0 1 i! (xx0 ) i F KHITY If n is very large, allocating an n-by-n matrix is tough. Can we truncate the above sum to approximate the kernel? 2018 Kevin Jamieson 18

RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 e jz = cos(z)+j sin(z) = E w,b [2 cos(w T x + b) cos(w T y + b)] Ewb cos wt W cos wt x y 2018 Kevin Jamieson 19

RBF kernel and random features 2 cos( ) cos( ) = cos( + ) + cos( ) Recall HW1 where we used the feature map: 2p 3 2 cos(w T 1 x + b 1 ) 6 7 (x) = 4 p. 5 2 cos(w T p x + b p ) E[ 1 p (x)t (y)] = 1 p w k N (0, 2 I) b k uniform(0, ) px E[2 cos(wk T x + b k ) cos(wk T y + b k )] k=1 e jz = cos(z)+j sin(z) = E w,b [2 cos(w T x + b) cos(w T y + b)] = e x y 2 2 0 [Rahimi, Recht NIPS 2007] NIPS Test of Time Award, 2018 2018 Kevin Jamieson 20

RBF Classification min,b nx bw = min max{0, 1 y i (b + x T i w)} + w 2 2 S w nx nx nx max{0, 1 y i (b + j hx i,x j i)} + i j hx i,x j i j=1 i,j=1 2018 Kevin Jamieson 21

Wait, infinite dimensions? Isn t everything separable there? How are we not overfitting? Regularization! Fat shattering (R/margin)^2 2018 Kevin Jamieson 22

String Kernels Example from Efron and Hastie, 2016 Amino acid sequences of different lengths: x1 x2 All subsequences of length 3 (of possible 20 amino acids) 2018 Kevin Jamieson 23

Principal Component Analysis Machine Learning CSE546 Kevin Jamieson University of Washington November 6, 2018 2018 Kevin Jamieson 24

Linear projections on Given x 1,...,x n 2 R d, for q d find a compressed representation with 1,..., n 2 R q such that x i µ + V q i and Vq T V q = I nx min µ,v q,{ i } i kx i µ V q i k 2 2 2018 Kevin Jamieson 25

Linear projections Given x 1,...,x n 2 R d, for q d find a compressed representation with 1,..., n 2 R q such that x i µ + V q i and Vq T V q = I nx min µ,v q,{ i } i kx i µ V q i k 2 2 Fix V q and solve for µ, i: µ = x = 1 n P n x i i = V T q (x i x) Which gives us: 2018 Kevin Jamieson did 4 Fk V q Vq T is a projection matrix that minimizes error in basis of size q 26

Linear projections NX (x i x) V q Vq T (x i x) 2 2 I Mi IIE zcxi x styytc.cc cxi ITVWVgtcxc.sc I Ipcc Elli Cai E UNECxi I := NX (x i x)(x i x) T V T q V q = I q Tr exo E cxi.si Tr vet Xi Ika E tug Tr E tr VE I Ve 2018 Kevin Jamieson 27

Linear projections NX (x i x) V q Vq T (x i x) 2 2 NX := (x i x)(x i x) T V T q V q = I q min V q NX (x i x) V q V T q (x i x) 2 2 =min V q Tr( ) Tr(V T q V q ) Eigenvalue decomposition of = U D UT 9 1 III v TE v iiitie Dis MY Di V U 2018 Kevin Jamieson 28

Linear projections NX (x i x) V q Vq T (x i x) 2 2 NX := (x i x)(x i x) T V T q V q = I q min V q NX (x i x) V q V T q (x i x) 2 2 =min V q Tr( ) Tr(V T q V q ) Eigenvalue decomposition of = ot V q are the first q eigenvectors of Minimize reconstruction error and capture the most variance in your data. 2018 Kevin Jamieson 29

Pictures V q are the first q eigenvectors of V q are the first q principal components := NX (x i x)(x i x) T 2018 Kevin Jamieson 30

Linear projections Given x i 2 R d and some q<dconsider where V q =[v 1,v 2,...,v q ] is orthonormal: V T q V q = I q V q are the first q eigenvectors of V q are the first q principal components NX := (x i x)(x i x) T Principal Component Analysis (PCA) projects (X 1 x T ) down onto V q (X 1 x T )V q = U q diag(d 1,...,d q ) U T q U q = I q 2018 Kevin Jamieson 31

Singular Value Decomposition (SVD) Theorem (SVD): Let A 2 R m n with rank r apple min{m, n}. ThenA = USV T where S 2 R r r is diagonal with positive entries, U T U = I, V T V = I. A T Av i = AA T u i = 2018 Kevin Jamieson 32

Singular Value Decomposition (SVD) Theorem (SVD): Let A 2 R m n with rank r apple min{m, n}. ThenA = USV T where S 2 R r r is diagonal with positive entries, U T U = I, V T V = I. A T Av i = S 2 i,iv i AA T u i = S 2 i,iu i V are the first r eigenvectors of A T A with eigenvalues diag(s) U are the first r eigenvectors of AA T with eigenvalues diag(s) 2018 Kevin Jamieson 33

Linear projections Given x i 2 R d and some q<dconsider where V q =[v 1,v 2,...,v q ] is orthonormal: V T q V q = I q V q are the first q eigenvectors of V q are the first q principal components NX := (x i x)(x i x) T Principal Component Analysis (PCA) projects (X 1 x T ) down onto V q (X 1 x T )V q = U q diag(d 1,...,d q ) U T q U q = I q Singular Value Decomposition defined as X 1 x T = USV T 2018 Kevin Jamieson 34

Dimensionality reduction V q are the first q eigenvectors of and SVD X 1 x T = USV T U 2 X 1 x T U 1 2018 Kevin Jamieson Kevin Jamieson 2018 35

Dimensionality reduction V q are the first q eigenvectors of and SVD X 1 x T = USV T Handwritten 3 s, 16x16 pixel image so that x i 2 R 256 (X 1 x T )V 2 = U 2 S 2 2 R n 2 diag(s) 2018 Kevin Jamieson Kevin Jamieson 2018 36

Kernel PCA V q are the first q eigenvectors of and SVD X 1 x T = USV T (X 1 x T )V q = U q S q 2 R n q JX = X 1 x T = USV T J = I 11 T /n (JX)(JX) T = 2018 Kevin Jamieson Kevin Jamieson 2018 37

Kernel PCA V q are the first q eigenvectors of and SVD X 1 x T = USV T (X 1 x T )V q = U q S q 2 R n q JX = X 1 x T = USV T J = I 11 T /n (JX)(JX) T = US 2 U T 2018 Kevin Jamieson Kevin Jamieson 2018 38

PCA Algorithm 2018 Kevin Jamieson Kevin Jamieson 2018 39