The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Similar documents
Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Kaggle.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Announcements. Proposals graded

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines and Kernel Methods

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Linear & nonlinear classifiers

Kernels and the Kernel Trick. Machine Learning Fall 2017

9.2 Support Vector Machines 159

Statistical Machine Learning from Data

(Kernels +) Support Vector Machines

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

14 Singular Value Decomposition

Kernel Methods. Machine Learning A W VO

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 7: Kernels for Classification and Regression

Support Vector Machines

Neural Network Training

CS798: Selected topics in Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear & nonlinear classifiers

ML (cont.): SUPPORT VECTOR MACHINES

Neural Networks and Deep Learning

CSC321 Lecture 8: Optimization

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

1 Kernel methods & optimization

CIS 520: Machine Learning Oct 09, Kernel Methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Ad Placement Strategies

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

CS281 Section 4: Factor Analysis and PCA

Regularized Least Squares

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

CSC321 Lecture 7: Optimization

Comparison of Modern Stochastic Optimization Algorithms

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Kernel Methods and Support Vector Machines

Logistic Regression Logistic

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Iterative solvers for linear equations

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Kernel Methods. Charles Elkan October 17, 2007

arxiv: v1 [math.na] 5 May 2011

7 Principal Component Analysis

TDT4173 Machine Learning

Lecture 10: October 27, 2016

Support Vector Machines

Machine learning for pervasive systems Classification in high-dimensional spaces

Kernel Methods. Barnabás Póczos

Machine Learning Lecture 5

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Approximate Principal Components Analysis of Large Data Sets

Support Vector Machine (SVM) and Kernel Methods

Iterative solvers for linear equations

Linear Regression (continued)

Kernel Methods. Outline

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Max Margin-Classifier

ECS171: Machine Learning

Linear Models in Machine Learning

Intelligent Systems Discriminative Learning, Neural Networks

Lecture 6. Regression

Learning From Data Lecture 25 The Kernel Trick

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

Introduction to Support Vector Machines

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machines.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture: Local Spectral Methods (1 of 4)

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CPSC 340: Machine Learning and Data Mining

Kernel Method: Data Analysis with Positive Definite Kernels

Least Mean Squares Regression

Why should you care about the solution strategies?

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Overfitting, Bias / Variance Analysis

Support Vector Machine (SVM) and Kernel Methods

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Tutorial on Machine Learning for Advanced Electronics

STA 4273H: Sta-s-cal Machine Learning

Transcription:

The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017

Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017

Principle Component Analysis Setting: find the dominant eigenvalue-eigenvector pair of a positive semidefinite symmetric matrix A. u 1 = arg max x x T Ax x T x Many ways to write this problem, e.g. p 1u 1 = arg min x kxx T Ak 2 F kbk F is Frobenius norm kbk 2 F = X X Bi,j 2 i j

PCA: A Non-Convex Problem PCA is not convex in any of these formulations Why? Think about the solutions to the problem: u and u Two distinct solutions à can t be convex Can we still use momentum to run PCA more quickly?

Power Iteration Before we apply momentum, we need to choose what base algorithm we re using. Simplest algorithm: power iteration Repeatedly multiply by the matrix A to get an answer x t+1 = Ax t

Why does Power Iteration Work? Let eigendecomposition of A be A = nx i=1 iu i u T i For 1 > 2 1 n PI converges in direction because cosine-squared of angle to u 1 is cos 2 ( ) = (ut 1 x t ) 2 kx t k 2 = (ut 1 A t x 0 ) 2 ka t x 0 k 2 = =1 P n i=2 P n i=1 2t i (ut i x 0) 2 2t i (ut i x 0) =1 2 P n i=1 2t 1 (u T 1 x 0 ) 2 2t i (ut i x 0) 2! 2t 2 1

What about a more general algorithm? Use both current iterate, and history of past iterations x t+1 = t Ax t + t,1 x t 1 + t,2 x t 2 + + t,t x 0 for fixed parameters and β What class of functions can we express in this form? Notice: x t is always a degree-t polynomial in A times x 0 Can prove by induction that we can express ANY polynomial

Power Iteration and Polynomials Can also think of power iteration as a degree-t polynomial of A x t = A t x 0 Is there a better degree-t polynomial to use than ft(x) =x t? If we use a different polynomial, then we get x t = f t (A)x 0 = Ideal solution: choose polynomial with zeros at all non-dominant eigenvalues Practically, make to be as large as possible and nx i=1 f t ( i )u i u T i x 0 f t ( 1 ) f t ( ) < 1 for all < 2

Chebyshev Polynomials Again It turns out that Chebyshev polynomials solve this problem. Recall: T 0 (x) =0,T 1 (x) =x and Nice properties: x apple 1 ) T n (x) apple 1 T n+1 (x) =2xT n (x) T n 1 (x)

Chebyshev Polynomials 1.5 1 0.5 0 T 0 (u) =1-0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials 1.5 1 0.5 0 T 1 (u) =u -0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials 1.5 1 0.5 0 T 2 (u) =2u 2 1-0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials 1.5 1 0.5 0-0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials 1.5 1 0.5 0-0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials 1.5 1 0.5 0-0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials 1.5 1 0.5 0-0.5-1 -1.5-1.5-1 -0.5 0 0.5 1 1.5

Chebyshev Polynomials Again It turns out that Chebyshev polynomials solve this problem. Recall: T 0 (x) =0,T 1 (x) =x and T n+1 (x) =2xT n (x) T n 1 (x) Nice properties: x apple 1 ) T n (x) apple 1 T n (1 + ) 1+ p n 2

Using Chebyshev Polynomials So we can choose our polynomial f in terms of T Want: f t ( 1 ) to be as large as possible, subject to f t ( ) < 1 for all < 2 To make this work, set f n (x) =T n x 2 Can do this by running the update x t+1 = 2A 2 x t x t 1 ) x t = T t A 2 x 0

Convergence of Momentum PCA x t kx t k = P n i=1 T t r Pn i=1 T t 2 i 2 i 2 u i u T i x 0 (u T i x 0) 2 Cosine-squared of angle to dominant component: cos 2 ( ) = (ut 1 x t ) 2 kx t k 2 = T 2 t P n i=1 T 2 t 1 2 (u T 1 x 0 ) 2 i (u T i x 0) 2 2

Convergence of Momentum PCA (continued) cos 2 ( ) = (ut 1 x t ) 2 T 2 1 t (u T 2 1 x 0 ) 2 kx t k 2 = P n i=1 T t 2 i (u T 2 i x 0) 2 P n i=2 T t 2 i (u T 2 i x 0) 2 =1 P n i=1 T t 2 i (u T 2 i x 0) 2 P n i=2 1 (ut i x 0) 2 =1 T 2 (u T 1 x 0) 2 t T 2 t 1 2 1 2

Convergence of Momentum PCA (continued) cos 2 ( ) 1 =1 0 T 2 t @ 1+ 1 2 r 2 1 2 =1 Tt 2! 1 2t 2 A 1+ 1 2 2 Recall that standard power iteration had:! 2t cos 2 2 ( ) =1 =1 1 1+ 1 2 2 2t! So the momentum rate is asymptotically faster than power iteration

Questions?

The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017

Basic Linear Models For classification using model vector w output = sign(w T x) Optimization methods vary; here s logistic regression minimize w 1 n nx i=1 log 1 + exp( w T x i y i ) (y i 2 { 1, 1})

Benefits of Linear Models Fast classification: just one dot product Fast training/learning: just a few basic linear algebra operations Drawback: limited expressivity Can only capture linear classification boundaries à bad for many problems How do we let linear models represent a broader class of decision boundaries, while retaining the systems benefits?

The Kernel Method Idea: in a linear model we can think about the similarity between two training examples x and y as being x T y This is related to the rate at which a random classifier will separate x and y Kernel methods replace this dot-product similarity with an arbitrary Kernel function that computes the similarity between x and y K(x, y) :X X! R

Kernel Properties What properties do kernels need to have to be useful for learning? Key property: kernel must be symmetric K(x, y) =K(y, x) Key property: kernel must be positive semi-definite 8c i 2 R,x i 2 X, nx i=1 j=1 Can check that the dot product has this property nx c i c j K(x i,x j ) 0

Facts about Positive Semidefinite Kernels Sum of two PSD kernels is a PSD kernel K(x, y) =K 1 (x, y)+k 2 (x, y) isapsdkernel Product of two PSD kernels is a PSD kernel K(x, y) =K 1 (x, y)k 2 (x, y) isapsdkernel Scaling by any function on both sides is a kernel K(x, y) =f(x)k 1 (x, y)f(y) isapsdkernel

Other Kernel Properties Useful property: kernels are often non-negative K(x, y) 0 Useful property: kernels are often scaled such that K(x, y) apple 1, and K(x, y) =1, x = y These properties capture the idea that the kernel is expressing the similarity between x and y

Common Kernels Gaussian kernel/rbf kernel: de-facto kernel in machine learning K(x, y) =exp kx yk 2 We can validate that this is a kernel Symmetric? Positive semi-definite? WHY? Non-negative? Scaled so that K(x,x) = 1?

Common Kernels (continued) Linear kernel: just the inner product K(x, y) =x T y Polynomial kernel: K(x, y) =(1+x T y) p Laplacian kernel: K(x, y) =exp( kx yk 1 ) Last layer of a neural network: if last layer outputs (x), then kernel is K(x, y) = (x) T (y)

Classifying with Kernels An equivalent way of writing a linear model on a training set is 0! 1 T nx output(x) = sign @ w i x i xa We can kernel-ize this by replacing the dot products with kernel evals! nx output(x) = sign w i K(x i,x) i=1 i=1

Learning with Kernels An equivalent way of writing linear-model logistic regression is minimize w 1 n 0 nx log i=1 B @1 + exp We can kernel-ize this by replacing the dot products with kernel evals 0 0 11 1 nx nx minimize w log @1 + exp @ w j y i K(x j,x i ) AA n i=1 0 B @ 0 @ 1 nx w j x j A j=1 j=1 T 11 CC x i y i AA

The Computational Cost of Kernels Recall: benefit of learning with kernels is that we can express a wider class of classification functions Recall: another benefit is linear classifier learning problems are easy to solve because they are convex, and gradients easy to compute Major cost of learning naively with Kernels: have to evaluate K(x, y) For SGD, need to do this effectively n times per update Computationally intractable unless K is very simple

The Gram Matrix Address this computational problem by pre-computing the kernel function for all pairs of training examples in the dataset. G i,j = K(x i,x j ) Transforms the learning problem into 1 nx minimize w log 1 + exp n i=1 y i e T i Gw This is much easier than recomputing the kernel at each iteration

Problems with the Gram Matrix Suppose we have n examples in our training set. How much memory is required to store the Gram matrix G? What is the cost of taking the product G i w to compute a gradient? What happens if we have one hundred million training examples?

Feature Extraction Simple case: let s imagine that X is a finite set {1, 2,, k} We can define our kernel as a matrix M 2 R k k M i,j = K(i, j) Since M is positive semidefinite, it has a square root U T U = M kx U k,i U k,j = M i,j = K(i, j) i=1

Feature Extraction (continued) So if we define a feature mapping (i) T (j) = (i) =Ue i then kx U k,i U k,j = M i,j = K(i, j) i=1 The kernel is equivalent to a dot product in some space In fact, this is true for all kernels, not just finite ones

Classifying with feature maps Suppose that we can find a finite-dimensional feature map that satisfies (i) T (j) =K(i, j) Then we can simplify our classifier to! nx output(x) = sign w i K(x i,x) = sign i=1 nx i=1! w i (x i ) T (x) = sign u T (x)

Learning with feature maps Similarly we can simplify our learning objective to 1 nx minimize u log 1 + exp u T (x i )y i n i=1 Take-away: this is just transforming the input data, then running a linear classifier in the transformed space! Computationally: super efficient As long as we can transform and store the input data in an efficient way

Problems with Feature Maps The dimension of the transformed data may be much larger than the dimension of the original data. Suppose that the feature map is : R d! R D and there are n examples How much memory is needed to store the transformed features? What is the cost of taking the product u T (x i ) to compute a gradient?

Feature Maps vs. Gram Matrices Systems trade-offs exist here. When number of examples gets very large, feature maps are better. When transformed feature vectors have high dimensionality, Gram matrices are better.

Another Problem with Feature Maps Recall: I said there was always a feature map for any kernel such that (i) T (j) =K(i, j) But this feature map is not always finite-dimensional For example, the Gaussian/RBF kernel has an infinite-dimensional feature map Many kernels we care about in ML have this property What do we do if ɸ has infinite dimensions? We can t just compute with it normally!

Solution: Approximate Feature Maps Find a finite-dimensional feature map so that K(x, y) (x) T (y) Typically, we want to find a family of feature maps ɸ t such that D : R d! R D lim D!1 D (x) T D(y) =K(x, y)

Types of Approximate Feature Maps Deterministic feature maps Choose a fixed-a-priori method of approximating the kernel Generally not very popular because of the way they scale with dimensions Random feature maps Choose a feature map at random (typically each feature is independent) such that E (x) T (y) = K(x, y) Then prove with high probability that over some region of interest (x) T (y) K(x, y) apple

Types of Approximate Features (continued) Orthogonal randomized feature maps Intuition behind this: if we have a feature map where for some i and j e T i (x) e T j (x) then we can t actually learn much from having both features. Strategy: choose the feature map at random, but subject to the constraint that the features be orthogonal in some way. Quasi-random feature maps Generate features using a low-discrepancy sequence rather than true randomness

Adaptive Feature Maps Everything before this didn t take the data into account Adaptive feature maps look at the actual training set and try to minimize the kernel approximation error using the training set as a guide For example: we can do a random feature map, and then fine-tune the randomness to minimize the empirical error over the training set Gaining in popularity Also, neural networks can be thought of as adaptive feature maps.

Systems Tradeoffs Lots of tradeoffs here Do we spend more work up-front constructing a more sophisticated approximation, to save work on learning algorithms? Would we rather scale with the data, or scale to more complicated problems? Another task for metaparameter optimization

Questions Upcoming things: Paper 2 review due tonight Paper 3 in class on Wednesday Start thinking about the class project it will come faster than you think!