Support Vector Machines and Kernel Methods

Similar documents
2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Machine Learning. Support Vector Machines. Manfred Huber

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear & nonlinear classifiers

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines: Maximum Margin Classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

ICS-E4030 Kernel Methods in Machine Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear & nonlinear classifiers

Lecture Notes on Support Vector Machine

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

(Kernels +) Support Vector Machines

Learning with kernels and SVM

Support Vector Machine (SVM) and Kernel Methods

CS-E4830 Kernel Methods in Machine Learning

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Support Vector Machines, Kernel SVM

Statistical Machine Learning from Data

Support Vector Machines

Support Vector Machines for Classification and Regression

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization and Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Max Margin-Classifier

Pattern Recognition 2018 Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Basis Expansion and Nonlinear SVM. Kai Yu

Constrained Optimization and Lagrangian Duality

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

SUPPORT VECTOR MACHINE

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Kernel Methods and Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Lecture 2: Linear SVM in the Dual

SVMs, Duality and the Kernel Trick

Announcements - Homework

UVA CS 4501: Machine Learning

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

LECTURE 7 Support vector machines

Machine Learning A Geometric Approach

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support vector machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Introduction to SVM and RVM

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

L5 Support Vector Classification

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification: A Statistical Portrait

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machines.

ML (cont.): SUPPORT VECTOR MACHINES

Soft-Margin Support Vector Machine

Constrained Optimization and Support Vector Machines

CS798: Selected topics in Machine Learning

Support Vector Machines

Support Vector Machine

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

Applied inductive learning - Lecture 7

Support Vector Machine for Classification and Regression

Linear, Binary SVM Classifiers

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

COMP 562: Introduction to Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Methods for Data Mining

Support Vector Machines for Regression

CS , Fall 2011 Assignment 2 Solutions

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

Linear Classification For linear separable cases, we have multiple decision boundaries

Linear Classification For linear separable cases, we have multiple decision boundaries Ruling out some separators by considering data noise

Linear Classification For linear separable cases, we have multiple decision boundaries The intuitive optimal decision boundary: the largest margin

Review: Logistic Regression Logistic regression is a binary classification model p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x 1 1+e μ> x Cross entropy loss function Gradient L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) ¾(x) @L(y; x; p μ ) 1 = y @μ ¾(μ > x) ¾(z)(1 ¾(z))x (1 y) 1 1 ¾(μ > ¾(z)(1 ¾(z))x x) =(¾(μ > x) y)x @¾(z) μ Ã μ + (y ¾(μ > x))x = ¾(z)(1 ¾(z)) @z x

Label Decision Logistic regression provides the probability p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x The final label of an instance is decided by setting a threshold h ^y = ( 1; p μ (y =1jx) >h 0; otherwise 1 1+e μ> x

Logistic Regression Scores x 2 s(x) =μ 0 + μ 1 x 1 + μ 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =μ 0 + μ 1 x (A) 1 + μ 2 x (A) 2 Decision boundary s(x) =μ 0 + μ 1 x (B) 1 + μ 2 x (B) 2 s(x) =μ 0 + μ 1 x (C) 1 + μ 2 x (C) 2 s(x) =0 The higher score, the larger distance to the decision boundary, the higher confidence x 1 Example from Andrew Ng

Linear Classification The intuitive optimal decision boundary: the highest confidence

Notations for SVMs Feature vector Class label Parameters Intercept b Feature weight vector Label prediction x y 2f 1; 1g w h w;b (x) =g(w > x + b) ( +1 z 0 g(z) = 1 otherwise

Logistic Regression Scores x 2 s(x) =b + w 1 x 1 + w 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =b + w 1 x (A) 1 + w 2 x (A) 2 Decision boundary s(x) =b + w 1 x (B) 1 + w 2 x (B) 2 s(x) =b + w 1 x (C) 1 + w 2 x (C) 2 s(x) =0 The higher score, the larger distance to the separating hyperplane, the higher confidence x 1 Example from Andrew Ng

Margins x 2 Functional margin ^ (i) = y (i) (w > x (i) + b) (B) w Note that the separating hyperplane won t change with the magnitude of (w, b) g(w > x + b) =g(2w > x +2b) Decision boundary Geometric margin (i) = y (i) (w > x (i) + b) where kwk 2 =1 x 1

Margins x 2 Decision boundary μ w > x (i) (i) y (i) w kwk + b =0 (B) w ) (i) = y (i) w> x (i) + b kwk μ ³ w >x = y (i) (i) + kwk b kwk Given a training set Decision boundary x 1 S = f(x i ;y i )g :::m the smallest geometric margin = min :::m (i)

Objective of an SVM Find a separable hyperplane that maximizes the minimum geometric margin max ;w;b s.t. y (i) (w > x (i) + b) ; i =1;:::;m kwk =1 (non-convex constraint) Equivalent to normalized functional margin max ^ ;w;b ^ kwk (non-convex objective) s.t. y (i) (w > x (i) + b) ^ ; i =1;:::;m

Objective of an SVM Functional margin scales w.r.t. (w,b) without changing the decision boundary. Let s fix the functional margin at 1. Objective is written as max w;b Equivalent with 1 kwk ^ =1 s.t. y (i) (w > x (i) + b) 1; ;:::;m min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; ;:::;m This optimization problem can be efficiently solved by quadratic programming

A Digression of Lagrange Duality in Convex Optimization Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Lagrangian for Convex Optimization A convex optimization problem min w f(w) s.t. h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as Solving L(w; ) =f(w)+ lx ih i (w) @L(w; ) @L(w; ) =0 =0 @w @ Lagrangian multipliers yields the solution of the original optimization problem.

Lagrangian for Convex Optimization w 2 f(w) =5 f(w) =4 f(w) =3 h(w) =0 L(w; ) =f(w)+ h(w) @L(w; ) @w = @f(w) @w w 1 + @h(w) @w =0 i.e., two gradients on the same direction

With Inequality Constraints A convex optimization problem min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as L(w; ; ) =f(w)+ kx i g i (w)+ lx ih i (w) Lagrangian multipliers

Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ lx ih i (w) μ P P(w) = max ; : i 0 L(w; ; ) If a given w violates any constraints, i.e., Then μ P (w) =+1 g i (w) > 0 or h i (w) 6= 0

Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ lx ih i (w) μ P P(w) = max ; : i 0 L(w; ; ) Conversely, if all constraints are satisfied for Then μ P (w) =f(w) μ P (w) = ( f(w) w if w satis es primal constraints +1 otherwise

Primal Problem μ P (w) = ( f(w) if w satis es primal constraints +1 otherwise The minimization problem min μ P(w) =min w w max ; : i 0 L(w; ; ) is the same as the original problem min f(w) w s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l Define the value of the primal problem p =min μ P(w) w

Dual Problem A slightly different problem Define the dual optimization problem Min & Max exchanged compared to the primal problem min μ P(w) =min w w max ; : i 0 Define the value of the dual problem d d = max min ; : i 0 w L(w; ; ) L(w; ; )

Primal Problem vs. Dual Problem d = max min ; : i 0 w L(w; ; ) min w max ; : i 0 L(w; ; ) =p Proof ) max min L(w; ; ) L(w; ; ); 8w; 0; w min ; : 0 w ) max min ; : 0 w L(w; ; ) max ; : 0 L(w; ; ) min w L(w; ; ); 8w max ; : 0 L(w; ; ) But under certain condition d = p

Karush-Kuhn-Tucker (KKT) Conditions If f and g i s are convex and h i s are affine, and suppose g i s are all strictly feasible then there must exist w *, α *, β * w * is the solution of the primal problem α *, β * are the solutions of the dual problem and the values of the two problems are equal And w *, α *, β * satisfy the KKT conditions KKT dual complementarity condition Moreover, if some w *, α *, β * satisfy the KKT conditions, then it is also a solution to the primal and dual problems. More details please refer to Boyd Convex optimization 2004.

Now Back to SVM Problem

Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; ;:::;m Re-wright the constraints as g i (w) = y (i) (w > x (i) + b)+1 0 so as to match the standard optimization form min w f(w) s.t. g i (w) 0; h i (w) =0; i =1;:::;k i =1;:::;l

Equality Cases g i (w) = y (i) (w > x (i) + b)+1=0 )y (i)³ w > x (i) kwk + b = 1 kwk kwk Geometric margin The g i s = 0 cases correspond to the training examples that have functional margin exactly equal to 1.

Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b)+1 0; i =1;:::;m Lagrangian L(w; b; ) = 1 2 kwk2 i [y (i) (w > x (i) + b) 1] No β or equality constraints in SVM problem

Solving L(w; b; ) = 1 2 kwk2 Derivatives @ m @w L(w; b; ) =w X i y (i) x (i) =0 ) w = L(w; b; ) = 1 2 = @ m @b L(w; b; ) = X i y (i) =0 i [y (i) (w > x (i) + b) 1] Then Lagrangian is re-written as i y (i) x (i) i y (i) x (i) 2 X m i [y (i) (w > x (i) + b) 1] i 1 2 i;j=1 y (i) y (j) i j x (i)> x (j) b i y (i) = 0

Solving α * Dual problem max max μ D( ) =max 0 W ( ) = i 1 2 s.t. i 0; ;:::;m i y (i) =0 min 0 w;b i;j=1 L(w; b; ) To solve α * with some methods e.g. SMO We will get back to this solution later y (i) y (j) i j x (i)> x (j)

Solving w * and b * With α * solved, w * is obtained by w = i y (i) x (i) Only supporting vectors with α > 0 With w * solved, b * is obtained by b = max i:y (i) = 1 w > x (i) +min i:y (i) =1 w > x (i) 2 w

Predicting Values With the solutions of w * and b *, the predicting value (i.e. functional margin) of each instance is w > x + b = = ³ X m i y (i) x (i) > x + b i y (i) hx (i) ;xi + b We only need to calculate the inner product of x with the supporting vectors

Non-Separable Cases The derivation of the SVM as presented so far assumes that the data is linearly separable. More practical cases are linearly non-separable.

Dealing with Non-Separable Cases Add slack variables Lagrangian L(w; b;»; ; r) = 1 2 w> w + C Dual problem max W ( ) = min w;b 1 2 kwk2 + C» i L1 regularization s.t. y (i) (w > x (i) + b) 1» i ;;:::;m» i 0; i =1;:::;m» i i [y (i) (x > w + b) 1+» i ] r i» i i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Surprisingly, this is the only change Efficiently solved by SMO algorithm

SVM Hinge Loss vs. LR Loss SVM Hinge loss 1 2 kwk2 + C max(0; 1 y i (w > x i + b)) LR log loss y i log ¾(w > x i + b) (1 y i )log(1 ¾(w > x i + b)) If y = 1

Coordinate Ascent (Descent) For the optimization problem max W ( 1; 2 ;:::; m ) Coordinate ascent algorithm Loop until convergence: f For i =1;:::;m f i := arg max W ( 1 ;:::; i 1 ; ^ i ; i+1 ;:::; m ) ^ i g g

Coordinate Ascent (Descent) A two-dimensional coordinate ascent example

SMO Algorithm SMO: sequential minimal optimization SVM optimization problem max W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Cannot directly apply coordinate ascent algorithm because i y (i) =0 ) i y (i) = X j y (j) j6=i

SMO Algorithm Update two variable each time Loop until convergence { 1. Select some pair α i and α j to update next 2. Re-optimize W(α) w.r.t. α i and α j } Convergence test: whether the change of W(α) is smaller than a predefined value (e.g. 0.01) Key advantage of SMO algorithm is the update of α i and α j (step 2) is efficient

SMO Algorithm max W ( ) = i 1 2 i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 Without loss of generality, hold α 3 α m and optimize W(α) w.r.t. α 1 and α 2 1 y (1) + 2 y (2) = y (i) y (j) i j x (i)> x (j) i y (i) = ³ i=3 ) 2 = y(1) y (2) 1 + ³ y (2) 1 =(³ 2 y (2) )y (1)

SMO Algorithm With 1 =(³ 2 y (2) )y (1), the objective is written as W ( 1 ; 2 ;:::; m )=W((³ 2 y (2) )y (1) ; 2 ;:::; m ) Thus the original optimization problem max W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m i y (i) =0 is transformed into a quadratic optimization problem w.r.t. 2 max W ( 2 )=a 2 2 + b 2 + c 2 s.t. 0 2 C

SMO Algorithm Optimizing a quadratic function is much efficient W ( 2 ) W ( 2 ) W ( 2 ) 0 b 2a C 2 0 C 2 0 C 2 max W ( 2 )=a 2 2 + b 2 + c 2 s.t. 0 2 C

Kernel Methods

Non-Separable Cases More practical cases are linearly non-separable. May be solved by slack variables Linearly separable case Cannot be solved by slack variables

Non-Separable Cases More practical cases are linearly non-separable. Solution: mapping feature vectors to a higher-dimensional space An example Á(x) Á(x) =x 2 x x

Non-Separable Cases More generally, mapping feature vectors to a different space Á(x) Figures from Stuart Russell

Feature Mapping Functions SVM only cares about the inner products W ( ) = i 1 2 y (i) y (j) i j x (i)> x (j) i;j=1 With the feature mapping function Á(x) W ( ) = i 1 2 y (i) y (j) i j K(x (i) ;x (j) ) i;j=1 Kernel K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) )

Kernel With the example feature mapping function 2 3 x Á(x) = 4x 2 5 The corresponding kernel is x 2 x 3 K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) ) = x (i) x (j) + x (i)2 x (j)2 + x (i)3 x (j)3 [Kernel Trick] For lots of cases, we only need K(x (i) ;x (j) ), thus we can directly define K(x (i) ;x (j) ) without explicitly defining For example, suppose K(x (i) ;x (j) ) Á(x (i) ) x (i) ;x (j) 2 R n K(x (i) ;x (j) )=(x (i)> x (j) ) 2

Kernel Example For example, suppose x (i) ;x (j) 2 R n K(x (i) ;x (j) )=(x (i)> x (j) ) 2 ³ nx = = = k=1 nx x (i) k x(j) k nx k=1 l=1 nx ³ nx l=1 x (i) k x(j) k x(i) l x (j) l (x (i) k x(i) l )(x (j) k x(j) l ) k;l=1 x (i) l x (j) l If n = 3, the mapping function is ) Á(x) = Note that calculating Á(x) takes O(n 2 ) time, while calculating K(x (i) ;x (j) ) only takes O(n) time 2 3 x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 6x 3 x 1 7 4 x 3 x 2 5 x 3 x 3

Kernel for Measuring Similarity Intuitively, for two instances x and z, if Á(x) and are close together, then we expect K(x; z) =Á(x) > Á(z) to be large, and vice versa. Gaussian kernel (a very widely used kernel) kx zk2 K(x; z) =exp μ 2¾ 2 Á(z) Also called radial basis function (RBF) kernel Then what is the feature mapping function for this kernel?

Kernel Matrix Consider a finite set of instances The corresponding Kernel Matrix K is defined as The kernel matrix K must be symmetric since fx (1) ;:::;x (m) g fk ij g i;j=1;:::;m K ij = K(x (i) ;x (j) )=Á(x (i) ) > Á(x (j) )=Á(x (j) ) > Á(x (i) )=K(x (j) ;x (i) )=K ji If we define Á k (x) as the k-th coordinate of the vector Á(x), then for any vector z 2 R m, we have z > Kz = X X z i K ij z j i j = X X z i Á(x (i) ) > Á(x (j) )z j = X X X z i Á k (x (i) )Á k (x (j) )z j i j i j k = X X X z i Á k (x (i) )Á k (x (j) )z j = X ³ X z i Á k (x (i) ) 2 0 k i j k i Therefore, K is semi-definite

Valid (Mercer) Kernel Theorem (Mercer) Let K : R n R n 7! R be given. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any fx (1) ;:::;x (m) g; m < 1, the corresponding kernel matrix is symmetric positive semi-definite. Example valid kernels RBF kernel Simple polynomial kernel kx zk2 K(x; z) =exp μ 2¾ 2 K(x; z) =(x > z) d Cosine similarity kernel K(x; z) = x> z James Mercer UK Mathematician 1883-1932 kxk kzk

Sigmoid Kernel K(x; z) = tanh( x > z + c) tanh(b) = 1 e 2b 1+e 2b Neural networks use sigmoid as activation function SVM with a sigmoid kernel is equivalent to a 2-layer perceptron (We shall return to this after the study of neural networks) Lin, Hsuan-Tien, and Chih-Jen Lin. "A study on sigmoid kernels for SVM and the training of non-psd kernels by SMO-type methods." 2003.

Generalized Linear Models

Review: Linear Regression 2 X = 6 4 x (1) x (2). x (n) 3 Prediction 2 7 5 = 6 4 3 ::: x (1) 3 d 3 ::: x (2) d...... 7 μ. 5 = 3 ::: x (n) d 2 x (1) 3 μ x (2) μ ^y = Xμ = 6 7 4. 5 x (n) μ x (1) 1 x (1) 2 x (1) x (2) 1 x (2) 2 x (2) x (n) 1 x (n) 2 x (n) 2 μ 1 μ 2 6 4. μ d 3 7 y 5 = 2 6 4 y 1 y 2. y n 3 7 5 Objective J(μ) = 1 2 (y ^y)> (y ^y) = 1 2 (y Xμ)> (y Xμ)

Review: Matrix Form of Linear Reg. Objective J(μ) = 1 2 (y Xμ)> (y Xμ) minj(μ) Gradient @J(μ) @μ = X > (y Xμ) Solution @J(μ) @μ = 0! X > (y Xμ) =0! X > y = X > Xμ! ^μ =(X > X) 1 X > y

Generalized Linear Models Dependence y = f(μ > Á(x)) Feature mapping function Mapped feature matrix 2 Á(x (1) 3 2 ) Á 1 (x (1) ) Á 2 (x (1) ) Á h (x (1) 3 ) Á(x (2) ) Á 1 (x (2) ) Á 2 (x (2) ) Á h (x (2) ) =.. Á(x (i) ) =..... Á 1 (x (i) ) Á 2 (x (i) ) Á h (x (i) ) 6 7 6 4. 5 4. 7..... 5 Á(x (n) ) Á 1 (x (n) ) Á 2 (x (n) ) Á h (x (n) )

Matrix Form of Kernel Linear Regression Objective J(μ) = 1 2 (y μ)> (y μ) minj(μ) Gradient Solution @J(μ) @μ @J(μ) = > (y μ) @μ = 0! > (y μ) =0! > y = > μ! ^μ =( > ) 1 > y

Matrix Form of Kernel Linear Regression With the Algebra trick (P 1 + B > R 1 B) 1 B > R 1 = PB > (BPB > + R) 1 The optimal parameters with L2 regularization ^μ =( > + I h ) 1 > y = > ( > + I n ) 1 y for prediction, we never actually need access ^y = ^μ = > ( > + I n ) 1 y = K(K + I n ) 1 y where the kernel matrix K = fk(x (i) ; x (j) )g [http://www.ics.uci.edu/~welling/classnotes/papers_class/kernel-ridge.pdf]