Lecture 7: Kernels for Classification and Regression

Size: px

Start display at page:

Download "Lecture 7: Kernels for Classification and Regression"

William Arnold
6 years ago
Views:

1 Lecture 7: Kernels for Classification and Regression CS , Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011

2 Outline

3 Outline

4 A linear regression problem Linear auto-regressive model for time-series: y t linear function of y t 1, y t 2 y t = w 1 + w 2 y t 1 + w 3 y t 2, t = 1,..., T. This writes y t = w T x t, with x t the feature vectors x t := (1, y t 1, y t 2 ), t = 1,..., T. Model fitting via least-squares: min w X T w y 2 2 Prediction rule : ŷ T +1 = w 1 + w 2 y T + w 3 y T 1 = w T x T +1.

5 Nonlinear regression Nonlinear auto-regressive model for time-series: y t quadratic function of y t 1, y t 2 y t = w 1 + w 2 y t 1 + w 3 y t 2 + w 4 y 2 t 1 + w 5 y t 1 y t 2 + w 6 y 2 t 2. This writes y t = w T φ(x t), with φ(x t) the augmented feature vectors ( ) φ(x t) := 1, y t 1, y t 2, yt 1, 2 y t 1 y t 2, yt 2 2. Everything the same as before, with x replaced by φ(x).

6 Nonlinear classification Non-linear (e.g., quadratic) decision boundary w 1 x 1 + w 2 x 2 + w 3 x w 4 x 1 x 2 + w 5 x b = 0. Writes w T φ(x) + b = 0, with φ(x) := (x 1, x 2, x 2 1, x 1 x 2, x 2 2 ).

7 Challenges In principle, it seems can always augment the dimension of the feature space to make the data linearly separable. (See the video at How do we do it in a computationally efficient manner?

8 Outline

9 Linear least-squares where min w X T w y λ w 2 2 X = [x 1,..., x n] is the m n matrix of data points. y R m is the response vector, w contains regression coefficients. λ 0 is a regularization parameter. Prediction rule: y = w T x, where x R n is a new data point.

10 Support vector machine (SVM) where min w m (1 y i (w T x i + b)) + λ w 2 2 i=1 X = [x 1,..., x m] is the n m matrix of data points in R n. y { 1, 1} m is the label vector. w, b contain classifer coefficients. λ 0 is a regularization parameter. In the sequel, we ll ignore the bias term (for simplicity only). Classification rule: y = sign(w T x + b), where x R n is a new data point.

11 of problem Many classification problems can be written where min w L(X T w, y) + λ w 2 2 X = [x 1,..., x n] is a m n matrix of data points. y R m contains a response vector (or labels). w contains classifier coefficients. L is a loss function that depends on the problem considered. λ 0 is a regularization parameter. Prediction/classification rule: depends only on w T x, where x R n is a new data point.

12 Loss functions Squared loss: (for linear least-squares regression) L(z, y) = z y 2 2. Hinge loss: (for SVMs) m L(z, y) = max(0, 1 y i z i ) i=1 Logistic loss: (for logistic regression) m L(z, y) = log(1 + e y i z i ). i=1

13 Outline

14 Key result For the generic problem: min w L(X T w) + λ w 2 2 the optimal w lies in the span of the data points (x 1,..., x m): for some vector v R m. w = Xv

15 Proof R( A) = N ( A ). Therefore, the output space R m is decomposed as Any w Rn can be written as( Athe of two Rm = R ) sum = R (orthogonal A ) N ( A )vectors:, R( A) and w = Xv + r dim N ( A ) + rank( A ) = m, T where X T r = 0 (that is, r is in the nullspace N (X )). 3 see a graphical exemplification in R in Figure Nonlinear case2.26: Illus Figure of the fundamen theorem of linea Algebra in R 3 ; A [ a1 a2 ]. Figure shows the case X = A = (a1, a2 ).

16 Consequence of key result For the generic problem: min w L(X T w) + λ w 2 2 the optimal w can be written as w = Xv for some vector v R m. Hence training problem depends only on K := X T X: min v L(Kv) + λv T Kv.

17 Kernel matrix The training problem depends only on the kernel matrix K = X T X K ij = xi T x j K contains the scalar products between all data point pairs. The prediction/classification rule depends on the scalar products between new point x and the data points x 1,..., x m: w T x = v T X T x = v T k, k := X T x = (x T x 1,..., x T x m).

18 Computational advantages Once K is formed (this takes O(n)), then the training problem has only m variables. When n >> m, this leads to a dramatic reduction in problem size.

19 How about the nonlinear case? In the nonlinear case, we simply replace the feature vectors x i by some augmented feature vectors φ(x i ), with φ a non-linear mapping. Example : in classification with quadratic decision boundary, we use φ(x) := (x 1, x 2, x 2 1, x 1 x 2, x 2 2 ). This leads to the modified kernel matrix K ij = φ(x i ) T φ(x j ), 1 i, j m.

20 The kernel function The kernel function associated with mapping φ is k(x, z) = φ(x) T φ(z). It provides information about the metric in the feature space, e.g.: φ(x) φ(z) 2 2 = k(x, x) 2k(x, z) + k(z, z). The computational effort involved in solving the training problem; making a prediction, depends only on our ability to quickly evaluate such scalar products. We can t choose k arbitrarily; it has to satisfy the above for some φ.

21 Outline

22 Quadratic kernels Classification with quadratic boundaries involves feature vectors φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2 ). Fact : given two vectors x, z R 2, we have φ(x) T φ(z) = (1 + x T z) 2.

23 More generally when φ(x) is the vector formed with all the products between the components of x R n, up to degree d, then for any two vectors x, z R n, φ(x) T φ(z) = (1 + x T z) d. Computational effort grows linearly in n. This represents a dramatic reduction in speed over the brute force approach: Form φ(x), φ(z); evaluate φ(x) T φ(z). Computational effort grows as n d.

24 Gaussian kernel function: k(x, z) = exp ( x ) z 2 2, 2σ 2 where σ > 0 is a scale parameter. Allows to ignore points that are too far apart. Corresponds to a non-linear mapping φ to infinite-dimensional feature space. There is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc).

25 In practice Kernels need to be chosen by the user. Choice not always obvious; Gaussian or polynomial kernels are popular. Control over-fitting via cross validation (wrt say, scale parameter of Gaussian kernel, or degree of polynomial kernel). Kernel methods not well adapted to l 1 -norm regularization.

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants