Machine Learning 2: Nonlinear Regression

Size: px

Start display at page:

Download "Machine Learning 2: Nonlinear Regression"

Ada Sanders
5 years ago
Views:

1 Machine Learning : Nonlinear Regression J. Zico Kolter September 17, 01 1

2 Non-linear regression Peak Hourly Demand (GW) High temperature / peak demand observations for all days in

3 .5 Demand (GW) Temp (F) Hour of day

4 Central idea of non-linear regression: same as linear regression, just with non-linear features x i E.g. φ(x i ) = x i 1 Two ways to construct non-linear features: explicitly (construct actual feature vector), or implicitly (using kernels) 4

5 Peak Hourly Demand (GW).5 d = Degree polynomial 5

6 Peak Hourly Demand (GW).5 d = Degree polynomial 6

7 Peak Hourly Demand (GW).5 d = Degree 4 polynomial 7

8 Constructing explicit feature vectors Polynomial features (max degree d) z d z d 1 Special case, n=1: φ(z) =. R d+1 z 1 General case: φ(z) = { n i=1 z b i i : } n b i d i=1 R (n+d n ) 8

9 1 0.5 φ i (x) x x x x Plot of polynomial bases 9

10 Radial basis function (RBF) features Defined by bandwidth σ and k RBF centers µ j R n, j = 1,..., k { z µj } φ j (z) = exp σ Feature Value Input 10

11 Difficulties with non-linear features Problem #1: Computational difficulties Polynomial features, ( ) n + d k = = O(d n ) d RBF features; suppose we want centers in uniform grid over input space (w/ d centers along each dimension) k = d n In both cases, exponential in the size of the input dimension; quickly intractable to even store in memory 11

12 Problem #: Representational difficulties With many features, our prediction function becomes very expressive Can lead to overfitting (low error on input data points, but high error nearby) 1

13 Peak Hourly Demand (GW).5 d = 1 Peak Hourly Demand (GW).5 d = Peak Hourly Demand (GW).5 d = 4 Peak Hourly Demand (GW).5 d = Least-squares fits for polynomial features of different degrees 1

14 Peak Hourly Demand (GW).5 num RBFs = Peak Hourly Demand (GW).5 num RBFs = Peak Hourly Demand (GW).5 num RBFs = 10 Peak Hourly Demand (GW).5 num RBFs = 50, λ = Least-squares fits for different numbers of RBFs 14

15 A few ways to deal with representational problem: Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth) Regularization: penalize large parameters θ minimize θ m l(ŷ i, y i ) + λ θ i=1 λ: regularization parameter, trades off between low loss and small values of θ (often, don t regularize constant term) We ll come back to this issue when talking about evaluating machine learning methods 15

16 J(θ) θ Pareto optimal surface for 0 RBF functions 16

17 Peak Hourly Demand (GW).5 num RBFs = 50, λ = 0 Peak Hourly Demand (GW).5 num RBFs = 50, λ = Peak Hourly Demand (GW).5 num RBFs = 50, λ = 50 Peak Hourly Demand (GW).5 num RBFs = 50, λ = RBF fits varying regularization parameter (not regularizing constant term) 17

18 Implicit feature vectors (kernels) One of the main trends in machine learning in the past 15 years Kernels let us work in high-dimensional feature spaces without explicitly constructing the feature vector This addresses the first problem, the computational difficulty 18

19 Simple example, polynomial feature, n =, d = φ(z) = 1 z1 z z 1 z1 z z Let s look at the inner product between two different feature vectors φ(z) T φ(z ) = 1 + z 1 z 1 + z z + z1z 1 + z1 z z 1z + zz = 1 + (z 1 z 1 + z z ) + (z 1 z 1 + z z ) = 1 + (z T z ) + (z T z ) = (1 + z T z ) 19

20 General case: (1 + z T z ) d is the inner product between two polynomial feature vectors of max degree d ( ( ) n+d d -dimensional) But, can be computed in only O(n) time We use the notation of a kernel function K : R n R n R that computes these inner products K(z, z ) = φ(z) T φ(z ) Some common kernels polynomial (degree d): Gaussian (bandwidth σ): K(z, z ) = (1 + z T z ) d { z z K(z, z } ) = exp σ 0

21 Using kernels in optimization We can compute inner produucts, but how does this help us solve optimization problems? Consider (regularized) optimization problem we ve been using minimize θ m l(θ T φ(x i ), y i ) + λ θ i=1 Representer theorem: The solution to above problem is given by θ = m α i φ(x i ), for some α R m, (or θ = Φ T α) i=1 1

22 Notice that (ΦΦ T ) ij = φ(x i ) T φ(x j ) = K(x i, x j ) Abusing notation a bit, we ll define the kernel matrix K R m m K = ΦΦ T, (K ij = K(x i, x j )) can be computed without constructing feature vectors or Φ

23 Let s take (reguarlized) least squares objective... J(θ) = Φθ y + λ θ and substitute θ = Φ T α J(α) = ΦΦ T α y + λα T ΦΦ T α = Kα y + λα T Kα = α T KKα y T Kα + y T y + λα T Kα Taking the gradient w.r.t. α and setting to zero α J(α) = KKα Ky + λkα α = (K + λi) 1 y

24 How do we compute prediction on a new input x? ( m T ŷ = θ T φ(x ) = α i φ(x i )) φ(x ) = i=1 m α i K(x i, x ) i=1 Need to keep around all examples x i in order to make a prediction; non-parametric method 4

25 MATLAB code for polynomial kernel % computing alphas d = 6; lambda = 1; K = (1 + X*X').^d; alpha = (K + lambda*eye(m)) \ y; % computing prediction k_test = (1 + x_test*x').^d; y_hat = k_test*alpha; Gaussian kernel sigma = 0.1; lambda = 1; K = exp(-0.5*sqdist(x', X')/sigma^) alpha = (K + lambda*eye(m)) \ y; 5

26 Peak Hourly Demand (GW).5 d = 4, λ = 0.01 Peak Hourly Demand (GW).5 d = 10, λ = Peak Hourly Demand (GW).5 σ =, λ = 0.01 Peak Hourly Demand (GW).5 σ = 40, λ = Fits from polynomial and RBF kernels 6

27 Non-linear regression in higher dimensions.5 Demand (GW) Temp (F) Hour of day 0 7

28 Kernels can also be used with other loss functions; key element is just the transformation θ = Φ T α. Absolute loss alpha = sdpvar(m,1); solvesdp([], sum(abs(k*alpha - y))); Some advanced algorithms possible: deadband loss + kernels = support vector regression 8

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up! Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new