Linear Regression (continued) - PDF Free Download

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39

Outline 1 Administration 2 Review of last lecture 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 2 / 39

Announcements HW2 will be returned in section on Friday HW3 due in class next Monday Midterm is next Wednesday (will review in more detail next class) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 3 / 39

Outline 1 Administration 2 Review of last lecture Perceptron Linear regression introduction 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 4 / 39

Perceptron Main idea Consider a linear model for binary classification w T x We use this model to distinguish between two classes { 1, +1}. One goal ε = n I[y n sign(w T x n )] i.e., to minimize errors on the training dataset. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 5 / 39

Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 What does update do? y n [(w + y n x n ) T x n ] = y n w T x n + y 2 nx T n x n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 What does update do? y n [(w + y n x n ) T x n ] = y n w T x n + y 2 nx T n x n We are adding a positive number, so it s possible that y n (w newt x n ) > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Perceptron algorithm Iteratively solving one case at a time REPEAT Pick a data point x n (can be a fixed order of the training instances) Make a prediction y = sign(w T x n ) using the current w If y = y n, do nothing. Else, w w + y n x n UNTIL converged. Properties Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39

Perceptron algorithm Iteratively solving one case at a time REPEAT Pick a data point x n (can be a fixed order of the training instances) Make a prediction y = sign(w T x n ) using the current w If y = y n, do nothing. Else, UNTIL converged. Properties This is an online algorithm. w w + y n x n If the training data is linearly separable, the algorithm stops in a finite number of steps (we proved this). The parameter vector is always a linear combination of training instances (requires initialization of w 0 = 0). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39

Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Key difference from classification We can measure closeness of prediction and labels Predicting shoe size: better to be off by one size than by 5 sizes Predicting song year: better to be off by one year than by 20 years As opposed to 0-1 classification error, we will focus on squared difference, i.e., (ŷ y) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

1D example: predicting the sale price of a house Sale price price per sqft square footage + fixed expense Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 9 / 39

Minimize squared errors Our model Sale price = price per sqft square footage + fixed expense + unexplainable stuff Training data sqft sale price prediction error squared error 2000 810K 720K 90K 8100 2100 907K 800K 107K 107 2 1100 312K 350K 38K 38 2 5500 2,600K 2,600K 0 0 Total 8100 + 107 2 + 38 2 + 0 + Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39

Minimize squared errors Our model Sale price = price per sqft square footage + fixed expense + unexplainable stuff Training data sqft sale price prediction error squared error 2000 810K 720K 90K 8100 2100 907K 800K 107K 107 2 1100 312K 350K 38K 38 2 5500 2,600K 2,600K 0 0 Total 8100 + 107 2 + 38 2 + 0 + Aim Adjust price per sqft and fixed expense such that the sum of the squared error is minimized i.e., unexplainable stuff is minimized. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 1D Solution: Identify stationary points by taking derivative with respect to parameters and setting to zero, yielding normal equations Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Probabilistic interpretation Noisy observation model Y = w 0 + w 1 X + η where η N(0, σ 2 ) is a Gaussian random variable Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39

Probabilistic interpretation Noisy observation model Y = w 0 + w 1 X + η where η N(0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x n, y n ) p(y n x n ) = N(w 0 + w 1 x n, σ 2 ) = 1 e [yn (w 0 +w 1 xn)]2 2σ 2 2πσ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39

Maximum likelihood estimation Maximize over w 0 and w 1 max log P (D) min n [y n (w 0 + w 1 x n )] 2 That is RSS( w)! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39

Maximum likelihood estimation Maximize over w 0 and w 1 max log P (D) min n [y n (w 0 + w 1 x n )] 2 That is RSS( w)! Maximize over s = σ 2 { } log P (D) = 1 1 s 2 s 2 [y n (w 0 + w 1 x n )] 2 + N 1 = 0 s n σ 2 = s = 1 [y n (w 0 + w 1 x n )] 2 N n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39

How does this probabilistic interpretation help us? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39

How does this probabilistic interpretation help us? It gives a solid footing to our intuition: minimizing RSS( w) is a sensible thing based on reasonable modeling assumptions Estimating σ tells us how much noise there could be in our predictions. For example, it allows us to place confidence intervals around our predictions. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39

Outline 1 Administration 2 Review of last lecture 3 Linear regression Multivariate solution in matrix form Computational and numerical optimization Ridge regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 15 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n = n (y n w T x n )(y n x T n w) w T x n x T n w 2y n x T n w + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) = w T x n x T n w 2y n x T n w + const. n { ( ) ( ) = w T x n x T n w 2 y n x T n n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 1 + 3 3 + 5 2 = 28

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1) x T N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1), y = x T N y 1 y 2. y N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1), y = x T N Compact expression { RSS( w) = X w y 2 2 = y 1 y 2. y N ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 This leads to the least-mean-square (LMS) solution w LMS = ( X T X) 1 X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Mini-Summary Linear regression is the linear combination of features f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x If we minimize residual sum of squares as our learning objective, we get a closed-form solution of parameters Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 19 / 39

Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Matrix multiply of X T X R (D+1) (D+1) Inverting the matrix X T X How many operations do we need? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Matrix multiply of X T X R (D+1) (D+1) Inverting the matrix X T X How many operations do we need? O(ND 2 ) for matrix multiplication O(D 3 ) (e.g., using Gauss-Jordan elimination) or O(D 2.373 ) (recent theoretical advances) for matrix inversion Impractical for very large D or N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) 3 t t + 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) 3 t t + 1 What is the complexity of each iteration? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Why would this work? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Hessian of RSS RSS( w) = w T X T X w 2 ( X T y) T w + const 2 RSS( w) w w T = 2 X T X Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Hessian of RSS RSS( w) = w T X T X w 2 ( X T y) T w + const 2 RSS( w) w w T = 2 X T X X T X is positive semidefinite, because for any v v T X T Xv = X T v 2 2 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 How does the complexity per iteration compare with gradient descent? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 How does the complexity per iteration compare with gradient descent? O(ND) for gradient descent versus O(D) for SGD Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Mini-summary Batch gradient descent computes the exact gradient. Stochastic gradient descent approximates the gradient with a single data point; Its expectation equals the true gradient. Mini-batch variant: trade-off between accuracy of estimating gradient and computational cost Similar ideas extend to other ML optimization problems. For large-scale problems, stochastic gradient descent often works well. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 24 / 39

What if X T X is not invertible Why might this happen? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

What if X T X is not invertible Why might this happen? Answer 1: N < D. Intuitively, not enough data to estimate all parameters. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

What if X T X is not invertible Why might this happen? Answer 1: N < D. Intuitively, not enough data to estimate all parameters. Answer 2: Columns of X are not linearly independent, e.g., some features are perfectly correlated. In this case, solution is not unique. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: λ 1 0 0 0 X T 0 λ 2 0 0 X = U 0 0 0 λ r 0 U 0 0 0 where λ 1 λ 2 λ r > 0 and r < D. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: λ 1 0 0 0 X T 0 λ 2 0 0 X = U 0 0 0 λ r 0 U 0 0 0 where λ 1 λ 2 λ r > 0 and r < D. Fix the problem by ensuring all singular values are non-zero X T X + λi = Udiag(λ 1 + λ, λ 2 + λ,, λ)u where λ > 0 and I is the identity matrix Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ 1 ( ) } T w T X T X w 2 X T y w + 1 2 2 λ w 2 2 }{{} regularization Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( w) Benefits RSS( w) { { }}{ 1 ( ) } T w T X T X w 2 X T y w + 1 2 2 λ w 2 2 }{{} regularization Numerically more stable, invertible matrix Prevent overfitting more on this later Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

How to choose λ? λ is referred as hyperparameter In contrast w is the parameter vector Use validation set or cross-validation to find good choice of λ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 28 / 39

Outline 1 Administration 2 Review of last lecture 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 29 / 39

Is a linear modeling assumption always a good idea? Example of nonlinear classification 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39

Is a linear modeling assumption always a good idea? Example of nonlinear classification 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 Example of nonlinear regression t 1 0 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39

Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x 1 x 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39

Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x 1 x 2 Transformed training data: linearly separable! 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.5 1 0.5 0 0.5 1 1.5 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39

Another example How to transform the input/feature? 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Another example 4 3 2 1 0 How to transform the input/feature? φ(x) : x R 2 z = x 2 1 x 1 x 2 x 2 2 R 3 1 2 3 4 4 3 2 1 0 1 2 3 4 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Another example 4 3 2 1 0 How to transform the input/feature? φ(x) : x R 2 z = x 2 1 x 1 x 2 x 2 2 R 3 1 2 3 4 4 3 2 1 0 1 2 3 4 Transformed training data: linearly separable 12 10 8 6 4 2 0 15 10 5 0 10 5 0 5 10 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). M could be greater than, less than, or equal to D With the new features, we can apply our learning techniques to minimize our errors on the transformed training data linear methods: prediction is based on w T φ(x) other methods: nearest neighbors, decision trees, etc Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39

Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39

Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). The LMS solution can be formulated with the new design matrix φ(x 1 ) T φ(x 2 ) T Φ =. RN M, w lms = ( Φ T Φ ) 1 Φ T y φ(x N ) T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39

Example with regression Polynomial basis functions 1 x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m=1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39

Example with regression Polynomial basis functions 1 x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m=1 Fitting samples from a sine function: underrfitting as f(x) is too simple 1 M = 0 1 M = 1 t t 0 0 1 1 0 x 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39

Adding high-order terms M=3 t 1 M = 3 0 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39

Adding high-order terms M=3 M=9: overfitting t 1 M = 3 t 1 M = 9 0 0 1 1 0 x 1 0 x 1 More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39

Overfiting Parameters for higher-order polynomials are very large M = 0 M = 1 M = 3 M = 9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 37 / 39

Overfitting can be quite disastrous Fitting the housing price data with large M Predicted price goes to zero (and is ultimately negative) if you buy a big enough house! This is called poor generalization/overfitting. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 38 / 39

Detecting overfitting Plot model complexity versus objective function X axis: model complexity, e.g., M Y axis: error, e.g., RSS, RMS (square root of RSS), 0-1 loss ERMS 1 0.5 Training Test 0 0 3 M 6 9 As a model increases in complexity: Training error keeps improving Test error may first improve but eventually will deteriorate Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 39 / 39