Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Size: px

Start display at page:

Download "Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares"

Deborah Copeland
5 years ago
Views:

1 Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13

2 Review 1 / 13

3 Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered. Find a line which minimizes the squared reconstruction error. 1 / 13

4 Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered. Find a line which minimizes the squared reconstruction error. 1 / 13

5 Alternate View: Minimizing Reconstruction Error with K-dim subspace. Equivalent ( dual ) formulation of PCA: find an orthonormal basis u 1, u 2,... u K which minimizes the total reconstruction error on the data: 1 argmin orthonormal basis:u 1,u 2,...u K N Recall the projection of x onto K-orthonormal basis is: Proj u1,...u K (x) = (x i Proj u1,...u K (x i )) 2 i K (u i x)u i j=1 The SVD simultaneously finds all u 1, u 2,... u K 2 / 13

6 Projection and Reconstruction: the one dimensional case Take out mean µ: Find the top eigenvector u of the covariance matrix. What are your projections? What are your reconstructions, X = [ x 1 x 2 x N ]? What is your reconstruction error of doing nothing (K = 0) and using K = 1? 1 N (x i µ) 2 = i 1 N (x i x i ) 2 = i Reduction in error by using a k-dim PCA projection: 3 / 13

7 PCA vs. Clustering Summarize your data with fewer points or fewer dimensions? 3 / 13

8 Loss functions 3 / 13

9 Today 3 / 13

10 Perceptron Perceptron Algorithm: A model and an algorithm, rolled into one. Isn t there a more principled methodology to derive algorithms? 3 / 13

11 What we ( naively ) want: Minimize training-set error rate : min w,b 1 N N y n (w x n + b) 0 }{{} zero-one loss on a point n n=1 This problem is NP-hard; even for a (multiplicative) approximation. loss margin = y (w x + b) Why is this loss function so unwieldy? 4 / 13

12 Relax! The mis-classification optimization problem: min w 1 N N y n (w x n ) 0 n=1 Instead, let s try to choose a reasonable loss function l(y n, w x) and then try to solve the relaxation: min w 1 N N l(y n, w x n ) n=1 5 / 13

13 What is a good relaxation? Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. idea: try to use a (sharp) upper bound of the zero-one loss by l: y(w x) 0 l(y, w x) want our relaxed optimization problem to be easy to solve. What properties might we want for l( )? 6 / 13

14 What is a good relaxation? Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. idea: try to use a (sharp) upper bound of the zero-one loss by l: y(w x) 0 l(y, w x) want our relaxed optimization problem to be easy to solve. What properties might we want for l( )? differentiable? sensitive to changes in w? convex? 6 / 13

15 The square loss! (and linear regression) The square loss: l(y, w x) = (y w x) 2. The relaxed optimization problem: min w 1 N N (y n w x n ) 2 nice properties: for binary classification, it is a an upper bound on the zero-one loss. It makes sense more generally, e.g. if we want to predict real valued y. We have a convex optimization problem. n=1 For classification, what is your decision rule using a w? 7 / 13

16 The square loss as an upper bound We have: Easy to see, by plotting: y(w x) 0 (y w x) 2 8 / 13

17 Remember this problem? Data derived from mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 ( bad = 0) or above ( good = 1) given the input row. 9 / 13

18 Remember this problem? Data derived from mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 ( bad = 0) or above ( good = 1) given the input row. Predicting a real y (often) makes more sense. 9 / 13

19 A better (convex) upper bound The logistic loss: l logistic (y, w x) = log (1 + exp( yw x)). We have: y(w x) 0 constant l logistic (y, w x) Again, easy to see, by plotting: 10 / 13

20 Least squares: let s minimize it! The optimization problem: min w 1 N N (y n w x n ) 2 = n=1 min Y Xw 2 w where Y is an n-vector and X is our n d data matrix. How do we interpret Xw? 11 / 13

21 Least squares: let s minimize it! The optimization problem: min w 1 N N (y n w x n ) 2 = n=1 min Y Xw 2 w where Y is an n-vector and X is our n d data matrix. How do we interpret Xw? The solution is the least squares estimator: w least squares = (X X) 1 X Y 11 / 13

22 Matrix calculus proof: scratch space 12 / 13

23 Matrix calculus proof: scratch space 12 / 13

24 Remember your linear system solving! 12 / 13

25 Lots of questions: What could go wrong with least squares? Suppose we are in high dimensions : more dimensions than data points. Inductive bias: we need a way to control the complexity of the model. How do we minimize (sum) logistic loss? Optimization: how do we do this all quickly? 13 / 13

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements HW3 due date as posted. make sure