Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

Size: px

Start display at page:

Download "Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3."

Donald Byrd
6 years ago
Views:

1 School of Computer Science Introduction to Machine Learning Linear Regression Readings: Bishop, 3.1 Murphy, 7 Matt Gormley Lecture 4 September 19,

2 Homework 1: due 9/26/16 Project Proposal: due 10/3/16 start early! Reminders 2

3 Outline Linear Regression Simple example Model Learning Gradient Descent SGD Closed Form Advanced opics Geometric and Probabilistic Interpretation of LMS L2 Regularization L1 Regularization Features Locally- Weighted Linear Regression Robust Regression 3

4 Outline Linear Regression Simple example Model Learning (aka. Least Squares) Gradient Descent SGD (aka. Least Mean Squares (LMS)) Closed Form (aka. Normal Equations) Advanced opics Geometric and Probabilistic Interpretation of LMS L2 Regularization (aka. Ridge Regression) L1 Regularization (aka. LASSO) Features (aka. non- linear basis functions) Locally- Weighted Linear Regression Robust Regression 4

5 Linear regression in 1D Our goal is to estimate w from a training data of <x i,y i > pairs Y y = wx + ε Optimization goal: minimize squared error (least squares): arg min w i ( y i wx i 2 ) X Why least squares? - minimizes squared distance between measurements and predicted line see HW - has a nice probabilistic interpretation - the math is pretty Slide courtesy of William Cohen

6 Solving Linear Regression in 1D o optimize closed form: We just take the derivative w.r.t. to w and set to 0: Slide courtesy of William Cohen w (y i wx i ) 2 = 2 x i (y i wx i ) i i 2 x i (y i wx i ) = 0 i i w = x i y i = wx i 2 i i i x i y i x i 2 2 x i y i 2 wx i x i = 0 i i

7 Linear regression in 1D Given an input x we would like to compute an output y In linear regression we assume that y and x are related with the following equation: What we are trying to predict y = wx+ε Observed values Y where w is a parameter and ε represents measurement error or other noise X Slide courtesy of William Cohen

8 Regression example Generated: w=2 Recovered: w=2.03 Noise: std=1 Slide courtesy of William Cohen

9 Regression example Generated: w=2 Recovered: w=2.05 Noise: std=2 Slide courtesy of William Cohen

10 Regression example Generated: w=2 Recovered: w=2.08 Noise: std=4 Slide courtesy of William Cohen

11 Bias term So far we assumed that the line passes through the origin What if the line does not? No problem, simply change the model to y = w 0 + w 1 x+ε Can use least squares to determine w 0, w 1 w 0 Y X w 0 = i y i n w x 1 i xi ( yi i w1 = 2 x i i w 0 ) Slide courtesy of William Cohen

12 Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. D = {x (i),y (i) } N i=1 where x R K and y R What are some example problems of this form? 12

13 Linear Regression Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. D = {x (i),y (i) } N i=1 where x R K and y R Prediction: Output is a linear function of the inputs. ŷ = h (x) = 1 x x K x K ŷ = h (x) = x (We assume x 1 is 1) Learning: finds the parameters that minimize some objective function. = argmin J( ) 13

14 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( ) We minimize the sum of the squares: J( )= 1 2 N i=1 ( x (i) y (i) ) 2 Why? 1. Reduces distance between true measurements and predicted hyperplane (line in 1D) 2. Has a nice probabilistic interpretation 14

1 2 We could solve it in N i=1 ( x (i) y (i) ) 2 Why? lots of ways.

15 Least Squares Learning: finds the parameters that minimize some objective function. = argmin J( ) We minimize the sum of the squares: his is a very general optimization J( )= setup. 1 2 We could solve it in N i=1 ( x (i) y (i) ) 2 Why? lots of ways. oday, 1. Reduces distance between true measurements and we ll consider three predicted hyperplane (line in 1D) 2. Has a ways. nice probabilistic interpretation 15

16 Least Squares Learning: hree approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 16

17 Learning as optimization he main idea today Make two assumptions: the classifier is linear, like naïve Bayes ie roughly of the form f w (x) = sign(x. w) we want to find the classifier w that fits the data best Formalize as an optimization problem Pick a loss function J(w,D), and find argmin w J(w) OR: Pick a goodness function, often Pr(D w), and find the argmax w f D (w) Slide courtesy of William Cohen 17

18 Learning as optimization: warmup Find: optimal (MLE) parameter θ of a binomial Dataset: D={x 1,,x n }, x i is 0 or 1, k of them are 1 P(D θ) = const θ x i (1 θ) 1 x i =θ k (1 θ) n k i d d P(D θ) = ( dθ dθ θ k (1 θ) n k ) " = d dθ θ % $ k '(1 θ) n k +θ k d # & dθ (1 θ)n k Slide courtesy of William Cohen = kθ k 1 (1 θ) n k +θ k (n k)(1 θ) n k 1 ( 1) =θ k 1 (1 θ) n k 1 k(1 θ) θ(n k) ( ) 18

19 Learning as optimization: warmup Goal: Find the best parameter θ of a binomial Dataset: D={x 1,,x n }, x i is 0 or 1, k of them are 1 = 0 θ= 0 Slide courtesy of William Cohen θ= 1 k- kθ nθ + kθ = 0 è nθ = k è θ = k/n 19

20 Learning as optimization with gradient ascent Goal: Learn the parameter w of Dataset: D={(x 1,y 1 ),(x n, y n) } Use your model to define Pr(D w) =. Set w to maximize Likelihood Usually we use numeric methods to find the optimum i.e., gradient ascent: repeatedly take a small step in the direction of the gradient Difference between ascent and descent is only the sign: I m going to be sloppy here about that. Slide courtesy of William Cohen 20

21 Gradient ascent o find argmin x f(x): Start with x 0 For t=1. x t+1 = x t + λ f (x t ) where λ is small Slide courtesy of William Cohen 21

22 Gradient descent Likelihood: ascent Loss: descent Slide courtesy of William Cohen 22

23 Pros and cons of gradient descent Simple and often quite effective on ML tasks Often very scalable Only applies to smooth functions (differentiable) Might find a local minimum, rather than a global one Slide courtesy of William Cohen 23

24 Pros and cons of gradient descent here is only one local optimum if the function is convex Slide courtesy of William Cohen 24

25 Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD(D, 2: (0) (0) ) 3: while not converged do 4: + J( ) 5: return In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives). J( )= d d 1 J( ) d d 2 J( ). d d N J( ) 25

26 Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD(D, 2: (0) (0) ) 3: while not converged do 4: + J( ) 5: return here are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance. J( ) 2 Alternatively we could check that the reduction in the objective function from one iteration to the next is small. 26

27 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: + J (i) ( ) 6: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per- example objective: Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. 27

28 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,...,K} do 6: k k + d d k J (i) ( ) 7: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm We need a per- example objective: Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. 28

Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,.

29 Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,...,K} do 6: k k + d d k J (i) ( ) 7: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm Let s start by calculating We need a per- example objective: this partial derivative for Let J( )= N i=1 J (i) the Linear Regression ( ) objective function. where J (i) ( )= 1 2 ( x (i) y (i) ) 2. 29

30 Partial Derivatives for Linear Reg. Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. d d k J (i) ( )= d 1 d k 2 ( x (i) y (i) ) 2 d d k ( x (i) y (i) ) 2 = 1 2 =( x (i) y (i) ) d ( x (i) y (i) ) d k =( x (i) y (i) ) d d k ( =( x (i) y (i) )x (i) k K k=1 kx (i) k y (i) ) 30

31 Partial Derivatives for Linear Reg. Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. d d k J (i) ( )= = 1 2 d 1 d k 2 ( x (i) y (i) ) 2 d ( x (i) d k y (i) ) 2 =( x (i) y (i) ) d d k ( x (i) y (i) ) =( x (i) y (i) ) d d k ( =( x (i) y (i) )x (i) k K k=1 kx (i) k y (i) ) 31

32 Partial Derivatives for Linear Reg. Let J( )= N i=1 J (i) ( ) where J (i) ( )= 1 2 ( x (i) y (i) ) 2. d d k J (i) ( )=( x (i) d d k J( )= = d d k N i=1 N i=1 J (i) ( ) y (i) )x (i) k ( x (i) y (i) )x (i) k Used by SGD (aka. LMS) Used by Gradient Descent 32

33 Least Mean Squares (LMS) Algorithm 3 Least Mean Squares (LMS) 1: procedure LMS(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: for k {1, 2,...,K} do 6: k k + ( x (i) y (i) )x (i) k 7: return Applied to Linear Regression, SGD is called the Least Mean Squares (LMS) algorithm 33

34 Least Squares Learning: hree approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) 34

35 Background: Matrix Derivatives l For R m f : n! R, define: l race: A tra = A11 f ( A) = " A1 n i= 1 A ii, m f f! #! A 1n " A tra = a, mn f f tr ABC = trcab = trbca l Some fact of matrix derivatives (without proof) A trab = B, 1 traba C = CAB + C AB, A = A( A ) A A Eric CMU,

36 he normal equations l Write the cost function in matrix form: l o minimize J(θ), take derivative and set to zero: Eric CMU, ( ) ( ) ( ) y y X y y X X X y X y X y J n i i i!!!!!! + = = = = θ θ θ θ θ θ θ θ ) ( ) ( x = x n x x X!!! 2 1 = y n y y y! " 2 1 ( ) ( ) ( ) = = + = + = + = y X X X y X X X X X y y X y X X y y X y y X X X J!!!!!!!!! θ θ θ θ θ θ θ θ θ θ θ θ θ θ θ tr tr tr tr y X X X! = θ he normal equations ( ) y X X X! 1 = * θ

37 Comments on the normal equation l In most situations of practical interest, the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that X X is necessarily invertible. l he assumption that X X is invertible implies that it is positive definite, thus the critical point we have found is a minimum. l What if X has less than full column rank? à regularization (later). Eric CMU,

38 Direct and Iterative methods l Direct methods: we can achieve the solution in a single step by solving the normal equation l l Using Gaussian elimination or QR decomposition, we converge in a finite number of steps It can be infeasible when data are streaming in in real time, or of very large amount l Iterative methods: stochastic or steepest gradient l l l Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate α Eric CMU,

39 Convergence rate l heorem: the steepest descent equation algorithm converge to the minimum of the cost characterized by normal equation: If l A formal analysis of LMS needs more math; in practice, one can use a small α, or gradually decrease α. Eric CMU,

40 Convergence Curves l For the batch method, the training MSE is initially large due to uninformed initialization l In the online update, N updates for every epoch reduces MSE to a much smaller value. Eric CMU,

41 Least Squares Learning: hree approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) pros: conceptually simple, guaranteed convergence cons: batch, often slow to converge Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) pros: memory efficient, fast convergence, less prone to local optima cons: convergence in practice requires tuning and fancier variants Approach 3: Closed Form (set derivatives equal to zero and solve for parameters) pros: one shot algorithm! cons: does not scale to large datasets (matrix inverse is bottleneck) 41

42 Geometric Interpretation of LMS he predictions on the training data are:! ( ) 1 yˆ! = Xθ * = X X X X y Note that and X! yˆ! ( 1! y = X X I )y ( X X ) ( y!ˆ! y) = X X ( X X ) = ( ) 1! X I y ( X X X X ) 1 X X = 0!! ( ) ŷ! y! is the orthogonal projection of into the space spanned by the columns of X! y y y y =! 1 " 2 y n X =! x x! 1 2 x n! Eric CMU,

43 Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are related by the equation: where ε is an error term of unmodeled effects or random noise Now assume that ε follows a Gaussian N(0,σ), then we have: By independence assumption: i i i y ε θ + = x = σ θ πσ θ ) ( exp ) ; ( i i i i y x y p x = = = = σ θ πσ θ θ n i i i n n i i i y x y p L ) ( exp ) ; ( ) ( x 43 Eric CMU,

44 Probabilistic Interpretation of LMS Hence the log- likelihood is: n l( θ) = nlog = ( y 2 i 1 i θ xi ) 2πσ σ 2 2 Do you recognize the last term? Yes it is: n 1 J ( θ ) = ( xi θ yi ) 2 i= 1 hus under independence assumption, LMS is equivalent to MLE of θ! 2 Eric CMU,

45 Non-Linear basis function So far we only used the observed values x 1,x 2, However, linear regression can be applied in the same way to functions of these values Eg: to add a term w x 1 x 2 add a new variable z=x 1 x 2 so each example becomes: x 1, x 2,. z As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem y = w + w x + + w x k k + ε Slide courtesy of William Cohen

46 Non-linear basis functions What type of functions can we use? A few common examples: - Polynomial: φ j (x) = x j for j=0 n - Gaussian: - Sigmoid: φ j (x) = (x µ j ) 2σ j 2 φ j (x) = 1 1+ exp( s j x) Any function of the input values can be used. he solution for the parameters of the regression remains the same. - Logs: φ j (x) = log(x +1) Slide courtesy of William Cohen

47 General linear regression problem Using our new notations for the basis function linear regression can be written as y = n j= 0 w j φ j (x) Where φ j (x) can be either x j for multivariate regression or one of the non-linear basis functions we defined and φ 0 (x)=1 for the intercept term Slide courtesy of William Cohen

48 An example: polynomial basis vectors on a small dataset From Bishop Ch 1 Slide courtesy of William Cohen

49 0 th Order Polynomial Slide courtesy of William Cohen n=10

50 Slide courtesy of William Cohen 1 st Order Polynomial

51 Slide courtesy of William Cohen 3 rd Order Polynomial

52 Slide courtesy of William Cohen 9 th Order Polynomial

53 Over-fitting Root-Mean-Square (RMS) Error: Slide courtesy of William Cohen

54 Slide courtesy of William Cohen Polynomial Coefficients

55 Regularization Penalize large coefficient values J X,y (w) = 1 2 i # & % y i w j φ j (x i ) ( $ j ' 2 λ 2 w 2 Slide courtesy of William Cohen

56 Regularization: + Slide courtesy of William Cohen

57 Polynomial Coefficients none exp(18) huge Slide courtesy of William Cohen

58 Slide courtesy of William Cohen Over Regularization:

59 Example: Stock Prices Suppose we wish to predict Google s stock price at time t+1 What features should we use? (putting all computational concerns aside) Stock prices of all other stocks at times t, t- 1, t- 2,, t - k Mentions of Google with positive / negative sentiment words in all newspapers and social media outlets Do we believe that all of these features are going to be useful? 59

60 Ridge Regression Adds an L2 regularizer to Linear Regression J RR ( )=J( )+ 2 2 = 1 2 N i=1 ( x (i) y (i) ) 2 + K k=1 2 k Bayesian interpretation: MAP estimation with a Gaussian prior on the parameters MAP = argmax N i=1 = argmax J RR ( ) log p (y (i) x (i) ) + log p( ) where p( ) N (0, 1 ) 60

61 LASSO Adds an L1 regularizer to Linear Regression J LASSO ( )=J( )+ 1 = 1 2 N i=1 ( x (i) y (i) ) 2 + K k=1 k Bayesian interpretation: MAP estimation with a Laplace prior on the parameters MAP = argmax N i=1 = argmax J LASSO ( ) log p (y (i) x (i) ) + log p( ) where p( ) Laplace(0,f( )) 61

62 Ridge Regression vs Lasso X X Ridge Regression: Lasso: βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 βs with constant l1 norm β1 Lasso (l1 penalty) results in sparse solu3ons vector with more zero coordinates Good for high- dimensional problems don t have to store all coordinates! Eric CMU,

63 Optimization for LASSO Can we apply SGD to the LASSO learning problem? i=1 argmax J LASSO ( ) J LASSO ( )=J( )+ 1 = 1 2 N i=1 ( x (i) y (i) ) 2 + K k=1 k 63

64 Optimization for LASSO Consider the absolute value function: r( )= K k=1 k he L1 penalty is subdifferentiable (i.e. not differentiable at 0) Where does a small step opposite the gradient accomplish far from zero? close to zero? 64

65 Optimization for LASSO he L1 penalty is subdifferentiable (i.e. not differentiable at 0) An array of optimization algorithms exist to handle this issue: Coordinate Descent Othant- Wise Limited memory Quasi- Newton (OWL- QN) (Andrew & Gao, 2007) and provably convergent variants Block coordinate Descent (seng & Yun, 2009) Sparse Reconstruction by Separable Approximation (SpaRSA) (Wright et al., 2009) Fast Iterative Shrinkage hresholding Algorithm (FISA) (Beck & eboulle, 2009) 65

66 Data Set Size: 9 th Order Polynomial Slide courtesy of William Cohen

67 Locally weighted linear regression l he algorithm: Instead of minimizing now we fit θ to minimize Where do w i 's come from? l n 1 2 J ( θ ) = ( xi θ yi ) 2 i= 1 n 1 J ( θ ) = w i ( xi θ yi ) 2 i= 1 ( exp x w = i i 2 x) 2τ where x is the query point for which we'd like to know its corresponding y 2 2 à Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query) Eric CMU,

68 Parametric vs. non-parametric l l l Locally weighted linear regression is the second example we are running into of a non-parametric algorithm. (what is the first?) he (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm l l l because it has a fixed, finite number of parameters (the θ), which are fit to the data; Once we've fit the θ and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. he term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set. Eric CMU,

69 Robust Regression l he best fit from a quadratic regression l But this is probably better How can we do this? Eric CMU,

70 LOESS-based Robust Regression l Remember what we do in "locally weighted linear regression"? à we "score" each point for its impotence l Now we score each point according to its "fitness" (Courtesy to Andrew Moore) Eric CMU,

71 Robust regression l For k = 1 to R l Let (x k,y k ) be the kth datapoint l l Let y est k be predicted value of y k Let w k be a weight for data point k that is large if the data point fits well and small if it fits badly: w k = φ est 2 ( y y ) ) k k l hen redo the regression using weighted data points. l Repeat whole thing until converged! Eric CMU,

72 Robust regression probabilistic interpretation l What regular regression does: Assume y k was originally generated using the following recipe: yk = θ xk + N ( 0, σ 2 ) Computational task is to find the Maximum Likelihood estimation of θ Eric CMU,

73 Robust regression probabilistic interpretation l What LOESS robust regression does: Assume y k was originally generated using the following recipe: with probability p: yk = θ xk + N ( 0, σ 2 ) but otherwise y k 2 ~ N ( µ, σ huge) Computational task is to find the Maximum Likelihood estimates of θ, p, µ and σ huge. l he algorithm you saw with iterative reweighting/refitting does this computation for us. Later you will find that it is an instance of the famous E.M. algorithm Eric CMU,

74 Summary Linear regression predicts its output as a linear function of its inputs Learning optimizes a function (equivalently likelihood or mean squared error) using standard techniques (gradient descent, SGD, closed form) Regularization enables shrinkage and model selection Eric CMU,

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting