Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Lucinda Bradford
6 years ago
Views:

1 Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA CSE 474/574 1 / 25

2 Outline Linear Regression Problem Formulation Geometric Interpretation Learning Parameters Recap Issues with Linear Regression Bayesian Linear Regression Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Handling Non-linear Relationships Handling Overfitting via Regularization Elastic Net Regularization Handling Outliers in Regression CSE 474/574 2 / 25

3 Taking the next step Hypothesis Space, H Conjunctive Disjunctive Disjunctions of k attributes Linear hyperplanes c / H Non-linear network Input Space, x x {0, 1} d x R d Input Space, y y {0, 1} y { 1, +1} y R Chandola@UB CSE 474/574 3 / 25

4 Linear Regression There is one scalar target variable y (instead of hidden) There is one vector input variable x Inductive bias: y = w x Linear Regression Learning Task Learn w given training examples, X, y. Chandola@UB CSE 474/574 4 / 25

5 Two Interpretations 1. Probabilistic Interpretation y is assumed to be normally distributed or, equivalently: where ɛ N (0, σ 2 ) y N (w x, σ 2 ) y = w x + ɛ y is a linear combination of the input variables Given w and σ 2, one can find the probability distribution of y for a given x Chandola@UB CSE 474/574 5 / 25

6 Two Interpretations 2. Geometric Interpretation Fitting a straight line to d dimensional data y = w x Will pass through origin Add intercept y = w x = w 1 x 1 + w 2 x w d x d y = w 0 + w 1 x 1 + w 2 x w d x d Equivalent to adding another column in X of 1s. Chandola@UB CSE 474/574 6 / 25

7 Learning Parameters - MLE Approach Find w and σ 2 that maximize the likelihood of training data ŵ MLE = (X X) 1 X y σ 2 MLE = 1 N (y Xw) (y Xw) Chandola@UB CSE 474/574 7 / 25

8 Learning Parameters - Least Squares Approach Minimize squared loss J(w) = 1 2 N (y i w x i ) 2 i=1 Make prediction (w x i ) as close to the target (y i ) as possible Least squares estimate ŵ = (X X) 1 X y Chandola@UB CSE 474/574 8 / 25

9 Gradient Descent Based Method Minimize the squared loss using Gradient Descent J(w) = 1 2 N (y i w x i ) 2 i=1 Why? Chandola@UB CSE 474/574 9 / 25

10 Recap - Linear Regression Geometric y = w x J(w) = 1 2 N (y i w x i ) 2 i=1 1. Least Squares ŵ = (X X) 1 X y 2. Gradient Descent Probabilistic p(y) = N (w x, σ 2 ) 1. Maximum Likelihood Estimation ŵ = (X X) 1 X y σ MLE 2 = 1 N (y Xw) (y Xw N i=1 Chandola@UB CSE 474/ / 25

11 Issues with Linear Regression 1. Not truly Bayesian 2. Susceptible to outliers 3. Too simplistic - Underfitting 4. No way to control overfitting 5. Unstable in presence of correlated input attributes 6. Gets confused by unnecessary attributes Chandola@UB CSE 474/ / 25

12 Putting a Prior on w Penalize large values of w A zero-mean Gaussian prior p(w) = N (w 0, τ 2 I ) What is posterior of w p(w D) i N (y i w x i, σ 2 )p(w) Posterior is also Gaussian Chandola@UB CSE 474/ / 25

13 Posterior Estimates of the Weight Vector Regularized least squares estimate of w arg max w N logn (y i w x i, σ 2 ) + log N (w 0, τ 2 I ) i=1 Chandola@UB CSE 474/ / 25

14 Parameter Estimation for Bayesian Regression Prior for w Posterior for w p(w y, X) = p(y X, w)p(w) p(y X) w N (w 0, τ 2 I D ) = N ( w = (X X + σ2 τ 2 I N) 1 X y, σ 2 (X X + σ2 τ 2 I N) 1 ) Posterior distribution for w is also Gaussian What will be MAP estimate for w? Chandola@UB CSE 474/ / 25

15 Prediction with Bayesian Regression For a new x, predict y Point estimate of y y = ŵmle x Treating y as a Gaussian random variable p(y x ) = N (ŵmle x, σ MLE 2 ) p(y x ) = N (ŵmapx, σ MAP) 2 Chandola@UB CSE 474/ / 25

16 Full Bayesian Treatment Treating y and w as random variables p(y x ) = p(y x, w)p(w X, y)dw This is also Gaussian! Chandola@UB CSE 474/ / 25

17 Handling Non-linear Relationships Replace x with non-linear functions φ(x) p(y x, θ) N (w φ(x)) Model is still linear in w Also known as basis function expansion Example φ(x) = [1, x, x 2,..., x p ] Increasing p results in more complex fits Chandola@UB CSE 474/ / 25

18 How to Control Overfitting? Use simpler models (linear instead of polynomial) Might have poor results (underfitting) Use regularized complex models Θ = arg min J(Θ) + λr(θ) Θ R() corresponds to the penalty paid for complexity of the model Chandola@UB CSE 474/ / 25

19 l 2 Regularization Ridge Regression ŵ = arg min J(w) + λ w 2 2 w Helps in reducing impact of correlated inputs Chandola@UB CSE 474/ / 25

20 Parameter Estimation for Ridge Regression Exact Loss Function J(w) = 1 2 N (y i w x i ) λ w 2 2 i=1 Ridge Estimate of w ŵ MAP = (λi D + X X) 1 X y Equivalent to MAP estimate for Bayesian Regression with Gaussian prior on w Chandola@UB CSE 474/ / 25

21 Using Gradient Descent with Ridge Regression Very similar to OLE Minimize the squared loss using Gradient Descent N J(w) = 1 2 i=1 (y i w x i ) λ w 2 2 J(w) = 1 w j 2 w j = N (y i w x i ) λ w 2 2 w j i=1 N (w x i y i )x ij + λw j i=1 Using the above result, one can perform repeated updates of the weights: w j := w j η J(w) w j Chandola@UB CSE 474/ / 25

22 l 1 Regularization Least Absolute Shrinkage and Selection Operator - LASSO ŵ = arg min J(w) + λ w w Helps in feature selection favors sparse solutions Optimization is not as straightforward as in Ridge regression Gradient not defined for wi = 0, i Equivalent to MAP estimate for Bayesian Regression with Laplace prior on w Laplace Distribution p(w) = 1 2b exp ( ) w µ b Has two parameters, µ and b Has a less fatter tail than Gaussian Chandola@UB CSE 474/ / 25

23 LASSO vs. Ridge Both control overfitting Ridge helps reduce impact of correlated inputs, LASSO helps in feature selection Elastic Net Regularization The best of both worlds ŵ = arg min J(w) + λ w + (1 λ) w 2 2 w Again, optimizing for w is not straightforward Chandola@UB CSE 474/ / 25

24 Impact of outliers on regression Linear regression training gets impacted by the presence of outliers The square term in the exponent of the Gaussian pdf is the culprit Equivalent to the square term in the loss How to handle this (Robust Regression)? Probabilistic: Use a different distribution instead of Gaussian for p(y x) Robust regression uses Laplace distribution p(y x) Laplace(w x, b) Geometric: Least absolute deviations instead of least squares J(w) = N y i w x i=1 Chandola@UB CSE 474/ / 25

25 References CSE 474/ / 25

Introduction to Machine Learning

Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574