y Xw 2 2 y Xw λ w 2 2

Similar documents
Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture : Probabilistic Machine Learning

ECE521 week 3: 23/26 January 2017

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Density Estimation. Seungjin Choi

4 Bias-Variance for Ridge Regression (24 points)

Introduction to Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

CPSC 340: Machine Learning and Data Mining

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Learning (II)

Today. Calculus. Linear Regression. Lagrange Multipliers

y(x) = x w + ε(x), (1)

Least Squares Regression

Linear Models for Regression CS534

Introduction to Machine Learning Spring 2018 Note 18

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Bayesian Linear Regression [DRAFT - In Progress]

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

GAUSSIAN PROCESS REGRESSION

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Week 3: Linear Regression

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

MODULE -4 BAYEIAN LEARNING

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

CMU-Q Lecture 24:

Lecture 2 Machine Learning Review

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CHAPTER 2 Estimating Probabilities

Least Squares Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning

Parametric Techniques Lecture 3

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Parametric Techniques

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Statistical learning. Chapter 20, Sections 1 4 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Statistics: Learning models from data

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

INTRODUCTION TO PATTERN RECOGNITION

Statistical learning. Chapter 20, Sections 1 3 1

Machine Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

GWAS IV: Bayesian linear (variance component) models

6.867 Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

4 Bias-Variance for Ridge Regression (24 points)

COS513 LECTURE 8 STATISTICAL CONCEPTS

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

STA414/2104 Statistical Methods for Machine Learning II

COMP90051 Statistical Machine Learning

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Overfitting, Bias / Variance Analysis

Linear Regression and Its Applications

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Regression with Numerical Optimization. Logistic

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 9: Bayesian Learning

Probabilistic Machine Learning. Industrial AI Lab.

FINAL: CS 6375 (Machine Learning) Fall 2014

Regularized Regression A Bayesian point of view

Bayesian RL Seminar. Chris Mansley September 9, 2008

Linear Models for Regression CS534

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Association studies and regression

Introduction to Probabilistic Machine Learning

Sparse Linear Models (10/7/13)

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Parameter Estimation. Industrial AI Lab.

STA 4273H: Sta-s-cal Machine Learning

COMP90051 Statistical Machine Learning

Linear Models for Regression CS534

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Approaches Data Mining Selected Technique

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS-E3210 Machine Learning: Basic Principles

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Bayesian Learning Features of Bayesian learning methods:

CSC321 Lecture 18: Learning Probabilistic Models

10. Linear Models and Maximum Likelihood Estimation

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Midterm Exam, Spring 2005

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Notes on Machine Learning for and

Bayesian Regression Linear and Logistic Regression

Announcements. Proposals graded

An Introduction to Statistical and Probabilistic Linear Models

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Statistical learning. Chapter 20, Sections 1 3 1

Transcription:

CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression: ŵ OLS = arg min w ŵ RIDGE = arg min w y Xw 2 2 y Xw 2 2 + λ w 2 2 One question that arises is why we specifically use the l 2 norm to measure the error of our predictions, and to penalize the model parameters. We will justify this design choice by exploring the statistical interpretations of regression namely, we will employ Gaussians, MLE and MAP to validate what we ve done so far through a different lens. 1.1 Probabilistic Model In the context of supervised learning, we assume that there exists a true underlying model mapping inputs to outputs: f : x f(x) The true model is unknown to us, and our goal is to find a hypothesis model that best represents the true model. The only information that we have about the true model is via a dataset D = {(x i, y i )} n where x i R d is the input and y i R is the observation, a noisy version of the true output f(x i ): Y i = f(x i ) + Z i We assume that x i is a fixed value (which implies that f(x i ) is fixed as well), while Z i is a random variable (which implies that Y i is a random variable as well). We always assume that Z i has zero mean, because otherwise there would be systematic bias in our observations. The Z i s could be Gaussian, uniform, Laplacian, etc... In most contexts, we us assume that they are independent identically distributed (i.i.d) Gaussians: Z i N (0, σ 2 ). We can therefore say that Y i is a random variable whose probability distribution is given by Y i N (f(x i ), σ 2 ) Now that we have defined the model and data, we wish to find a hypothesis model h (parameterized by ) that best captures the relationships in the data, while possibly taking into account prior beliefs that we have about the true model. We can represent this as a probability problem, where the goal is to find the optimal model that maximizes our probability. Note 4, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 1

1.2 Maximum Likelihood Estimation In Maximum Likelihood Estimation (MLE), the goal is to find the hypothesis model that maximizes the probability of the data. If we parameterize the set of hypothesis models with, we can express the problem as ˆ MLE L(; D) = p(data = D true model = h ) The quantity L() that we are maximizing is also known as the likelihood, hence the term MLE. Substituting our representation of D we have ˆ MLE L(; X, y) = p(y 1,..., y n x 1,..., x n, ) Note that we implicitly condition on the x i s, because we treat them as fixed values of the data. The only randomness in our data comes from the y i s (since they are noisy versions of the true values f(x i )). We can further simplify the problem by working with the log likelihood l(; X, y) = log L(; X, y) ˆ MLE L(; X, y) l(; X, y) With logs we are still working with the same problem, because logarithms are monotonic functions. In other words we have that: Let s decompose the log likelihood: P (A) < P (B) log P (A) < log P (B) l(; X, y) = log p(y 1,..., y n x 1,..., x n, ) = log n p(y i x i, ) = log[p(y i x i, )] We decoupled the probabilities from each datapoints because their corresponding noise components are independent. Note that the logs allow us to work with sums rather products, simplifying the problem one reason why the log likelihood is such a powerful tool. Each individual term p(y i x i, ) comes from a Gaussian Continuing with logs: ˆ MLE = arg min Y i N (h (x i ), σ 2 ) l(; X, y) (1) log[p(y i x i, )] (2) ( (y i h (x i )) 2 ) n log 2πσ (3) 2σ 2 ( (y i h (x i )) 2 ) + n log 2πσ (4) 2σ 2 Note 4, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 2

= arg min (y i h (x i )) 2 (5) Note that in step (4) we turned the problem from a maximization problem to a minimization problem by negating the objective. In step (5) we eliminated the second term and the denominator in the first term, because they do not depend on the variables we are trying to optimize over. Now let s look at the case of regression our hypothesis has the form h (x i ) = x i, where R d, where d is the number of dimensions of our featurized datapoints. For this specific setting, the problem becomes: ˆ MLE = arg min R d (y i x i ) 2 This is just the Ordinary Least Squares (OLS) problem! We just proved that OLS and MLE for regression lead to the same answer! We conclude that MLE is a probabilistic justification for why using squared error (which is the basis of OLS) is a good metric for evaluating a regression model. 1.3 Maximum a Posteriori In Maximum a Posteriori (MAP) Estimation, the goal is to find the model, for which the data maximizes the probability of the model: ˆ MAP p(true model = h data = D) The probability distribution that we are maximizing is known as the posterior. Maximizing this term directly is often infeasible, so we we use Bayes Rule to re-express the objective. ˆ MAP = arg min p(true model = h data = D) p(data = D true model = h ) p(true model = h ) p(data = D) p(data = D true model = h ) p(true model = h ) log p(data = D true model = h ) + log p(true model = h ) log p(data = D true model = h ) log p(true model = h ) We treat p(data = D) as a constant value because it does not depend on the variables we are optimizing over. Notice that MAP is just like MLE, except we add a term p(true model = h ) to our objective. This term is the prior over our true model. Adding the prior has the effect of favoring certain models over others a priori, regardless of the dataset. Note the MLE is a special case of MAP, when the prior does not treat any model more favorably over other models. Concretely, we have that ˆ MAP = arg min ( ) log[p(y i x i, )] log[p()] Note 4, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 3

Again, just as in MLE, notice that we implicitly condition on the x i s because we treat them as constants. Also, let us assume as before that the noise terms are i.i.d. Gaussians: N i N (0, σ 2 ). For the prior term P (Θ), we assume that the components j are i.i.d. Gaussians: j N ( j0, σ 2 h) Using this specific information, we now have: ( n ) ˆ MAP = arg min (y i h (x i )) 2 d + j=1 ( j j0 ) 2 2σ 2 2σh 2 d = arg min (y i h (x i )) 2 + σ2 ( j j0 ) 2 Let s look again at the case for linear regression to illustrate the effect of the prior term when j0 = 0. In this context, we refer to the linear hypothesis function h (x) = x. σ 2 h j=1 ˆ MAP = arg min R d (y i x i ) 2 + σ2 σ 2 h d j=1 2 j This is just the Ridge Regression problem! We just proved that Ridge Regression and MAP for regression lead to the same answer! We can simply set λ = σ2. We conclude that MAP is a σh 2 probabilistic justification for adding the penalized ridge term in Ridge Regression. 1.4 MLE vs. MAP Based on our analysis of Ordinary Least Squares Regression and Ridge Regression, we should expect to see MAP perform better than MLE. But is that always the case? Let us visit a simple 2D problem where f(x) = slope x + intercept Suppose we already know the true underlying model parameters: (slope, intercept ) = (0.5, 1.0) we would like to know what parameters MLE and MAP will select, after providing them with some dataset D. Let s start with MLE: Note 4, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 4

The diagram above shows the the contours of the likelihood distribution in model space. The gray dot represents the true underlying model. MLE chooses the point that maximizes the likelihood, which is indicated by the green dot. As we can see, MLE chooses a reasonable hypothesis, but this hypothesis lies in a region on high variance, which indicates a high level of uncertainty in the predicted model. A slightly different dataset could significantly alter the predicted model. Now, let s take a look at the hypothesis model from MAP. One question that arises is where the prior should be centered and what its variance should be. This depends on our belief of what the true underlying model is. If we have reason to believe that the model weights should all be small, then the prior should be centered at zero with a small variance. Let s look at MAP for a prior that is centered at zero: For reference, we have marked the MLE estimation from before as the green point and the true Note 4, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 5

model as the gray point. The prior distribution is indicated by the diagram on the left, and the posterior distribution is indicated by the diagram on the right. MAP chooses the point that maximizes the posterior probability, which is approximately (0.70, 0.25). Using a prior centered at zero leads us to skew our prediction of the model weights toward the origin, leading to a less accurate hypothesis than MLE. However, the posterior has significantly less variance, meaning that the point that MAP chooses is less likely to overfit to the noise in the dataset. Let s say in our case that we have reason to believe that both model weights should be centered around the 0.5 to 1 range. Our prediction is now close to that of MLE, with the added benefit that there is significantly less variance. However, if we believe the model weights should be centered around the -0.5 to -1 range, we would make a much poorer prediction than MLE. As always, in order to compare our beliefs to see which prior works best in practice, we should use cross validation! Note 4, UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 6