Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207
Overfitting Problem y = 0 + x y = 0 + x + 2 x 2 y = 0 + x + + 5 x 5 2 / 23
Overfitting Problem (Contd.) Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data 3 / 23
Addressing The Overfitting Problem Reduce the number of features Manually select which features to keep Use a model selection algorithm Regularization Keep all the features, but reduce the magnitude of parameters j Regularization works well when we have a lot of slightly useful features. 4 / 23
Optimizing Cost Function by Regularization Consider the following function 0 + x + 2 x 2 + 3 x 3 + 4 x 4 To eliminate the influence of 3 x 3 and 4 x 4 to smoothen hypothesis function, the cost function can be modified as follows [ m ] min (h (x (i) ) y (i) ) 2 + 000 3 2 + 000 4 2 2m Large values of 3 and 4 will increase the objective function 5 / 23
Optimizing Cost Function by Regularization (Contd.) A more general form n min (h (x (i) ) y (i) ) 2 + λ 2m where λ is the regularization parameter As the magnitudes of the fitting parameters increase, there will be an increasing penalty on the cost function This penalty is dependent on the squares of the parameters as well as the magnitude of λ j= 2 j 6 / 23
Regularized Linear Regression Gradient descent Repeat { 0 := 0 α m [ j := j α (h (x (i) ) y (i) )x (i) 0 ( m ] (h (x (i) ) y (i) )x (i) j ) + λ m j } The regularization is performed by λ j /m Another form of the update rule j := j ( α λ m ) α m (h (x (i) ) y (i) )x (i) where α m should always be less than 7 / 23
Overfitting Problem (Contd.) 8 / 23
Regularized Linear Regression (Contd.) Normal equation where = (X T X + λ L) X T y 0 L =... 9 / 23
Regularized Logistic Regression Recall the cost function for logistic regression J() = m [y (i) log(h (x (i) )) + ( y (i) ) log( h (x (i) ))] Adding a term for regularization J() = m [y (i) log(h (x (i) ))+( y (i) ) log( h (x (i) ))]+ λ n 2m Gradient descent: Repeat 0 := 0 α m ( j := j α m (h (x (i) ) y (i) )x (i) 0 m m (h (x (i) ) y (i) )x (i) j + λ m j ) for j =, 2,, n j= 2 j 0 / 23
Parameter Estimation in Probabilistic Models Assume data are generated via probabilistic model d p(d ) p(d ): Probability distribution underlying the data : Fixed but unknown distribution parameter Given: m independent and identically distributed (i.i.d.) samples of the data D = {d i },,m Independent and Identically Distributed Given, each sample is independent of all other samples All samples drawn from the same distribution Goal: Estimate parameter that best models/describes the data Several ways to define the best / 23
Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation (MLE): Choose the parameter that maximizes the probability of the data, given that parameter Probability of the data, given the parameters is called the Likelihood, a function of and defined as: L() = P (D ) = m p(d i ) MLE typically maximizes the Log-likelihood instead of the likelihood m l() = log L() = log p(d i ) = p(d i ) Maximum likelihood parameter estimation MLE = arg max l() = arg max p(d i ) 2 / 23
Maximum-a-Posteriori Estimation (MAP) Maximum-a-Posteriori Estimation (MAP): Choose that maximizes the posterior probability of (i.e., probability in the light of the observed data) Posterior probability of is given by the Bayes Rule p( D) = p()p(d ) p(d) p(): Prior probability of (without having seen any data) P (D): Probability of the data (independent of ) p(d) = p()p(d )d The Bayes Rule lets us update our belief about in the light of observed data While doing MAP, we usually maximize the log of the posteriori probability 3 / 23
Maximum-a-Posteriori Estimation (Contd.) Maximum-a-Posteriori parameter estimation MAP = arg max p( D) = arg max = arg max p()p(d ) p(d) p()p(d ) = arg max log p()p(d ) = arg max (log p() + log p(d )) ( ) = arg max log p() + log p(d i ) Same as MLE except the extra log-prior-distribution term! MAP allows incorporating our prior knowledge about in its estimation 4 / 23
For each (x (i), y (i) ), Linear Regression: MLE Solution y (i) = T x (i) + ɛ i The noise ɛ i is drawn from a Gaussian distribution ɛ N (0, σ 2 ) Each y (i) is drawn from the following Gaussian The log-likelihood Maximize l() l() = log L() = m log y (i) x (i) ; N ( T x (i), σ 2 ) MLE = arg min 2σ 2 m (y(i) T x (i) ) 2 2πσ 2σ 2 (y (i) T x (i) ) 2 5 / 23
Linear Regression: MAP Solution follows a Gaussian distribution N (0, σ 2 I) and thus The MAP solution p() = ( ) exp T (2πσ 2 ) n/2 2σ 2 log p() = n log T 2πσ 2σ 2 MAP = arg max { m log p(y (i) x (i), ) + log p()} { m = arg max m log (y(i) T x (i) ) 2 } 2πσ 2σ 2 + n log T 2πσ 2σ 2 { m = arg min (y(i) T x (i) ) 2 } 2σ 2 + T 2σ 2 6 / 23
Linear Regression: MLE vs MAP ML solution MAP solution What do we learn? ML = arg min 2σ 2 ML = arg min { 2σ 2 (y (i) T x (i) ) 2 } (y (i) T x (i) ) 2 + T 2σ 2 ML estimation of a parameter leads to unregularized solution MAP estimation of a parameter leads to regularized solution The prior distribution acts as a regularizer in MAP estimation For MAP, different prior distributions lead to different regularizers Gaussian prior on regularizes the l 2 norm of Laplace prior exp(c ) on regularizes the the l norm of 7 / 23
Probabilistic Classification: Logistic Regression Often we do not just care about predicting the label y for an example Rather, we want to predict the label probabilities p(y x, ) Consider the following function (y {, }) p(y x, ) = g(y T x) = + exp( y T x) g is the logistic function which maps all real number into (0, ) This is logistic regression model, which is a classification model 8 / 23
Logistic Regression What does the decision boundary look like for Logistic Regression? At the decision boundary labels +/- becomes equiprobable p(y = + x, ) = p(y = x, ) + exp( T = x) + exp( T x) exp( T x) = exp( T x) T x = 0 The decision boundary is therefore linear Logistic Regression is a linear classifier (note: it is possible to kernelize and make it nonlinear) 9 / 23
Log-likelihood MLE solution Logistic Regression: MLE Solution l() = log = m p(y (i) x (i), ) log p(y (i) x (i), ) MLE = arg min = log + exp( y (i) T x (i) ) = log[ + exp( y (i) T x (i) )] log[ + exp( y (i) T x (i) )] No close-form solution exists, but we can do gradient descent on 20 / 23
Logistic Regression: MAP Solution Again, assume follows a Gaussian distribution N (0, σ 2 I) p() = MAP solution ( ) exp T (2πσ 2 ) n/2 2σ 2 MAP = arg min log p() = n log T 2πσ 2σ 2 log[ + exp( y (i) T x (i) )] + 2σ 2 T See A comparison of numerical optimizers for logistic regression by Tom Minka on optimization techniques (gradient descent and others) for logistic regression (both MLE and MAP) 2 / 23
MLE solution MAP solution Logistic Regression: MLE vs MAP MLE = arg min MAP = arg min log[ + exp( y (i) T x (i) )] log[ + exp( y (i) T x (i) )] + 2σ 2 T Take-home messages (we already saw these before :-) ) MLE estimation of a parameter leads to unregularized solutions MAP estimation of a parameter leads to regularized solutions The prior distribution acts as a regularizer in MAP estimation For MAP, different prior distributions lead to different regularizers Gaussian prior on regularizes the l 2 norm of Laplace prior exp(c ) on regularizes the the l norm of 22 / 23
Thanks! Q & A 23 / 23