Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Similar documents
Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Least Squares Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Density Estimation. Seungjin Choi

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Least Squares Regression

y Xw 2 2 y Xw λ w 2 2

Logistic Regression. Machine Learning Fall 2018

Lecture : Probabilistic Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

ECE521 week 3: 23/26 January 2017

Bayesian Learning (II)

CMU-Q Lecture 24:

Introduction to Machine Learning

Machine Learning. Lecture 2: Linear regression. Feng Li.

Overfitting, Bias / Variance Analysis

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning 4771

Lecture 2 Machine Learning Review

COMP90051 Statistical Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Statistical learning. Chapter 20, Sections 1 3 1

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

MODULE -4 BAYEIAN LEARNING

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Parameter Estimation. Industrial AI Lab.

Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Announcements. Proposals graded

Parametric Techniques Lecture 3

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Machine Learning

Parametric Techniques

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Introduction to Probabilistic Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Linear Classification. Prof. Matteo Matteucci

Bayesian Support Vector Machines for Feature Ranking and Selection

Introduction to Logistic Regression

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Statistical learning. Chapter 20, Sections 1 4 1

Bayesian Methods: Naïve Bayes

An Introduction to Statistical and Probabilistic Linear Models

Regression with Numerical Optimization. Logistic

Naïve Bayes classification

Machine Learning Basics: Maximum Likelihood Estimation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Classification Logistic Regression

Classification Based on Probability

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CPSC 340: Machine Learning and Data Mining

Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Relevance Vector Machines

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Support Vector Machines

Machine Learning, Fall 2009: Midterm

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Statistical learning. Chapter 20, Sections 1 3 1

CSCI567 Machine Learning (Fall 2014)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Notes on Machine Learning for and

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning Gaussian Naïve Bayes Big Picture

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Bayesian Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The Naïve Bayes Classifier. Machine Learning Fall 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Sparse Linear Models (10/7/13)

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Association studies and regression

Bias-Variance Tradeoff

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Introduction to Machine Learning. Lecture 2

Bayesian Linear Regression [DRAFT - In Progress]

Machine Learning Practice Page 2 of 2 10/28/13

STA414/2104 Statistical Methods for Machine Learning II

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Generative v. Discriminative classifiers Intuition

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning, Fall 2012 Homework 2

CS6220: DATA MINING TECHNIQUES

ECE521 lecture 4: 19 January Optimization, MLE, regularization

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning (CSE 446): Probabilistic Machine Learning

Transcription:

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207

Overfitting Problem y = 0 + x y = 0 + x + 2 x 2 y = 0 + x + + 5 x 5 2 / 23

Overfitting Problem (Contd.) Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data 3 / 23

Addressing The Overfitting Problem Reduce the number of features Manually select which features to keep Use a model selection algorithm Regularization Keep all the features, but reduce the magnitude of parameters j Regularization works well when we have a lot of slightly useful features. 4 / 23

Optimizing Cost Function by Regularization Consider the following function 0 + x + 2 x 2 + 3 x 3 + 4 x 4 To eliminate the influence of 3 x 3 and 4 x 4 to smoothen hypothesis function, the cost function can be modified as follows [ m ] min (h (x (i) ) y (i) ) 2 + 000 3 2 + 000 4 2 2m Large values of 3 and 4 will increase the objective function 5 / 23

Optimizing Cost Function by Regularization (Contd.) A more general form n min (h (x (i) ) y (i) ) 2 + λ 2m where λ is the regularization parameter As the magnitudes of the fitting parameters increase, there will be an increasing penalty on the cost function This penalty is dependent on the squares of the parameters as well as the magnitude of λ j= 2 j 6 / 23

Regularized Linear Regression Gradient descent Repeat { 0 := 0 α m [ j := j α (h (x (i) ) y (i) )x (i) 0 ( m ] (h (x (i) ) y (i) )x (i) j ) + λ m j } The regularization is performed by λ j /m Another form of the update rule j := j ( α λ m ) α m (h (x (i) ) y (i) )x (i) where α m should always be less than 7 / 23

Overfitting Problem (Contd.) 8 / 23

Regularized Linear Regression (Contd.) Normal equation where = (X T X + λ L) X T y 0 L =... 9 / 23

Regularized Logistic Regression Recall the cost function for logistic regression J() = m [y (i) log(h (x (i) )) + ( y (i) ) log( h (x (i) ))] Adding a term for regularization J() = m [y (i) log(h (x (i) ))+( y (i) ) log( h (x (i) ))]+ λ n 2m Gradient descent: Repeat 0 := 0 α m ( j := j α m (h (x (i) ) y (i) )x (i) 0 m m (h (x (i) ) y (i) )x (i) j + λ m j ) for j =, 2,, n j= 2 j 0 / 23

Parameter Estimation in Probabilistic Models Assume data are generated via probabilistic model d p(d ) p(d ): Probability distribution underlying the data : Fixed but unknown distribution parameter Given: m independent and identically distributed (i.i.d.) samples of the data D = {d i },,m Independent and Identically Distributed Given, each sample is independent of all other samples All samples drawn from the same distribution Goal: Estimate parameter that best models/describes the data Several ways to define the best / 23

Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation (MLE): Choose the parameter that maximizes the probability of the data, given that parameter Probability of the data, given the parameters is called the Likelihood, a function of and defined as: L() = P (D ) = m p(d i ) MLE typically maximizes the Log-likelihood instead of the likelihood m l() = log L() = log p(d i ) = p(d i ) Maximum likelihood parameter estimation MLE = arg max l() = arg max p(d i ) 2 / 23

Maximum-a-Posteriori Estimation (MAP) Maximum-a-Posteriori Estimation (MAP): Choose that maximizes the posterior probability of (i.e., probability in the light of the observed data) Posterior probability of is given by the Bayes Rule p( D) = p()p(d ) p(d) p(): Prior probability of (without having seen any data) P (D): Probability of the data (independent of ) p(d) = p()p(d )d The Bayes Rule lets us update our belief about in the light of observed data While doing MAP, we usually maximize the log of the posteriori probability 3 / 23

Maximum-a-Posteriori Estimation (Contd.) Maximum-a-Posteriori parameter estimation MAP = arg max p( D) = arg max = arg max p()p(d ) p(d) p()p(d ) = arg max log p()p(d ) = arg max (log p() + log p(d )) ( ) = arg max log p() + log p(d i ) Same as MLE except the extra log-prior-distribution term! MAP allows incorporating our prior knowledge about in its estimation 4 / 23

For each (x (i), y (i) ), Linear Regression: MLE Solution y (i) = T x (i) + ɛ i The noise ɛ i is drawn from a Gaussian distribution ɛ N (0, σ 2 ) Each y (i) is drawn from the following Gaussian The log-likelihood Maximize l() l() = log L() = m log y (i) x (i) ; N ( T x (i), σ 2 ) MLE = arg min 2σ 2 m (y(i) T x (i) ) 2 2πσ 2σ 2 (y (i) T x (i) ) 2 5 / 23

Linear Regression: MAP Solution follows a Gaussian distribution N (0, σ 2 I) and thus The MAP solution p() = ( ) exp T (2πσ 2 ) n/2 2σ 2 log p() = n log T 2πσ 2σ 2 MAP = arg max { m log p(y (i) x (i), ) + log p()} { m = arg max m log (y(i) T x (i) ) 2 } 2πσ 2σ 2 + n log T 2πσ 2σ 2 { m = arg min (y(i) T x (i) ) 2 } 2σ 2 + T 2σ 2 6 / 23

Linear Regression: MLE vs MAP ML solution MAP solution What do we learn? ML = arg min 2σ 2 ML = arg min { 2σ 2 (y (i) T x (i) ) 2 } (y (i) T x (i) ) 2 + T 2σ 2 ML estimation of a parameter leads to unregularized solution MAP estimation of a parameter leads to regularized solution The prior distribution acts as a regularizer in MAP estimation For MAP, different prior distributions lead to different regularizers Gaussian prior on regularizes the l 2 norm of Laplace prior exp(c ) on regularizes the the l norm of 7 / 23

Probabilistic Classification: Logistic Regression Often we do not just care about predicting the label y for an example Rather, we want to predict the label probabilities p(y x, ) Consider the following function (y {, }) p(y x, ) = g(y T x) = + exp( y T x) g is the logistic function which maps all real number into (0, ) This is logistic regression model, which is a classification model 8 / 23

Logistic Regression What does the decision boundary look like for Logistic Regression? At the decision boundary labels +/- becomes equiprobable p(y = + x, ) = p(y = x, ) + exp( T = x) + exp( T x) exp( T x) = exp( T x) T x = 0 The decision boundary is therefore linear Logistic Regression is a linear classifier (note: it is possible to kernelize and make it nonlinear) 9 / 23

Log-likelihood MLE solution Logistic Regression: MLE Solution l() = log = m p(y (i) x (i), ) log p(y (i) x (i), ) MLE = arg min = log + exp( y (i) T x (i) ) = log[ + exp( y (i) T x (i) )] log[ + exp( y (i) T x (i) )] No close-form solution exists, but we can do gradient descent on 20 / 23

Logistic Regression: MAP Solution Again, assume follows a Gaussian distribution N (0, σ 2 I) p() = MAP solution ( ) exp T (2πσ 2 ) n/2 2σ 2 MAP = arg min log p() = n log T 2πσ 2σ 2 log[ + exp( y (i) T x (i) )] + 2σ 2 T See A comparison of numerical optimizers for logistic regression by Tom Minka on optimization techniques (gradient descent and others) for logistic regression (both MLE and MAP) 2 / 23

MLE solution MAP solution Logistic Regression: MLE vs MAP MLE = arg min MAP = arg min log[ + exp( y (i) T x (i) )] log[ + exp( y (i) T x (i) )] + 2σ 2 T Take-home messages (we already saw these before :-) ) MLE estimation of a parameter leads to unregularized solutions MAP estimation of a parameter leads to regularized solutions The prior distribution acts as a regularizer in MAP estimation For MAP, different prior distributions lead to different regularizers Gaussian prior on regularizes the l 2 norm of Laplace prior exp(c ) on regularizes the the l norm of 22 / 23

Thanks! Q & A 23 / 23