Machine Learning (CSE 446): Probabilistic Machine Learning

Size: px

Start display at page:

Download "Machine Learning (CSE 446): Probabilistic Machine Learning"

Frederica Rodgers
5 years ago
Views:

1 Machine Learning (CSE 446): Probabilistic Machine Learning oah Smith c 2017 University of Washington nasmith@cs.washington.edu ovember 1, / 24

2 Understanding MLE y 1 MLE π^ You can think of MLE as a black box for choosing parameter values. 2 / 24

3 Understanding MLE π Y ^ y 1 MLE π 3 / 24

4 Understanding MLE x xxx1 y 1 MLE ŵ b^ 4 / 24

5 Understanding MLE x w b logistic Y x xxx1 y 1 MLE ŵ b^ 5 / 24

6 Probabilistic Stories Bernoulli π Y logistic regression x w logistic Y b 6 / 24

7 Probabilistic Stories Bernoulli π Y logistic regression x w logistic Y b Gaussian μ Y σ 2 linear regression x w Y b σ 2 7 / 24

8 Then and ow Before today, you knew how to do MLE: For a Bernoulli distribution: ˆπ = count(+1) count(+1)+count( 1) = + For a Gaussian distribution: ˆµ = n=1 yn (and similar for estimating variance, ˆσ 2 ). Logistic regression and linear regression, respectively, generalize these so that the parameter is itself a function of x, so that we have a conditional model of Y given X. The practical difference is that the MLE doesn t have a closed form for these models. (So we use SGD and friends.) 8 / 24

9 A Twist! There is a closed form for the MLE of linear regression. To keep it simple, assume b = 0. Let X R d be the stack of training inputs and y R be the stack of training outputs. ŵ = argmin w 1 (y n w x n ) 2 n=1 9 / 24

10 A Twist! There is a closed form for the MLE of linear regression. To keep it simple, assume b = 0. Let X R d be the stack of training inputs and y R be the stack of training outputs. ŵ = argmin w 1 n=1 (y n w x n ) 2 argmin (y Xw) (y Xw) w 10 / 24

11 A Twist! There is a closed form for the MLE of linear regression. To keep it simple, assume b = 0. Let X R d be the stack of training inputs and y R be the stack of training outputs. ŵ = argmin w 1 n=1 (y n w x n ) 2 argmin (y Xw) (y Xw) w gradient w.r.t. w {}}{ 2X (y Xw) = 0 11 / 24

12 A Twist! There is a closed form for the MLE of linear regression. To keep it simple, assume b = 0. Let X R d be the stack of training inputs and y R be the stack of training outputs. ŵ = argmin w 1 n=1 (y n w x n ) 2 argmin (y Xw) (y Xw) w gradient w.r.t. w {}}{ 2X (y Xw) = 0 ( 1 ŵ = X X) X y Invertibility is fine if we have more than d linearly independent observations. 12 / 24

13 A Twist! There is a closed form for the MLE of linear regression. To keep it simple, assume b = 0. Let X R d be the stack of training inputs and y R be the stack of training outputs. ŵ = argmin w 1 n=1 (y n w x n ) 2 argmin (y Xw) (y Xw) w gradient w.r.t. w {}}{ 2X (y Xw) = 0 ( 1 ŵ = X X) X y Invertibility is fine if we have more than d linearly independent observations. costs O(d 3 ). But it 13 / 24

14 MLE is Dangerous Variance(ˆπ) = π(1 π) Variance(ˆµ) = σ2 (ote that π is the true probability that Y = 1!) (ote that σ 2 is the true variance of the r.v.!) 14 / 24

15 MLE is Dangerous Variance(ˆπ) = π(1 π) Variance(ˆµ) = σ2 (ote that π is the true probability that Y = 1!) (ote that σ 2 is the true variance of the r.v.!) Recall the bias-variance tradeoff. Bias/approximation error: if your choice of features and probabilistic model align to reality, MLE is great. Variance/estimation error: MLE tends to overfit unless you have a lot of data. 15 / 24

16 MLE is Dangerous Variance(ˆπ) = π(1 π) Variance(ˆµ) = σ2 (ote that π is the true probability that Y = 1!) (ote that σ 2 is the true variance of the r.v.!) Regularization reduces variance but increases bias. 16 / 24

17 Adding Regularization to the Probabilistic Story Probabilistic story: For n {1,..., }: Observe xn. Transform it using parameters w and b to get p w,b (Y x n ). Sample y n p w,b (Y x n ). 17 / 24

18 Adding Regularization to the Probabilistic Story Probabilistic story: For n {1,..., }: Observe xn. Transform it using parameters w and b to get p w,b (Y x n ). Sample yn p w,b (Y x n ). Probabilistic story with regularization: Use hyperparameters α to define a prior distribution over random variables W, p α (W ). Sample w p α (W ). For n {1,..., }: Observe x n. Transform it using parameters w and b to get p w,b (Y x n ). Sample yn p w,b (Y x n ). 18 / 24

19 Maximum a Posteriori (MAP) Estimation (ŵ, b) = argmax w,b log p α (w) + }{{} log prior log p w,b (y n x n ) n=1 } {{ } log likelihood 19 / 24

20 Maximum a Posteriori (MAP) Estimation (ŵ, b) = argmax w,b log p α (w) + }{{} log prior log p w,b (y n x n ) n=1 } {{ } log likelihood Typical assumption is that each weight is independent of the others. p α (W ) = j p α (W j ) 20 / 24

21 Maximum a Posteriori (MAP) Estimation (ŵ, b) = argmax w,b log p α (w) + }{{} log prior log p w,b (y n x n ) n=1 } {{ } log likelihood Typical assumption is that each weight is independent of the others. p α (W ) = j p α (W j ) Option 1: let p α (W j ) be a zero-mean Gaussian distribution with standard deviation α. log p α (w) = 1 2α 2 w constant 21 / 24

22 Maximum a Posteriori (MAP) Estimation (ŵ, b) = argmax w,b log p α (w) + }{{} log prior log p w,b (y n x n ) n=1 } {{ } log likelihood Typical assumption is that each weight is independent of the others. p α (W ) = j p α (W j ) Option 1: let p α (W j ) be a zero-mean Gaussian distribution with standard deviation α. log p α (w) = 1 2α 2 w constant Option 2: let p α (W j ) be a zero-location Laplace distribution with scale α. log p α (w) = 1 α w 1 + constant 22 / 24

23 Probabilistic Story: L 2 -Regularized Logistic Regression 0 σ 2 x w b logistic Y x xxx1 y 1 MAP ŵ b^ 23 / 24

24 Why Go Probabilistic? Interpret the classifier s activation function as a (log) probability (density), which encodes uncertainty. Interpret the regularizer as a (log) probability (density), which encodes uncertainty. Leverage theory from statistics to get a better understanding of the guarantees we can hope for with our learning algorithms. Change your assumptions, turn the optimization-crank, and get a new machine learning method. The key to success is to tell a probabilistic story that s reasonably close to reality, including the prior(s). 24 / 24

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem