Probabilistic Machine Learning

Size: px

Start display at page:

Download "Probabilistic Machine Learning"

Howard Paul
5 years ago
Views:

1 Probabilistic Machine Learning by Prof. Seungchul Lee isystes Design Lab UNIST Table of Contents I.. Probabilistic Linear Regression I... Maxiu Likelihood Solution II... Maxiu-a-Posteriori Solution III..3. Suary: MLE vs MAP II.. Probabilistic Linear Classification I... Logistic Regression II... Maxiu Likelihood Solution III..3. Maxiu-a-Posteriori Solution IV..4. Suary: MLE vs MAP III. 3. Probabilistic Clustering IV. 4. Probabilistic Diension Reduction. Probabilistic Linear Regression P(X θ) = Probability [data pattern] Inference idea data = underlying pattern + independent noise

2 each response generated by a linear odel plus soe Gaussian noise each response y then becoes a draw fro the following Gaussian: Probability of each response variable Given observed data vector y = T x + ε, ε N (0, σ ) y ( T x, σ ) P(y x, ) = N ( T x, σ ) = exp( ) πσ (y x) σ T D = {( x, y), ( x, y),, ( x, y )}, we want to estiate the weight.. Maxiu Likelihood Solution Log-likelihood: l() = log L() = log P(D ) = log P(Y X, ) = log P (, ) = log P (, ) ( ) = log exp ( ) πσ σ T = { log(π σ ) } ( ) σ T Maxiu Likelihood Solution: ^MLE = arg ax log P(D ) = arg ax = arg in = arg in σ σ ( T ) ( T ) ( T ) It is equivalent to the least-squares objective for linear regression

3 .. Maxiu-a-Posteriori Solution Let's assue a Gaussian prior distribution over the weight vector Log posterior probability: P() N ( 0, I ) = exp( ) (π) D/ T log P( D) = log P()P(D ) P(D) = log P() + log P(D ) log P(D) constant Maxiu-a-Posteriori Solution: ^MAP = arg ax log P( D) = arg ax {log P() + log P(D )} D = arg ax { log(π) + { log(π ) }} T σ ( T ) σ = arg in ( T ) + σ T (ignoring constants and changing ax to in) For objective σ = (or soe constant) for each input, it s equivalent to the regularized least-squares BIG Lesson: MAP = l nor regularization

4 .3. Suary: MLE vs MAP MLE solution: ^MLE = arg in σ ( T ) MAP solution: ^MLE = arg in ( T ) + σ T Take-Hoe essages: MLE estiation of a paraeter leads to unregularized solutions MAP estiation of a paraeter leads to regularized solutions The prior distribution acts as a regularizer in MAP estiation Note : for MAP, different prior distributions lead to different regularizers Gaussian prior on regularizes the l nor of Laplace prior exp( C ) on regularizes the l nor of. Probabilistic Linear Classification Often we do not just care about predicting the label y for an exaple Rather, we want to predict the label probabilities P(y x, ) E.g., P(y = + x, ) : the probability that the label is + In a sense, it is our confidence in the predicted label Probabilistic classification odels allow us do that Consider the following function in a copact expression : (y = / + ) P(y x, ) = σ (y T x) = + exp( y T x) σ is the logistic function which aps all real nuber into (0, )

5 .. Logistic Regression What does the decision boundary look like for logistic regression? At the decision boundary labels / + P(y = + x, ) + exp( T x) exp( T x) The decision boundary is therefore linear becoes equiprobable T x = P(y = x, ) = + exp( T x) = exp( T x) = 0 logistic regression is a linear classifier note: it is possible to kernelize and ake it nonlinear.. Maxiu Likelihood Solution Goal: want to estiate fro the data D = {( x, y),, ( x, y )} Log-likelihood: l() = log L() = log P(D ) = log P(Y X, ) = log P(, ) = log P(, ) = log + exp( ) T = log[ + exp( )] T

6 Maxiu Likelihood Solution: ^MLE = arg ax log L() = arg in log[ + exp( )] No closed-for solution exists but we can do gradient descent on = T log L() = exp( T )( ) + exp( T ) + exp( T ).3. Maxiu-a-Posteriori Solution Let's assue a Gaussian prior distribution over the weight vector Maxiu-a-Posteriori Solution: P() = N ( 0, I ) = exp( ) (π) D/ T ^MAP = arg ax log P( D) = arg ax{log P() + log P(D ) log P(D) } = arg ax{log P() + log P(D )} constant D = arg ax { log(π) + log[ + exp( )]} T T = arg in log[ + exp( )] + T T (ignoring constants and changing ax to in) BIG Lesson: MAP = l nor regularization No closed-for solution exists but we can do gradient descent on See A coparison of nuerical optiizers for logistic regression ( by To Minka on optiization techniques (gradient descent and others) for logistic regression (both MLE and MAP)

7 .4. Suary: MLE vs MAP MLE solution: ^MLE = arg in log[ + exp( y )] T MAP solution: ^MAP = arg in log[ + exp( y )] + T T Take-hoe essages (we already saw these before) MLE estiation of a paraeter leads to unregularized solutions MAP estiation of a paraeter leads to regularized solutions The prior distribution acts as a regularizer in MAP estiation Note: For MAP, different prior distributions lead to different regularizers Gaussian prior on regularizes the l nor of Laplace prior exp( C ) on regularizes the l nor of 3. Probabilistic Clustering will not cover in this course 4. Probabilistic Diension Reduction will not cover in this course In []: %%javascript $.getscript(' js')

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear