CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu
Logis?cs Assignment 1 Due Feb 4 Electronic copy on blackboard Hard copy in class If you have discussed a problem with someone or get the idea from other sources (e.g. academic publica?ons, lectures, textbooks), you need to acknowledge it! Northeastern University Academic Integrity Policy hup://www.northeastern.edu/osccr/academicintegrity-policy/
Survey What do you expect you can learn from this course? Content of the Course Difficulty of the material Difficulty of the assignment Amount of programming
What We Learned Last Week Genera?ve Model and Discrimina?ve Model Logis?c Regression Genera?ve Models Genera?ve Models vs. Discrimina?ve Models Decision Tree
Genera?ve VS. Discrimina?ve Model Genera?ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina?ve model Learn P(Y X) from training sample Directly models the mapping from features x to y
Genera?ve VS. Discrimina?ve Model Easy to fit the model
Genera?ve VS. Discrimina?ve Model Easy to fit the model Genera?ve model!
Genera?ve VS. Discrimina?ve Model Fit classes separately
Genera?ve VS. Discrimina?ve Model Fit classes separately Genera?ve model!
Genera?ve VS. Discrimina?ve Model Handle missing features easily
Genera?ve VS. Discrimina?ve Model Handle missing features easily Genera?ve model!
Genera?ve VS. Discrimina?ve Model Handle unlabeled training data
Genera?ve VS. Discrimina?ve Model Handle unlabeled training data Easier for Genera?ve model!
Genera?ve VS. Discrimina?ve Model Symmetric in inputs and outputs
Genera?ve VS. Discrimina?ve Model Symmetric in inputs and outputs Genera?ve model! Define p(x,y)
Genera?ve VS. Discrimina?ve Model Handle feature preprocessing
Genera?ve VS. Discrimina?ve Model Handle feature preprocessing Discrimina?ve model!
Genera?ve VS. Discrimina?ve Model Well-calibrated probabili?es
Genera?ve VS. Discrimina?ve Model Well-calibrated probabili?es Discrimina?ve model!
Logis?c Regression A discrimina?ve model sigm is sigmod func?on
Logis?c Regression
Bayesian Inference
Bayes Rules
Play tennis? Decision Tree
Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)
Informa?on Gain Gain(S,A)=expected reduc?on in entropy due to sor?ng on A
Today s Outline Bayesian Sta?s?cs Frequen?st Sta?s?cs Feature Selec?on Some slides are borrowed from Kevin Murphy s lectures
Fundamental principle of Bayesian sta?s?cs Everything that is uncertain is modeled with a probability distribu?on. Parameters Hyper-parameters Incorporate everything that is known is by condi?oning on it, using Bayes rule to update our prior beliefs into posterior beliefs.
Fundamental principle of Bayesian sta?s?cs Everything that is uncertain is modeled with a probability distribu?on. Parameters Hyper-parameters Incorporate everything that is known is by condi?oning on it, using Bayes rule to update our prior beliefs into posterior beliefs. Posterior Prior Likelihood
Advantages of Bayes Conceptually simple Handle small sample sizes Handle complex hierarchical models without overfihng No need to choose between different es?mators, hypothesis tes?ng procedures
Disadvantages of Bayes Need to specify a prior! Computa?onal Issues!
Disadvantages of Bayes Need to specify a prior! Subjec?ve But every model come with its own assump?on Es?mate prior from data -> empirical Bayes
Disadvantages of Bayes Computa?onal Issues! Compu?ng the normaliza?on constant requires integra?ng over all the parameters Compu?ng posterior expecta?ons requires integra?ng over all the parameters
Approximate inference We can evaluate posterior expecta?ons using Monte Carlo integra?on
Monte Carlo Approxima?on In general, compu?ng the distribu?on of a func?on of an random variable using the change of variable is difficult. A powerful way: Generate samples from the distribu?on Use Monte Carlo to approximate the expected value of any func?on of a random variable
Monte Carlo Approxima?on Many useful func?ons that we can approximate
Monte Carlo Approxima?on Suppose we have and We can approximate p(y) by drawing sample from p(x), squaring them, and compu?ng the empirical distribu?on.
Monte Carlo Approxima?on Suppose we have and P(y)
Disadvantages of Bayes Computa?onal Issues! Compu?ng the normaliza?on constant requires integra?ng over all the parameters Compu?ng posterior expecta?ons requires integra?ng over all the parameters
Conjugate priors For simplicity, we will mostly focus on a special kind of prior which has nice mathema?cal proper?es. A prior likelihood posterior as. is said to be conjugate to a if the corresponding has the same func?onal form
Conjugate priors This means the prior family is closed under Bayesian upda?ng. we can recursively apply the rule to update our beliefs as data streams in. -> online learning
Coin Tossing Example Consider the problem of es?ma?ng the probability of heads from a sequence of N coin tosses: Likelihood Prior Posterior
Likelihood: Binomial distribu?on Let X = number of heads in N trials.
Likelihood: Bernoulli Distribu?on Special case of Binomial Binomial distribu?on when N=1 is called the Bernoulli distribu?on. Specially,
Fihng a Bernoulli distribu?on Suppose we conduct N=100 trials and get data D = (1, 0, 1, 1, 0,.) with N 1 heads and N 0 tails. What is?
Fihng a Bernoulli distribu?on Suppose we conduct N=100 trials and get data D = (1, 0, 1, 1, 0,.) with N 1 heads and N 0 tails. What is? Maximum likelihood es?ma?on
Fihng a Bernoulli distribu?on
Fihng a Bernoulli distribu?on Log-likelihood
Fihng a Bernoulli distribu?on Log-likelihood
Fihng a Bernoulli distribu?on
Conjugate priors: The beta-bernoulli model Consider the probability of heads, given a sequence of N coin tosses, X 1,, X N. Likelihood Natural conjugate prior is the Beta distribu?on Posterior is also Beta, with updated counts
The beta distribu?on Beta distribu?on Beta func?on
Beta distribu?on The beta distribu?on
Upda?ng a beta distribu?on Prior is Beta(2,2). Observe 1 head. Posterior is Beta(3,2), so mean shins from 2/4 to 3/5. Prior is Beta(3,2). Observe 1 head. Posterior is Beta(4,2), so mean shins from 3/5 to 4/6.
Sehng the hyper-parameters The prior hyper-parameters can be interpreted as pseudo counts The effec?ve sample size (strength) of the prior is The prior mean is If our prior belief is p(heads) = 0.3, and we think this belief is equivalent to about 10 data points, we just solve
Point Es?ma?on The posterior is our belief state. To convert it to a single best guess (point es?mate), we pick the value that minimizes some loss func?on, e.g., MSE -> posterior mean, 0/1 loss -> posterior mode
Posterior Mean Let N=N 1 + N 0 be the amount of data, and be the amount of virtual data The posterior mean is a convex combina?on of prior mean and MLE N 1 /N Prior MLE
MAP Es?ma?on It is onen easier to compute the posterior mode (op?miza?on) than the posterior mean (integra?on). This is called maximum a posteriori es?ma?on. For the beta distribu?on
Summary of beta-bernoulli model
Bayesian Model Selec?on Face with a set of models of different complexity, how should we choose?
Bayesian Model Selec?on Cross-valida?on Divide training set into N par??ons Train on N-1 par??ons, and evaluate on the rest In total, fihng the model for N?mes
Bayesian Model Selec?on Compute posterior Then compute MAP
Bayesian Model Selec?on Compute posterior Uniform prior over models Then we are picking the model which maximizes Marginal likelihood, Integrated likelihood, Or evidence
Bayes Factors To compare two models, use posterior odds Bayes factor The Bayes factor is a Bayesian version of a likelihood ra?o test, that can be used to compare models of different complexity
Example: Coin Flipping Suppose we toss a coin N=250?mes and observe N 1 =141 heads and N 0 =109 tails
Example: Coin Flipping Suppose we toss a coin N=250?mes and observe N 1 =141 heads and N 0 =109 tails Consider two hypotheses: H 0 : H 1 :
Example: Coin Flipping
Bayesian Occam s Razor Occam s Razor
Bayesian Occam s Razor Occam s Razor Simplest model that adequately explains the data
Bayesian Occam s Razor Occam s Razor Simplest model that adequately explains the data selects models would always favor the model with most parameters MLE, or MAP to es?mate parameters Integrate out the parameters!
Bayesian Occam s Razor Overfihng early samples
Bayesian Occam s Razor Probability over all possible datasets Complex models must spread out their probability mass thinly
Bayesian Occam s Razor Complex models must spread out their probability mass thinly
Marginal likelihood When performing Bayesian model selec?on and empirical Bayes es?ma?on, we will need This is given by a ra?o of the posterior and prior normalizing constants
Summary of beta-bernoulli model
From coins to dice
Mul?nomial: 1 sample One-shot encoding Probability for class k
Likelihood
Conjugate Prior: Dirichlet distribu?on Generaliza?on of Beta to K dimensions Normaliza?on constant
Conjugate Prior: Dirichlet distribu?on Generaliza?on of Beta to K dimensions (20, 20, 20) (2, 2, 2) (20, 2, 2)
Summary of Dirichlet-mul?nomial model
Frequen?st Sta?s?cs We have seen how Bayesian inference offers a principled solu?on to the parameter es?ma?on problem.
Frequen?st Sta?s?cs Parameter es?ma?on MAP es?mate MLE
Why maximum likelihood? KL divergence from the true distribu?on p to the approxima?on q is
Why maximum likelihood? KL divergence from the true distribu?on p to the approxima?on q is Empirical distribu?on
Maximum Likelihood = min KL (to empirical distribu?on) KL divergence to empirical distribu?on
Maximum Likelihood = min KL (to empirical distribu?on) KL divergence to empirical distribu?on Hence minimizing KL is equivalent to minimizing the average nega?ve log likelihood on the training set
Bernoulli MLE Remember that
However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0.
However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0. Too few samples -> sparse data!
However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0. We can add pseudo counts C 0 and C 1 (e.g., 0.1) to the sufficient sta?s?cs N 0 and N 1 to get a beuer behaved es?mate. This is the MAP es?mate using a Beta prior.
MLE for the mul?nomial If x n {1,,K}, the likelihood is The log-likelihood is
Compu?ng the mul?nomial MLE
Compu?ng the mul?nomial MLE
Compu?ng the Gaussian MLE
Compu?ng the Gaussian MLE
Bayesian vs. Frequen?st MLE returns a point es?mate In frequen?st sta?s?cs, we treat D as random and as fixed, and ask how the es?mate would change if D changed. In Bayesian sta?s?cs, we treat D as fixed and as random, and model our uncertainty with the posterior
Unbiased es?mators The bias of an es?mator is defined as An es?mator is unbiased if bias=0.
Unbiased es?mators MLE for Gaussian mean is unbiased
Is being unbiased enough?
Consistent es?mators An es?mator is consistent if it converges (in probability) to the true value with enough data MLE is a consistent es?mator.
Bias-variance tradeoff Being unbiased is not necessarily desirable! Suppose our loss func?on is mean squared error where
Feature Selec?on If predic?ve accuracy is the goal, onen best to keep all predictors and use L2 regulariza?on We onen want to select a subset of the inputs that are most relevant for predic?ng the output, to get sparse models interpretability, speed, possibly beuer predic?ve accuracy
Filter methods Compute relevance of each feature to the label marginally Computa?onally efficient
Correla?on coefficient Measures extent to which X j and Y are linearly related
Correla?on coefficient Mutual informa?on Can model non linear non Gaussian dependencies For discrete data
Wrapper Methods Perform discrete search in model space Wrap search around standard model fihng
Wrapper Methods Forward selec?on for linear regression At each step, add feature that maximally reduces residual error
Wrapper Methods Forward selec?on for linear regression At each step, add feature that maximally reduces residual error
Wrapper Methods Forward selec?on for linear regression Put the es?ma?on in
What we learned today Bayesian Sta?s?cs Frequen?st Sta?s?cs Feature Selec?on Some slides are borrowed from Kevin Murphy s lectures
Homework Read Murphy CH 5, 6 Assignment 1 due 02/04, 6pm! Both hard copy and electronic copy