CS 6140: Machine Learning Spring PDF Free Download

CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu

Logis?cs Assignment 1 Due Feb 4 Electronic copy on blackboard Hard copy in class If you have discussed a problem with someone or get the idea from other sources (e.g. academic publica?ons, lectures, textbooks), you need to acknowledge it! Northeastern University Academic Integrity Policy hup://www.northeastern.edu/osccr/academicintegrity-policy/

Survey What do you expect you can learn from this course? Content of the Course Difficulty of the material Difficulty of the assignment Amount of programming

What We Learned Last Week Genera?ve Model and Discrimina?ve Model Logis?c Regression Genera?ve Models Genera?ve Models vs. Discrimina?ve Models Decision Tree

Genera?ve VS. Discrimina?ve Model Genera?ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina?ve model Learn P(Y X) from training sample Directly models the mapping from features x to y

Genera?ve VS. Discrimina?ve Model Easy to fit the model

Genera?ve VS. Discrimina?ve Model Easy to fit the model Genera?ve model!

Genera?ve VS. Discrimina?ve Model Fit classes separately

Genera?ve VS. Discrimina?ve Model Fit classes separately Genera?ve model!

Genera?ve VS. Discrimina?ve Model Handle missing features easily

Genera?ve VS. Discrimina?ve Model Handle missing features easily Genera?ve model!

Genera?ve VS. Discrimina?ve Model Handle unlabeled training data

Genera?ve VS. Discrimina?ve Model Handle unlabeled training data Easier for Genera?ve model!

Genera?ve VS. Discrimina?ve Model Symmetric in inputs and outputs

Genera?ve VS. Discrimina?ve Model Symmetric in inputs and outputs Genera?ve model! Define p(x,y)

Genera?ve VS. Discrimina?ve Model Handle feature preprocessing

Genera?ve VS. Discrimina?ve Model Handle feature preprocessing Discrimina?ve model!

Genera?ve VS. Discrimina?ve Model Well-calibrated probabili?es

Genera?ve VS. Discrimina?ve Model Well-calibrated probabili?es Discrimina?ve model!

Logis?c Regression A discrimina?ve model sigm is sigmod func?on

Logis?c Regression

Bayesian Inference

Bayes Rules

Play tennis? Decision Tree

Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)

Informa?on Gain Gain(S,A)=expected reduc?on in entropy due to sor?ng on A

Today s Outline Bayesian Sta?s?cs Frequen?st Sta?s?cs Feature Selec?on Some slides are borrowed from Kevin Murphy s lectures

Fundamental principle of Bayesian sta?s?cs Everything that is uncertain is modeled with a probability distribu?on. Parameters Hyper-parameters Incorporate everything that is known is by condi?oning on it, using Bayes rule to update our prior beliefs into posterior beliefs.

Advantages of Bayes Conceptually simple Handle small sample sizes Handle complex hierarchical models without overfihng No need to choose between different es?mators, hypothesis tes?ng procedures

Disadvantages of Bayes Need to specify a prior! Computa?onal Issues!

Disadvantages of Bayes Need to specify a prior! Subjec?ve But every model come with its own assump?on Es?mate prior from data -> empirical Bayes

Disadvantages of Bayes Computa?onal Issues! Compu?ng the normaliza?on constant requires integra?ng over all the parameters Compu?ng posterior expecta?ons requires integra?ng over all the parameters

Approximate inference We can evaluate posterior expecta?ons using Monte Carlo integra?on

Monte Carlo Approxima?on In general, compu?ng the distribu?on of a func?on of an random variable using the change of variable is difficult. A powerful way: Generate samples from the distribu?on Use Monte Carlo to approximate the expected value of any func?on of a random variable

Monte Carlo Approxima?on Many useful func?ons that we can approximate

Monte Carlo Approxima?on Suppose we have and We can approximate p(y) by drawing sample from p(x), squaring them, and compu?ng the empirical distribu?on.

Monte Carlo Approxima?on Suppose we have and P(y)

Disadvantages of Bayes Computa?onal Issues! Compu?ng the normaliza?on constant requires integra?ng over all the parameters Compu?ng posterior expecta?ons requires integra?ng over all the parameters

Conjugate priors For simplicity, we will mostly focus on a special kind of prior which has nice mathema?cal proper?es. A prior likelihood posterior as. is said to be conjugate to a if the corresponding has the same func?onal form

Conjugate priors This means the prior family is closed under Bayesian upda?ng. we can recursively apply the rule to update our beliefs as data streams in. -> online learning

Coin Tossing Example Consider the problem of es?ma?ng the probability of heads from a sequence of N coin tosses: Likelihood Prior Posterior

Likelihood: Binomial distribu?on Let X = number of heads in N trials.

Likelihood: Bernoulli Distribu?on Special case of Binomial Binomial distribu?on when N=1 is called the Bernoulli distribu?on. Specially,

Fihng a Bernoulli distribu?on Suppose we conduct N=100 trials and get data D = (1, 0, 1, 1, 0,.) with N 1 heads and N 0 tails. What is?

Fihng a Bernoulli distribu?on Suppose we conduct N=100 trials and get data D = (1, 0, 1, 1, 0,.) with N 1 heads and N 0 tails. What is? Maximum likelihood es?ma?on

Fihng a Bernoulli distribu?on

Fihng a Bernoulli distribu?on Log-likelihood

Fihng a Bernoulli distribu?on

Conjugate priors: The beta-bernoulli model Consider the probability of heads, given a sequence of N coin tosses, X 1,, X N. Likelihood Natural conjugate prior is the Beta distribu?on Posterior is also Beta, with updated counts

The beta distribu?on Beta distribu?on Beta func?on

Beta distribu?on The beta distribu?on

Upda?ng a beta distribu?on Prior is Beta(2,2). Observe 1 head. Posterior is Beta(3,2), so mean shins from 2/4 to 3/5. Prior is Beta(3,2). Observe 1 head. Posterior is Beta(4,2), so mean shins from 3/5 to 4/6.

Sehng the hyper-parameters The prior hyper-parameters can be interpreted as pseudo counts The effec?ve sample size (strength) of the prior is The prior mean is If our prior belief is p(heads) = 0.3, and we think this belief is equivalent to about 10 data points, we just solve

Point Es?ma?on The posterior is our belief state. To convert it to a single best guess (point es?mate), we pick the value that minimizes some loss func?on, e.g., MSE -> posterior mean, 0/1 loss -> posterior mode

Posterior Mean Let N=N 1 + N 0 be the amount of data, and be the amount of virtual data The posterior mean is a convex combina?on of prior mean and MLE N 1 /N Prior MLE

MAP Es?ma?on It is onen easier to compute the posterior mode (op?miza?on) than the posterior mean (integra?on). This is called maximum a posteriori es?ma?on. For the beta distribu?on

Summary of beta-bernoulli model

Bayesian Model Selec?on Face with a set of models of different complexity, how should we choose?

Bayesian Model Selec?on Cross-valida?on Divide training set into N par??ons Train on N-1 par??ons, and evaluate on the rest In total, fihng the model for N?mes

Bayesian Model Selec?on Compute posterior Then compute MAP

Bayesian Model Selec?on Compute posterior Uniform prior over models Then we are picking the model which maximizes Marginal likelihood, Integrated likelihood, Or evidence

Bayes Factors To compare two models, use posterior odds Bayes factor The Bayes factor is a Bayesian version of a likelihood ra?o test, that can be used to compare models of different complexity

Example: Coin Flipping Suppose we toss a coin N=250?mes and observe N 1 =141 heads and N 0 =109 tails

Example: Coin Flipping Suppose we toss a coin N=250?mes and observe N 1 =141 heads and N 0 =109 tails Consider two hypotheses: H 0 : H 1 :

Example: Coin Flipping

Bayesian Occam s Razor Occam s Razor

Bayesian Occam s Razor Occam s Razor Simplest model that adequately explains the data

Bayesian Occam s Razor Occam s Razor Simplest model that adequately explains the data selects models would always favor the model with most parameters MLE, or MAP to es?mate parameters Integrate out the parameters!

Bayesian Occam s Razor Overfihng early samples

Bayesian Occam s Razor Probability over all possible datasets Complex models must spread out their probability mass thinly

Bayesian Occam s Razor Complex models must spread out their probability mass thinly

Marginal likelihood When performing Bayesian model selec?on and empirical Bayes es?ma?on, we will need This is given by a ra?o of the posterior and prior normalizing constants

Summary of beta-bernoulli model

From coins to dice

Mul?nomial: 1 sample One-shot encoding Probability for class k

Likelihood

Conjugate Prior: Dirichlet distribu?on Generaliza?on of Beta to K dimensions Normaliza?on constant

Conjugate Prior: Dirichlet distribu?on Generaliza?on of Beta to K dimensions (20, 20, 20) (2, 2, 2) (20, 2, 2)

Summary of Dirichlet-mul?nomial model

Frequen?st Sta?s?cs We have seen how Bayesian inference offers a principled solu?on to the parameter es?ma?on problem.

Frequen?st Sta?s?cs Parameter es?ma?on MAP es?mate MLE

Why maximum likelihood? KL divergence from the true distribu?on p to the approxima?on q is

Why maximum likelihood? KL divergence from the true distribu?on p to the approxima?on q is Empirical distribu?on

Maximum Likelihood = min KL (to empirical distribu?on) KL divergence to empirical distribu?on

Maximum Likelihood = min KL (to empirical distribu?on) KL divergence to empirical distribu?on Hence minimizing KL is equivalent to minimizing the average nega?ve log likelihood on the training set

Bernoulli MLE Remember that

However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0.

However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0. Too few samples -> sparse data!

However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0. We can add pseudo counts C 0 and C 1 (e.g., 0.1) to the sufficient sta?s?cs N 0 and N 1 to get a beuer behaved es?mate. This is the MAP es?mate using a Beta prior.

MLE for the mul?nomial If x n {1,,K}, the likelihood is The log-likelihood is

Compu?ng the mul?nomial MLE

Compu?ng the Gaussian MLE

Bayesian vs. Frequen?st MLE returns a point es?mate In frequen?st sta?s?cs, we treat D as random and as fixed, and ask how the es?mate would change if D changed. In Bayesian sta?s?cs, we treat D as fixed and as random, and model our uncertainty with the posterior

Unbiased es?mators The bias of an es?mator is defined as An es?mator is unbiased if bias=0.

Unbiased es?mators MLE for Gaussian mean is unbiased

Is being unbiased enough?

Consistent es?mators An es?mator is consistent if it converges (in probability) to the true value with enough data MLE is a consistent es?mator.

Bias-variance tradeoff Being unbiased is not necessarily desirable! Suppose our loss func?on is mean squared error where

Feature Selec?on If predic?ve accuracy is the goal, onen best to keep all predictors and use L2 regulariza?on We onen want to select a subset of the inputs that are most relevant for predic?ng the output, to get sparse models interpretability, speed, possibly beuer predic?ve accuracy

Filter methods Compute relevance of each feature to the label marginally Computa?onally efficient

Correla?on coefficient Measures extent to which X j and Y are linearly related

Correla?on coefficient Mutual informa?on Can model non linear non Gaussian dependencies For discrete data

Wrapper Methods Perform discrete search in model space Wrap search around standard model fihng

Wrapper Methods Forward selec?on for linear regression At each step, add feature that maximally reduces residual error

Wrapper Methods Forward selec?on for linear regression Put the es?ma?on in

What we learned today Bayesian Sta?s?cs Frequen?st Sta?s?cs Feature Selec?on Some slides are borrowed from Kevin Murphy s lectures

Homework Read Murphy CH 5, 6 Assignment 1 due 02/04, 6pm! Both hard copy and electronic copy