Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional probabilities Sum rule, product rule, Bayes rule Independence, conditional independence Envelope Problem One envelope has twice as much money as other Binomial distribution Maximum Likelihood Estimation (MLE) [A peek at PAC learning theory] A Switch? B 3 4 $ Envelope Problem One envelope has twice as much money as other A B Coin Flip P(H ) = 0.1 P(H ) = 0.5 Which coin will I use? P(H A has $20 E[B] = $ ½ (10 + 40) = $ 25 Switch? 5 P( P( P( Prior: Probability of a hypothesis before we make any observations 1

Coin Flip P(H ) = 0.1 P(H ) = 0.5 P(H Experiment 1: Heads P( H) =? P( H) =? P( H) =? Which coin will I use? P( P( P( Uniform Prior: All hypothesis are equally likely before we make any observations P(H )=0.1 P(H ) = 0.5 P(H P( )=1/3 P( P( Experiment 1: Heads Sum Rule P( H) = 0.066 P( H) = 0.333 P( H) = 0.6 Posterior: Probability of a hypothesis given data Product Rule P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equally likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis Now, P( HT) =? P( HT) =? P( HT) =? P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( 2

Now, P( HT) = 0.21P( HT) = 0.58 P( HT) = 0.21 P( HT) = 0.21P( HT) = 0.58 P( HT) = 0.21 P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) P(H )=05 0.5 Most likely coin: Your Estimate? Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior Best estimate for P(H) P(H )=05 0.5 P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( P(H ) = 0.5 P( Using Prior Knowledge Should we always use a Uniform Prior? Background knowledge: Heads => we have to buy Dan chocolate Dan likes chocolate => Dan is more likely to use a coin biased in his favor Using Prior Knowledge We can encode it in the prior: P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P(H ) = 0.1 P(H ) = 0.5 P(H 3

Experiment 1: Heads P( H) =? P( H) =? P( H) =? Experiment 1: Heads P( H) = 0.006 P( H) = 0.165 P( H) = 0.829 Compare with ML posterior after Exp 1: P( H) = 0.066 066 P( H) = 0.333 P( H) = 0.600 P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P( HT) =? P( HT) =? P( HT) =? P( HT) = 0.035 P( HT) = 0.481 P( HT) = 0.485 P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P( HT) = 0.035 P( HT)=0.481 P( HT) = 0.485 Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) P(H )=09 0.9 P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( 4

Most likely coin: Your Estimate? Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior Best estimate for P(H) P(H Did We Do The Right Thing? P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H P( P(H ) = 0.1 P(H ) = 0.5 P(H Did We Do The Right Thing? A Better Estimate P( HT) =0.035 P( HT)=0.481 P( HT)=0.485 and are almost equally likely Recall: = 0.680 P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H ) = 0.1 P(H ) = 0.5 P(H P(H ) = 0.1 P(H ) = 0.5 P(H Bayesian Estimate Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior = 0.680 P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H ) = 0.1 P(H ) = 0.5 P(H Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H MAP (Maximum A Posteriori): P(H after 10 experiments: P(H Bayesian: P(H) = 0.68 after 10 experiments: P(H 5

Summary Recap ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Prior Uniform Any Any Hypothesis The most likely The most likely Weighted combination Envelope Problem One envelope has twice as much money as other A B $ Envelope Problem One envelope has twice as much money as other A B Switch? 33 A has $20 E[B] = $ ½ (10 + 40) = $ 25 Switch? 34 Those Pesky Distributions Those Pesky Distributions Discrete Continuous Discrete Continuous Binary {0, 1} M Values Binary {0, 1} M Values Bernouilli Gaussian ~ Normal Single Event Bernouilli Sufficient Statistics P(x=1) = P(x=0) = 1-, 35 Sequence (N trials) N= Conjugate Prior Binomial Beta Multinomial Dirichlet 36 6

Those Pesky Distributions Discrete Continuous Prior Distributions Single Event Binary {0, 1} Bernouilli M Values Slightly harder Which coin is he using (of three known choices)? What is the bias of a single new, unseen coin? Sequence (N trials) Binomial Multinomial N= H + T Conjugate Prior Beta Dirichlet 37 38 What if I have prior beliefs? Billionaire says: Here s a new coin; I bet it s close to 50 50. What can you do for me now? You say: I can learn it the Bayesian way Rather than estimating a single, we obtain a distribution over possible values of Use Bayes rule: Bayesian Learning Prior Data Likelihood Posterior Normalization In the beginning Observe flips e.g.: {tails, tails} After observations Or equivalently: Remember, for uniform priors: reduces to MLE objective Bayesian Learning for Dollars Beta prior distribution: P( ) Likelihood function is Binomial: What about prior? Represent expert knowledge Want simple posterior form Conjugate priors: Closed form representation of posterior For Binomial, conjugate prior is Beta distribution Likelihood function: Posterior: 7

Posterior Distribution Prior: Data: H heads and T tails Posterior distribution: Bayesian Posterior Inference Posterior distribution: Bayesian inference: Nolonger single parameter For any specific f, the function of interest Compute the expected value of f Integral is often hard to compute MAP: Maximum a Posteriori Approximation As more data is observed, Beta is more certain MAP: use most likely parameter to approximate the expectation 46 MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra thumbtack flips As N, prior is forgotten But, for small sample size, prior is important! What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians 8

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ N(, 2 ) Y = ax + b Y ~ N(a +b,a 2 2 ) Sum of Gaussians is Gaussian X ~ N( X, 2 X) Y ~ N( Y, 2 Y) Z = X+Y Z ~ N( X + Y, 2 X+ 2 Y) Easy to differentiate, as we will see soon! Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ X i i= 0 85 1 95 Exam Score 2 100 3 12 99 89 MLE for Gaussian: Prob. of i.i.d. samples D={x 1,,x N }: Your second learning algorithm: MLE for mean of a Gaussian What s MLE for mean? Log-likelihood of data: MLE for variance Again, set derivative to zero: MLE: Learning Gaussian parameters BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator: 9

Bayesian learning of Gaussian parameters MAP for mean of Gaussian Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean: 10