Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Size: px

Start display at page:

Download "Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2"

Rodney Taylor
5 years ago
Views:

1 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional probabilities Sum rule, product rule, Bayes rule Independence, conditional independence Envelope Problem One envelope has twice as much money as other Binomial distribution Maximum Likelihood Estimation (MLE) [A peek at PAC learning theory] A Switch? B 3 4 $ Envelope Problem One envelope has twice as much money as other A B Coin Flip P(H ) = 0.1 P(H ) = 0.5 Which coin will I use? P(H A has $20 E[B] = $ ½ ( ) = $ 25 Switch? 5 P( P( P( Prior: Probability of a hypothesis before we make any observations 1

2 Coin Flip P(H ) = 0.1 P(H ) = 0.5 P(H Experiment 1: Heads P( H) =? P( H) =? P( H) =? Which coin will I use? P( P( P( Uniform Prior: All hypothesis are equally likely before we make any observations P(H )=0.1 P(H ) = 0.5 P(H P( )=1/3 P( P( Experiment 1: Heads Sum Rule P( H) = P( H) = P( H) = 0.6 Posterior: Probability of a hypothesis given data Product Rule P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equally likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis Now, P( HT) =? P( HT) =? P( HT) =? P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( 2

3 Now, P( HT) = 0.21P( HT) = 0.58 P( HT) = 0.21 P( HT) = 0.21P( HT) = 0.58 P( HT) = 0.21 P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) P(H )= Most likely coin: Your Estimate? Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior Best estimate for P(H) P(H )= P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( P(H ) = 0.5 P( Using Prior Knowledge Should we always use a Uniform Prior? Background knowledge: Heads => we have to buy Dan chocolate Dan likes chocolate => Dan is more likely to use a coin biased in his favor Using Prior Knowledge We can encode it in the prior: P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P(H ) = 0.1 P(H ) = 0.5 P(H 3

4 Experiment 1: Heads P( H) =? P( H) =? P( H) =? Experiment 1: Heads P( H) = P( H) = P( H) = Compare with ML posterior after Exp 1: P( H) = P( H) = P( H) = P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P( HT) =? P( HT) =? P( HT) =? P( HT) = P( HT) = P( HT) = P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P( HT) = P( HT)=0.481 P( HT) = Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) P(H )= P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( 4

5 Most likely coin: Your Estimate? Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior Best estimate for P(H) P(H Did We Do The Right Thing? P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H P( P(H ) = 0.1 P(H ) = 0.5 P(H Did We Do The Right Thing? A Better Estimate P( HT) =0.035 P( HT)=0.481 P( HT)=0.485 and are almost equally likely Recall: = P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H ) = 0.1 P(H ) = 0.5 P(H P(H ) = 0.1 P(H ) = 0.5 P(H Bayesian Estimate Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior = P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H ) = 0.1 P(H ) = 0.5 P(H Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H MAP (Maximum A Posteriori): P(H after 10 experiments: P(H Bayesian: P(H) = 0.68 after 10 experiments: P(H 5

6 Summary Recap ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Prior Uniform Any Any Hypothesis The most likely The most likely Weighted combination Envelope Problem One envelope has twice as much money as other A B $ Envelope Problem One envelope has twice as much money as other A B Switch? 33 A has $20 E[B] = $ ½ ( ) = $ 25 Switch? 34 Those Pesky Distributions Those Pesky Distributions Discrete Continuous Discrete Continuous Binary {0, 1} M Values Binary {0, 1} M Values Bernouilli Gaussian ~ Normal Single Event Bernouilli Sufficient Statistics P(x=1) = P(x=0) = 1-, 35 Sequence (N trials) N= Conjugate Prior Binomial Beta Multinomial Dirichlet 36 6

7 Those Pesky Distributions Discrete Continuous Prior Distributions Single Event Binary {0, 1} Bernouilli M Values Slightly harder Which coin is he using (of three known choices)? What is the bias of a single new, unseen coin? Sequence (N trials) Binomial Multinomial N= H + T Conjugate Prior Beta Dirichlet What if I have prior beliefs? Billionaire says: Here s a new coin; I bet it s close to What can you do for me now? You say: I can learn it the Bayesian way Rather than estimating a single, we obtain a distribution over possible values of Use Bayes rule: Bayesian Learning Prior Data Likelihood Posterior Normalization In the beginning Observe flips e.g.: {tails, tails} After observations Or equivalently: Remember, for uniform priors: reduces to MLE objective Bayesian Learning for Dollars Beta prior distribution: P( ) Likelihood function is Binomial: What about prior? Represent expert knowledge Want simple posterior form Conjugate priors: Closed form representation of posterior For Binomial, conjugate prior is Beta distribution Likelihood function: Posterior: 7

8 Posterior Distribution Prior: Data: H heads and T tails Posterior distribution: Bayesian Posterior Inference Posterior distribution: Bayesian inference: Nolonger single parameter For any specific f, the function of interest Compute the expected value of f Integral is often hard to compute MAP: Maximum a Posteriori Approximation As more data is observed, Beta is more certain MAP: use most likely parameter to approximate the expectation 46 MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra thumbtack flips As N, prior is forgotten But, for small sample size, prior is important! What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians 8

9 Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ N(, 2 ) Y = ax + b Y ~ N(a +b,a 2 2 ) Sum of Gaussians is Gaussian X ~ N( X, 2 X) Y ~ N( Y, 2 Y) Z = X+Y Z ~ N( X + Y, 2 X+ 2 Y) Easy to differentiate, as we will see soon! Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ X i i= Exam Score MLE for Gaussian: Prob. of i.i.d. samples D={x 1,,x N }: Your second learning algorithm: MLE for mean of a Gaussian What s MLE for mean? Log-likelihood of data: MLE for variance Again, set derivative to zero: MLE: Learning Gaussian parameters BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator: 9

10 Bayesian learning of Gaussian parameters MAP for mean of Gaussian Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean: 10

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Vibhav Gogate The University of Texas at Dallas Point Estimation Vibhav Gogate The University of Texas at Dallas Some slides courtesy of Carlos Guestrin, Chris Bishop, Dan Weld and Luke Zettlemoyer. Basics: Expectation and Variance Binary Variables