Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Similar documents
Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Gaussians Linear Regression Bias-Variance Tradeoff

Bayesian Methods: Naïve Bayes

Naïve Bayes Lecture 17

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayesian Models in Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CS 361: Probability & Statistics

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Parametric Models: from data to models

Announcements. Proposals graded

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Machine Learning CMU-10701

Announcements. Proposals graded

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

CS 361: Probability & Statistics

Lecture 11: Probability Distributions and Parameter Estimation

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Inference and MCMC

Introduction to Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSE 473: Artificial Intelligence Autumn Topics

PMR Learning as Inference

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CPSC 540: Machine Learning

Probability and Estimation. Alan Moses

Computational Cognitive Science

Machine Learning

Bayesian Learning (II)

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning Bayesian network : Given structure and completely observed data

Compute f(x θ)f(θ) dθ

Parameter Learning With Binary Variables

Computational Cognitive Science

Introduction to Machine Learning. Lecture 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 2: Conjugate priors

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Probabilistic Machine Learning

MLE/MAP + Naïve Bayes

Basics on Probability. Jingrui He 09/11/2007

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Machine Learning using Bayesian Approaches

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Discrete Binary Distributions

(1) Introduction to Bayesian statistics

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

COMP90051 Statistical Machine Learning

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Density Estimation. Seungjin Choi

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CS 6140: Machine Learning Spring 2016

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture : Probabilistic Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Naïve Bayes classification

Hierarchical Models & Bayesian Model Selection

A Brief Review of Probability, Bayesian Statistics, and Information Theory

MLE/MAP + Naïve Bayes

Learning P-maps Param. Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Part 4: Multi-parameter and normal models

LEARNING WITH BAYESIAN NETWORKS

Some Probability and Statistics

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

CSE446: Clustering and EM Spring 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Computational Perception. Bayesian Inference

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Parameter estimation Conditional risk

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Artificial Intelligence

Machine Learning CSE546

Transcription:

Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional probabilities Sum rule, product rule, Bayes rule Independence, conditional independence Envelope Problem One envelope has twice as much money as other Binomial distribution Maximum Likelihood Estimation (MLE) [A peek at PAC learning theory] A Switch? B 3 4 $ Envelope Problem One envelope has twice as much money as other A B Coin Flip P(H ) = 0.1 P(H ) = 0.5 Which coin will I use? P(H A has $20 E[B] = $ ½ (10 + 40) = $ 25 Switch? 5 P( P( P( Prior: Probability of a hypothesis before we make any observations 1

Coin Flip P(H ) = 0.1 P(H ) = 0.5 P(H Experiment 1: Heads P( H) =? P( H) =? P( H) =? Which coin will I use? P( P( P( Uniform Prior: All hypothesis are equally likely before we make any observations P(H )=0.1 P(H ) = 0.5 P(H P( )=1/3 P( P( Experiment 1: Heads Sum Rule P( H) = 0.066 P( H) = 0.333 P( H) = 0.6 Posterior: Probability of a hypothesis given data Product Rule P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equally likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis Now, P( HT) =? P( HT) =? P( HT) =? P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( 2

Now, P( HT) = 0.21P( HT) = 0.58 P( HT) = 0.21 P( HT) = 0.21P( HT) = 0.58 P( HT) = 0.21 P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) P(H )=05 0.5 Most likely coin: Your Estimate? Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior Best estimate for P(H) P(H )=05 0.5 P(H ) = 0.1 P(H ) = 0.5 P(H P( P( P( P(H ) = 0.5 P( Using Prior Knowledge Should we always use a Uniform Prior? Background knowledge: Heads => we have to buy Dan chocolate Dan likes chocolate => Dan is more likely to use a coin biased in his favor Using Prior Knowledge We can encode it in the prior: P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P(H ) = 0.1 P(H ) = 0.5 P(H 3

Experiment 1: Heads P( H) =? P( H) =? P( H) =? Experiment 1: Heads P( H) = 0.006 P( H) = 0.165 P( H) = 0.829 Compare with ML posterior after Exp 1: P( H) = 0.066 066 P( H) = 0.333 P( H) = 0.600 P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P( HT) =? P( HT) =? P( HT) =? P( HT) = 0.035 P( HT) = 0.481 P( HT) = 0.485 P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P( HT) = 0.035 P( HT)=0.481 P( HT) = 0.485 Your Estimate? What is the probability of heads after two experiments? Most likely coin: Best estimate for P(H) P(H )=09 0.9 P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( P(H ) = 0.1 P(H ) = 0.5 P(H P( ) = 0.05 P( ) = 0.25 P( 4

Most likely coin: Your Estimate? Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior Best estimate for P(H) P(H Did We Do The Right Thing? P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H P( P(H ) = 0.1 P(H ) = 0.5 P(H Did We Do The Right Thing? A Better Estimate P( HT) =0.035 P( HT)=0.481 P( HT)=0.485 and are almost equally likely Recall: = 0.680 P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H ) = 0.1 P(H ) = 0.5 P(H P(H ) = 0.1 P(H ) = 0.5 P(H Bayesian Estimate Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior = 0.680 P( HT)=0.035 P( HT)=0.481 P( HT)=0.485 P(H ) = 0.1 P(H ) = 0.5 P(H Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H MAP (Maximum A Posteriori): P(H after 10 experiments: P(H Bayesian: P(H) = 0.68 after 10 experiments: P(H 5

Summary Recap ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Prior Uniform Any Any Hypothesis The most likely The most likely Weighted combination Envelope Problem One envelope has twice as much money as other A B $ Envelope Problem One envelope has twice as much money as other A B Switch? 33 A has $20 E[B] = $ ½ (10 + 40) = $ 25 Switch? 34 Those Pesky Distributions Those Pesky Distributions Discrete Continuous Discrete Continuous Binary {0, 1} M Values Binary {0, 1} M Values Bernouilli Gaussian ~ Normal Single Event Bernouilli Sufficient Statistics P(x=1) = P(x=0) = 1-, 35 Sequence (N trials) N= Conjugate Prior Binomial Beta Multinomial Dirichlet 36 6

Those Pesky Distributions Discrete Continuous Prior Distributions Single Event Binary {0, 1} Bernouilli M Values Slightly harder Which coin is he using (of three known choices)? What is the bias of a single new, unseen coin? Sequence (N trials) Binomial Multinomial N= H + T Conjugate Prior Beta Dirichlet 37 38 What if I have prior beliefs? Billionaire says: Here s a new coin; I bet it s close to 50 50. What can you do for me now? You say: I can learn it the Bayesian way Rather than estimating a single, we obtain a distribution over possible values of Use Bayes rule: Bayesian Learning Prior Data Likelihood Posterior Normalization In the beginning Observe flips e.g.: {tails, tails} After observations Or equivalently: Remember, for uniform priors: reduces to MLE objective Bayesian Learning for Dollars Beta prior distribution: P( ) Likelihood function is Binomial: What about prior? Represent expert knowledge Want simple posterior form Conjugate priors: Closed form representation of posterior For Binomial, conjugate prior is Beta distribution Likelihood function: Posterior: 7

Posterior Distribution Prior: Data: H heads and T tails Posterior distribution: Bayesian Posterior Inference Posterior distribution: Bayesian inference: Nolonger single parameter For any specific f, the function of interest Compute the expected value of f Integral is often hard to compute MAP: Maximum a Posteriori Approximation As more data is observed, Beta is more certain MAP: use most likely parameter to approximate the expectation 46 MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra thumbtack flips As N, prior is forgotten But, for small sample size, prior is important! What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians 8

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ N(, 2 ) Y = ax + b Y ~ N(a +b,a 2 2 ) Sum of Gaussians is Gaussian X ~ N( X, 2 X) Y ~ N( Y, 2 Y) Z = X+Y Z ~ N( X + Y, 2 X+ 2 Y) Easy to differentiate, as we will see soon! Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ Variance: σ X i i= 0 85 1 95 Exam Score 2 100 3 12 99 89 MLE for Gaussian: Prob. of i.i.d. samples D={x 1,,x N }: Your second learning algorithm: MLE for mean of a Gaussian What s MLE for mean? Log-likelihood of data: MLE for variance Again, set derivative to zero: MLE: Learning Gaussian parameters BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator: 9

Bayesian learning of Gaussian parameters MAP for mean of Gaussian Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean: 10