Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Similar documents
Non-parametric Methods

Point Estimation. Vibhav Gogate The University of Texas at Dallas

CSC321 Lecture 18: Learning Probabilistic Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Discrete Binary Distributions

Introduction to Machine Learning

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduction to Machine Learning CMU-10701

CS 361: Probability & Statistics

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Methods: Naïve Bayes

Computational Perception. Bayesian Inference

Naïve Bayes classification

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian Models in Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Announcements. Proposals graded

CS 361: Probability & Statistics

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Introduction to Probabilistic Machine Learning

Lecture 2: Conjugate priors

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Conjugate Priors, Uninformative Priors

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Probability and Estimation. Alan Moses

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

MLE/MAP + Naïve Bayes

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Parametric Models: from data to models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Objective - To understand experimental probability

Machine Learning 4771

CPSC 340: Machine Learning and Data Mining

MLE/MAP + Naïve Bayes

Lecture 18: Learning probabilistic models

Parameter Learning With Binary Variables

CHAPTER 2 Estimating Probabilities

COMP90051 Statistical Machine Learning

Computational Cognitive Science

Intro to Bayesian Methods

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

PMR Learning as Inference

Parametric Techniques Lecture 3

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Modeling Environment

Computational Cognitive Science

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Language as a Stochastic Process

Parametric Techniques

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Learning P-maps Param. Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Introduction to Machine Learning. Lecture 2

Expectation-Maximization (EM) algorithm

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture 2: Repetition of probability theory and statistics

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Compute f(x θ)f(θ) dθ

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Probability Theory and Simulation Methods

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Updating with Discrete Priors Class 11, Jeremy Orloff and Jonathan Bloom

Machine Learning using Bayesian Approaches

Lecture 2: Priors and Conjugacy

Announcements. Proposals graded

Parameter estimation Conditional risk

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Probabilistic Graphical Models

CS 630 Basic Probability and Information Theory. Tim Campbell

Introduction to Machine Learning

Bayesian Regression Linear and Logistic Regression

COS513 LECTURE 8 STATISTICAL CONCEPTS

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

(1) Introduction to Bayesian statistics

Machine Learning

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction to Bayesian Learning. Machine Learning Fall 2018

Ways to make neural networks generalize better

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Learning Bayesian network : Given structure and completely observed data

Transcription:

Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation

Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform Prior.

Coin Tossing Let s say you re given a coin, and you want to find out P(heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P(heads) = 2/3, no need for CMPT726 Hmm... is this rigorous? Does this make sense?

Coin Tossing Let s say you re given a coin, and you want to find out P(heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P(heads) = 2/3, no need for CMPT726 Hmm... is this rigorous? Does this make sense?

Coin Tossing Let s say you re given a coin, and you want to find out P(heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P(heads) = 2/3, no need for CMPT726 Hmm... is this rigorous? Does this make sense?

Coin Tossing Let s say you re given a coin, and you want to find out P(heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P(heads) = 2/3, no need for CMPT726 Hmm... is this rigorous? Does this make sense?

Coin Tossing - Model Bernoulli distribution P(heads) = µ, P(tails) = 1 µ Assume coin flips are independent and identically distributed (i.i.d.) i.e. All are separate samples from the Bernoulli distribution Given data D = {x 1,..., x N }, heads: x i = 1, tails: x i = 0, the likelihood of the data is: p(d µ) = N p(x n µ) = N µ xn (1 µ) 1 xn

Maximum Likelihood Estimation Given D with h heads and t tails What should µ be? Maximum Likelihood Estimation (MLE): choose µ which maximizes the likelihood of the data µ ML = arg max µ p(d µ) Since ln( ) is monotone increasing: µ ML = arg max ln p(d µ) µ

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn N x n ln µ + (1 x n ) ln(1 µ) Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t µ = h t + h

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn N x n ln µ + (1 x n ) ln(1 µ) Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t µ = h t + h

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn N x n ln µ + (1 x n ) ln(1 µ) Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t µ = h t + h

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn N x n ln µ + (1 x n ) ln(1 µ) Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t µ = h t + h

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn N x n ln µ + (1 x n ) ln(1 µ) Take derivative, set to 0: d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t µ = h t + h

MLE Estimate: The 0 problem. h heads, t tails, n = h+t. Practical problems with using the MLE h n If h or t are 0, the 0 prob may be multiplied with other nonzero probs (singularity). If n = 0, no estimate at all. This happens quite often in high dimensional spaces.

Smoothing Frequency Estimates h heads, t tails, n = h+t. Prior probability estimate p. Equivalent Sample Size m. m estimate = h + mp n + m Interpretation: we started with a virtual sample of m tosses with mp heads. h +1 P = ½,m=2 Laplace correction = n + 2

Bayesian Approach Key idea: don t even try to pick specific parameter value μ use a probability distribution over parameter values. Learning = use Bayes theorem to update probability distribution. Prediction = model averaging.

Prior Distribution over Parameters Could use uniform distribution. Exercise: what does uniform over [0,1] look like? What if we don t think prior distribution is uniform? Use conjugate prior. Prior has parameters a, b hyperparameters. Prior P(μ a,b) = f(a,b) is some function of hyperparameters. Posterior has same functional form f(a,b ) where a,b are updated by Bayes theorem.

Beta Distribution We will use the Beta distribution to express our prior knowledge about coins: Beta(µ a, b) = Γ(a + b) µ a 1 (1 µ) b 1 Γ(a)Γ(b) }{{} normalization Parameters a and b control the shape of this distribution

Posterior P(µ D) P(D µ)p(µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored

Posterior P(µ D) P(D µ)p(µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored

Posterior P(µ D) P(D µ)p(µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored

Bayesian Point Estimation What if a Bayesian had to guess a single parameter value given hyperdistribution P? Use expected value E P (μ). E.g., for P = Beta(μ a,b) we have E P (μ) = a/a+b. If we use uniform prior P, what is E P (μ D)? The Laplace correction!