Discrete Binary Distributions

Similar documents
Lecture 10 and 11: Text and Discrete Distributions

Probabilistic and Bayesian Machine Learning

Lecture 2: Conjugate priors

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Models in Machine Learning

CS 361: Probability & Statistics

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probability and Estimation. Alan Moses

Computational Perception. Bayesian Inference

Lecture 2: Priors and Conjugacy

Unsupervised Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CS 361: Probability & Statistics

Bayesian RL Seminar. Chris Mansley September 9, 2008

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Parameter Learning With Binary Variables

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Compute f(x θ)f(θ) dθ

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Introduction to Machine Learning

Chapter 5. Bayesian Statistics

Bayesian Analysis for Natural Language Processing Lecture 2

5. Conditional Distributions

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Updating with Discrete Priors Class 11, Jeremy Orloff and Jonathan Bloom

Lecture 6: Graphical Models: Learning

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Language as a Stochastic Process

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

CSC321 Lecture 18: Learning Probabilistic Models

Chapter 1. Sets and probability. 1.3 Probability space

PMR Learning as Inference

Hierarchical Models & Bayesian Model Selection

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

COMP90051 Statistical Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Computational Cognitive Science

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Classical and Bayesian inference

Probability Theory and Simulation Methods

CPSC 340: Machine Learning and Data Mining

CHAPTER 2 Estimating Probabilities

Common probability distributionsi Math 217 Probability and Statistics Prof. D. Joyce, Fall 2014

Contents. Decision Making under Uncertainty 1. Meanings of uncertainty. Classical interpretation

Probability Theory and Applications

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probability theory basics

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Learning Bayesian network : Given structure and completely observed data

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

6.867 Machine Learning

Lecture 1: Probability Fundamentals

1 Introduction. P (n = 1 red ball drawn) =

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Naïve Bayes classification

Notes on Machine Learning for and

Probability Review - Bayes Introduction

Bayesian data analysis using JASP

Probability Theory for Machine Learning. Chris Cremer September 2015

CS540 Machine learning L8

Using Probability to do Statistics.

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Review of Probabilities and Basic Statistics

Machine Learning using Bayesian Approaches

Bayesian Inference. p(y)

CS540 Machine learning L9 Bayesian statistics

The Random Variable for Probabilities Chris Piech CS109, Stanford University

Conjugate Priors, Uninformative Priors

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

ECE 302: Probabilistic Methods in Electrical Engineering

Probabilistic Graphical Models

Introduction to Machine Learning CMU-10701

Bayesian Inference: Posterior Intervals

KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP

Data Analysis and Uncertainty Part 2: Estimation

Introduction to Machine Learning. Lecture 2

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

2011 Pearson Education, Inc

Topic 5 Basics of Probability

Computational Cognitive Science

Basics of Statistical Estimation

Machine Learning Summer School

Probability Distributions

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Ling 289 Contingency Table Statistics

Transcription:

Discrete Binary Distributions Carl Edward Rasmussen November th, 26 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 5

Key concepts Bernoulli: probabilities over binary variables Binomial: probabilities over counts and binary sequences Inference, priors and pseudo-counts, the Beta distribution model comparison: an example Carl Edward Rasmussen Discrete Binary Distributions November th, 26 2 / 5

Coin tossing You are presented with a coin: what is the probability of heads? What does this question even mean? How much are you willing to bet p(head) >.5? Do you expect this coin to come up heads more often that tails? Wait... can you toss the coin a few times, I need data! Ok, you observe the following sequence of outcomes (T: tail, H: head): H This is not enough data! Now you observe the outcome of three additional tosses: HHTH How much are you now willing to bet p(head) >.5? Carl Edward Rasmussen Discrete Binary Distributions November th, 26 3 / 5

The Bernoulli discrete binary distribution The Bernoulli probability distribution over binary random variables: Binary random variable X: outcome x of a single coin toss. The two values x can take are X = for tail, X = for heads. Let the probability of heads be = p(x = ). is the parameter of the Bernoulli distribution. The probability of tail is p(x = ) =. We can compactly write p(x = x ) = p(x ) = x ( ) x What do we think is after observing a single heads outcome? Maximum likelihood! Maximise p(h ) with respect to : p(h ) = p(x = ) =, argmax [,] = Ok, so the answer is =. This coin only generates heads. Is this reasonable? How much are you willing to bet p(heads)>.5? Carl Edward Rasmussen Discrete Binary Distributions November th, 26 4 / 5

The binomial distribution: counts of binary outcomes We observe a sequence of tosses rather than a single toss: HHTH The probability of this particular sequence is: p(hhth) = 3 ( ). But so is the probability of THHH, of HTHH and of HHHT. We often don t care about the order of the outcomes, only about the counts. In our example the probability of 3 heads out of 4 tosses is: 4 3 ( ). The binomial distribution gives the probability of observing k heads out of n tosses p(k, n) = ( n ) k k ( ) n k This assumes n independent tosses from a Bernoulli distribution p(x ). ( n ) = n! k k!(n k)! is the binomial coefficient, also known as n choose k. Carl Edward Rasmussen Discrete Binary Distributions November th, 26 5 / 5

Naming of discrete distributions binary multi-valued sequence binary categorical k ( ) n k categorical k counts binomial ( n ) k k ( ) n k multinomial i= k i i For binary outcomes with have k successes in n trials. For multi-dimensional distributions we have k possible outcomes, a sequence x,..., x N and counts c i = N n= δ(x n, i). For all distributions the parameter of the distribution is the vector, which has either one element [; ] or multiple entries such that i > and i i =. Carl Edward Rasmussen Discrete Binary Distributions November th, 26 6 / 5

Maximum likelihood under a binomial distribution If we observe k heads out of n tosses, what do we think is? We can maximise the likelihood of parameter given the observed data. p(k, n) k ( ) n k It is convenient to take the logarithm and derivatives with respect to log p(k, n) = k log + (n k) log( ) + Constant log p(k, n) = k n k = = k n Is this reasonable? For HHTH we get = 3/4. How much would you bet now that p(heads) >.5? What do you think p( >.5)is? Wait! This is a probability over... a probability? Carl Edward Rasmussen Discrete Binary Distributions November th, 26 7 / 5

Prior beliefs about coins before tossing the coin So you have observed 3 heads out of 4 tosses but are unwilling to bet that p(heads) >.5? (That for example out of,, tosses at least 5,, will be heads) Why? You might believe that coins tend to be fair ( 2 ). A finite set of observations updates your opinion about. But how to express your opinion about before you see any data? Pseudo-counts: You think the coin is fair and... you are... Not very sure. You act as if you had seen 2 heads and 2 tails before. Pretty sure. It is as if you had observed 2 heads and 2 tails before. Totally sure. As if you had seen heads and tails before. Depending on the strength of your prior assumptions, it takes a different number of actual observations to change your mind. Carl Edward Rasmussen Discrete Binary Distributions November th, 26 8 / 5

The Beta distribution: distributions on probabilities Continuous probability distribution defined on the interval [, ] Beta( α, β) = Γ(α + β) Γ(α)Γ(β) α ( ) β = α > and β > are the shape parameters. B(α, β) α ( ) β these parameters correspond to one plus the pseudo-counts. Γ(α) is an extension of the factorial function. Γ(n) = (n )! for integer n. B(α, β) is the beta function, it normalises the Beta distribution. The mean is given by E() = α α+β. [Left: α = β =, Right: α = β = 3] 2 Beta(,) 2 Beta(3,3).5.5 p() p().5.5.2.4.6.8.2.4.6.8 Γ(α) = xα e x dx Carl Edward Rasmussen Discrete Binary Distributions November th, 26 9 / 5

Posterior for coin tossing Imagine we observe a single coin toss and it comes out heads. Our observed data is: D = {k = }, where n =. The probability of the observed data given is the likelihood: p(d ) = We use our prior p( α, β) = Beta( α, β) to get the posterior probability: p( D) = p( α, β)p(d ) p(d) Beta( α, β) (α ) ( ) (β ) Beta( α +, β) The Beta distribution is a conjugate prior to the Bernoulli/binomial distribution: The resulting posterior is also a Beta distribution. The posterior parameters are given by: α posterior = α prior + k β posterior = β prior + (n k) Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 5

Before and after observing one head 2 Beta(,) Prior 2 Beta(3,3).5.5 p() p().5.5.2.4.6.8.2.4.6.8 2 Beta(2,) 2.5 Beta(4,3).5 2.5 p() p().5.5.2.4.6.8 Posterior.2.4.6.8 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 5

Making predictions Given some data D, what is the predicted probability of the next toss being heads, x next =? Under the Maximum Likelihood approach we predict using the value of ML that maximises the likelihood of given the observed data, D: p(x next = ML ) = ML With the Bayesian approach, average over all possible parameter settings: p(x next = D) = p(x = ) p( D) d The prediction for heads happens to correspond to the mean of the posterior distribution. E.g. for D = {(x = )}: Learner A with Beta(, ) predicts p(x next = D) = 2 3 Learner B with Beta(3, 3) predicts p(x next = D) = 4 7 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 2 / 5

Making predictions - other statistics Given the posterior distribution, we can also answer other questions such as what is the probability that >.5 given the observed data? p( >.5 D) =.5 p( D) d =.5 Beta( α, β )d Learner A with prior Beta(, ) predicts p( >.5 D) =.75 Learner B with prior Beta(3, 3) predicts p( >.5 D) =.66 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 3 / 5

Learning about a coin, multiple models () Consider two alternative models of a coin, fair and bent. A priori, we may think that fair is more probable, eg: p(fair) =.8, p(bent) =.2 For the bent coin, (a little unrealistically) all parameter values could be equally likely, where the fair coin has a fixed probability: p(q bent).8.6.4.2.5 parameter, q p(q fair).8.6.4.2.5 parameter, q Carl Edward Rasmussen Discrete Binary Distributions November th, 26 4 / 5

Learning about a coin, multiple models (2) We make tosses, and get data D: T H T H T T T T T T The evidence for the fair model is: p(d fair) = (/2). and for the bent model: p(d bent) = p(d, bent)p( bent) d = 2 ( ) 8 d = B(3, 9).2 Using priors p(fair) =.8, p(bent) =.2, the posterior by Bayes rule: p(fair D).8, p(bent D).4, ie, two thirds probability that the coin is fair. How do we make predictions? By weighting the predictions from each model by their probability. Probability of Head at next toss is: 2 3 2 + 3 3 2 = 5 2. Carl Edward Rasmussen Discrete Binary Distributions November th, 26 5 / 5