Bayesian Models in Machine Learning

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Bayesian Learning (II)

CS 361: Probability & Statistics

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Computational Cognitive Science

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Stat 5101 Lecture Notes

Bayesian Methods for Machine Learning

CS 361: Probability & Statistics

Learning Bayesian network : Given structure and completely observed data

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Computational Perception. Bayesian Inference

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Machine Learning using Bayesian Approaches

Computational Cognitive Science

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

CSC321 Lecture 18: Learning Probabilistic Models

MLE/MAP + Naïve Bayes

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

(1) Introduction to Bayesian statistics

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Discrete Binary Distributions

Lecture 2: Priors and Conjugacy

Basics of Statistical Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Artificial Intelligence

Machine Learning Overview

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

PMR Learning as Inference

Probability Theory for Machine Learning. Chris Cremer September 2015

Language as a Stochastic Process

Bayesian Analysis for Natural Language Processing Lecture 2

Lecture 11: Probability Distributions and Parameter Estimation

Gentle Introduction to Infinite Gaussian Mixture Modeling

Lecture 2: Conjugate priors

Naïve Bayes classification

ECE521 week 3: 23/26 January 2017

COMP90051 Statistical Machine Learning

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

MLE/MAP + Naïve Bayes

Probability and Estimation. Alan Moses

Bayesian Methods: Naïve Bayes

STA 4273H: Statistical Machine Learning

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Lecture : Probabilistic Machine Learning

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Bayesian RL Seminar. Chris Mansley September 9, 2008

Data Mining Techniques. Lecture 3: Probability

Generative Models for Discrete Data

Grundlagen der Künstlichen Intelligenz

Non-parametric Methods

CPSC 340: Machine Learning and Data Mining

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Latent Dirichlet Allocation Introduction/Overview

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Machine Learning 4771

The Bayes classifier

Introduction to Machine Learning

Lecture 2: Repetition of probability theory and statistics

Introduction to Probability and Statistics (Continued)

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Introduction to Probabilistic Machine Learning

Hierarchical Models & Bayesian Model Selection

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Lecture 18: Learning probabilistic models

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

6.867 Machine Learning

Introduction to Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning

Classical and Bayesian inference

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Modeling Environment

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning CMU-10701

Transcription:

Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017

Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of an event occurring in a large (infinite) number of trials E.g. When flipping a coin many times, what is the proportion of heads? Bayesian Inferring probabilities for events that have never occurred or believes which are not directly observed Prior believes are specified in terms of prior probabilities Taking into account uncertainty (posterior distribution) of the estimated parameters or hidden variables in our probabilistic model.

Coin flipping example Lets flip the coin N = 1000 times getting H = 750 heads and T = 250 tails. What is? Intuitive (and also ML) estimate is 750 / 1000 = 0.75. Given some, we can calculate probability (likelihood) of X Now lets express our ignorant prior belief about as: ) = Then using Bayes rule, we obtain probability density function for :

Coin flipping example (cont.) N = 1000, H = 750, T = 250 Posterior distribution is our new belief about Flipping the coin once more, what is the probability of head? Note that we never computed value of Rule of succession used by Pierre-Simon Laplace to estimate that the probability of sun rising tomorrow is

Distributions from our example Likelihood of observed data given a parametric model of probability distribution Bernoulli distribution with parameter Prior on the parameters of the model ) Uniform prior as a special case of Beta distribution Posterior distribution of model parameters given an observed data Posterior predictive distribution of a new observation give prior (training) observations

Bernoulli and Binomial distributions The coin flipping distribution is Bernoulli distribution Flipping the coin once, what is the probability of x = 1 (head) or x = 0 (tail) Related binomial distribution is also described by single probability How many heads do I get if I flip the coin N times? N = 10 = 0.25

Beta distribution Normalizing constant Beta distribution has similar form as Bern or Bin, but it is now function of Continuous distribution for over the interval (0,1) Can be used to express our prior beliefs about the Bernoulli dist. parameter Uniform distribution over as was the prior in our coin flipping example

Beta as a conjugate prior Sufficient statistics Using Beta as a prior for Bernoulli parameter results in Beta posterior distribution Beta is conjugate prior to Bernoulli and can be seen as a prior counts of heads and tails. Continuous distribution of over the interval (0,1) Beta distribution can be used to express our prior beliefs about the Bernoulli distributions parameter

Categorical and Multinomial distribution One-hot encoding of a discrete event ( Probabilities of the events (eg. for fair dice) on a dice) π C = 3 is a point on a simplex Categorical distribution simply returns the probability of a given event x Sample from the distribution is the event (or its one-hot encoding) π π Multinomial distribution is also described by single probability vector How many ones, twos, threes, do I get if I throw the dice N times? Sample from the distribution is vector of numbers (e.g. 11x one, 8x two, )

Dirichlet distribution π C = 3 Dirichlet distribution is continuous distribution over the points on a K dimensional simplex. Can be used to express our prior beliefs about the categorical distribution parameter π π

Dirichlet as a conjugate prior Sufficient statistics Using Dirichlet as a prior for Categorical parameter results in Dirichlet posterior distribution Dirichlet is conjugate prior to Categorical dist. can be seen as a prior count for the individual events.

Gaussian distribution (univariate) ML estimates of parameters

Gamma distribution Normal distribution can be expressed in terms of precision Gamma distribution d can be used as a prior over the precision

NormalGamma distribution Joint distribution over and. Note that and are not independent.

NormalGamma distribution NormalGamma distribution is the conjugate prior for Gaussian dist. Given observations, the posterior distribution Defined in terms of sufficient statistics N and

Gaussian distribution (multivariate) ML estimates of parameters

Gaussian distribution (multivariate) Conjugate prior is Normal-Wishart where is Wishart distribution and

Exponential family All the distributions described so far are distributions from the exponential family, which can be expressed in the following form g For example for Gaussian distribution: g To evaluate likelihood of set of observations: g

Exponential family For any distributions from exponential family g Likelihood of observed data can be evaluated using the sufficient statistics and g : Conjugate prior distribution over parameter exists in form: Posterior distribution takes the same form as the conjugate prior and we need only the prior parameters and the sufficient stats to evaluate it: g can be seen as prior observation and as prior count of observation g

Parameter estimation revisited Lets estimate again parameters of a chosen distribution given some of observed data Using the Bayes rule, we get the posterior distribution We can choose the most likelihood parameters: Maximum a-posteriori (MAP) estimate η η Assuming flat (constant) prior, we obtain Maximum likelihood (ML) estimate as a special case: η

Posterior predictive distribution We do not need to obtain a point estimate of the parameters It is always good to postpone making hard decisions Instead, we can take into account the uncertainty encoded in the posterior distribution when evaluating posterior predictive probability for a new data point (as we did in our coin flipping example) Rather than using one most likely setting of parameters, we average over their different setting, which could possibly generate the observed data this approach is robust to overfitting

Posterior predictive for Bernoulli Beta prior on parameters of Bernoulli distribution leads to Beta posterior The posterior predictive distribution is again Bernoulli In our coin flipping example:

Posterior predictive for Categorical Dirichlet prior on parameters of Categorical distribution leads to Dirichlet posterior The posterior predictive distribution is again Categorical

Student s t-distribution NormalGamma prior on parameters of Gaussian distribution leads to NormalGamma posterior The posterior predictive distribution is Student s t-distribution

Student s t-distribution Gaussian distribution is a special case of Student s with degree of freedom For the posterior,