Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Similar documents
Density Estimation. Seungjin Choi

Introduction to Probabilistic Machine Learning

Introduction to Machine Learning

Learning Bayesian network : Given structure and completely observed data

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

Naïve Bayes classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Hierarchical Models & Bayesian Model Selection

Bayesian Models in Machine Learning

Lecture 6: Graphical Models: Learning

Bayesian Regression Linear and Logistic Regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Methods: Naïve Bayes

Lecture 2: Priors and Conjugacy

An Introduction to Bayesian Machine Learning

Machine Learning Summer School

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

Bayesian Methods for Machine Learning

Bayesian Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Learning (II)

CPSC 540: Machine Learning

PMR Learning as Inference

Unsupervised Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Nonparameteric Regression:

GAUSSIAN PROCESS REGRESSION

Lecture 4: Probabilistic Learning

Probabilistic Graphical Models

Bayesian Machine Learning

Expectation Propagation for Approximate Bayesian Inference

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Techniques Lecture 3

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

an introduction to bayesian inference

Introduction to Machine Learning

Parametric Techniques

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STA414/2104 Statistical Methods for Machine Learning II

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

The Bayes classifier

Graphical Models for Collaborative Filtering

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

CPSC 340: Machine Learning and Data Mining

Introduction to Bayesian inference

Computational Cognitive Science

Lecture 2: Conjugate priors

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Probability and Estimation. Alan Moses

MLE/MAP + Naïve Bayes

Generative Models for Discrete Data

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Non-Parametric Bayes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Machine Learning

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS 6140: Machine Learning Spring 2016

STA 4273H: Sta-s-cal Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probabilistic Machine Learning

Machine learning - HT Maximum Likelihood

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Inference and MCMC

Conjugate Priors, Uninformative Priors

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

STA 4273H: Statistical Machine Learning

Bayesian Inference. p(y)

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning - MT Classification: Generative Models

Variational Scoring of Graphical Model Structures

Nonparametric Bayesian Methods (Gaussian Processes)

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Introduction into Bayesian statistics

Approximate Inference using MCMC

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS-E3210 Machine Learning: Basic Principles

MLE/MAP + Naïve Bayes

The connection of dropout and Bayesian statistics

Data Analysis and Uncertainty Part 2: Estimation

Transcription:

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for training data D = {(x i, y i )} n i= given a parameter vector θ: y i (π,..., π K ), x y i g yi (x) = p(x yi ) k-th class conditional density assumed to have a parametric form for g k (x) = p(x k ) and all parameters are given by θ = (π,..., π K ;,..., K ) Generative process defines the likelihood function: the joint distribution of all the observed data p(d θ) given a parameter vector θ. Process of generative learning consists of computing the MLE θ of θ based on D: θ = argmax p(d θ) θ Θ We then use a plug-in approach to perform classification f θ(x) = argmax P θ(y π k p(x k ) = k X = x) = argmax K k {,...,K} k {,...,K} j= π jp(x j ) The Framework Being Bayesian: treat parameter vector θ as a random variable: process of learning is then computation of the posterior distribution p(θ D). In addition to the likelihood p(d θ) need to specify a prior distribution p(θ). Posterior distribution is then given by the Bayes Theorem: Likelihood: p(d θ) Prior: p(θ) Summarizing the posterior: p(θ D) = p(d θ)p(θ) Posterior: p(θ D) Marginal likelihood: = Θ p(d θ)p(θ)dθ Posterior mode: θ MAP = argmax θ Θ p(θ D) (maximum a posteriori). Posterior mean: θ mean = E [θ D]. Posterior variance: Var[θ D].

A simple example: We have a coin with probability of coming up heads. Model coin tosses as i.i.d. Bernoullis, =head, =tail. Estimate given a dataset D = {x i } n i= of tosses..8.6.4. n=, n =, n =.8.6.4. n=, n =, n =.5 n=, n =4, n =6.5 p(d ) = n ( ) n.8.6.8.6 with n j = n i= (x i = j). Maximum Likelihood estimate: ˆ ML = n n Bayesian approach: treat the unknown parameter as a random variable. Simple prior: Uniform[, ], i.e., p() = for [, ]. Posterior distribution: p( D) = p(d θ)p(θ) = n ( ) n, = n ( ) n d = Posterior is a Beta(n +, n + ) distribution: mean = n + n+. (n + )! n!n!.4. 9 8 7 6 5 4....4.5.6.7.8.9 n=, n =65, n =5....4.5.6.7.8.9.4. 5 5....4.5.6.7.8.9 5 n=, n =686, n =4....4.5.6.7.8.9.5 9 8 7 6 5 4....4.5.6.7.8.9 n=, n =74, n =996....4.5.6.7.8.9 Posterior becomes behaves like the ML estimate as dataset grows and is peaked at true value =.7. All Bayesian reasoning is based on the posterior distribution. Posterior mode: MAP = n n Posterior mean: mean = n + n+ Posterior variance: Var[ D] = mean ( mean ) n+ ( α)-credible regions: (l, r) [, ] s.t. r p(θ D)dθ = α. l Consistency: Assuming that the true parameter value is given a non-zero density under the prior, the posterior distribution concentrates around the true value as n. Rate of convergence? The posterior predictive distribution is the conditional distribution of x n+ given D = {x i } n i= : p(x n+ D) = = p(x n+, D)p( D)d p(x n+ )p( D)d = ( mean ) x n+ ( mean ) x n+ We predict on new data by averaging the predictive distribution over the posterior. Accounts for uncertainty about.

Beta Distributions In this example, the posterior distribution has a known analytic form and is in the same Beta family as the prior: Uniform[, ] Beta(, ). An example of a conjugate prior. A Beta distribution Beta(a, b) with parameters a, b > is an exponential family distribution with density 4.5 4.5 (.,.) (.8,.8) (,) (,) (5,5) 9 8 7 (,9) (,7) (5,5) (7,) (9,) p( a, b) = Γ(a + b) Γ(a)Γ(b) a ( ) b where Γ(t) = u t e u du is the gamma function. If the prior is Beta(a, b), then the posterior distribution is so is Beta(a + n, b + n ). p( D, a, b) = a+n ( ) b+n Hyperparameters a and b are pseudo-counts, an imaginary initial sample that reflects our prior beliefs about..5.5.5....4.5.6.7.8.9 6 5 4....4.5.6.7.8.9 Bayesian Inference on the Categorical Distribution Dirichlet Distributions Suppose we observe D = {y i } n i= with y i {,..., K}, and model them as i.i.d. with pmf π = (π,..., π K ): p(d π) = n π yi = i= with n k = n i= (y i = k) and π k >, K k= π k =. The conjugate prior on π is the Dirichlet distribution Dir(α,..., α K ) with parameters α k >, and density p(π) = Γ( K k= α k) K k= Γ(α k) k= k= π n k k π α k k on the probability simplex {π : π k >, K k= π k = }. The posterior is also Dirichlet Dir(α + n,..., α K + n K ). Posterior mean is π k mean = α k + n k K j= α. j + n j (A) Support of the Dirichlet density for K =. (B) Dirichlet density for α k =. (C) Dirichlet density for α k =..

Naïve Bayes Bayesian Inference on Naïve Bayes model Return to the spam classification example with two-class naïve Bayes p p(x i k ) = x(j) i kj ( kj ) x(j) i. j= Set n k = n i= {y i = k}, n kj = n i= (y i = k, x (j) i ˆπ k = n k n, ˆkj = i:y i =k x(j) i n k = ). MLE is: = n kj n k. One problem: if the l-th word did not appear in documents labelled as class k then ˆ kl = and P(Y = k X = x with l-th entry equal to ) p ( ) x (j) ˆπ k ˆkj ( ˆ ) x (j) kj = j= i.e. we will never attribute a new document containing word l to class k (regardless of other words in it). Bayesian Inference on Naïve Bayes model Given D = {(x i, y i )} n i=, want to predict a label ỹ for a new document x. We can calculate with Predicted class is p( x, ỹ = k D) = p(ỹ = k D)p( x ỹ = k, D) p(ỹ = k D) = α k + n k K l= α l + n p( x (j) = ỹ = k, D) = a + n kj a + b + n k p(ỹ = k x, D) = p(ỹ = k D)p( x ỹ = k, D) p( x D) Compared to ML plug-in estimator, pseudocounts help to regularize probabilities away from extreme values. Under the Naïve Bayes model, the joint distribution of labels y i {,..., K} and data vectors x i {, } p is n p(x i, y i ) = i= = n i= k= k= π k π n k k p j= p j= x(j) i kj ( kj ) x(j) i n kj kj ( kj) n k n kj (y i =k) where n k = n i= (y i = k), n kj = n i= (y i = k, x (j) i = ). For conjugate prior, we can use Dir((α k ) K k= ) for π, and Beta(a, b) for kj independently. Because the likelihood factorizes, the posterior distribution over π and ( kj ) also factorizes, and posterior for π is Dir((α k + n k ) K k= ), and for kj is Beta(a + n kj, b + n k n kj ). and Regularization Consider a Bayesian approach to logistic regression: introduce a multivariate normal prior for weight vector w R p, and a uniform (improper) prior for offset b R. The prior density ( is: p(b, w) = (πσ ) p exp ) σ w The posterior is p(b, w D) exp ( σ w ) n log( + exp( y i (b + w x i ))) The posterior mode is equivalent to minimizing the L -regularized empirical risk. Regularized empirical risk minimization is (often) equivalent to having a prior and finding a MAP estimate of the parameters. L regularization - multivariate normal prior. L regularization - multivariate Laplace prior. From a Bayesian perspective, the MAP parameters are just one way to summarize the posterior distribution. i=

Bayesian Model Selection A model M with a given set of parameters θ M consists of both the likelihood p(d θ M ) and the prior distribution p(θ M ). One example model would consist of all Gaussian mixtures with K components and equal covariance (LDA): θ LDA = (π,..., π K ; µ,..., µ K ; Σ), along with a prior on θ; another would allow different covariances (QDA) θ QDA = (π,..., π K ; µ,..., µ K ; Σ,..., Σ K ). The posterior distribution p(θ M D, M) = p(d θ M, M)p(θ M M) p(d M) Marginal probability of the data under M (Bayesian model evidence): p(d M) = p(d θ M, M)p(θ M M)dθ Θ Bayesian Occam s Razor Occam s Razor: of two explanations adequate to explain the same set of observations, the simpler should be preferred. p(d M) = p(d θ M, M)p(θ M M)dθ Θ Model evidence p(d M) is the probability that a set of randomly selected parameter values inside the model would generate dataset D. Models that are too simple are unlikely to generate the observed dataset. Models that are too complex can generate many possible dataset, so again, they are unlikely to generate that particular dataset at random. Compare models using their Bayes factors p(d M) p(d M ) Bayesian model comparison: Occam s razor at work M = M = M = M = 4 4 4 4 5 5 5 5 M = 4 M = 5 M = 6 M = 7 4 4 4 4 5 5 5 5 P(Y M) Model Evidence.8.6.4. 4 5 6 7 M Discussion Use probability distributions to reason about uncertainties of parameters (latent variables and parameters are treated in the same way). Model consists of the likelihood function and the prior distribution on parameters: allows to integrate prior beliefs and domain knowledge. Bayesian computation most posteriors are intractable, and posterior needs to be approximated by: Monte Carlo methods (MCMC and SMC). Variational methods (variational Bayes, belief propagation etc). Prior usually has hyperparameters, i.e., p(θ) = p(θ ψ). How to choose ψ? Be Bayesian about ψ as well choose a hyperprior p(ψ) and compute p(ψ D). Maximum Likelihood II find ψ maximizing argmax ψ Ψ p(d ψ). p(d ψ) = p(d θ)p(θ ψ)dθ p(ψ D) = p(d ψ)p(ψ) figures by M.Sahani

Further Reading Videolectures by Zoubin Ghahramani: and Graphical models. Gelman et al. Bayesian Data Analysis. Kevin Murphy. Machine Learning: a Probabilistic Perspective. E. T. Jaynes. Probability Theory: The Logic of Science.