CS540 Machine learning L8

Similar documents
Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

CS340 Machine learning Gaussian classifiers

CS540 Machine learning L9 Bayesian statistics

Bayesian RL Seminar. Chris Mansley September 9, 2008

CPSC 540: Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes classification

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CPSC 340: Machine Learning and Data Mining

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Introduction to Machine Learning

Lecture 2: Priors and Conjugacy

Learning Bayesian network : Given structure and completely observed data

CSC321 Lecture 18: Learning Probabilistic Models

Density Estimation. Seungjin Choi

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning

Bayesian Methods: Naïve Bayes

Introduction to Probabilistic Machine Learning

Generative Models for Discrete Data

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Bayesian Regression Linear and Logistic Regression

The Naïve Bayes Classifier. Machine Learning Fall 2017

an introduction to bayesian inference

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Hierarchical Models & Bayesian Model Selection

Computational Perception. Bayesian Inference

Introduction to Machine Learning

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

Machine Learning, Fall 2012 Homework 2

Bayesian Learning (II)

Data Analysis and Uncertainty Part 2: Estimation

Computational Cognitive Science

Bayesian Models in Machine Learning

COMP90051 Statistical Machine Learning

Bayesian Methods for Machine Learning

Discrete Binary Distributions

Part 1: Expectation Propagation

Introduction to Bayesian Learning. Machine Learning Fall 2018

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 6: Model Checking and Selection

CPSC 540: Machine Learning

CS 188: Artificial Intelligence Spring Today

g-priors for Linear Regression

Learning with Probabilities

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Announcements. Proposals graded

MLE/MAP + Naïve Bayes

Bayesian Analysis for Natural Language Processing Lecture 2

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

PMR Learning as Inference

Classification: Logistic Regression from Data

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Computational Cognitive Science

MLE/MAP + Naïve Bayes

STA 4273H: Statistical Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Conjugate Priors, Uninformative Priors

Machine Learning - MT Classification: Generative Models

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Introduction to Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Time Series and Dynamic Models

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduc)on to Bayesian Methods

CS-E3210 Machine Learning: Basic Principles

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning. Lecture 2

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Principles of Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference

Approximate Inference Part 1 of 2

Bayesian Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Parametric Techniques Lecture 3

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probability and Estimation. Alan Moses

David Giles Bayesian Econometrics

10. Exchangeability and hierarchical models Objective. Recommended reading

Bayesian Inference for Dirichlet-Multinomials

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

CSC2541 Lecture 1 Introduction

Parametric Techniques

The connection of dropout and Bayesian statistics

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Statistical Data Mining and Machine Learning Hilary Term 2016

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Transcription:

CS540 Machine learning L8

Announcements Linear algebra tutorial by Mark Schmidt, 5:30 to 6:30 pm today, in the CS X-wing 8th floor lounge (X836). Move midterm from Tue Oct 14 to Thu Oct 16? Hw3sol handed out today Change in order

Last time Multivariate Gaussians Eigenanalysis MLE Use in generative classifiers

This time Naïve Bayes classifiers Bayesian parameter estimation I: Beta-Binomial model

Bayes rule for classifiers Class posterior Class-conditional density Class prior p(y=c x)= p(x y=c)p(y=c) c p(x y=c )p(y=c ) Normalization constant

Class prior Let (Y 1,..,Y C ) ~ Mult(π, 1) be the class prior. P(y 1,...,y C π)= Since c Y c =1, only one bit can be on. This is called a 1-of-C encoding. We can write Y=c instead. P(y π)= C c=1 C c=1 π I(y=c) c e.g., p(man)=0.7, p(woman)=0.1, p(child)=0.2 π I(y c=1) c Y=2 (Y 1,Y 2,Y 3 ) = (0,1,0) =π y C π c =1 c=1

Correlated features Height and weight are not independent

Fitting the model Fit each class conditional density separately µ c = 1 n I(y i =c)x i = 1 n c n c i=1 i:y i =c Σ c = 1 n I(y i =c)(x i µ n c )(x i µ c ) T c n c = i=1 n I(y i =c) i=1 x i π c = n c n

Ignoring the correlation If X j R, we can use product of 1d Gaussians X j y=c ~ N(µ jc, σ jc ) p(x y=c)= d j=1 1 2πσjc 2 exp( 1 2σjc 2 (x j µ jc ) 2 ) σ1c 2... 0 Σ c =... 0... σ 2 dc

Document classification Let Y {1,,C} be the class label and x {0,1} d eg Y {spam, urgent, normal}, x i = I(word i is present in message) Bag of words model 1 2 3 4 5 6 7 Words = {john, mary, sex, money, send, meeting, unk} John sent money to Mary after the meeting about money Stop word removal john sent money mary after meeting about money Tokenization 1 7 4 2 7 6 7 4 Word counting [1, 1, 0, 2, 0, 1] Thresholding (binarization) [1, 1, 0, 1, 0, 1]

Binary features (multivariate Bernoulli) Let X i y=c ~ Ber(µ ic ) so p(x i =1 y=c) = µ ic p(x y=c,µ)= d j=1 µ I(x j=1) jc (1 µ jc ) I(x j=0)

Fitting the model µ jc = 1 n c n n jc = i=1 I(y i =c)i(x ij =1)= n jc n c n I(y i =c,x ij =1) i=1

Bayes rule Class posterior p(y=c x)= p(y=c)p(x y=c) p(x) Since numerator and denominator are very small number, use logs to avoid underflow logp(y=c x)=logπ c + d I(x i =1)logθ ic +I(x i =0)log(1 θ ic ) logp(x) i=1 = π c d i=1 θi(x i=1) ic (1 θ ic ) I(x i=0) p(x) How compute the normalization constant? logp(x)=log[ c p(y=c,x)]=log[ c π c f c ]

Log-sum-exp trick Define logp(x) = log[ c π c f c ] b c = logπ c +logf c logp(x) = log [ e b c =log ( ] e b c )e B e B c c [ [ ] = log ( e bc B )e ]= B log( e bc B ) c c +B B = max c b c log(e 120 +e 121 )=log ( e 120 (e 0 +e 1 ) ) =log(e 0 +e 1 ) 120 In Matlab, use Minka s function S = logsumexp(b) logjoint = log(prior) + counts * log(theta) + (1-counts) * log(1-theta); logpost = logjoint logsumexp(logjoint)

Missing features Suppose the value of x 1 is unknown We can simply drop the term p(x 1 y=c). p(y=c x 2:d ) p(y=c,x 2:d ) = x 1 p(y=c,x 1,x 2:d ) = x 1 p(y=c) d p(x j y=c) This is a big advantage of generative classifiers over discriminative classifiers j=1 = p(y=c)[ x 1 p(x 1 y=c)] = p(y=c) d p(x j y=c) j=2 d p(x j y=c) j=2

Form of the class posterior We can derive an analytic expression for p(y=c x) that will be useful later. p(y =c x,θ,π) = = = p(x y=c)p(y=c) c p(x y=c )p(y=c ) exp[logp(x y=c)+logp(y=c)] c exp[logp(x y=c )+logp(y=c )] exp[logπ c + i I(x i=1)logθ ic +I(x i =0)log(1 θ ic )] c exp[logπ c + i I(x i=1)logθ i,c +I(x i =0)log(1 θ ic )]

Form of the class posterior From previous slide p(y =c x,θ,π) exp logπ c + i [ I(x i =1)logθ ic +I(x i =0)log(1 θ ic ) ] Define x = [1,I(x 1 =1),I(x 1 =0),...,I(x d =1),I(x d =0)] β c = [logπ c,logθ 1c,log(1 θ 1c ),...,logθ dc,log(1 θ dc )] Then the posterior is given by the softmax function p(y =c x,β)= exp[βt cx ] c exp[β T c x ]

Discriminative vs generative Discriminative: p(y x,theta) Generative: p(y,x theta)

Logisitic regression vs naïve Bayes Discriminative Generative Easy to t? No Yes Can handle basis function expansion? Yes No Fit classes separately? No Yes Handle missing data? No Yes Best for Large sample size Small sample size

Consider naïve Bayes for binary features. Sparse data problem Spam Ham Limited Time 1 10 2 9 X = you will receive our limited time offer if you send us $1M today Offer 0 0 Total Ns Nh p(x y=s)=(1/n s )(10/N s )(0/N s )=0 MLE overfits the data

Outline Bayes: what/why? Bernoulli

Fundamental principle of Bayesian statistics In Bayesian stats, everything that is uncertain (e.g., θ) is modeled with a probability distribution. We incorporate everything that is known (e.g., D) is by conditioning on it, using Bayes rule to update our prior beliefs into posterior beliefs. Posterior probability p( h d) = Likelihood h H p( d h) p( h) p( d h ) p( h ) Prior probability Bayesian inference = Inverse probability theory

In praise of Bayes Bayesian methods are conceptually simple and elegant, and can handle small sample sizes (e.g., one-shot learning) and complex hierarchical models without overfitting. They provide a single mechanism for answering all questions of interest; there is no need to choose between different estimators, hypothesis testing procedures, etc. They avoid various pathologies associated with orthodox statistics. They often enjoy good frequentist properties.

Why isn t everyone a Bayesian? The need for a prior. Computational issues.

The need for a prior Bayes rule requires a prior, which is considered subjective. However, we know learning without assumptions is impossible (no free lunch theorem). Often we actually have informative prior knowledge. If not, it is possible to create relatively uninformative priors to represent prior ignorance. We can also estimate our priors from data (empirical Bayes). We can use posterior predictive checks to test goodness of fit of both prior and likelihood.

Computational issues Computing the normalization constant requires integrating over all the parameters p(θ D) = p(θ)p(d θ) p(θ )p(d θ )dθ Computing posterior expectations requires integrating over all the parameters Ef(Θ)= f(θ)p(θ D)dθ

Approximate inference We can evaluate posterior expectations using Monte Carlo integration Ef(Θ)= f(θ)p(θ D)dθ 1 N N f(θ s ) whereθ s p(θ D) s=1 Generating posterior samples can be tricky Importance sampling Particle filtering Markov chain Monte Carlo (MCMC) There are also deterministic approximation methods Laplace Variational Bayes Expectation Propagation

Conjugate priors For simplicity, we will mostly focus on a special kind of prior which has nice mathematical properties. A prior p(θ) is said to be conjugate to a likelihood p(d θ) if the corresponding posterior p(θ D) has the same functional form as p(θ). This means the prior family is closed under Bayesian updating. So we can recursively apply the rule to update our beliefs as data streams in (online learning). A natural conjugate prior means p(θ) has the same functional form as p(d θ).

Example: coin tossing Consider the problem of estimating the probability of heads θ from a sequence of N coin tosses, D = (X 1,, X N ) First we define the likelihood function, then the prior, then compute the posterior. We will also consider different ways to predict the future. MLE is ˆθ = N 1 N Suffers from sparse data problem

Black swan paradox Suppose we have seen N=3 white swans. What is the probability that swan X N+1 is black? If we plug in the MLE, we predict black swans are impossible, since N b =N 1 =0, N w =N 0 =3 ˆθ MLE = N b N b +N w = 0 N, p(x=b ˆθ MLE )=ˆθ MLE =0 However, this may just be due to sparse data. Below, we will see how Bayesian approaches work better in the small sample setting.

The beta-bernoulli model Consider the probability of heads, given a sequence of N coin tosses, X 1,, X N. Likelihood p(d θ) = N n=1 θ X n (1 θ) 1 X n =θ N 1 (1 θ) N 0 Natural conjugate prior is the Beta distribution p(θ)=be(θ α 1,α 0 ) θ α 1 1 (1 θ) α 0 1 Posterior is also Beta, with updated counts p(θ D)=Be(θ α 1 +N 1,α 0 +N 0 ) θ α 1 1+N 1 (1 θ) α 0 1+N 0 Just combine the exponents in θ and (1-θ) from the prior and likelihood

Beta distribution The beta distribution p(θ α 1,α 0 )= 1 B(α 1,α 0 ) θα 1 1 (1 θ) α 0 1 The normalization constant is the beta function B(α 1,α 0 )= 1 0 θ α 1 1 (1 θ) α 0 1 dθ= Γ(α 1)Γ(α 0 ) Γ(α 1 +α 0 ) E[θ]= α 1 α 1 +α 0

Updating a beta distribution Prior is Beta(2,2). Observe 1 head. Posterior is Beta(3,2), so mean shifts from 2/4 to 3/5. Prior is Beta(3,2). Observe 1 head. Posterior is Beta(4,2), so mean shifts from 3/5 to 4/6.

Setting the hyper-parameters The prior hyper-parameters α 1, α 0 can be interpreted as pseudo counts. The effective sample size (strength) of the prior is α 1 +α 0. The prior mean is α 1 /(α 1 +α 0 ). If our prior belief is p(heads) = 0.3, and we think this belief is equivalent to about 10 data points, we just solve α 1 +α 0 =10, α 1 α 1 +α 0 =0.3

Posterior mean Let N=N 1 + N 0 be the amount of data, and M=α 0 +α 1 be the amount of virtual data. The posterior mean is a convex combination of prior mean α 1 /M and MLE N 1 /N E[θ α 1,α 0,N 1,N 0 ] = = α 1 +N 1 = α 1+N 1 α 1 +N 1 +α 0 +N 0 N+M M α 1 N+MM + N N 1 N+M N = w α 1 M +(1 w)n 1 N w = M/(N+M) is the strength of the prior relative to the total amount of data We shrink our estimate away from the MLE towards the prior (a form of regularization).

MAP estimation It is often easier to compute the posterior mode (optimization) than the posterior mean (integration). This is called maximum a posteriori estimation. This is equivalent to penalized likelihood estimation. ˆθ MAP =argmax θ ˆθ MAP =argmax θ For the beta distribution, MAP = p(θ D) logp(d θ)+logp(θ) α 1 1 α 1 +α 0 2

Posterior predictive distribution We integrate out our uncertainty about θ when predicting the future (hedge our bets) p(x D) = p(x θ)p(θ D)dθ If the posterior becomes peaked p(θ D) δ(θ ˆθ) we get the plug-in principle. p(x D) = p(x θ)δ(θ ˆθ)dθ=p(x ˆθ) Sifting property of delta functions

Posterior predictive distribution Let α i = updated hyper-parameters. In this case, the posterior predictive is equivalent to plugging in the posterior mean parameters p(x=1 D) = = 1 0 1 0 p(x = 1 θ)p(θ D)dθ θbeta(θ α 1,α 0)dθ=E[θ]= α 1 α 0 +α 1 If α 0 =α 1 =1, we get Laplace s rule of succession (add one smoothing) p(x=1 N 1,N 0 )= N 1 +1 N 1 +N 0 +2

Solution to black swan paradox If we use a Beta(1,1) prior, the posterior predictive is p(x=1 N 1,N 0 )= N 1 +1 N 1 +N 0 +2 so we will never predict black swans are impossible. However, as we see more and more white swans, we will come to believe that black swans are pretty rare.

Prior Summary of beta-bernoulli model Likelihood Posterior p(θ)=beta(θ α 1,α 0 )= Posterior predictive p(d θ)=θ N 1 (1 θ) N 0 1 B(α 1,α 0 ) θα 1 1 (1 θ) α 0 1 p(θ D)=Beta(θ α 1 +N 1,α 0 +N 0 ) p(x=1 D)= α 1+N 1 α 1 +α 0 +N