Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian RL Seminar. Chris Mansley September 9, 2008

Computational Cognitive Science

Introduction to Probabilistic Machine Learning

Naïve Bayes classification

Lecture 2: Priors and Conjugacy

CS540 Machine learning L9 Bayesian statistics

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Computational Perception. Bayesian Inference

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Computational Cognitive Science

Data Analysis and Uncertainty Part 2: Estimation

Probability and Estimation. Alan Moses

Conjugate Priors, Uninformative Priors

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian Methods: Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Parametric Techniques Lecture 3

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

CPSC 540: Machine Learning

PMR Learning as Inference

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Parametric Techniques

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Learning Bayesian network : Given structure and completely observed data

How can a Machine Learn: Passive, Active, and Teaching

Hierarchical Models & Bayesian Model Selection

Lecture 4: Probabilistic Learning

Lecture 2: Conjugate priors

Bayesian Models in Machine Learning

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Density Estimation: ML, MAP, Bayesian estimation

GAUSSIAN PROCESS REGRESSION

Lecture 3. Univariate Bayesian inference: conjugate analysis

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

6.867 Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 340 Fall 2007: Homework 3

Bayesian Decision and Bayesian Learning

ECE521 week 3: 23/26 January 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CS-E3210 Machine Learning: Basic Principles

Bayesian Learning (II)

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

MLE/MAP + Naïve Bayes

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Regression Linear and Logistic Regression

STAT 730 Chapter 4: Estimation

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

6.867 Machine Learning

Discrete Binary Distributions

Machine Learning 4771

Non-parametric Methods

Time Series and Dynamic Models

(3) Review of Probability. ST440/540: Applied Bayesian Statistics

Announcements. Proposals graded

Introduc)on to Bayesian methods (con)nued) - Lecture 16

COMP90051 Statistical Machine Learning

Principles of Bayesian Inference

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

MLE/MAP + Naïve Bayes

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probabilistic Graphical Models

CS540 Machine learning L8

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Principles of Bayesian Inference

Announcements. Proposals graded

Chapter 5. Bayesian Statistics

Bayesian Inference and MCMC

Introduction to Machine Learning

Lecture 11: Probability Distributions and Parameter Estimation

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Principles of Bayesian Inference

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

An Introduction to Bayesian Machine Learning

CPSC 340: Machine Learning and Data Mining

Parameter Learning With Binary Variables

David Giles Bayesian Econometrics

Bayesian Inference. Introduction

Transcription:

Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 31

Outline Generative Models for Discrete Data Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Steps for Learning a Generative Model Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Learning Gaussian Models Estimating Parameters Estimating Posterior Chandola@UB CSE 474/574 2 / 31

Generative Models Let us go back to our tumor example X represents the data with multiple discrete attributes Is X a discrete or continuous random variable? Y represent the class (benign or malignant) Most probable class P(Y = c X = x, θ) P(X = x Y = c, θ)p(y = c, θ) P(X = x Y = c, θ) = p(x y = c, θ) p(x y = c, θ) - class conditional density How is the data distributed for each class? Chandola@UB CSE 474/574 3 / 31

Bayesian Concept Learning Concept assigns binary labels to examples Features are modeled as a random variable X Class is modeled as a random variable Y We want to find out: P(Y = c X = x) Chandola@UB CSE 474/574 4 / 31

Concept Learning in Number Line I give you a set of numbers (training set D) belonging to a concept Choose the most likely hypothesis (concept) Assume that numbers are between 1 and 100 Hypothesis Space (H): All powers of 2 All powers of 4 All even numbers All prime numbers Numbers close to a fixed number (say 12). Socrative Game Goto: http: //b.socrative.com Enter class ID - UBML18 Chandola@UB CSE 474/574 5 / 31

Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of 16 7. Multiples of 5 8. Multiples of 10 9. Numbers within 20 ± 5 10. All numbers between 1 and 100 D = {} Chandola@UB CSE 474/574 6 / 31

Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of 16 7. Multiples of 5 8. Multiples of 10 9. Numbers within 20 ± 5 10. All numbers between 1 and 100 D = {16} Chandola@UB CSE 474/574 6 / 31

Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of 16 7. Multiples of 5 8. Multiples of 10 9. Numbers within 20 ± 5 10. All numbers between 1 and 100 D = {60} Chandola@UB CSE 474/574 6 / 31

Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of 16 7. Multiples of 5 8. Multiples of 10 9. Numbers within 20 ± 5 10. All numbers between 1 and 100 D = {16, 19, 15, 20, 18} Chandola@UB CSE 474/574 6 / 31

Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of 16 7. Multiples of 5 8. Multiples of 10 9. Numbers within 20 ± 5 10. All numbers between 1 and 100 D = {16, 4, 64, 32} Chandola@UB CSE 474/574 6 / 31

Computing Likelihood Why choose powers of 2 concept over even numbers concept for D = {16, 4, 64, 32}? Avoid suspicious coincidences Choose concept with higher likelihood What is the likelihood of above D to be generated using the powers of 2 concept? Likelihood for even numbers concept? Chandola@UB CSE 474/574 7 / 31

Likelihood Why choose one hypothesis over other? Avoid suspicious coincidences Choose concept with higher likelihood p(d h) = p(x h) x D Log Likelihood log p(d h) = x D log p(x h) Chandola@UB CSE 474/574 8 / 31

Bayesian Concept Learning 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of 16 7. Multiples of 5 8. Multiples of 10 9. Numbers within 20 ± 5 10. All numbers between 1 and 100 D = {16, 4, 64, 32} Chandola@UB CSE 474/574 9 / 31

Adding a Prior Inside information about the hypotheses Some hypotheses are more likely apriori May not be the right hypothesis (prior can be wrong) Chandola@UB CSE 474/574 10 / 31

Posterior Revised estimates for h after observing evidence (D) and the prior Posterior Likelihood Prior p(d h)p(h) p(h D) = h H p(d h )p(h ) h Prior Likelihood Posterior 1 Even 0.3 0.16 10 6 0.621 10 3 2 Odd 0.075 0 0 3 Squares 0.075 0 0 4 Powers of 2 0.1 0.77 10 3 0.997 5 Powers of 4 0.075 0 0 6 Powers of 16 0.075 0 0 7 Multiples of 5 0.075 0 0 8 Multiples of 10 0.075 0 0 9 Numbers within 20 ± 5 0.075 0 0 10 All Numbers 0.075 0.01 10 6 0.009 10 3 Chandola@UB CSE 474/574 11 / 31

Finding the Best Hypothesis Maximum A Priori Estimate ĥ prior = arg max p(h) h Maximum Likelihood Estimate (MLE) ĥ MLE = arg max h = arg max h p(d h) = arg max log p(d h) h log p(x H) x D Maximum a Posteriori (MAP) Estimate ĥ MAP = arg max h p(d h)p(h) = arg max(log p(d h) + log p(h)) h Chandola@UB CSE 474/574 12 / 31

MAP and MLE ĥ prior - Most likely hypothesis based on prior ĥ MLE - Most likely hypothesis based on evidence ĥ MAP - Most likely hypothesis based on posterior ĥ prior = arg max log p(h) h ĥ MLE = arg max log p(d h) h ĥ MAP = arg max(log p(d h) + log p(h)) h Chandola@UB CSE 474/574 13 / 31

Interesting Properties As data increases, MAP estimate converges towards MLE Why? MAP/MLE are consistent estimators If concept is in H, MAP/ML estimates will converge If c / H, MAP/ML estimates converge to h which is closest possible to the truth Chandola@UB CSE 474/574 14 / 31

From Prior to Posterior via Likelihood Prior Posterior Objective: To revise the prior distribution over the hypotheses after observing data (evidence). Chandola@UB CSE 474/574 15 / 31

Posterior Predictive Distribution New input, x What is the probability that x is also generated by the same concept as D? P(Y = c X = x, D)? Option 0: Treat h prior as the true concept P(Y = c X = x, D) = P(X = x c = h prior ) Option 1: Treat h MLE as the true concept P(Y = c X = x, D) = P(X = x c = h MLE ) Option 2: Treat h MAP as the true concept P(Y = c X = x, D) = P(X = x c = h MAP ) Option 3: Bayesian Averaging P(Y = c X = x, D) = h P(X = x c = h)p(h D) Chandola@UB CSE 474/574 16 / 31

Steps for Learning a Generative Model Example: D is a sequence of N binary values (0s and 1s) (coin tosses) What is the best distribution that could describe D? What is the probability of observing a head in future? Step 1: Choose the form of the model Hypothesis Space - All possible distributions Too complicated!! Revised hypothesis space - All Bernoulli distributions (X Ber(θ), 0 θ 1) θ is the hypothesis Still infinite (θ can take infinite possible values) Chandola@UB CSE 474/574 17 / 31

Compute Likelihood Likelihood of D p(d θ) = θ N1 (1 θ) N0 Maximum Likelihood Estimate ˆθ MLE = arg max θ = N 1 N 0 + N 1 p(d θ) = arg max θ N1 (1 θ) N0 θ Chandola@UB CSE 474/574 18 / 31

Compute Likelihood Likelihood of D p(d θ) = θ N1 (1 θ) N0 Maximum Likelihood Estimate ˆθ MLE = arg max θ = N 1 N 0 + N 1 We can stop here (MLE approach) Probability of getting a head next: p(d θ) = arg max θ N1 (1 θ) N0 θ p(x = 1 D) = ˆθ MLE Chandola@UB CSE 474/574 18 / 31

Incorporating Prior 1.2 1.1 Prior encodes our prior belief on θ How to set a Bayesian prior? 1. A point estimate: θ prior = 0.5 2. A probability distribution over θ (a random variable) Which one? For a bernoulli distribution 0 θ 1 Beta Distribution 1 0.9 3 2 1 0.2 0.4 0.6 0.8 1 p(θ) 0.2 0.4 0.6 0.8 1 Chandola@UB CSE 474/574 19 / 31

Beta Distribution as Prior Continuous random variables defined between 0 and 1 Beta(θ a, b) p(θ a, b) = 1 B(a, b) θa 1 (1 θ) b 1 a and b are the (hyper-)parameters for the distribution B(a, b) is the beta function If x is integer B(a, b) = Γ(a)Γ(b) Γ(a + b) Γ(x) = Control the shape of the pdf 0 u x 1 e u du Γ(x) = (x 1)! We can stop here as well (prior approach) p(x = 1) = θ prior Chandola@UB CSE 474/574 20 / 31

Conjugate Priors Another reason to choose Beta distribution Posterior Likelihood Prior p(d θ) = θ N1 (1 θ) N0 p(θ) θ a 1 (1 θ) b 1 p(θ D) θ N1 (1 θ) N0 θ a 1 (1 θ) b 1 θ N1+a 1 (1 θ) N0+b 1 Posterior has same form as the prior Beta distribution is a conjugate prior for Bernoulli/Binomial distribution Chandola@UB CSE 474/574 21 / 31

Estimating Posterior Posterior We start with a belief that p(θ D) θ N1+a 1 (1 θ) N0+b 1 = Beta(θ N 1 + a, N 0 + b) E[θ] = a a + b After observing N trials in which we observe N 1 heads and N 0 trails, we update our belief as: E[θ D] = a + N 1 a + b + N Chandola@UB CSE 474/574 22 / 31

Using Posterior We know that posterior over θ is a beta distribution MAP estimate What happens if a = b = 1? ˆθ MAP = arg max p(θ a + N 1, b + N 0 ) θ = a + N 1 1 a + b + N 2 We can stop here as well (MAP approach) Probability of getting a head next: p(x = 1 D) = ˆθ MAP Chandola@UB CSE 474/574 23 / 31

True Bayesian Approach All values of θ are possible Prediction on an unknown input (x ) is given by Bayesian Averaging p(x = 1 D) = = 1 0 1 0 = E[θ D] a + N 1 = a + b + N p(x = 1 θ)p(θ D)dθ θbeta(θ a + N 1, b + N 0 ) This is same as using E[θ D] as a point estimate for θ Chandola@UB CSE 474/574 24 / 31

The Black Swan Paradox Why use a prior? Consider D = tails, tails, tails N 1 = 0, N = 3 ˆθMLE = 0 p(x = 1 D) = 0!! Never observe a heads The black swan paradox How does the Bayesian approach help? p(x = 1 D) = a a + b + 3 Chandola@UB CSE 474/574 25 / 31

Why is MAP Estimate Insufficient? MAP is only one part of the posterior θ at which the posterior probability is maximum But is that enough? What about the posterior variance of θ? var[θ D] = (a + N 1)(b + N 0) (a + b + N) 2 (a + b + N + 1) If variance is high then θmap is not trustworthy Bayesian averaging helps in this case Chandola@UB CSE 474/574 26 / 31

Multivariate Gaussian pdf for MVN with d dimensions: [ 1 N (x µ, Σ) exp 1 ] (2π) d/2 Σ 1/2 2 (x µ) Σ 1 (x µ) Chandola@UB CSE 474/574 27 / 31

Estimating Parameters of MVN Problem Statement Given a set of N independent and identically distributed (iid) samples, D, learn the parameters (µ, Σ) of a Gaussian distribution that generated D. MLE approach - maximize log-likelihood Result µ MLE = 1 N Σ MLE = 1 N N x i x i=1 N (x i x)(x i x) i=1 Chandola@UB CSE 474/574 28 / 31

Estimating Posterior We need posterior for both µ and Σ p(µ) p(σ) What distribution do we need to sample µ? A Gaussian distribution! p(µ) = N (µ m 0, V 0) What distribution do we need to sample Σ? An Inverse-Wishart distribution. p(σ) = IW (Σ S, ν) ( 1 = Σ (ν+d+1)/2 exp 1 ) Z IW 2 tr(s 1 Σ 1 ) where, Z IW = S ν/2 2 νd/2 Γ D (ν/2) Chandola@UB CSE 474/574 29 / 31

Calculating Posterior Posterior for µ - Also a MVN p(µ D, Σ) = N (m N, V N ) V 1 N = V 1 0 + NΣ 1 m N = V N (Σ 1 (N x) + V 1 0 m 0) Posterior for Σ - Also an Inverse Wishart p(σ D, µ) = IW (S N, ν N ) ν N = ν 0 + N S 1 N = S 0 + S µ Chandola@UB CSE 474/574 30 / 31

References Chandola@UB CSE 474/574 31 / 31