Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Dominick Wheeler
6 years ago
Views:

1 Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA CSE 474/574 1 / 31

2 Outline Generative Models for Discrete Data Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Steps for Learning a Generative Model Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Learning Gaussian Models Estimating Parameters Estimating Posterior Chandola@UB CSE 474/574 2 / 31

3 Generative Models Let us go back to our tumor example X represents the data with multiple discrete attributes Is X a discrete or continuous random variable? Y represent the class (benign or malignant) Most probable class P(Y = c X = x, θ) P(X = x Y = c, θ)p(y = c, θ) P(X = x Y = c, θ) = p(x y = c, θ) p(x y = c, θ) - class conditional density How is the data distributed for each class? Chandola@UB CSE 474/574 3 / 31

4 Bayesian Concept Learning Concept assigns binary labels to examples Features are modeled as a random variable X Class is modeled as a random variable Y We want to find out: P(Y = c X = x) Chandola@UB CSE 474/574 4 / 31

5 Concept Learning in Number Line I give you a set of numbers (training set D) belonging to a concept Choose the most likely hypothesis (concept) Assume that numbers are between 1 and 100 Hypothesis Space (H): All powers of 2 All powers of 4 All even numbers All prime numbers Numbers close to a fixed number (say 12). Socrative Game Goto: http: //b.socrative.com Enter class ID - UBML18 Chandola@UB CSE 474/574 5 / 31

6 Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of Multiples of 5 8. Multiples of Numbers within 20 ± All numbers between 1 and 100 D = {} Chandola@UB CSE 474/574 6 / 31

7 Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of Multiples of 5 8. Multiples of Numbers within 20 ± All numbers between 1 and 100 D = {16} Chandola@UB CSE 474/574 6 / 31

8 Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of Multiples of 5 8. Multiples of Numbers within 20 ± All numbers between 1 and 100 D = {60} Chandola@UB CSE 474/574 6 / 31

9 Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of Multiples of 5 8. Multiples of Numbers within 20 ± All numbers between 1 and 100 D = {16, 19, 15, 20, 18} Chandola@UB CSE 474/574 6 / 31

10 Ready? Hypothesis Space (H) 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of Multiples of 5 8. Multiples of Numbers within 20 ± All numbers between 1 and 100 D = {16, 4, 64, 32} Chandola@UB CSE 474/574 6 / 31

11 Computing Likelihood Why choose powers of 2 concept over even numbers concept for D = {16, 4, 64, 32}? Avoid suspicious coincidences Choose concept with higher likelihood What is the likelihood of above D to be generated using the powers of 2 concept? Likelihood for even numbers concept? Chandola@UB CSE 474/574 7 / 31

12 Likelihood Why choose one hypothesis over other? Avoid suspicious coincidences Choose concept with higher likelihood p(d h) = p(x h) x D Log Likelihood log p(d h) = x D log p(x h) Chandola@UB CSE 474/574 8 / 31

13 Bayesian Concept Learning 1. Even numbers 2. Odd numbers 3. Squares 4. Powers of 2 5. Powers of 4 6. Powers of Multiples of 5 8. Multiples of Numbers within 20 ± All numbers between 1 and 100 D = {16, 4, 64, 32} Chandola@UB CSE 474/574 9 / 31

14 Adding a Prior Inside information about the hypotheses Some hypotheses are more likely apriori May not be the right hypothesis (prior can be wrong) Chandola@UB CSE 474/ / 31

15 Posterior Revised estimates for h after observing evidence (D) and the prior Posterior Likelihood Prior p(d h)p(h) p(h D) = h H p(d h )p(h ) h Prior Likelihood Posterior 1 Even Odd Squares Powers of Powers of Powers of Multiples of Multiples of Numbers within 20 ± All Numbers Chandola@UB CSE 474/ / 31

16 Finding the Best Hypothesis Maximum A Priori Estimate ĥ prior = arg max p(h) h Maximum Likelihood Estimate (MLE) ĥ MLE = arg max h = arg max h p(d h) = arg max log p(d h) h log p(x H) x D Maximum a Posteriori (MAP) Estimate ĥ MAP = arg max h p(d h)p(h) = arg max(log p(d h) + log p(h)) h Chandola@UB CSE 474/ / 31

17 MAP and MLE ĥ prior - Most likely hypothesis based on prior ĥ MLE - Most likely hypothesis based on evidence ĥ MAP - Most likely hypothesis based on posterior ĥ prior = arg max log p(h) h ĥ MLE = arg max log p(d h) h ĥ MAP = arg max(log p(d h) + log p(h)) h Chandola@UB CSE 474/ / 31

18 Interesting Properties As data increases, MAP estimate converges towards MLE Why? MAP/MLE are consistent estimators If concept is in H, MAP/ML estimates will converge If c / H, MAP/ML estimates converge to h which is closest possible to the truth Chandola@UB CSE 474/ / 31

19 From Prior to Posterior via Likelihood Prior Posterior Objective: To revise the prior distribution over the hypotheses after observing data (evidence). CSE 474/ / 31

20 Posterior Predictive Distribution New input, x What is the probability that x is also generated by the same concept as D? P(Y = c X = x, D)? Option 0: Treat h prior as the true concept P(Y = c X = x, D) = P(X = x c = h prior ) Option 1: Treat h MLE as the true concept P(Y = c X = x, D) = P(X = x c = h MLE ) Option 2: Treat h MAP as the true concept P(Y = c X = x, D) = P(X = x c = h MAP ) Option 3: Bayesian Averaging P(Y = c X = x, D) = h P(X = x c = h)p(h D) Chandola@UB CSE 474/ / 31

21 Steps for Learning a Generative Model Example: D is a sequence of N binary values (0s and 1s) (coin tosses) What is the best distribution that could describe D? What is the probability of observing a head in future? Step 1: Choose the form of the model Hypothesis Space - All possible distributions Too complicated!! Revised hypothesis space - All Bernoulli distributions (X Ber(θ), 0 θ 1) θ is the hypothesis Still infinite (θ can take infinite possible values) Chandola@UB CSE 474/ / 31

22 Compute Likelihood Likelihood of D p(d θ) = θ N1 (1 θ) N0 Maximum Likelihood Estimate ˆθ MLE = arg max θ = N 1 N 0 + N 1 p(d θ) = arg max θ N1 (1 θ) N0 θ Chandola@UB CSE 474/ / 31

23 Compute Likelihood Likelihood of D p(d θ) = θ N1 (1 θ) N0 Maximum Likelihood Estimate ˆθ MLE = arg max θ = N 1 N 0 + N 1 We can stop here (MLE approach) Probability of getting a head next: p(d θ) = arg max θ N1 (1 θ) N0 θ p(x = 1 D) = ˆθ MLE Chandola@UB CSE 474/ / 31

24 Incorporating Prior Prior encodes our prior belief on θ How to set a Bayesian prior? 1. A point estimate: θ prior = A probability distribution over θ (a random variable) Which one? For a bernoulli distribution 0 θ 1 Beta Distribution p(θ) Chandola@UB CSE 474/ / 31

25 Beta Distribution as Prior Continuous random variables defined between 0 and 1 Beta(θ a, b) p(θ a, b) = 1 B(a, b) θa 1 (1 θ) b 1 a and b are the (hyper-)parameters for the distribution B(a, b) is the beta function If x is integer B(a, b) = Γ(a)Γ(b) Γ(a + b) Γ(x) = Control the shape of the pdf 0 u x 1 e u du Γ(x) = (x 1)! We can stop here as well (prior approach) p(x = 1) = θ prior Chandola@UB CSE 474/ / 31

26 Conjugate Priors Another reason to choose Beta distribution Posterior Likelihood Prior p(d θ) = θ N1 (1 θ) N0 p(θ) θ a 1 (1 θ) b 1 p(θ D) θ N1 (1 θ) N0 θ a 1 (1 θ) b 1 θ N1+a 1 (1 θ) N0+b 1 Posterior has same form as the prior Beta distribution is a conjugate prior for Bernoulli/Binomial distribution Chandola@UB CSE 474/ / 31

27 Estimating Posterior Posterior We start with a belief that p(θ D) θ N1+a 1 (1 θ) N0+b 1 = Beta(θ N 1 + a, N 0 + b) E[θ] = a a + b After observing N trials in which we observe N 1 heads and N 0 trails, we update our belief as: E[θ D] = a + N 1 a + b + N Chandola@UB CSE 474/ / 31

28 Using Posterior We know that posterior over θ is a beta distribution MAP estimate What happens if a = b = 1? ˆθ MAP = arg max p(θ a + N 1, b + N 0 ) θ = a + N 1 1 a + b + N 2 We can stop here as well (MAP approach) Probability of getting a head next: p(x = 1 D) = ˆθ MAP Chandola@UB CSE 474/ / 31

29 True Bayesian Approach All values of θ are possible Prediction on an unknown input (x ) is given by Bayesian Averaging p(x = 1 D) = = = E[θ D] a + N 1 = a + b + N p(x = 1 θ)p(θ D)dθ θbeta(θ a + N 1, b + N 0 ) This is same as using E[θ D] as a point estimate for θ Chandola@UB CSE 474/ / 31

30 The Black Swan Paradox Why use a prior? Consider D = tails, tails, tails N 1 = 0, N = 3 ˆθMLE = 0 p(x = 1 D) = 0!! Never observe a heads The black swan paradox How does the Bayesian approach help? p(x = 1 D) = a a + b + 3 Chandola@UB CSE 474/ / 31

31 Why is MAP Estimate Insufficient? MAP is only one part of the posterior θ at which the posterior probability is maximum But is that enough? What about the posterior variance of θ? var[θ D] = (a + N 1)(b + N 0) (a + b + N) 2 (a + b + N + 1) If variance is high then θmap is not trustworthy Bayesian averaging helps in this case Chandola@UB CSE 474/ / 31

32 Multivariate Gaussian pdf for MVN with d dimensions: [ 1 N (x µ, Σ) exp 1 ] (2π) d/2 Σ 1/2 2 (x µ) Σ 1 (x µ) Chandola@UB CSE 474/ / 31

33 Estimating Parameters of MVN Problem Statement Given a set of N independent and identically distributed (iid) samples, D, learn the parameters (µ, Σ) of a Gaussian distribution that generated D. MLE approach - maximize log-likelihood Result µ MLE = 1 N Σ MLE = 1 N N x i x i=1 N (x i x)(x i x) i=1 Chandola@UB CSE 474/ / 31

34 Estimating Posterior We need posterior for both µ and Σ p(µ) p(σ) What distribution do we need to sample µ? A Gaussian distribution! p(µ) = N (µ m 0, V 0) What distribution do we need to sample Σ? An Inverse-Wishart distribution. p(σ) = IW (Σ S, ν) ( 1 = Σ (ν+d+1)/2 exp 1 ) Z IW 2 tr(s 1 Σ 1 ) where, Z IW = S ν/2 2 νd/2 Γ D (ν/2) Chandola@UB CSE 474/ / 31

35 Calculating Posterior Posterior for µ - Also a MVN p(µ D, Σ) = N (m N, V N ) V 1 N = V NΣ 1 m N = V N (Σ 1 (N x) + V 1 0 m 0) Posterior for Σ - Also an Inverse Wishart p(σ D, µ) = IW (S N, ν N ) ν N = ν 0 + N S 1 N = S 0 + S µ Chandola@UB CSE 474/ / 31

36 References CSE 474/ / 31

Introduction to Machine Learning

Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574