Introduction to Machine Learning

Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 20

Outline Learning Probabilistic Classifiers Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Naive Bayes Classification Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Gaussian Discriminant Analysis Moving to Continuous Data Quadratic and Linear Discriminant Analysis Training a QDA or LDA Classifier Chandola@UB CSE 474/574 2 / 20

Learning Probabilistic Classifiers Training data, D = [ x i, y i ] D i=1 1. {circular,large,light,smooth,thick}, malignant 2. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin}, benign 4. {oval,large,light,irregular,thick}, malignant 5. {circular,small,light,smooth,thick}, benign Testing: Predict y for x Option 1: Functional Approximation y = f (x ) Option 2: Probabilistic Classifier P(Y = benign X = x ), P(Y = malignant X = x ) Chandola@UB CSE 474/574 3 / 20

Applying Bayes Rule Training data, D = [ x i, y i ] D i=1 1. {circular,large,light,smooth,thick}, malignant 2. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin}, benign 4. {oval,large,light,irregular,thick}, malignant 5. {circular,small,light,smooth,thick}, benign x = circular,small, light,irregular,thin What is P(Y = benign x )? What is P(Y = malignant x )? Chandola@UB CSE 474/574 4 / 20

Output Label A Discrete Random Variable Y takes two values What is p(y )? Ber(θ) How do you estimate θ? Treat the labels in training data as binary samples Done that last week! Posterior for θ α0 + N1 p(θ) = α 0 + β 0 + N Class 1 - Malignant; Class 2 - Benign Can we just use p(y θ) for predicting future labels? Just a prior for Y Chandola@UB CSE 474/574 5 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? Chandola@UB CSE 474/574 6 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? P(Y = malignant)? Chandola@UB CSE 474/574 6 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? P(Y = malignant)? P(Y = malignant X = x )? Chandola@UB CSE 474/574 6 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? P(Y = malignant)? P(Y = malignant X = x )? P(Y = malignant X = x ) = P(X=x Y =malignant)p(y =malignant) P(X=x Y =malignant)p(y =malignant)+p(x=x Y =benign)p(y =benign) Chandola@UB CSE 474/574 6 / 20

What is P(X = x Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(x)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? Chandola@UB CSE 474/574 7 / 20

Naive Bayes Assumption All features are independent Each variable can be assumed to be a Bernoulli random variable D P(X = x Y = malignant) = p(xj Y = malignant) j=1 D P(X = x Y = benign) = p(xj Y = benign) j=1 y x 1 x 2 x 3 x 4 x 5 x 6 x D Only need 2D parameters Chandola@UB CSE 474/574 8 / 20

Example - Only binary features Training a Naive Bayes Classifier Find parameters that maximize likelihood of training data What is a training example? x i? x i, y i What are the parameters? θ for Y (class prior) θ benign and θ malignant (or θ 1 and θ 2 ) Joint probability distribution of (X, Y ) p(x i, y i ) = p(y i θ)p(x i y i ) = p(y i θ) j p(x ij θ jyi ) Chandola@UB CSE 474/574 9 / 20

Likelihood? Likelihood for D l(d Θ) = i p(y i θ) j p(x ij θ jyi ) Log-likelihood for D ll(d Θ) = N 1 log θ + N 2 log(1 θ) + j N 1j log θ 1j + (N 1 N 1j ) log (1 θ 1j ) + j N 2j log θ 2j + (N 2 N 2j ) log (1 θ 2j ) N 1 - # malignant training examples, N 2 = # benign training examples N 1j - # malignant training examples with x j = 1, N 2j = # benign training examples with x j = 2 Chandola@UB CSE 474/574 10 / 20

MLE? Maximize with respect to θ, assuming Y to be Bernoulli ˆθ = N c N Assuming each feature is binary (x j (y = c) Bernoulli(θ cj ), c = {1, 2}) ˆθ cj = N cj N c Algorithm 1 Naive Bayes Training for Binary Features 1: N c = 0, N cj = 0, j 2: for i = 1 : N do 3: c y i 4: N c N c + 1 5: for j = 1 : D do 6: if x ij = 1 then 7: N cj N cj + 1 8: end if 9: end for 10: end for 11: ˆθ c = Nc N, ˆθ cj = N cj N c 12: return b Chandola@UB CSE 474/574 11 / 20

Adding Prior Add prior to θ and each θ cj. Beta prior for θ ( Beta(a0, b 0)) Beta prior for θcj ( Beta(a, b)) Posterior Estimates p(θ D) = Beta(N 1 + a 0, N N 1 + b 0 ) p(θ cj D) = Beta(N cj + a, N c N cj + b) Chandola@UB CSE 474/574 12 / 20

Using Naive Bayes Model for Prediction p(y = c x, D) p(y = c D) j p(x j y = c, D) MLE approach, MAP approach? Bayesian approach: p(y = 1 x, D) [ ] Ber(y = 1 θ)p(θ D)dθ) [ ] Ber(x j θ cj )p(θ cj D)dθ cj j θ = N 1 + a 0 N + a 0 + b 0 θ cj = N cj + a N c + a + b Chandola@UB CSE 474/574 13 / 20

Example # Shape Size Color Type 1 cir large light malignant 2 cir large light benign 3 cir large light malignant 4 ovl large light benign 5 ovl large dark malignant 6 ovl small dark benign 7 ovl small dark malignant 8 ovl small light benign 9 cir small dark benign 10 cir large dark malignant Test example: x = {cir, small, light} Chandola@UB CSE 474/574 14 / 20

What if Attributes are Continuous? Naive Bayes is still applicable! Each variable is a univariate Gaussian (normal) distribution p(y x) = p(y) j = p(y) p(x j y) = p(y) j 1 (2π) D/2 e Σ 1/2 1 2πσj 2 (x µ) Σ 1 (x µ) 2 e (xj µi ) 2 2σ j 2 Where Σ is a diagonal matrix with σ1 2, σ2 1,..., σ2 D as the diagonal entries µ is a vector of means Treating x as a multivariate Gaussian with zero covariance Chandola@UB CSE 474/574 15 / 20

What if Σ is not diagonal? Gaussian Discriminant Analysis Class conditional density p(x y = 1) = N (µ 1, Σ 1) p(x y = 2) = N (µ 2, Σ 2) Posterior density for y p(y = 1 x) = p(y = 1)N (µ 1, Σ 1) p(y = 1)N (µ 1, Σ 1) + p(y = 2)N (µ 2, Σ 2) Chandola@UB CSE 474/574 16 / 20

Quadratic and Linear Discriminant Analysis Using non-diagonal covariance matrices for each class - Quadratic Discriminant Analysis (QDA) Quadratic decision boundary If Σ 1 = Σ 2 = Σ Linear Discriminant Analysis (LDA) Parameter sharing or tying Results in linear surface No quadratic term Chandola@UB CSE 474/574 17 / 20

Alternative Interpretation of LDA Equivalent to computing the Mahalanobis distance of x to the two means. Chandola@UB CSE 474/574 18 / 20

How to Train MLE Training Estimate Bernoulli parameters for Y using MLE For each class, estimate MLE parameters for the multivariate normal distribution, i.e., µ 1, Σ 1 and µ 2, Σ 2 For LDA, compute the MLE for Σ using all training data (ignoring the class label) Chandola@UB CSE 474/574 19 / 20

References Chandola@UB CSE 474/574 20 / 20