Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

The Bayes classifier

Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Naïve Bayes classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Decision and Bayesian Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine learning - HT Maximum Likelihood

Introduction to Probabilistic Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning for Signal Processing Bayes Classification

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Parametric Techniques

Machine Learning (CS 567) Lecture 5

Parametric Techniques Lecture 3

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Methods: Naïve Bayes

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Support Vector Machines

Introduction to Machine Learning

Linear Classification: Probabilistic Generative Models

Machine Learning for Signal Processing Bayes Classification and Regression

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Gaussian Models

The generative approach to classification. A classification problem. Generative models CSE 250B

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CSC411 Fall 2018 Homework 5

Probability and Estimation. Alan Moses

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

ECE521 week 3: 23/26 January 2017

Naive Bayes and Gaussian Bayes Classifier

Machine Learning, Fall 2012 Homework 2

Introduction to Machine Learning Spring 2018 Note 18

Naive Bayes and Gaussian Bayes Classifier

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSC 411: Lecture 09: Naive Bayes

MLE/MAP + Naïve Bayes

Introduction to Machine Learning

Generative Model (Naïve Bayes, LDA)

Logistic Regression. Machine Learning Fall 2018

Graphical Models for Collaborative Filtering

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

MLE/MAP + Naïve Bayes

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

5. Discriminant analysis

Machine Learning 2017

Gaussian discriminant analysis Naive Bayes

Bayesian Decision Theory

Linear Regression and Discrimination

Learning with Probabilities

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

L11: Pattern recognition principles

Naive Bayes and Gaussian Bayes Classifier

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Density Estimation. Seungjin Choi

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Minimum Error Rate Classification

Generative v. Discriminative classifiers Intuition

LDA, QDA, Naive Bayes

Generative v. Discriminative classifiers Intuition

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Bayesian Approaches Data Mining Selected Technique

ECE 5984: Introduction to Machine Learning

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

LEC 4: Discriminant Analysis for Classification

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

CMU-Q Lecture 24:

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Statistical Machine Learning Hilary Term 2018

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Machine Learning

Linear Classification

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Learning Bayesian network : Given structure and completely observed data

Pattern Recognition and Machine Learning

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

PMR Learning as Inference

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 2: From Linear Regression to Kalman Filter and Beyond

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. Seungjin Choi

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Transcription:

Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 20

Outline Learning Probabilistic Classifiers Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Naive Bayes Classification Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Gaussian Discriminant Analysis Moving to Continuous Data Quadratic and Linear Discriminant Analysis Training a QDA or LDA Classifier Chandola@UB CSE 474/574 2 / 20

Learning Probabilistic Classifiers Training data, D = [ x i, y i ] D i=1 1. {circular,large,light,smooth,thick}, malignant 2. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin}, benign 4. {oval,large,light,irregular,thick}, malignant 5. {circular,small,light,smooth,thick}, benign Testing: Predict y for x Option 1: Functional Approximation y = f (x ) Option 2: Probabilistic Classifier P(Y = benign X = x ), P(Y = malignant X = x ) Chandola@UB CSE 474/574 3 / 20

Applying Bayes Rule Training data, D = [ x i, y i ] D i=1 1. {circular,large,light,smooth,thick}, malignant 2. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin}, benign 4. {oval,large,light,irregular,thick}, malignant 5. {circular,small,light,smooth,thick}, benign x = circular,small, light,irregular,thin What is P(Y = benign x )? What is P(Y = malignant x )? Chandola@UB CSE 474/574 4 / 20

Output Label A Discrete Random Variable Y takes two values What is p(y )? Ber(θ) How do you estimate θ? Treat the labels in training data as binary samples Done that last week! Posterior for θ α0 + N1 p(θ) = α 0 + β 0 + N Class 1 - Malignant; Class 2 - Benign Can we just use p(y θ) for predicting future labels? Just a prior for Y Chandola@UB CSE 474/574 5 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? Chandola@UB CSE 474/574 6 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? P(Y = malignant)? Chandola@UB CSE 474/574 6 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? P(Y = malignant)? P(Y = malignant X = x )? Chandola@UB CSE 474/574 6 / 20

Computing Posterior for Y What is probability of x to be malignant P(X = x Y = malignant)? P(Y = malignant)? P(Y = malignant X = x )? P(Y = malignant X = x ) = P(X=x Y =malignant)p(y =malignant) P(X=x Y =malignant)p(y =malignant)+p(x=x Y =benign)p(y =benign) Chandola@UB CSE 474/574 6 / 20

What is P(X = x Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(x)) Step 2: Learn parameters from training data Chandola@UB CSE 474/574 7 / 20

What is P(X = x Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(x)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? Chandola@UB CSE 474/574 7 / 20

What is P(X = x Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(x)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 2(2 D 1) Chandola@UB CSE 474/574 7 / 20

What is P(X = x Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(x)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 2(2 D 1) How much training data is needed? Chandola@UB CSE 474/574 7 / 20

Naive Bayes Assumption All features are independent Each variable can be assumed to be a Bernoulli random variable D P(X = x Y = malignant) = p(xj Y = malignant) j=1 D P(X = x Y = benign) = p(xj Y = benign) j=1 y x 1 x 2 x 3 x 4 x 5 x 6 x D Only need 2D parameters Chandola@UB CSE 474/574 8 / 20

Example - Only binary features Training a Naive Bayes Classifier Find parameters that maximize likelihood of training data What is a training example? x i? x i, y i What are the parameters? θ for Y (class prior) θ benign and θ malignant (or θ 1 and θ 2 ) Joint probability distribution of (X, Y ) p(x i, y i ) = p(y i θ)p(x i y i ) = p(y i θ) j p(x ij θ jyi ) Chandola@UB CSE 474/574 9 / 20

Likelihood? Likelihood for D l(d Θ) = i p(y i θ) j p(x ij θ jyi ) Log-likelihood for D ll(d Θ) = N 1 log θ + N 2 log(1 θ) + j N 1j log θ 1j + (N 1 N 1j ) log (1 θ 1j ) + j N 2j log θ 2j + (N 2 N 2j ) log (1 θ 2j ) N 1 - # malignant training examples, N 2 = # benign training examples N 1j - # malignant training examples with x j = 1, N 2j = # benign training examples with x j = 2 Chandola@UB CSE 474/574 10 / 20

MLE? Maximize with respect to θ, assuming Y to be Bernoulli ˆθ = N c N Assuming each feature is binary (x j (y = c) Bernoulli(θ cj ), c = {1, 2}) ˆθ cj = N cj N c Algorithm 1 Naive Bayes Training for Binary Features 1: N c = 0, N cj = 0, j 2: for i = 1 : N do 3: c y i 4: N c N c + 1 5: for j = 1 : D do 6: if x ij = 1 then 7: N cj N cj + 1 8: end if 9: end for 10: end for 11: ˆθ c = Nc N, ˆθ cj = N cj N c 12: return b Chandola@UB CSE 474/574 11 / 20

Adding Prior Add prior to θ and each θ cj. Beta prior for θ ( Beta(a0, b 0)) Beta prior for θcj ( Beta(a, b)) Posterior Estimates p(θ D) = Beta(N 1 + a 0, N N 1 + b 0 ) p(θ cj D) = Beta(N cj + a, N c N cj + b) Chandola@UB CSE 474/574 12 / 20

Using Naive Bayes Model for Prediction p(y = c x, D) p(y = c D) j p(x j y = c, D) MLE approach, MAP approach? Bayesian approach: p(y = 1 x, D) [ ] Ber(y = 1 θ)p(θ D)dθ) [ ] Ber(x j θ cj )p(θ cj D)dθ cj j θ = N 1 + a 0 N + a 0 + b 0 θ cj = N cj + a N c + a + b Chandola@UB CSE 474/574 13 / 20

Example # Shape Size Color Type 1 cir large light malignant 2 cir large light benign 3 cir large light malignant 4 ovl large light benign 5 ovl large dark malignant 6 ovl small dark benign 7 ovl small dark malignant 8 ovl small light benign 9 cir small dark benign 10 cir large dark malignant Test example: x = {cir, small, light} Chandola@UB CSE 474/574 14 / 20

What if Attributes are Continuous? Naive Bayes is still applicable! Each variable is a univariate Gaussian (normal) distribution p(y x) = p(y) j = p(y) p(x j y) = p(y) j 1 (2π) D/2 e Σ 1/2 1 2πσj 2 (x µ) Σ 1 (x µ) 2 e (xj µi ) 2 2σ j 2 Where Σ is a diagonal matrix with σ1 2, σ2 1,..., σ2 D as the diagonal entries µ is a vector of means Treating x as a multivariate Gaussian with zero covariance Chandola@UB CSE 474/574 15 / 20

What if Σ is not diagonal? Gaussian Discriminant Analysis Class conditional density p(x y = 1) = N (µ 1, Σ 1) p(x y = 2) = N (µ 2, Σ 2) Posterior density for y p(y = 1 x) = p(y = 1)N (µ 1, Σ 1) p(y = 1)N (µ 1, Σ 1) + p(y = 2)N (µ 2, Σ 2) Chandola@UB CSE 474/574 16 / 20

Quadratic and Linear Discriminant Analysis Using non-diagonal covariance matrices for each class - Quadratic Discriminant Analysis (QDA) Quadratic decision boundary If Σ 1 = Σ 2 = Σ Linear Discriminant Analysis (LDA) Parameter sharing or tying Results in linear surface No quadratic term Chandola@UB CSE 474/574 17 / 20

Alternative Interpretation of LDA Equivalent to computing the Mahalanobis distance of x to the two means. Chandola@UB CSE 474/574 18 / 20

How to Train MLE Training Estimate Bernoulli parameters for Y using MLE For each class, estimate MLE parameters for the multivariate normal distribution, i.e., µ 1, Σ 1 and µ 2, Σ 2 For LDA, compute the MLE for Σ using all training data (ignoring the class label) Chandola@UB CSE 474/574 19 / 20

References Chandola@UB CSE 474/574 20 / 20