Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Similar documents
Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Lecture 2: Probability, Naive Bayes

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification

Probabilistic Graphical Models

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSCE 478/878 Lecture 6: Bayesian Learning

Generative Learning algorithms

ECE521 week 3: 23/26 January 2017

Probability Based Learning

Lecture 9: Bayesian Learning

Text Classification and Naïve Bayes

Bayesian Methods: Naïve Bayes

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Support Vector Machines

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Introduction to Machine Learning

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning, Fall 2012 Homework 2

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

The Bayes classifier

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

STA 4273H: Statistical Machine Learning

Chapter 6 Classification and Prediction (2)

MLE/MAP + Naïve Bayes

Bayesian Learning. Bayesian Learning Criteria

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning. Naïve Bayes classifiers

CS6220: DATA MINING TECHNIQUES

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CMU-Q Lecture 24:

Bayesian Decision Theory

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning (II)

Lecture 1: Bayesian Framework Basics

Gaussian discriminant analysis Naive Bayes

Naïve Bayes, Maxent and Neural Models

COMP 328: Machine Learning

Introduction to Machine Learning

Data Mining Part 4. Prediction

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

CSC321 Lecture 18: Learning Probabilistic Models

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Introduction to Machine Learning

Bayesian Decision and Bayesian Learning

Algorithmisches Lernen/Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayesian. From Han Kamber Pei

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning Linear Classification. Prof. Matteo Matteucci

Logistic Regression. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probability Review and Naïve Bayes

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

CS798: Selected topics in Machine Learning

COMP90051 Statistical Machine Learning

Introduction to Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

MLE/MAP + Naïve Bayes

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning - MT Classification: Generative Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Stephen Scott.

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Learning with Probabilities

CS6220: DATA MINING TECHNIQUES

Machine learning - HT Maximum Likelihood

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CSE 473: Artificial Intelligence Autumn Topics

CS 188: Artificial Intelligence Spring Today

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

MODULE -4 BAYEIAN LEARNING

ECE 5984: Introduction to Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning for Signal Processing Bayes Classification

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Be able to define the following terms and answer basic questions about them:

Generative Model (Naïve Bayes, LDA)

Machine Learning Lecture 2

Introduction to Machine Learning

Bayesian Learning Extension

T Machine Learning: Basic Principles

Transcription:

Bayes Rule CS789: Machine Learning and Neural Network Bayesian learning P (Y X) = P (X Y )P (Y ) P (X) Jakramate Bootkrajang Department of Computer Science Chiang Mai University P (Y ): prior belief, prior probability, or simply prior. Probability of observing class Y. P (X Y ): likelihood (Relative) probability of seeing X in class Y P (X) = Y P (X Y )P (Y ): data evidence P (Y X): a posteriori probability Probability of class Y after having seen the data X Jakramate Bootkrajang CS789: Machine Learning and Neural Network 1 / 52 What will we learn in this lecture? Jakramate Bootkrajang CS789: Machine Learning and Neural Network 3 / 52 A Side Note on Probability We will learn about Bayes rule We will construct various classifiers using Bayes rule A likelihood function L(θ X) is a function describing probability of a parameter given an outcome. For a dataset S = (x 1,..., x m ) the likelihood of parameter θ is given by L(θ X) = P (X = x 1 θ) P (X = x 2 θ) P (X = x m θ) Jakramate Bootkrajang CS789: Machine Learning and Neural Network 2 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 4 / 52

A Side Note on Probability Building a classifier using Bayes rule Suppose we have two dices h 1 and h 2 Say, h 1 is fair but h 2 is biased The probability of getting i given the h 1 dice is called conditional probability, denoted by P (i h 1 ) Pick a dice at random with P (h j ) : j = 1, 2. The probability for picking the h j dice and getting an i with the dice is called joint probability, and is P (i, h j ) = P (h j )P (i h j ) The so-called marginal probability is P (i) = h j P (i, h j ). Goal: How? P (Y X) = P (X Y )P (Y ) P (X) Calculate probabilities of how likely to see label Y when X is presented, P (Y X) Find P (X Y ) Find P (Y ) Find P (X) Jakramate Bootkrajang CS789: Machine Learning and Neural Network 5 / 52 Bayes decision rule Jakramate Bootkrajang CS789: Machine Learning and Neural Network 7 / 52 Building a classifier using Bayes rule Consider a binary classification: Y { 1, 1}, we can construct a classfier with minimal probability of error if we define Definition (Bayes decision rule) Theorem h (x) = { 1 P (Y = 1 X = x) > 1/2 1 For any classifier h : X { 1, 1}, otherwise (1) P (h (X) Y ) P (h(x) Y ), (2) Way to obtain P (X Y ) P (X Y ) can be given. P (X Y ) can be modelled using discrete probability distribution. We can count number of time X occurs to estimate its likelihood given Y. P (X Y ) can be modelled using continuous probability distribution. We estimate parameters of the distribution. Way to obtain P (Y ), find ratio #Y N Way to obtain P (X), find marginal probability Y P (X Y )P (Y ) that is, h is the optimal classifier. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 6 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 8 / 52

Two philosophies of estimating P (Y X) Case 1: P (X Y ) is given Maximum Likelihood: assume equal priors hml (X) = argmax y P (Y = y X) = argmax y P (X Y =y) 0.5 P (X) Often used when we have very little idea about the data. Maximum a Posteriori: consider priors P (X Y =y) P (Y =y) h MAP (X) = argmax y P (Y = y X) = argmax y P (X) Generally gives better performance if we have the priors. Does a patient have cancer or not? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only 0.008 of the entire population has this disease. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 9 / 52 A word about the Bayesian Framework Jakramate Bootkrajang CS789: Machine Learning and Neural Network 11 / 52 Working out the variables Allows us to combine observed data and prior knowledge Provides practical learning algorithms It is a generative approach, which offers a useful conceptual framework This means that any kind of objects (e.g. time-series, trees, etc.) can be classified, based on a probabilistic model specification Y = {cancer, cancer}, X = {positive, negative} To decide whether the patient has cancer we have to calculate The posterior probability the the patient has cancer, P (Y = cancer X = positive) The posterior probability the the patient does not have cancer, P (Y = cancer X = positive) According to the Bayes decision rule, we pick y which gives P (Y = y X = x) > 0.5 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 10 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 12 / 52

ML Solution Case 2: Discrete P (X Y ) 1. The posterior probability of having cancer. P (Y = cancer X = positive) = P (X = positive Y = cancer) 0.5 P (X = positive) P (X = positive) = P (X = positive Y = cancer)p (cancer) + P (X = positive Y = cancer)p ( cancer) =......... 2. The posterior probability of being healthy. P (Y = cancer X = positive) = P (X = positive Y = cancer) 0.5 P (X = positive) =......... The dataset: (W,F), (BR,F), (W,A), (B,F), (B,F), (BR,F), (W,A) Assume we have a set of data which classifies dog friendliness based on its colour. Y = {Aggressive, F riendly}, X = {W hite, BRown, Black} If we see new white dog would it be friendly? 3. Diagnosis?? Jakramate Bootkrajang CS789: Machine Learning and Neural Network 13 / 52 MAP Solution Jakramate Bootkrajang CS789: Machine Learning and Neural Network 15 / 52 ML Solution 1. The posterior probability of having cancer. P (Y = cancer X = positive) = P (X = positive Y = cancer) 0.008 P (X = positive) P (X = positive) = P (X = positive Y = cancer)p (cancer) + P (X = positive Y = cancer)p ( cancer) =......... 2. The posterior probability of being healthy. 3. Diagnosis?? P (Y = cancer X = positive) = P (X = positive Y = cancer) 0.992 P (X = positive) =......... 1. The posterior probability of being friendly. P (Y = friendly X = white) = P (X = white Y = friendly) 0.5 P (X = white) P (X = white) = P (X = white Y = friendly)p (Y = friendly) + P (X = white Y = aggressive)p (Y = aggressive) =......... 2. The posterior probability of being aggressive. 3. Diagnosis?? P (Y = aggressive X = white) = P (X = white Y = aggressive) 0.5 P (X = white) =......... Jakramate Bootkrajang CS789: Machine Learning and Neural Network 14 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 16 / 52

The Naive Bayes Classifier (1/2) Example of Naive classifier: Playing Tennis What if our example has several attributes? x = {a 1, a 2,..., a n } The problem is P (X, Y ) = P (Y )P (X Y ) factorised into a long sequence. By chain rule, P (X, Y ) = P (Y )P (a 1,..., a m Y ) = P (Y )P (a 1 Y )P (a 2,..., a m Y, a 1 ) = P (Y )P (a 1 Y )P (a 2 Y, a 1 )P (a 3,..., a m Y, a 1, a 2 ) The naive assumption assumes that each feature a i is conditionally independent of every other feature a j for j i Jakramate Bootkrajang CS789: Machine Learning and Neural Network 17 / 52 The Naive Bayes Classifier (2/2) Jakramate Bootkrajang CS789: Machine Learning and Neural Network 19 / 52 Naive Bayes Solution So we have P (a i Y, a j,... ) = P (a i Y ) and so on. Which gives: P (X Y ) = P (a 1,..., a m Y ) = i P (a i Y ) A Bayesian classifier that uses the Naive assumption is called The Naive Bayes classifier. One of the most practical methods widely used in, Medical applications. Text classification. Classify any new data point x = (a 1,..., a m ) as h naive (X) = argmax Y P (Y )P (X Y ) = argmax Y P (Y ) i P (a i Y ) We need to estimate the parameters from the training examples For each class y: ˆP (Y = y) For each feature ai : ˆP (ai Y ) Based on the examples in the table, classify the following x x = {sunny, cool, high, strong}, Play tennis or not? Jakramate Bootkrajang CS789: Machine Learning and Neural Network 18 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 20 / 52

The working Learning to classify text h naive = argmax y [yes,no] P (Y = y)p (X = x Y = y) = argmax y [yes,no] P (Y = y) i P (a i Y = y) Now find = argmax y [yes,no] P (Y = y)p (sunny Y = y)p (cool Y = y) P (high Y = y)p (strong Y = y) The attributes (features) are the words NB classifiers are one of the most effective for this task P (Y = yes) = 9/14 = 0.64 find P (sunny Y = yes) = 2/9 = 0.22 find P (cool Y = yes) = 3/9 = 0.33 find P (high Y = yes) =...... find P (strong Y = yes) =...... and so on... Jakramate Bootkrajang CS789: Machine Learning and Neural Network 21 / 52 Exercise Jakramate Bootkrajang CS789: Machine Learning and Neural Network 23 / 52 Representation of text: bag of words Assume we have a data set described the following three variables: Hair = B,D, where B=blonde, D=dark. Height = T,S, where T=tall, S=short. Country = G,P, where G=Greenland, P=Poland. You are given the following training data set (Hair, Height, Country): (B,T,G), (D,T,G), (D,T,G), (D,T,G), (B,T,G), (B,S,G), (B,S,G), (D,S,G), (B,T,G), (D,T,G), (D,T,G), (D,T,G), (B,T,G), (B,S,G), (B,S,G), (D,S,G), (B,T,P), (B,T,P), (B,T,P), (D,T,P), (D,T,P), (D,S,P), (B,S,P), (D,S,P). Now, suppose you observe a new individual tall with blond hair, and you want to use these training data to determine the most likely country of origin. Compute the maximum a posteriori (MAP) answer to the above question, using the Naive Bayes assumption. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 22 / 52 I love this movie! It s sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I ve seen it several times, and I m always happy to see it again whenever I have a friend who hasn t seen it yet. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 24 / 52

Representation of text: bag of words Parameter estimation Predefine vocabolary set V and highlight w V. I love this movie! It s sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I ve seen it several times, and I m always happy to see it again whenever I have a friend who hasn t seen it yet. Simply use the frequencies in the data (Case 2) Class prior probability: Likelihood P (Y = y j ) = count(y =y j) m P (w i Y = y j ) = count(w i,y =y j ) w V count(w,y =y j ) Problem of the above is if no training documents contain the word fantastic in class positive, then P ( f antastic positive) = 0 So P (positive X new ) will always be zero. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 25 / 52 Representation of text: bag of words Jakramate Bootkrajang CS789: Machine Learning and Neural Network 27 / 52 Laplace smoothing for Naive Bayes great 2 love 2 recommend 1 laugh 1 terrible 0 happy 1 sad 0.. To pretend that you have seen each of all the words in V at least α times. P (w i Y ) = count(w i, Y ) + α (count(w, Y ) + α) w V count(w i, Y ) + α = ( count(w, Y )) + α V w V Here, α is called smoothing parameter (aka hyper-parameter) which often be tuned using cross-validation. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 26 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 28 / 52

Example of 20-Newsgroups text classification using NB Case 3: Motivations We have already seen how Bayes rule can be turned into a classifier. Our examples so far only consider discrete attributes. E.g. {sunny, warm}, {positive,negative} Today we learn how to do this when the data attributes are continuous. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 29 / 52 Summary Jakramate Bootkrajang CS789: Machine Learning and Neural Network 31 / 52 An example problem Task: predict gender of individuals based on their heights. Bayes rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Maximum Likelihood doesn t Naive Bayes classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attr. are independent given the class. Given 100 height examples of women. 100 height examples of man. Encode class label of male as y = 1 and female as y = 0. So, y {0, 1}. Bayesian classification is a generative approach to classification. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 30 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 32 / 52

Class posterior Univariate Gaussian (Normal Distribution) From Bayes rule we can obtain the class posteriors of male: p(y = 1 x) = p(x y = 1)p(y = 1) p(x y = 0)p(y = 0) + p(x y = 1)p(y = 1) The denominator is the probability of measuring the height x irrespective of the class. p(x y = k) = 1 exp{ (x µ k) 2 2πσk 2 2σk 2 } where µ k is the mean(centre), and σk 2 is the variance (spread). These are parameters that describe the distributions. We will have separate Gaussian for each class. So, the female class will have µ 0 as its mean, and σ0 2 as its variance. And male class with m 1 and σ1 2. We will estimate these parameters from the data. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 33 / 52 Modelling p(x y) using continuous probability distribution Jakramate Bootkrajang CS789: Machine Learning and Neural Network 35 / 52 Illustration - our 1D example Our measurements are heights. This is our data, x. Class-conditional likelihoods p(x y = 1): probability that a male has height x metres. p(x y = 0): probability that a female has height x metres. We will model each class by a Gaussian distribution. (Other distribution is possible) Jakramate Bootkrajang CS789: Machine Learning and Neural Network 34 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 36 / 52

Multivariate Gaussian Example with 2 classes Let x = {x 1, x 2, x 3,..., x d }. Let k {0, 1} p(x y = k) = 1 (2π) d/2 Σ k 1/2 exp{ 1 2 (x µ k) t Σ 1 k (x µ k)} where µ k is the mean vector, and Σ k is the covariance matrix. These parameters get estimated from the data. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 37 / 52 Illustration - 2D example Jakramate Bootkrajang CS789: Machine Learning and Neural Network 39 / 52 Class priors Class prior: the probability of seeing male example (and female example). Since in this example we had the same number of males and females, we empirically calculate, p(y = 1) = p(y = 0) = 100 100 + 100 = 0.5 These are priors of class membership and they could be set before measuring any data. The class prior can be useful in cases where class proportions are imbalanced. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 38 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 40 / 52

Discriminant function Discriminant Analysis: case Σ 1 = Σ 0 = σ 2 I According to Bayes decision rule, we will predict y = 1 if p(y = 1 x) > 1/2 and y = 0 otherwise. We can formulate the above rule as a mathematical function. ( p(y = 1 x) ) f 1 (x) = 1 p(y = 0 x) > 1 Or equivalently ( p(y = 1 x) ) f 2 (x) = 1 log p(y = 0 x) > 0 The sign of f 2 defines the prediction f 2 (x) > 0 = male, f 2 (x) 0 = female Such functions are called discriminant functions. The determinant is Σ = σ 2d And Σ 1 = (1/σ 2 )I f 2 (x) = 1 2 (x µ 1) t I σ 2 (x µ 1) + log ω 1 + 1 2 (x µ 0) t I σ 2 (x µ 0) log = 1 2σ 2 (xt x 2µ t 1x + µ t 1µ 1 ) + log ω 1 + 1 2σ 2 (xt x 2µ t 0x + µ t 0µ 0 ) log = 1 σ 2 (µt 1x 1 2 µt 1µ 1 ) + 1 σ 2 (µt 0x 1 2 µt 0µ 0 ) + log ω 1 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 41 / 52 Discriminant Analysis Jakramate Bootkrajang CS789: Machine Learning and Neural Network 43 / 52 Discriminant Analysis: case Σ 1 = Σ 0 = σ 2 I Recall our discriminant function f 2 (x) = log p(y=1 x) p(y=0 x) We d like to know what decision boundary a particular Σ will induced. We write (for normal density and ω i def = p(y = k)) f 2 (x) = log p(x µ 1, Σ 1 )ω 1 p(x µ 0, Σ 0 ) = log p(x µ 1, Σ 1 ) + log ω 1 log p(x µ 0, Σ 0 ) log =... = 1 2 (x µ 1) t Σ 1 1 (x µ 1) d 2 log 2π 1 2 log Σ 1 + log ω 1 + 1 2 (x µ 0) t Σ 1 0 (x µ 0) + d 2 log 2π + 1 2 log Σ 0 log 1 σ 2 (µt 1x 1 2 µt 1µ 1 ) + 1 σ 2 (µt 0x 1 2 µt 0µ 0 ) + log ω 1 = 0 (µ t 1x 1 2 µt 1µ 1 ) + (µ t 0x 1 2 µt 0µ 0 ) + σ 2 log ω 1 = 0 (µ 0 µ 1 ) t x + 1 2 µt 1µ 1 1 2 µt 0µ 0 + σ 2 log ω 1 = 0 (µ 0 µ 1 ) t x + 1 2 (µt 1µ 1 µ t 0µ 0 ) + σ 2 log ω 1 = 0 (µ 0 µ 1 ) t x + 1 2 (µ 1 µ 0 ) t (µ 1 + µ 0 ) + σ 2 log ω 1 = 0 (µ 0 µ 1 ) t x (µ 0 µ 1 ) t[ 1 2 (µ σ 2 1 + µ 0 ) + (µ 0 µ 1 ) log ω ] 1 = 0 w t (x x 0 ) = 0 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 42 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 44 / 52

Discriminant Analysis: case Σ 1 = Σ 0 = σ 2 I Discriminant Analysis: case Σ 1 Σ 0 = arbitrary w t (x x 0 ) = 0 This defines a hyperplane through point x 0 and orthogonal to w. Since w = (µ 0 µ 1 ) the hyperplane is a plane normal to the line linking the means. The plane cut the line at x 0. If ω 1 = then x 0 = (µ 0 µ 1 )/2, the midpoint between the means. In other cases, x 0 shifts away from the more likely mean (from class with larger ω or larger prior) In the last case we found that f 2 (x) = 1 2 (x µ 1) t Σ 1 1 (x µ 1) d 2 log 2π 1 2 log Σ 1 + log ω 1 + 1 2 (x µ 0) t Σ 1 0 (x µ 0) + d 2 log 2π + 1 2 log Σ 0 log The decision boundary is quadratic, since things cannot be simplified. The non-linearity of this form leads to more powerful classifier for tackling data which is not linearly-separable. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 45 / 52 Discriminant Analysis: case Σ 1 = Σ 0 = Σ Jakramate Bootkrajang CS789: Machine Learning and Neural Network 47 / 52 Gaussian s parameters estimation Along the same line of analysis we found that in this case w = Σ 1 (µ i µ j ) x 0 = 1 2 (µ i µ j ) log[p(ω i )/p(ω j )] (µ i µ j ) t Σ 1 (µ i µ j ) (µ i µ j ) The decision boundary is still linear and ω controls the intercept on the line linking the means. The covariance The mean The prior Σ k = mk i=1 (x i µ k )(x i µ k ) t m k µ k = mk i=1 x i m k However, the hyperplane is not orthogonal to the line between the means due to the covariance, w = Σ 1 (µ i µ j ) ω k = p(y = k) = m k m Jakramate Bootkrajang CS789: Machine Learning and Neural Network 46 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 48 / 52

Naive assumption Summary The full covariance are d d In many situation there is not enough data to estimate full covariance. The Naive Bayes is again useful and tends to work well in practice. Using Naive assumption the covariance becomes diagonal. This type of classifier is call Generative because it makes an assumption that the points in each class are generated by some distribution i.e., Gaussian distribution in our example. One can model the discriminant function directly. That is called Discriminative classifier (next week) Jakramate Bootkrajang CS789: Machine Learning and Neural Network 49 / 52 Multi-class classification Jakramate Bootkrajang CS789: Machine Learning and Neural Network 51 / 52 References We may have more than two classes. Say, Healthy, Disease 1, Disease 2. Our Gaussian classifier is easy to use in multi-class problem. We compute posterior probability for each of the classes. Machine learning course by Ata Kabán. http://www.cs.bham.ac.uk/~axk/ml_new.htm We predict class with highest posterior. Jakramate Bootkrajang CS789: Machine Learning and Neural Network 50 / 52 Jakramate Bootkrajang CS789: Machine Learning and Neural Network 52 / 52