Linear Models for Classification

Similar documents
Linear Models for Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Lecture 7

Ch 4. Linear Models for Classification

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Lecture 5

Linear Classification

Machine Learning. 7. Logistic and Linear Regression

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear Classification: Probabilistic Generative Models

Linear & nonlinear classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear & nonlinear classifiers

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Reading Group on Deep Learning Session 1

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Iterative Reweighted Least Squares

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naïve Bayes classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Multiclass Logistic Regression

Bayesian Logistic Regression

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Probabilistic generative models

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 5: Logistic Regression. Neural Networks

Introduction to Machine Learning

SGN (4 cr) Chapter 5

Neural Networks and the Back-propagation Algorithm

ECE521 Lecture7. Logistic Regression

Neural Network Training

Linear Models for Classification

Machine Learning 2017

Multivariate statistical methods and data mining in particle physics

Generative v. Discriminative classifiers Intuition

CSCI-567: Machine Learning (Spring 2019)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Generative v. Discriminative classifiers Intuition

Intelligent Systems Discriminative Learning, Neural Networks

Logistic Regression. Seungjin Choi

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Max Margin-Classifier

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear discriminant functions

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Pattern Recognition and Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

CSC 411: Lecture 04: Logistic Regression

STA 4273H: Statistical Machine Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MLPR: Logistic Regression and Neural Networks

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Linear Models for Regression

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Linear Classifiers as Pattern Detectors

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear Models for Regression. Sargur Srihari

Logistic Regression. Machine Learning Fall 2018

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Outline Lecture 2 2(32)

Lecture 15: Logistic Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Linear Discrimination Functions

Machine Learning for Signal Processing Bayes Classification and Regression

ECE521 week 3: 23/26 January 2017

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

10-701/ Machine Learning - Midterm Exam, Fall 2010

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Lecture 5: LDA and Logistic Regression

PATTERN RECOGNITION AND MACHINE LEARNING

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Probabilistic Machine Learning. Industrial AI Lab.

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Statistical Machine Learning Hilary Term 2018

Linear Decision Boundaries

6.867 Machine Learning

Chapter 14 Combining Models

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear Regression (continued)

Overfitting, Bias / Variance Analysis

L11: Pattern recognition principles

Advanced statistical methods for data analysis Lecture 2

Linear and logistic regression

Transcription:

Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning

Congradulations!!!! You have just inherited an old house from your great grand aunt on your mothers side twice removed by marriage (and once by divorce). There is an amazing collection of books in the old library (and in almost every other room in the house) containing every thing from old leather bounded tomes and crinkly old parchments to the newer dust jacket bound best sellers and academic text books along with a sizable collection of paper back pulp fiction and DC comic books. Yep, old Aunt Lacee was a collector and you have to clean out the old house before you can sell it.

Your Mission Being the over worked (and underpaid) student that you are, you have limited time to spend on this task... but because you spend your leasuire time listening to NPR ( the Book Guys ) you know that there is money to be made in old books. In other words, you need to quickly determine which books to throw out (or better still recycle), which to sell, and which to hang onto as an investment.

The Task From listening to The Book Guys you know thqt there are many aspects of a book that determine its present value which will help you determine if you wish to toss, sell or keep it. These aspects include: date published author topic condition genre presence of dust jacket number of volume know to be published etc... And to your advantage, you have just completed a course in machine learning so you recognize that what you have is a straight forward classification problem.

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Outline Defining the problem Approaches in modeling 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Classification Defining the problem Approaches in modeling Problem Components - A group, X, of items, x with common characteristics with specific values assigned to these characterisitcs: values can be nominal, numeric, discrete or continuous. - A set of disjoint classes, into which we wish to place each of the above items. - A function that assigns each item to one and only one of these disjoint classes. Classification: Assigning each item to one discrete class using a function devised specifically for this purpose.

Structure of items Defining the problem Approaches in modeling Each item can be represented as a D-dimensional vector, x = {x 1, x 2,..., x D }, where D is the number of aspects, attributes or value fields used to described the item. Aunt Lacee s Collection - Items to be classified are books, comics and parchments, each of which has a set of values attached to it (type, title, publish date, genre, conditions,...) Sample items from Aunt Lace s Collection: x = { book, Origin of Species, 1872, biology, mint,...} x = { parchment, Magna Carta, 1210, history, brittle,...}

Structure of Classes Defining the problem Approaches in modeling A set of K classes, C = {c 1, c 2,..., c K }, where each x can belong to only one class Input space is divided into K decision areas, each area corresponding to a class Boundries of decision areas are decision boundaries or decision surfaces In linear classification models these surfaces are linear function of x In other words, these surfaces are defined by (D 1)-dimensional hyperplanes within the D-dimensional input space.

Defining the problem Approaches in modeling Example: two dimensions, two classes x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w

Structure of Classes Defining the problem Approaches in modeling For Aunt Lacee s book collection, K =3 c 1 = no value - books with no value which will be recycled c 2 = sell immediately - books with immediate cash value such as current text books and best sellers which will be sold quickly. c 3 = keep - these books (or parchments or comics ) have museum quality price tags and require time in order to place properly (for maximum profit). Each item of the collection will be assigned one and only one class. By their very nature, they are mutually exclusive.

Defining the problem Approaches in modeling Representation of a K Class Label Let t be a vector of length K, used to represent a class label. Each element t k of t is 0 except for element i when x c i For Aunt Lacee s collection, the values of t are as follows: t i = {1, 0, 0} indicates x i c 1 and should be recycled. t i = {0, 1, 0} indicates x i c 2 and should be sold. t i = {0, 0, 1} indicates x i c 3 and should be kept. A binary class is a special case, needing only a single dimension vector. t = {0} indicates x i c 0 t = {1} indicates x i c 1

Outline Defining the problem Approaches in modeling 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Approaches to the problem Defining the problem Approaches in modeling Three approaches to finding the function for our classification problem - The simplest approach is a function which directly assigns each x to one c i C Probabilistic - Separates the inference stage from the decision stage. - In the inference stage, the conditional probability distribution, p(c k x), is modeled directly. - In decision stage, class is assigned based on these distributions.

Approaches to the problem Defining the problem Approaches in modeling Probabilistic Generative Functions - Both the class conditional probability distribution, p(x C k ) as well as the prior probabilities p(c k ), are modeled and used to compute posterior probabilites using bayes theorem. p(c k x) = p(x C k) p(c k ) p(x) - This model develops the probability densities of the input space such that new examples can accurately be generated.

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Two class problem y(x) = w T x + w 0 where w is a weight vector of same dimension D as x. w 0 is the bias or threshold (-w 0 ). An input vector x is assigned to one of the two classes as follows: { c0 if y(x) < 0 x c 1 if y(x) 0 Decision boundary will be a hyperplane in D 1 dimensions

Matrix notation As a reminder of the convention: vectors are column matrices where w 1 w 2 w = so w T = [ ] w 1 w 2 w D. and w D w T x = [ ] w 1 w 2 w D x 1 x 2. x D = w 1x 1 +w 2 x 2 + +w D x D

Example: two dimensions, two classes x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w y(x) = w T x + w 0

Multi-class (K > 2) K-class discriminant comprised of K functions of the form y k (x) = w T k + w k0 Assign input vector as follows x c k where k = argmax y k (x) k {1,2...} R j R i R k x A ˆx x B

Learning parameter w Three techniques for learning the parameter of the discriminant function, w. Least Squares Fisher s linear discriminant

Least Squares Once again, each class, C k, has it s own linear model : y k (x) = w T k x + w k0. As a reminder of the convention: vectors are column matrices w 1 w 2 w = so w T = [ ] w 1 w 2 w D. w D

Least Squares, compact notation Let W be a D + 1 K matrix whose columns represent the column vector w: w 0 w 1 w =. w D Let x be a D + 1 1 column matrix (1, x T ) T : x = 1 x 1. x D

Least Squares, compact notation The individual class discriminant functions y k (x) = w T k x + w k0 can be written y(x) = W T x

Least Squares, determining W W is determined by minimizing a sum-of-squares error function whose form is given as: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 Let X be a n (D + 1) matrix representing a training set of n examples. Let T be a n k matrix representing the targets for the n training examples

Least Squares, determining W This yields the expression E D ( W) = 1 {( X 2 Tr W T) T ( X W } T) To minimize, take the derivative with respect to W and set to zero to obtain W = ( X T X) 1 XT T = X T and finally the discriminant function y(x) = W T x = T T ( X ) T x

Least Squares, considerations Under certain conditions, this model will have the property that the elements of y(x)will sum to 1 for any value of x. However, since they are not constraint to lay on the interval (0, 1), meaning that negative numbers and numbers larger than 1 might occur, then the elements cannot be treated as probabilities. Among other disadvantages, this approach has an inappropriate response to outliers.

Least Squares - Response to outliers 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 a) Well separated b) In the presence of outliers several misclassified examples

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Fisher s Linear Discriminant in concept 4 An approach that reduces the dimensionality of the model by projecting the input vector to a reduced dimension space. Simple example, two dimensional input vectors, projected down to one 2 0 2 2 2 6

Fisher s Linear Discriminant Start with a two class problem: y = w T x whose class mean vectors are given as. m 1 = 1 N 1 Choose w to maximize n C 1 x n and m 2 = 1 N 2 m 2 m 1 = w T (m 2 m 1 ) n C 2 x n

Fisher s Linear Discriminant Maximizing the separation of the mean for each class 4 2 0 2 2 2 6 However, classes still overlap

Fisher s Linear Discriminant Add the condition of minimizing the within-class variance, which is given as s 2 k = n C k (y n m k ) 2 Fishers criterion is based on the maximization of separation of class mean with minimized within-class variance. These two conditioned are captured in the ratio between the variance of the class means and the within-class variance, given by J(w) = (m 2 m 1 ) 2 s 2 1 + S2 2

Fisher s Linear Discriminant Casting this ratio back into terms of the original frame of reference, J(w) = wt S B w w T S B w where S B = (m 2 m 1 )(m 2 m 1 ) T and S w = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Take derivative with respect to w and set to zero to find minimum.

Fisher s Linear Discriminant derivative : (w T S B w)s w w = (w T S w w)s B w Only direction of w is important w S 1 w (m 2 m 1 ) To make this a discriminant function, y 0 is chosen so that { C1 if y(x) y x 0 C 2 if y(x) < y 0

Fisher s Linear Discriminant Second consideration: minimize the variance within-class 4 2 0 2 2 2 6 Two classes nicely separated in the dimensionally reduced space

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

This model takes the form y(x) = f (w T Φ(x)) - where Φ is a transformation function that creates the feature vector from the input vector. We will use the indentity transformation function to begin our discussion. - where f ( ) is given by { +1, a 0 f (a) = 1, a < 0.

- The binary problem There is a change in target coding: t is now a scaler, taking the values or either 1 or -1. This value is interpreted as the input vector belonging to C 1 if t = 1, else C 2 when t = 1. In considering w, we want which means we want x n C 1 w T Φ(x n ) > 0 and x n C 2 w T Φ(x n ) 0 x n X, w T Φ(x n )t n > 0

- weight update The perceptron error E p (w) = w T Φ n t n n M The perceptron update rule if x is misclassified w (τ+1) = w (τ) η E p (w) = w (τ) + ηφ n t n

- example 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 a) Misclassified example b) w after update

- example 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 a) Next misclassified example b) w after update

- consideration The update rule is guaranteed to reduce the error from that specific example It does not guarantee to reduce the error contribution from the other misclassified examples. Could change previously correctly classified example to misclassified. However, the perceptron convergence theorem does guarantee to find an exact solution if one exists It will find this exact solution in a finite number of steps.

review We have seen

Outline Logistic Regression 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Logistic Regression A Logit - What is it when it s at home A logit is simply the natural log of the odds Odds are simply the ratio of two probabilites In a binary classification problem, the sum of the two posterior probabilities sum to 1 If p(c 1 x) is the probability that x belongs to c 1, then p(c 2 x) = 1 p(c 1 x). So the odds are odds = p(c 1 x) 1 p(c 1 x)

A Logit - What benefits Logistic Regression Example: if an individual is 6 foot tall, then according to census data that probability that the individual is male is 0.9. This makes the probability of being female 1 0.9 = 0.1 The odds on being male are 0.9/0.1 = 9. However, the odds of being female are 0.1/0.9 =.11 The lack of symmetry is unappealing. Intuition would appreciate the odds on being female being the opposite of the odds on being male.

A Logit - linear model Logistic Regression The natural log supplies this symmetry: ln(9.0) = 2.197 ln(0.1) = 2.197 Now, if we assume that the logit is linear with respect to x we have ( ) P logit(p) = ln = a + Bx 1 P where a and B are parameters.

from logit to sigmoid Logistic Regression From Exponentiate both sides ( ) P logit(p) = ln = a + Bx 1 P P = (1 P)e (a+bx) = e (a+bx) Pe (a+bx) P + Pe (a+bx) = e (a+bx) P(1 + e (a+bx) ) = e (a+bx) P = e(a+bx) 1 + e (a+bx) = 1 1 + e (a+bx) where a is the probability when x is zero and B adjusts the rate that the probability changes with x.

The sigmoid Logistic Regression Sigmoid mean S-shaped Also called a squashing function because it maps a very large domain into the relatively small interval (0, 1). 1 0.5 0 5 0 5

The model Logistic Regression The posterior probability of C 1 can be written p(c 1 Φ) = y(φ) = σ(w T 1 Φ) = 1 + e (wt Φ) w must be learned by adjusting its M components ( input vector has length M) weight update: w (τ+1) = w (τ) η E n where N E(w) = (y n t n )Φ n and E n = (y n t n )Φ n n=1

Maximum Likelihood Logistic Regression Maximum likelihood: the probability p(t w), which read the probability of the observed data set given the parameter vector w. This can be calculated by taking the product of individual probabilities of the class assigned to each x n D agreeing with t n. N p(t w) = p(c n = t n x) where t n = {0, 1} and { p(c n = t n x) = n=1 p(c 1 Φ n ) if c n = 1 1 p(c 1 Φ n ) if c n = 0

Maximum Likelihood Logistic Regression Since the target is either 1 or 0, this allows for a mathematically convenient expression for this product p(t w) = N (p(c 1 Φ n )) tn (1 p(c 1 Φ n )) (1 tn) n=1 From p(c 1 Φ) = y(φ) = σ(w T Φ) p(t w) = N (y(φ)) tn (1 y(φ)) (1 tn) n=1

Maximum Likelihood and error Logistic Regression The negative log of the maximum likelihood function is E(w) = ln p(t w) = The gradient of this is N (t n ln y n + (1 t n ) ln(1 y n )) n=1 E(w) = d ( ln p(t w)) dw

Maximum Likelihood and error Logistic Regression = E(w) = N n 1 = = N n=1 N n=1 d dw (t n ln y n + (1 t n ) ln(1 y n )) t n dy n y n dw + (1 t n) d(1 y n ) (1 y n ) dw ( tn Φ n y n (1 y n ) + (1 t ) n) y n (1 y n ) ( Φ ny n (1 y n )) N (t n t n y n y n + t n y n )Φ n = n=1 N (y n t n )Φ n n=1

Logistic regression model Logistic Regression The model based on maximum likelihood p(c 1 Φ) = y(φ) = σ(w T 1 Φ) = 1 + e (wt Φ) weight update based on gradient of maximum likelihood: w (τ+1) = w (τ) η E n = w (τ) + η((y n t n )Φ n )

A new model Logistic Regression The model based on literative reweighted least squares p(c 1 Φ) = y(φ) = σ(w T 1 Φ) = 1 + e (wt Φ) weight update based on a Newton-Raphson iterative optimization scheme: w new = w old H 1 E(w) The Hessian H, is a matrix whose elements are are the second derivatives of E(w) with respect to w. This is an Numerical analysis technique which is an alternative to the first one covered. Faster convergence at the cost of more computationaly expense steps is the trade off.

Outline Modeling conditional class probabilities Bayes Theorem Discrete Features 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Modeling conditional class probabilities Bayes Theorem Discrete Features Probabilistic Generative Models: The approach This approach tends to be more computationally expensive. The training data and any information on the distribution of the training data within input space is used to model the class conditional probabilities. Then using Bayes Theorem, the posterior probability is calculated. The descision of label is made by choosing the maximum posterior probability.

Modeling conditional class probabilities Bayes Theorem Discrete Features Modeling class conditional probabilities with prior probabilities The class conditional probability is given by p(x c k ) and is read the probability of x given the class c k. The prior probability p(c k ) which is the probability of c k independent of any other variable. The probability p(x n, c 1 ) = p(c 1 )p(x n c 1 )

Specific case of Binary label Modeling conditional class probabilities Bayes Theorem Discrete Features Let t n = 1 c 1 and t n = 0 c 2 Let p(c 1 ) = π so p(c 2 ) = 1 π Let each class have a Gaussian class-conditional density with shared covariance matrix. { 1 1 N (x µ, Σ) = (2π) D/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ is a D-dimensional mean vector, Σ is a D D covariance matrix, and Σ is the determinant of Σ.

Specific case of Binary label Modeling conditional class probabilities Bayes Theorem Discrete Features The conditional probabilities for each class are p(c 1 )p(x n c 1 ) = πn (x µ 1, Σ) p(c 2 )p(x n c 2 ) = (1 π)n (x µ 2, Σ) The likelihood function is given by N p(t π, µ 1, µ 2, Σ) = [πn (x n µ 1, Σ)] tn [(1 π)n (x n µ 2, Σ)] 1 tn n=1

Specific case of Binary label Modeling conditional class probabilities Bayes Theorem Discrete Features The error for this is the negative log of the likelihood N (t n ln π + (1 t n ) ln(1 π)) n=1 We minimize this by setting the derviative with respect to π to zero and solve for π. π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2

Outline Modeling conditional class probabilities Bayes Theorem Discrete Features 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Review of Bayes Theorem Modeling conditional class probabilities Bayes Theorem Discrete Features P(c k x) = P(x c k)p(c k ) P(x) P(x) is the prior probability that x will be observed, meaning the probability of x given no knowledge about which c k is observed. It can be seen that as P(x) increases,p(c k x) decreases, indicating that the higher a probability of an incident independent of any other factor, the lower the probability of that incident dependent on another condition.

Review of Bayes Theorem Modeling conditional class probabilities Bayes Theorem Discrete Features P(x c k ) is the class conditional probability that x will be observed once class c k is observed. Both P(x c k ) and P(c k ) have been modeled Now P(c k x), A posterior probability, can be calculated The label is assigned as the class that generates the Maximum A Posterior (MAP) probability for the input vector P(x c k )P(c k ) c MAP argmax P(c k x) = argmax c k C c k C P(x) c MAP argmax P(x c k )P(c k ) c k C

Review of Bayes Theorem Modeling conditional class probabilities Bayes Theorem Discrete Features P(x c k ) is the class conditional probability that x will be observed once class c k is observed. Both P(x c k ) and P(c k ) have been modeled Now P(c k x), A posterior probability, can be calculated The label is assigned as the class that generates the Maximum A Posterior (MAP) probability for the input vector P(x c k )P(c k ) c MAP argmax P(c k x) = argmax c k C c k C P(x) c MAP argmax P(x c k )P(c k ) c k C

Outline Modeling conditional class probabilities Bayes Theorem Discrete Features 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Discrete feature Values Modeling conditional class probabilities Bayes Theorem Discrete Features Each x is made up of an ordered set of feature values: x = {a 1, a 2,..., a i ) where i = number of attributes. Sample problem: Aunt Lacee s Library x = { book, Origin of Species, 1500-1900, biology, mint,...} Each attribute has as set of allowed values a 1 {book, paperback, parchment, comic}. a 3 {<1200, 1200-1500, 1500-1900, 1900-1930, 1930-1960, 1960-current}

Naïve Bayes assumption Modeling conditional class probabilities Bayes Theorem Discrete Features Assume that the attributes are conditionally independent. P(x c k ) = P(a 1, a 2,..., a i c k ) = i P(a i c k ) where any given P(a i c k ) = number of instances in training set with same a i value and target value c k divided by number of instances with target c k. P(c k ) is the number of instances with target = c k divided by total number of instances. Final label is determined by naïve Bayes c NB = argmax c k {c 1,c 2 } P(c k ) i P(a i c k )

Review Modeling conditional class probabilities Bayes Theorem Discrete Features Discriminant functions Least Squares Probabilistic Logistic Regression - With maximum likelihood error approximation - With Newton-Raphson approach to error approximation Probabilistic Generative Functions Gaussian class conditional probabilites Discrete Attribute values with the Naïve Bayes Classifier