Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Size: px
Start display at page:

Download "Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory"

Transcription

1 Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. This approach is based on quantifying the tradeoffs between various classification decisions and their costs. The problem needs to be posed in probabilistic terms and all associated probabilities need to be completely known. The theory is just a formalization of some common sense procedures; However, it gives a solid foundation to which various pattern classification methods can build on Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.1/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.2/50 The fish example: Terminology The fish example continued Separate between two kinds of fish: Sea bass and salmon. No other kinds of fish are possible. ω is true class, it is a random variable (RV). Two classes are possible: ω = ω 1 for sea-bass and ω = ω 2 for salmon. If the sea contains more sea-basses than salmons, it is natural to assume (even when no data or features are available), that the caught fish is a sea-bass. This is modeled with the prior probabilities, P (ω 1 ) and P (ω 2 ), which are positive and sum to one. If there are more sea basses, P (ω 1 ) > P (ω 2 ). If we must decide at this point (for some curious reason) which fish we have, how would we decide? Because there are more sea-basses, we would say that the fish is a sea-bass. In other words, our decision rule becomes: Decide ω 1 if P (ω 1 ) > P (ω 2 ) and otherwise decide ω 2. To develop better rules, we must extract some information or features from the data. This means, for instance, making lightness measurements about the fish Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.3/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.4/50

2 The fish example continued The fish example continued Suppose we have a lightness reading, say x, from the fish. What to do next? We know every probability relevant to the classification problem. Particularly, we know P (ω 1 ), P (ω 2 ) and the class conditional probability densities p(x ω 1 ) and p(x ω 2 ). Based on these we can compute the probability that the class ω = ω 1 given that the lightness reading is x and similarly for salmon. Just use Bayes formula: The decision rule: Decide ω 1 if P (ω 1 x) > P (ω 2 x) and otherwise decide ω 2. Remember that we do not know (directly) P (ω j x). They must be computed through the Bayes rule: P (ω j x) = p(x ω j)p (ω j ), p(x) P (ω j x) = p(x ω j)p (ω j ), p(x) where the evidence p(x) = 2 j=1 p(x ω j)p (ω j ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.5/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.6/50 The fish example continued The fish example continued The justification for the rule: P (error x) = P (ω 1 x) if we decide ω 2. P (error x) = P (ω 2 x) if we decide ω 1. Average error P (error) = P (error, x)dx = P (error x)p(x)dx. Thus, if P (error x) is minimal for every x, also the average error is minimized. The decision rule Decide ω 1 if P (ω 1 x) > P (ω 2 x) and otherwise decide ω 2. guarantees just that. An equivalent decision rule is obtained by multiplying P (ω j x) in the previous rule by p(x). Because p(x) is a constant this obviously does not affect the decision itself. We have a rule: Decide ω 1 if p(x ω 1 )P (ω 1 ) > p(x ω 2 )P (ω 2 ), otherwise decide ω Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.7/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.8/50

3 BDT - features BDT - classes The purpose is to decide for an action based on the sensed object with the measured feature vector x. Each object to be classified has a corresponding feature vector and we identify the feature vector x with the object to be classified. The set of all possible feature vectors is called feature space, which we denote by F. Feature spaces correspond to sample spaces. Examples of feature spaces are R d, {0, 1} d, R {0, 1} d. For the moment, we assume that the feature space is R d. We denote by ω the (unknown) class or the category of the object x. We use symbols ω 1,..., ω c for the c categories, or classes to which x can belong to. At this stage, c is fixed. Each category ω i has a prior probability P (ω i ). The prior probability tells us how likely particular class is before making any observations. In addition, we know the probability density functions (pdfs) of feature vectors drawn from a certain class. These are called class conditional density functions (ccdfs) and denoted by p(x ω i ) for i = 1,..., c. Ccdfs tell us how probable is the feature vector x provided that the class of the object is ω i Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.9/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.10/50 BDT - actions We write α 1,..., α a for a possible actions and α(x) for the action taken after observing x. Thought as a function from the feature space to {α 1,..., α a }, α(x) is called a decision rule. In fact, any function from the feature space to {α 1,..., α a } is a decision rule. The number of actions a need not to be equal to c, the number of classes. But if a = c and the actions α i read: Assign x to the class i, we often forget about the actions and talk about assigning x to a certain class. BDT - loss function We will develop the optimal decision rule based on statistics, but before that we need to tie actions and classes together. This is done with a loss function. The loss function, denoted by λ, tells how costly each action is and it is used to convert a probability determination into a decision of an action. λ is a function from action/class pairs to the set of positive real numbers. λ(α i ω j ) describes the loss incurred for taking action α i if the true class is ω j. Low λ(α ω j ) for good actions given the class ω j and high λ(α ω j ) for bad actions given the class ω j Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.11/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.12/50

4 BDT - Bayes decision rule Bayes decision rule If the true class is ω j, by definition we will incur the loss λ(α i ω j ) when taking the action α i after observing x. The expected loss or the conditional risk of taking action α i, after observing x, is R(α i x) = c λ(α i ω j )P (ω j x). j=1 The total expected loss, termed overall risk, is R total = R total (α) = R(α(x) x)p(x)dx Now, we would like to derive such decision rule α(x) that it minimizes the overall risk. This decision rule is Select action α i that gives the minimum expected loss R(α i x). i.e. α(x) = arg min αi R(α i x) This is called the Bayes decision rule. The classifier build upon this rule is called the Bayes (minimum risk) classifier. The overall risk R total for the Bayes decision rule is called Bayes risk. It is the smallest overall risk that is possible. for the decision rule α Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.13/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.14/50 Bayes decision rule Bayes decision rule We now prove that the Bayes decision rule indeed minimizes the overall risk. THEOREM. Let α : F {α 1,..., α a } be an arbitrary decision rule and α bayes : F {α 1,..., α a } the Bayes decision rule. Then R total (α bayes ) R total (α). PROOF: Note that by the definition of the Bayes decision rule R(α bayes (x) x) R(α(x) x). Hence, because p(x) 0, R(α Bayes (x) x)p(x) R(α(x) x)p(x). And R total (α bayes ) = R(α bayes (x) x)p(x)dx R(α(x) x)p(x)dx = R total (α) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.15/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.16/50

5 Two category classification Two category classification The possible classes are ω 1, ω 2. The action α 1 corresponds deciding that the true class is ω 1 and α 2 corresponds deciding that the true class is ω 2. Write λ(α i ω j ) = λ ij R(α 1 x) = λ 11 P (ω 1 x) + λ 12 P (ω 2 x) R(α 2 x) = λ 21 P (ω 1 x) + λ 22 P (ω 2 x) The Bayes decision rule; decide that the true class is ω 1 if R(α 1 x) < R(α 2 x) and ω 2 otherwise. We decide (that the true class is) ω 1 if (λ 21 λ 11 )P (ω 1 x) > (λ 12 λ 22 )P (ω 2 x). Ordinarily λ 21 λ 11 and λ 12 λ 22 are positive. That is, the loss is greater when making a mistake. Assume λ 21 > λ 11. Then Bayes decision rule can be written as: Decide ω 1 if p(x ω 1 ) p(x ω 2 ) > λ 12 λ 22 λ 21 λ 11 P (ω 2 ) P (ω 1 ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.17/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.18/50 Example: SPAM filtering Zero-one-loss We have two actions: α 1 stands for keep the mail and α 2 stands for delete as SPAM. There are two classes ω 1 (normal mail) ω 2 (SPAM i.e. junk mail). P (ω 1 ) = 0.4, P (ω 2 ) = 0.6 and λ 11 = 0, λ 21 = 3, λ 12 = 1, λ 22 = 0. That is, deleting important mail as SPAM is more costly than keeping a SPAM mail. We get an message with the feature vector x and p(x ω 1 ) = 0.35, p(x ω 2 ) = How does the Bayes minimum risk classifier act? P (ω 1 x) = = 0.264; P (ω 2 x) = R(α 1 x) = = 0.736, R(α 2 x) = = Don t delete the mail! Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.19/50 An important loss function is zero-one-loss λ(α i ω j ) = 0 if i = j 1 if i j In the matrix form Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.20/50

6 Minimum error rate classification Recap: The Bayes Classifier Assume that our loss-function is zero-one-loss and actions α i read as Decide that the true class is ω i. We can then identify the action α i and the class ω i. The Bayes decision rule applied to this case leads to minimum error rate classification rule and Bayes (minimum error) classifier. The minimum error rate classification rule is: Decide ω i if P (ω i x) > P (ω j x) for all j i. Given a feature vector x, compute the conditional risk for taking action α i for all i = 1,..., a and select the action that gives the smallest conditional risk R(α i x). Classification with zero-one-loss: Compute the probability P (ω i x) for all categories ω 1,..., ω c and select the category that gives the largest probability. Remember to use Bayes rule in computing the probabilities Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.21/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.22/50 Discriminant functions Discriminant functions Note: In what follows we will assume that a = c and use ω i and α i interchangeably. There are many ways to represent the pattern classifiers. One of the most useful is in terms of a set of discriminant functions g 1 (x),..., g c (x) for a c category classifier. The classifier assigns a feature vector x to class ω i if g i (x) > g j (x) Figure 2.5 from Duda, Hart, Stork: Pattern Classification, Wiley, 2001 for all i j. For the Bayes classifier g i (x) = R(α i x) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.23/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.24/50

7 Equivalent discriminant functions Equivalent discriminant functions The choice of discriminant functions is not unique. Many distinct sets of discriminant functions lead to the same classifier, that is, to the same decision rule. We say that two sets discriminant functions are equivalent if they lead to the same classifier. Or to put it more formally, two sets discriminant functions are equivalent if their corresponding decision rules give equal decisions for all x. The following holds: Let f(x) < f(y) whenever x < y (i.e. f is monotonically increasing function) and g i (x), i = 1,..., c be the discriminant functions representing a classifier. Then, the discriminant functions f(g i (x)), i = 1,..., c represent essentially the same classifier as g i (x), i = 1,..., c. Example: Equivalent sets of discriminant functions for the minimum error rate (Bayes) classifier; g i (x) = P (ω i x) = p(x ω i)p (ω i ) P c j=1 p(x ω j)p (ω j ) g i (x) = p(x ω i )P (ω i ) g i (x) = ln p(x ω i ) + ln P (ω i ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.25/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.26/50 Linear discriminant functions Discriminant functions: Two categories A discriminant function is linear if it can be written in the form g i (x) = w t i x + w i0; w i = [w i1,..., w id ]. The term w i0 is called the threshold or bias for the ith category. Note that a linear discriminant function is linear with respect to w i, w i0, but actually affine with respect to x. Don t let this bother you too much! A classifier is linear if it can be represented using entirely linear discriminant functions. Linear classifiers have some important properties which we will study later during this course. Two-category case: We may combine the two discriminant functions into a single discriminant function. The decision rule: Decide ω 1 if g 1 (x) > g 2 (x) and otherwise decide ω 2. Define g(x) = g 1 (x) g 2 (x). We obtain an equivalent decision rule: Decide ω 1 if g(x) > Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.27/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.28/50

8 Decision regions Decision regions The effect of any decision rule is to divide the feature space (in this case R d ) into c disjoint decision regions, R 1,..., R c. Decision rules can written with the help of decision regions: If x R i decide ω i. (Therefore decision regions form a representation for a classifier.) Decision regions can be derived from discriminant functions R i = {x : g i (x) > g j (x) i j}. Note that decision regions are properties of the classifier and they are not affected if the discriminant functions are changed to equivalent ones. Boundaries of decision regions, i.e places where two or more discriminant functions yield the same value, are called decision boundaries Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.29/50 Figure 2.6 from Duda, Hart, Stork: Pattern Classification, Wiley, Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.30/50 Decision regions example Decision regions example Consider two-category classification problem with P (ω 1 ) = 0.6, P (ω 2 ) = 0.4 and p(x ω 1 ) = 1 2π exp[ 0.5x 2 ] and p(x ω 2 ) = 1 2π exp[ 0.5(x 1) 2 ]. Find the decision regions and boundaries for the minimum error rate Bayes classifier. The decision region R 1 is the set of points where P (ω 1 x) > P (ω 2 x). The decision region R 2 is the set of points where P (ω 2 x) > P (ω 1 x). The decision boundary is the set of points where P (ω 2 x) = P (ω 1 x) Class conditional densities class 1 class Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.31/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.32/50

9 Decision regions example Decision regions example Let us begin with the decision boundary: P (ω 1 x) = P (ω 2 x) p(x ω 1 )P (ω 1 ) = p(x ω 2 )P (ω 2 ), where we used the Bayes formula and multiplied with p(x). p(x ω 1 )P (ω 1 ) = p(x ω 2 )P (ω 2 ) ln[p(x ω 1 )P (ω 1 )] = ln[p(x ω 2 )P (ω 2 )] Class 1 decision region class 1 class 2 Class 2 decision region (x/2) 2 + ln 0.6 = ((x 1)/2) 2 + ln 0.4 x 2 4 ln 0.6 = x 2 2x ln 0.4 Decision boundary is x = ln 0.6 ln R 1 = {x : x < x }, R 2 = {x : x > x } P (ω 1 x) and P (ω 2 x) and decision regions Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.33/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.34/50 The normal density The normal density - properties Univariate case where parameter σ > 0. Multivariate case p(x) = p(x) = 1 2πσ exp[ 1 2 (x µ σ )2 ], 1 (2π) d/2 det(σ) exp[ 1 2 (x µ)t Σ 1 (x µ)], where x is a d-column vector and Σ is a positive definite matrix. For a positive definite matrix Σ, x T Σx > 0 for all x 0. The normal density has several properties which give it a special position among probability densities. To large extent, this is due analytical tractability - as we shall soon see - but there are also other reasons for favoring normal densities. X N(µ, Σ) stands for X is a RV having the normal density with parameters µ, Σ. The expected value of X is E[X] = µ. The variance of X is V ar[x] = Σ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.35/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.36/50

10 The normal density - properties DFs for the normal density Let X = [X 1,..., X d ] N(µ, Σ). Then X i N(µ i, σ ii ). Let A, B be d d matrices. Then AX N(Aµ, AΣA T ). AX and BX are independent if and only if AΣB T is the zero matrix. The sum of two normally distributed RVs is also normally distributed. Central Limit Theorem: The sum of n identically distributed independent (i.i.d) RVs tends to a normally distributed RV as n approaches infinity. The minimum error rate classifier can be represented by discriminant functions (DFs) g i (x) = ln p(x ω i ) + ln P (ω i ), i = 1,..., c. Letting p(x ω i ) = N(µ i, Σ i ) we obtain g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.37/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.38/50 DFs for the normal density: Σ i = σ 2 I DFs for the normal density: Σ i = σ 2 I Σ i = σ 2 I: Features are independent and each feature has the same variance σ 2. Geometrically, this corresponds to the situation in which the samples fall in the equally-sized (hyper)spherical clusters and the cluster for the ith class is centered around µ i. We will now derive equivalent linear DFs to the DFs g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). 1) All the constant terms like d 2 ln 2π can be dropped - dropping them does not affect the classification result. 2) In this particular case also the determinants of the covariance matrices have all the same value (σ 2d ). 3) Σ 1 = 1 σ 2 I and hence g i (x) = x µ i 2 2σ 2 + ln P (ω i ). We (sloppily) use = sign to indicate that discriminant functions are equivalent Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.39/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.40/50

11 DFs for the normal density: Σ i = σ 2 I Minimum distance classifier g i (x) = x µ i 2 2σ 2 + ln P (ω i ) = 1 2σ 2 (x µ i) T (x µ i ) + ln P (ω i ) = 1 2σ 2 (xt x 2µ T i x + µ T i µ i ) + ln P (ω i ). From the last expression we see that the quadratic term x T x is same for all categories and can be dropped. Hence, we obtain equivalent linear discriminant functions: If we assume that all P (ω i ) are equal and Σ i = σ 2 I, we obtain the minimum distance classifier. Note that this means that P (ω i ) = 1 c. The name of the classifier follows from the set of discriminant functions used: g i (x) = x µ i. Hence, a feature vector is assigned to the category with the nearest mean. Note that a minimum distance classifier can be also implemented as a linear classifier. g linear i (x) = 1 σ 2 (µt i x 1 2 µt i µ i ) + ln P (ω i ) Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.41/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.42/50 DFs for the normal density: Σ i = Σ DFs for the normal density: Σ i arbitrary Consider a bit more complicated model. Now all covariance matrices have still the same value but there exists some correlation between the features. Also in this case we have a linear classifier: It is time to consider the most general Gaussian model, where features from each category are assumed to be normally distributed, but nothing more is not assumed. In this case the discriminant functions where and g i (x) = w T i x + w i0, w i = Σ 1 µ i w i0 = 1 2 µt i Σ 1 µ i + ln P (ω i ). g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln det(σ i)+ln P (ω i ). cannot be simplified much; Only constant terms can be dropped. Discriminant functions are now necessarily quadratic which means that the decision regions may have more complicated shapes than in the linear case Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.43/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.44/50

12 BDT - Discrete features Independent binary features In many practical applications, the feature space is discrete. The components of the feature vectors are then binary or higher-integer valued. This simply means that integrals as in continuous case must be replaced with sums and probability densities must be replaced with probabilities. For example, the minimum error rate classification rule is: Decide ω i if Consider the two-category problem, where the feature vectors x = [x 1,..., x d ] are binary, i.e. x i is either 0 or 1. Assume further that features are (conditionally) independent, that is P (x ω j ) = d P (x i ω j ). i=1 Denote p i = P (x i = 1 ω 1 ) and q i = P (x i = 1 ω 2 ). P (ω i x) = P (x ω i)p (ω i ) P (x) > P (ω j x) i j Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.45/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.46/50 Independent binary features Independent binary features P (x ω 1 ) = d i=1 px i i (1 p i) 1 x i P (x ω 2 ) = d i=1 qx i i (1 q i) 1 x i Use discriminant functions g 1 (x) = ln P (ω 1 x), g 2 (x) = ln P (ω 2 x). Recall that g(x) = g 1 (x) g 2 (x), when the classifier studied the sign of g(x). By Bayes rule, this leads to g(x) = ln P (x ω 1) P (x ω 2 ) + ln P (ω 1) P (ω 2 ). g(x) = d i=1 [x i ln p i q i + (1 x i ) ln 1 p i 1 q i ] + ln P (ω 1) P (ω 2 ) g(x) = d i=1 x i ln p i(1 q i ) q i (1 p i ) + d i=1 ln 1 p i 1 q i + ln P (ω 1) P (ω 2 ) Note that we have a linear machine: g(x) = d i=1 w ix i + w i0. The magnitude of w i determines the importance of the yes (1) answer for x i. If p i = q i, the value of x i gives no information about the class. The prior probabilities appear only in the bias term. Increasing P (ω 1 ) biases the decision in favor of ω Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.47/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.48/50

13 Receiver operating characteristic For signal detection theory and receiver operating characteristic (ROC), see milos/courses/cs2750/lectures/class9.pdf. BDT - context This far we have assumed that our interest is in classifying a single object at time. However, in applications we may need to classify several objects at same time. Example: image segmentation. If we assume that the class of one object is independent from remaining ones, nothing does change. If there is some dependence, then the basic principles remain the same, i.e. we assign objects to the most probable category. But now we need to take also categories of other objects into account and place all objects in such categories that the probability of the whole ensemble is maximized. Computational difficulties follow Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.49/ Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory p.50/50

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

Bayesian Decision Theory Lecture 2

Bayesian Decision Theory Lecture 2 Bayesian Decision Theory Lecture 2 Jason Corso SUNY at Buffalo 14 January 2009 J. Corso (SUNY at Buffalo) Bayesian Decision Theory Lecture 2 14 January 2009 1 / 58 Overview and Plan Covering Chapter 2

More information

Pattern Classification

Pattern Classification Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors

More information

Contents 2 Bayesian decision theory

Contents 2 Bayesian decision theory Contents Bayesian decision theory 3. Introduction... 3. Bayesian Decision Theory Continuous Features... 7.. Two-Category Classification... 8.3 Minimum-Error-Rate Classification... 9.3. *Minimax Criterion....3.

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3 CS434a/541a: attern Recognition rof. Olga Veksler Lecture 3 1 Announcements Link to error data in the book Reading assignment Assignment 1 handed out, due Oct. 4 lease send me an email with your name and

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

SGN-2506: Introduction to Pattern Recognition. Jussi Tohka Tampere University of Technology Institute of Signal Processing 2006

SGN-2506: Introduction to Pattern Recognition. Jussi Tohka Tampere University of Technology Institute of Signal Processing 2006 SGN-2506: Introduction to Pattern Recognition Jussi Tohka Tampere University of Technology Institute of Signal Processing 2006 September 1, 2006 ii Preface This is an English translation of the lecture

More information

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3 Biometrics: A Pattern Recognition System Yes/No Pattern classification Biometrics CSE 190 Lecture 3 Authentication False accept rate (FAR): Proportion of imposters accepted False reject rate (FRR): Proportion

More information

Lecture Notes on the Gaussian Distribution

Lecture Notes on the Gaussian Distribution Lecture Notes on the Gaussian Distribution Hairong Qi The Gaussian distribution is also referred to as the normal distribution or the bell curve distribution for its bell-shaped density curve. There s

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Nearest Neighbor Pattern Classification

Nearest Neighbor Pattern Classification Nearest Neighbor Pattern Classification T. M. Cover and P. E. Hart May 15, 2018 1 The Intro The nearest neighbor algorithm/rule (NN) is the simplest nonparametric decisions procedure, that assigns to unclassified

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Bayes Rule for Minimizing Risk

Bayes Rule for Minimizing Risk Bayes Rule for Minimizing Risk Dennis Lee April 1, 014 Introduction In class we discussed Bayes rule for minimizing the probability of error. Our goal is to generalize this rule to minimize risk instead

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information

Minimum Error Rate Classification

Minimum Error Rate Classification Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...

More information

Probability Models for Bayesian Recognition

Probability Models for Bayesian Recognition Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Machine Learning Spring 2018 Note 18 CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Expect Values and Probability Density Functions

Expect Values and Probability Density Functions Intelligent Systems: Reasoning and Recognition James L. Crowley ESIAG / osig Second Semester 00/0 Lesson 5 8 april 0 Expect Values and Probability Density Functions otation... Bayesian Classification (Reminder...3

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1 CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 1 1 Outline of the lecture Syllabus Introduction to Pattern Recognition Review of Probability/Statistics 2 Syllabus Prerequisite Analysis of

More information

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)? ECE 830 / CS 76 Spring 06 Instructors: R. Willett & R. Nowak Lecture 3: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Elias Tragas tragas@cs.toronto.edu October 3, 2016 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, 2016 1 / 23 Naive Bayes Bayes Rules: Naive

More information

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents

More information

01 Probability Theory and Statistics Review

01 Probability Theory and Statistics Review NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement

More information

Bayes Decision Theory - I

Bayes Decision Theory - I Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find

More information

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth Application: Can we tell what people are looking at from their brain activity (in real time? Gaussian Spatial Smooth 0 The Data Block Paradigm (six runs per subject Three Categories of Objects (counterbalanced

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21 Naive Bayes Bayes Rules: Naive Bayes

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 3: Probability, Bayes Theorem, and Bayes Classification Peter Belhumeur Computer Science Columbia University Probability Should you play this game? Game: A fair

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Continuous Random Variables

Continuous Random Variables 1 / 24 Continuous Random Variables Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering Indian Institute of Technology Bombay February 27, 2013 2 / 24 Continuous Random Variables

More information

Lecture 8: Classification

Lecture 8: Classification 1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University eriksson@math.uu.se Multivariate Methods 19/5 2010 Classification: introductory examples Goal: Classify an observation

More information

Lecture 8: Signal Detection and Noise Assumption

Lecture 8: Signal Detection and Noise Assumption ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(0, σ I n n and S = [s, s,..., s n ] T

More information

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary ECE 830 Spring 207 Instructor: R. Willett Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we saw that the likelihood

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

The generative approach to classification. A classification problem. Generative models CSE 250B

The generative approach to classification. A classification problem. Generative models CSE 250B The generative approach to classification The generative approach to classification CSE 250B The learning process: Fit a probability distribution to each class, individually To classify a new point: Which

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Example - basketball players and jockeys. We will keep practical applicability in mind:

Example - basketball players and jockeys. We will keep practical applicability in mind: Sonka: Pattern Recognition Class 1 INTRODUCTION Pattern Recognition (PR) Statistical PR Syntactic PR Fuzzy logic PR Neural PR Example - basketball players and jockeys We will keep practical applicability

More information

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1 Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Gaussian discriminant analysis Naive Bayes

Gaussian discriminant analysis Naive Bayes DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

SDS 321: Introduction to Probability and Statistics

SDS 321: Introduction to Probability and Statistics SDS 321: Introduction to Probability and Statistics Lecture 14: Continuous random variables Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin www.cs.cmu.edu/

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

L5: Quadratic classifiers

L5: Quadratic classifiers L5: Quadratic classifiers Bayes classifiers for Normally distributed classes Case 1: Σ i = σ 2 I Case 2: Σ i = Σ (Σ diagonal) Case 3: Σ i = Σ (Σ non-diagonal) Case 4: Σ i = σ 2 i I Case 5: Σ i Σ j (general

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Lecture 1 October 9, 2013

Lecture 1 October 9, 2013 Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: http://www.di.ens.fr/~fbach/courses/fall2013/

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

12 Discriminant Analysis

12 Discriminant Analysis 12 Discriminant Analysis Discriminant analysis is used in situations where the clusters are known a priori. The aim of discriminant analysis is to classify an observation, or several observations, into

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p(x ω i ).4 ω.3.. 9 3 4 5 x FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category

More information

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables ECE 6010 Lecture 1 Introduction; Review of Random Variables Readings from G&S: Chapter 1. Section 2.1, Section 2.3, Section 2.4, Section 3.1, Section 3.2, Section 3.5, Section 4.1, Section 4.2, Section

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging

More information