Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

Size: px

Start display at page:

Download "Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1"

Douglas Lester
6 years ago
Views:

1 Week 6 Based in part on slides from textbook, slides of Susan Holmes October 29, / 1

2 Part I Other classification techniques 2 / 1

3 Rule based classifiers Rule based classifiers Examples: if Refund==No and Marital_Status==Married = Cheat=No if Blood_Type==Warm (Lay_Eggs==Yes) = Class=Birds LHS=rule antecedent or condition RHS=rule consequent 3 / 1

4 Rule based classifiers Rule-based Classifier (Example) R1: (Give Birth = no)! (Can Fly = yes) " Birds R2: (Give Birth = no)! (Live in Water = yes) " Fishes R3: (Give Birth = yes)! (Blood Type = warm) " Mammals R4: (Give Birth = no)! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1

5 Rule based classifiers Rule based classifiers A rule covers an case or instance, if the features satisfy the antecedent or condition. The coverage of a rule on a dataset is the fraction of cases the rule covers. The accuracy of a rule is the fraction of cases it covers for which it the consequent agrees with the label. Ideal rules are rules with high coverage and high accuracy. 5 / 1

6 Rule based classifiers overage and Accuracy rage of a rule: raction of records at satisfy the tecedent of a rule racy of a rule: raction of records at satisfy both the tecedent and nsequent of a le (Status=Single)! No Coverage = 40%, Accuracy = 50% Kumar Introduction to 4/18/ / 1

7 Rule based classifiers How does Rule-based Classifier Work? R1: (Give Birth = no)! (Can Fly = yes) " Birds R2: (Give Birth = no)! (Live in Water = yes) " Fishes R3: (Give Birth = yes)! (Blood Type = warm) " Mammals R4: (Give Birth = no)! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules 7 / 1

8 Rule based classifiers Rule based classifiers A rule-based classifier is determined by a set of rules. The rules are mutually exclusive if each case is covered by one and only one rule. The rules are exhaustive if each case is covered by at least one rule. A decision tree is an example of a rule-based classifier. 8 / 1

9 Rule based classifiers 9 / 1

10 es To Rules Rule based classifiers 10 / 1

11 Rule based classifiers Rules Can Be Simplified Initial Rule: (Refund=No)! (Status=Married) " No We can simplify the Married==No rule. Simplified Rule: (Status=Married) " No Tan,Steinbach, Kumar Introduction to 4/18/ / 1

12 Rule based classifiers Possible issues Rules may not be mutually exclusive: they can be ordered, or use majority rule. Rules may not be exhaustive: use a default class. Finding rules Try to extract directly from data (see PRIM in ESL). Extracted from a decision tree (some tree software such as C4.5 does this). 12 / 1

13 PRIM Learning (Patient a rule Rule Induction) 13 / 1 Tan,Steinbach, Kumar Introduction to 4/18/

14 Removing a discovered rule to find additional rules 14 / 1

15 Rule based classifiers Evaluating a rule Accuracy(R)= 1 - Misclassification Error(R) = nc(r) n(r) Laplace(R)= nc(r)+1 n(r)+k where k is the number of classes... M-estimate(R)= nc(r)+αp i(r) n(r)+α where i(r) is the majority class and p i is a set of prior probabilities on each of the k classes and α > 0 is some prior weight. The Laplace rule is effectively a posterior estimate of the probability of an instance belonging to a class based on a prior probability / 1

16 Rule based classifiers Properties of rule-based classifiers Flexible like decision trees (even more flexible). As expressive as decision trees. Easy to classify new cases (once ordering established). Can be extracted from decision trees. Disadvantage: there is no clear loss function they are optimizing / 1

17 Nearest-neighbour classifiers Nearest-neighbour classifiers For each x find k nearest neighbours in the cases data matrix X. Let k (x) denote the majority rule among the k-nearest neighbours. Assign f (x) = k (x). Easy to describe, but can be expensive to compute many nearest neighbours / 1

18 Nearest Neighbor Classifiers Nearest neighbour classifier! Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records Tan,Steinbach, Kumar Introduction to 4/18/ / 1

19 o large, Choice neighborhood of k may include points fr lasses 19 / 1

20 Nearest neighbour classifier k = 1 20 / 1

21 Nearest neighbour classifier k = 3 21 / 1

22 Nearest neighbour classifier k = / 1

23 Nearest neighbour classifier k = / 1

Linear Discriminant Analysis using (sepal.width, sepal.length) 5.0 4.

24 Linear Discriminant Analysis using (sepal.width, sepal.length) / 1

25 Naive Bayes classifiers Bayesian classifiers In LDA / QDA, we used Bayes Theorem to classify. Formally, given a partition E 1,..., E k Bayes Theorem says P(E i A) = P(A E i)p(e i ) P(A) P(A E i )P(E i ) = k j=1 P(A E j)p(e j ) We applied this to the events E i = {Y = c} and A = {X = x} yielding P(Y = c X = x) = P(X = x Y = c)p(y = c) k j=1 P(X = x Y = j)p(y = j) P(X = x Y = c)p(y = c) 25 / 1

26 Naive Bayes classifiers Naive Bayes classifiers A Naive Bayes classifier applies Bayes Theorem based on several features x = (x 1,..., x p ) P(Y = c X 1 = x 1,..., X p = x p ) P(X 1 = x 1,..., X p = x p Y = c)p(y = c) To use the classifier, we need to estimate the RHS above. 26 / 1

27 Naive Bayes classifiers Naive Bayes classifiers The Naive Bayes classifier assumes that the features are independent given the class label. Therefore, P(Y = c X 1 = x 1,..., X p = x p ( p ) P(X l = x l Y = c) P(Y = c) l=1 Fits a discriminant model within each feature. For continuous features, typically a 1-dimensional QDA model is used (i.e. Gaussian within each class). 27 / 1

28 Naive Bayes classifiers Laplace smoothing For discrete features, it uses a discrete estimate P(X j = l Y = c) = # {ix ij = l, Y i = c}. # {Y i = c} The discrete estimates have the problem that if no instances of class c had feature X j = l then any new instances x with x j = l cannot be classified to class c. One solution: use the Laplace smoothed probabilities P(X j = l Y = c) = # {i : X ij = l, Y i = c} + α. # {Y i = c} + α k 28 / 1

29 Naive Bayes classifier 29 / 1

Quadratic Discriminant Analysis using (sepal.width, sepal.length) 5.

30 Quadratic Discriminant Analysis using (sepal.width, sepal.length) / 1

31 Naive Bayes classifiers Properties Features may not be conditionally independent, which may force some bias. Continuous features may not be Gaussian (LDA suffers from this also... ) Can be slow to predict new instances (at least e1071 s implementation). 31 / 1

Classification Part 2. Overview

Classification Part 2. Overview Classification Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Overview Rule based Classifiers Nearest-neighbor Classifiers Data Mining