Lecture Notes for Chapter 5. Introduction to Data Mining

Size: px

Start display at page:

Download "Lecture Notes for Chapter 5. Introduction to Data Mining"

Brittany Cummings
6 years ago
Views:

1 Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

2 Rule-Based Classifier Classify records by using a collection of if then rules Rule: where (Condition) y Condition is a conjunctions of attributes y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classification rules: Condition=(A 1 op v 1 ) & (A 2 op v 2 ) (A k op v k ) (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

3 Rule-based Classifier (Example) Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

4 Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no? grizzly bear warm yes no no? The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

5 10 Rule Coverage and Accuracy Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Ex: (Status=Single) No Coverage = 40%, Accuracy = 50% Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

6 How does Rule-based Classifier Work? R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name Blood Type Give Birth Can Fly Live in Water Class lemur warm yes no no? turtle cold no no sometimes? dogfish shark cold yes no yes? A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

7 Characteristics of Rule-Based Classifier Mutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one rule Exhaustive rules Classifier has exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one rule Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

8 From Decision Trees To Rules Classification Rules Yes NO Refund {Single, Divorced} Taxable Income No Marital Status < 80K > 80K {Married} NO (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No NO YES Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

9 10 Rules Can Be Simplified Yes NO NO Refund {Single, Divorced} Taxable Income No Marital Status < 80K > 80K YES {Married} NO Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Initial Rule: (Refund=No) (Status=Married) No Simplified Rule: (Status=Married) No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

10 Effect of Rule Simplification Rules are no longer mutually exclusive A record may trigger more than one rule Solution? Ordered rule set Unordered rule set use voting schemes Rules are no longer exhaustive A record may not trigger any rules Solution? Use a default class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

11 Ordered Rule Set Rules are rank ordered according to their priority An ordered rule set is known as a decision list When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered If none of the rules fired, it is assigned to the default class R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name Blood Type Give Birth Can Fly Live in Water Class turtle cold no no sometimes? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

12 Rule Ordering Schemes Rule-based ordering Individual rules are ranked based on their quality Class-based ordering Rules that belong to the same class appear together Rule-based Ordering (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No Class-based Ordering (Refund=Yes) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No (Refund=No, Marital Status={Married}) ==> No (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

13 Building Classification Rules Direct Method: Extract rules directly from data e.g.: RIPPER, CN2, Holte s 1R Holte s 1R Algorithm: for each attribute for each value of that attribute find class distribution for attr/value conc = most frequent class make rule: attribute=value -> conc calculate error rate of rule set select rule set with lowest error rate Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

14 Direct Method: Sequential Covering 1. Start from an empty rule 2. Grow a rule using the Learn-One-Rule function 3. Remove training records covered by the rule 4. Repeat Step (2) and (3) until stopping criterion is met Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

15 Example of Sequential Covering (i) Original Data (ii) Step 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

16 Example of Sequential Covering R1 R1 R2 (iii) Step 2 (iv) Step 3 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

17 Aspects of Sequential Covering Rule Growing Instance Elimination Rule Evaluation Stopping Criterion Rule Pruning Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

18 Rule Growing Two common strategies { } Yes: 3 No: 4 Refund=No, Status=Single, Income=85K (Class=Yes) Refund=No, Status=Single, Income=90K (Class=Yes) Refund= No Yes: 3 No: 4 Status = Single Yes: 2 No: 1 Status = Divorced Yes: 1 No: 0 Status = Married Yes: 0 No: 3... Income > 80K Yes: 3 No: 1 Refund=No, Status = Single (Class = Yes) (a) General-to-specific (b) Specific-to-general Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

19 Rule Growing (Examples) CN2 Algorithm: Start from an empty conjunct: {} Add conjuncts that minimizes the entropy measure: {A}, {A,B}, Determine the rule consequent by taking majority class of instances covered by the rule RIPPER Algorithm: Start from an empty rule: {} => class Add conjuncts that maximizes FOIL s information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

20 Aspects of Sequential Covering Rule Growing Instance Elimination Rule Evaluation Stopping Criterion Rule Pruning Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

21 Instance Elimination Why do we need to eliminate instances? Otherwise, the next rule is identical to previous rule Why do we remove positive instances? Ensure that the next rule is different Why do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagram class = + class = - R1 R3 R Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

22 Aspects of Sequential Covering Rule Growing Instance Elimination Rule Evaluation Stopping Criterion Rule Pruning Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

23 Rule Evaluation Metrics: Accuracy Laplace M-estimate Foil-gain n c n n c 1 n k n c kp n k pos' FOIL _ Gain pos' (log 2 log 2 pos' neg' pos ) pos neg Pos/neg are # of positive/negative tuples covered by R. n : Number of instances covered by rule n c : Number of instances covered by rule k : Number of classes p : Prior probability Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

24 Aspects of Sequential Covering Rule Growing Instance Elimination Rule Evaluation Stopping Criterion Rule Pruning Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

25 Stopping Criterion and Rule Pruning Stopping criterion Compute the gain If gain is not significant, discard the new rule Rule Pruning FOIL _ Prune( Similar to post-pruning of decision trees Reduced Error Pruning: Remove one of the conjuncts in the rule R) pos pos Compare error rate on validation set before and after pruning If error improves, prune the conjunct neg neg Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

26 Summary of Direct Method Grow a single rule Remove Instances from rule Prune the rule (if necessary) Add rule to Current Rule Set Repeat Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

27 Direct Method: RIPPER For 2-class problem, choose one of the classes as positive class, and the other as negative class Learn rules for positive class Negative class will be default class For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

28 Direct Method: RIPPER Growing a rule: Start from empty rule Add conjuncts as long as they improve FOIL s information gain Stop when rule no longer covers negative examples Prune the rule immediately using incremental reduced error pruning Measure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set n: number of negative examples covered by the rule in the validation set Pruning method: delete any final sequence of conditions that maximizes v Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

29 Direct Method: RIPPER Building a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

30 Direct Method: RIPPER Optimize the rule set: For each rule r in the rule set R Consider 2 alternative rules: Replacement rule (r*): grow new rule from scratch Revised rule(r ): add conjuncts to extend the rule r Compare the rule set for r against the rule set for r* and r Choose rule set that minimizes MDL principle Repeat rule generation and rule optimization for the remaining positive examples Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

31 Indirect Methods P Q No Yes R Rule Set No Yes No Yes Q No Yes - + r1: (P=No,Q=No) ==> - r2: (P=No,Q=Yes) ==> + r3: (P=Yes,R=No) ==> + r4: (P=Yes,R=Yes,Q=No) ==> - r5: (P=Yes,R=Yes,Q=Yes) ==> + Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

32 Indirect Method: C4.5rules Extract rules from an unpruned decision tree For each rule, r: A y, consider an alternative rule r : A y where A is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all r s Prune if one of the r s has lower pessimistic error rate Repeat until we can no longer improve generalization error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

33 Indirect Method: C4.5rules Instead of ordering the rules, order subsets of rules (class ordering) Each subset is a collection of rules with the same rule consequent (class) Compute description length of each subset Description length = L(error) + g L(model) g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

34 Example Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class human yes no no no yes mammals python no yes no no no reptiles salmon no yes no yes no fishes whale yes no no yes no mammals frog no yes no sometimes yes amphibians komodo no yes no no yes reptiles bat yes no yes no yes mammals pigeon no yes yes no yes birds cat yes no no no yes mammals leopard shark yes no no yes no fishes turtle no yes no sometimes yes reptiles penguin no yes no sometimes yes birds porcupine yes no no no yes mammals eel no yes no yes no fishes salamander no yes no sometimes yes amphibians gila monster no yes no no yes reptiles platypus no yes no no yes mammals owl no yes yes no yes birds dolphin yes no no yes no mammals eagle no yes yes no yes birds Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

35 C4.5 versus C4.5rules versus RIPPER Give Birth? C4.5rules: (Give Birth=No, Can Fly=Yes) Birds Yes No (Give Birth=No, Live in Water=Yes) Fishes (Give Birth=Yes) Mammals Mammals Live In Water? (Give Birth=No, Can Fly=No, Live in Water=No) Reptiles ( ) Amphibians Yes No RIPPER: (Live in Water=Yes) Fishes Sometimes (Have Legs=No) Reptiles Fishes Amphibians Can Fly? (Give Birth=No, Can Fly=No, Live In Water=No) Reptiles (Can Fly=Yes,Give Birth=No) Birds Yes No () Mammals Birds Reptiles Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

36 C4.5 versus C4.5rules versus RIPPER C4.5 and C4.5rules: RIPPER: Confusion Matrix PREDICTED CLASS Amphibians Fishes Reptiles Birds Mammals ACTUAL Amphibians CLASS Fishes Reptiles Birds Mammals PREDICTED CLASS Amphibians Fishes Reptiles Birds Mammals ACTUAL Amphibians CLASS Fishes Reptiles Birds Mammals Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

37 Advantages of Rule-Based Classifiers As highly expressive as decision trees Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

38 Instance-Based Classifiers Set of Stored Cases Atr1... AtrN Class A B B C A C B Store the training records Use training records to predict the class label of unseen cases Unseen Case Atr1... AtrN Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

39 Instance Based Classifiers Examples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly Nearest neighbor Uses k closest points (nearest neighbors) for performing classification Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

40 Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

41 Nearest-Neighbor Classifiers Unknown record Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

42 Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

43 1 nearest-neighbor Voronoi Diagram Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

44 Nearest Neighbor Classification Compute distance between two points: Euclidean distance d( p, q) i ( p q i i ) 2 Determine the class from nearest neighbor list take the majority vote of class labels among the k- nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

45 Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

46 Nearest Neighbor Classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

47 Nearest Neighbor Classification Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results vs d = d = Solution: Normalize the vectors to unit length Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

48 Nearest neighbor Classification k-nn classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

49 Example: PEBLS PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg) Works with both continuous and nominal features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM) Each record is assigned a weight factor Number of nearest neighbor, k = 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

50 10 Example: PEBLS Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Distance between nominal attribute values: d(single,married) = 2/4 0/4 + 2/4 4/4 = 1 d(single,divorced) = 2/4 1/2 + 2/4 1/2 = 0 d(married,divorced) = 0/4 1/2 + 4/4 1/2 = 1 d(refund=yes,refund=no) = 0/3 3/7 + 3/3 4/7 = 6/7 Marital Status Class Single Married Divorced Yes Refund Class Yes No Yes 0 3 d( V1, V2 ) n 1i n i 1 n n 2i 2 No No 3 4 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

51 10 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat X Yes Single 125K No Y No Married 100K No Distance between record X and record Y: where: ( X w X, Y ) w X w Y d i1 d( X i, Y Number of times X is used for prediction Number of times X predicts correctly i 2 ) w X 1 if X makes accurate prediction most of the time w X > 1 if X is not reliable for making predictions Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Naïve Bayes Classifier Thomas Bayes 1702-1761

52 Naïve Bayes Classifier Thomas Bayes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

53 Antenna Length Grasshoppers Katydids Abdomen Length Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

54 Antenna Length With a lot of data, we can build a histogram. Let us just build one for Antenna Length for now Katydids Grasshoppers Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

55 We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us us two normal distributions for ease of visualization in the following slides Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

56 p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 3 ) = 10 / (10 + 2) = P(Katydid 3 ) = 2 / (10 + 2) = Antennae length is 3 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 7 ) = 3 / (3 + 9) = 0.

57 p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 7 ) = 3 / (3 + 9) = P(Katydid 7 ) = 9 / (3 + 9) = Antennae length is 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

58 p(c j d) = probability of class c j, given that we have observed d P(Grasshopper 5 ) = 6 / (6 + 6) = P(Katydid 5 ) = 6 / (6 + 6) = Antennae length is 5? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

59 Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H X) follows the Bayes theorem P( H X ) P( X H) P( H) P( X ) Informally, this can be written as posterior =likelihood x prior probability/ evidence Predicts X belongs to C i iff the probability P(C i X) is the highest among all the P(C k X) foi all the k classes Practical difficulty: require initial knowledge of many probabilities =>significant computational cost Grasshopper Katydid Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

60 Example Name Chris Claudia Chris Chris Alberto Karin Nina Sergio Chris Sex Male Female Female Female Male Female Female Male Female Is Chris a Male or Female? P( H X ) P( X H) P( H) P( X ) p(male Chris) = 1/3 * 3/9 4/9 p(female Chris) = 3/6 * 6/9 4/9 =>Chris is a Female Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ /03/26 60

61 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem: ) ( ) ( ) ( ) ( A P C P C A P A C P ) ( ), ( ) ( ) ( ), ( ) ( C P A C P C A P A P A C P A C P

62 Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) Can we estimate P(C A 1, A 2,,A n ) directly from data? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

63 Bayesian Classifiers Approach: compute the posterior probability P(C A 1, A 2,, A n ) for all values of C using the Bayes theorem P( C A A 1 2 A n ) P( A A A C) P( C) 1 2 n P( A A A ) 1 2 n Choose value of C that maximizes P(C A 1, A 2,, A n ) Equivalent to choosing value of C that maximizes P(A 1, A 2,, A n C) P(C) How to estimate P(A 1, A 2,, A n C )? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

64 Naïve Bayes Classifier Assume independence among attributes A i when class is given: P(A 1, A 2,, A n C) = P(A 1 C j ) P(A 2 C j ) P(A n C j ) A1 (Height=170), A2 (Weight=50), difficult Can estimate P(A i C j ) for all A i and C j much easier New point is classified to C j if P(C j ) P(A i C j ) is maximal. P( C A A 1 2 A n ) P( A A A C) P( C) 1 2 n P( A A A ) 1 2 n Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

65 10 How to Estimate Probabilities from Data? categorica l Tid Refund Marital Status categorica l Taxable Income continuous 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Evade class Class: P(C) = N c /N e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes: P(A i C k ) = A ik / N c k 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes where A ik is number of instances having attribute A i and belongs to class C k Examples: P(Status=Married No) = 4/7 P(Refund=Yes Yes)=0 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

66 Training dataset Class: C1:buys_computer= yes C2:buys_computer= no Data sample: X (age<=30, Income=medium, Student=yes, Credit_rating=Fair) age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes medium no excellent yes high yes fair yes >40 medium no excellent no Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

67 Naïve Bayesian Classifier: Example X=(age<=30, income=medium, student=yes, credit_rating=fair) P(buys_computer= yes )=9/14, P(buys_computer= no )=5/14 Compute P(X Ci) for each class P(age= <30 buys_computer= yes ) = 2/9=0.222 P(income= medium buys_computer= yes )= 4/9 =0.444 P(student= yes buys_computer= yes)= 6/9 =0.667 P(credit_rating= fair buys_computer= yes )=6/9=0.667 X=(age<=30,income =medium, student=yes,credit_rating=fair) P(X Ci) : P(X buys_computer= yes )= x x x =0.044 P(credit_rating= fair buys_computer= no )=2/5=0.4 P(age= <30 buys_computer= no ) = 3/5 =0.6 P(income= medium buys_computer= no ) = 2/5 = 0.4 P(student= yes buys_computer= no )= 1/5=0.2 => P(X buys_computer= no )= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X Ci)*P(Ci ) : P(X buys_computer= yes ) * P(buys_computer= yes )=0.028 P(X buys_computer= no ) * P(buys_computer= no )=0.007 => X belongs to class buys_computer=yes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

68 Example of Naïve Bayes Classifier Given a Test Record: X ( Refund No,Married,Income 120K) naive Bayes Classifier: P(Refund=Yes No) = 3/7 P(Refund=No No) = 4/7 P(Refund=Yes Yes) = 0 P(Refund=No Yes) = 1 P(Marital Status=Single No) = 2/7 P(Marital Status=Divorced No)=1/7 P(Marital Status=Married No) = 4/7 P(Marital Status=Single Yes) = 2/7 P(Marital Status=Divorced Yes)=1/7 P(Marital Status=Married Yes) = 0 For taxable income: If class=no: sample mean=110 sample variance=2975 If class=yes: sample mean=90 sample variance=25 P(X Class=No) = P(Refund=No Class=No) P(Married Class=No) P(Income=120K Class=No) = 4/7 4/ = P(X Class=Yes) = P(Refund=No Class=Yes) P(Married Class=Yes) P(Income=120K Class=Yes) = = 0 Since P(X No)P(No) > P(X Yes)P(Yes) Therefore P(No X) > P(Yes X) => Class = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

69 Example of Naïve Bayes Classifier Name Give Birth Can Fly Live in Water Have Legs Class human yes no no yes mammals python no no no no non-mammals salmon no no yes no non-mammals whale yes no yes no mammals frog no no sometimes yes non-mammals komodo no no no yes non-mammals bat yes yes no yes mammals pigeon no yes no yes non-mammals cat yes no no yes mammals leopard shark yes no yes no non-mammals turtle no no sometimes yes non-mammals penguin no no sometimes yes non-mammals porcupine yes no no yes mammals eel no no yes no non-mammals salamander no no sometimes yes non-mammals gila monster no no no yes non-mammals platypus no no no yes mammals owl no yes no yes non-mammals dolphin yes no yes no mammals eagle no yes no yes non-mammals A: attributes M: mammals N: non-mammals P( A M ) P( A N) P( A M ) P( M ) P( A N) P( N) Give Birth Can Fly Live in Water Have Legs Class yes no yes no? P(A M)P(M) > P(A N)P(N) => Mammals Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

70 How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumption Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(A i c) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

71 10 How to Estimate Probabilities from Data? categorica l Tid Refund Marital Status categorica l Taxable Income continuous 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Evade 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes class Normal distribution: P( A i c One for each (A i,c i ) pair For (Income, Class=No): ) If Class=No j sample mean = 110 sample variance = 2975 ij e ( A i ) 2 ij 2 ij 2 P( Income 120 No) 1 2 (54.54) e (120110) 2 ( 2975) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

72 Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: Original : P( A Laplace : P( A i i C) C) m - estimate : P( A i N N N N ic ic c c C) 1 c N N ic c mp m c: number of classes p: prior probability m: parameter Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

73 Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes Use other techniques such as Bayesian Belief Networks (BBN) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

74 Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

75 Bayesian Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution X Z Y P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ /03/26 76

76 Bayesian Belief Network: An Example One conditional probability table (CPT) for each variable Family History Smoker The CPT for the variable LungCancer: Shows the conditional probability for each possible combination of its parents (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LungCancer Emphysema LC ~LC PositiveXRay Dyspnea Bayesian Belief Networks Derivation of the probability of a particular combination of values of X, from CPT: P( x 1,..., x n ) n P( xi i 1 Parents( Yi)) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

77 Classification: A Mathematical Mapping Classification: predicts categorical class labels E.g., Personal homepage classification x i = (x 1, x 2, x 3, ), y i = +1 or 1 x 1 : # of a word homepage x 2 : # of a word welcome x 3 : Mathematically x X = n, y Y = {+1, 1} We want a function f: X Y Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ /03/26 78

78 Linear Classification Binary Classification problem x x x x o x x o x o x x o o o o o x o o o o o The data above the line belongs to class x The data below red line belongs to class o Examples: Perceptron SVM Probabilistic Classifiers Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

79 Neural Networks Analogy to Biological Systems (Indeed a great example of a good learning system) Massive Parallelism allowing for computational efficiency The first learning algorithm came in 1959 (Rosenblatt) If a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

80 Neural Networks Dendrites Terminal Branches of Axon Axon x1 x2 w1 w2 Dendrites Terminal Branches of Axon x3 w3 S Binary classifier functions Threshold activation function Axon xn wn Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

81 Activation Function Step Threshold Threshold Activation Function y hid ( u) 1 1 e u Sigmoid Sigmoid Neuron unit function y u N n1 w n x n Linear Linear Activation functions Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

82 Neural Networks The general artificial neuron model has five components, shown in the following list 1. A set of inputs, x i. 2. A set of weights, w i. 3. A bias, u. A Neuron 4. An activation function, f. x 0 w 0 - k 5. Neuron output, y x 1 x n w 1 w n f output y Input vector x weight vector w weighted sum Activation function Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

83 Artificial Neural Networks (ANN) X 1 X 2 X 3 Y Input X 1 X 2 X 3 Black box Output Y Output Y is 1 if at least two of the three inputs are equal to 1. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

84 Artificial Neural Networks (ANN) X 1 X 2 X 3 Y Input nodes X 1 X 2 X 3 Black box S 0.3 t=0.4 Output node Y 1, if 0.3x yˆ 1, if 0.3x x 0.3x x 0.3x ; Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

85 Artificial Neural Networks (ANN) Model is an assembly of inter-connected nodes and weighted links Input nodes X 1 Black box w 1 Output node Output node sums up each of its input value according to the weights of its links X 2 X 3 w 2 w 3 S t Y Compare output node against some threshold t Perceptron Model Y Y I( w X t) i i sign( w X t) i i i i or Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

86 General Structure of ANN x 1 x 2 x 3 x 4 x 5 Input Layer Input Neuron i Output Hidden Layer I 1 I 2 I 3 w i1 w i2 w i3 S i Activation function g(s i ) O i O i threshold, t Output Layer Training ANN means learning the weights of the neurons y Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

87 Layered Networks Inputs. Hidden Neurons Output Neurons Outputs x y 1 h1 (u) y o1 (u) y 1 x y 2 h2 (u) y o2 (u) y 2 x n y hn (u) y om (u) y m Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

88 Perceptron Learning Algorithm 1. Let D={x i,y i i=1,2,,n} be the training instances 2. Initialize the weight vector with random values w (0) 3. repeat 4. for each instance (x i,y i ) D 5. Compute 6. for each weight w j do 7. w (k+1) j = w (k) j + λ( - ) x ij 8. end for error 9. end for 10. until stopping condition is met ŷ ( ) yˆ k i Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ y i

89 Algorithm for learning ANN Initialize the weights (w 0, w 1,, w k ) Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples Objective function: SSE N Find the weights w i s that minimize the above objective function e.g., backpropagation algorithm i 2 ˆ y i y i Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

90 Error Back-Propagation (BP) Hidden Neurons Training error term: e y hid (u hid,1 ) u hid,1 N1 y hid w hid,1 w out,1 y train x y hid (u hid,2 ) u out,1 x w hid,2 N2 w out,2 y out (Su out,n ) y out e(y out, y train ) e w hid,n w out,n Output Neuron e 1 2 ( y yˆ) 2 y hid (u hid,n ) Nn Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

91 Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

92 Support Vector Machines B 1 One Possible Solution Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

93 Support Vector Machines B 2 Another possible solution Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

94 Support Vector Machines B 2 Other possible solutions Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

95 Support Vector Machines B 1 B 2 Which one is better? B1 or B2? How do you define better? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

96 Support Vector Machines B 1 B 2 b 21 b 22 margin b 11 Find hyperplane maximizes the margin => B1 is better than B2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ b 12

97 Support Vector Machines B 1 x { x1, x2, x3, x4...} w x b 0 w x b 1 w x b 1 b 11 1 if w x b 1 ( x) 1 if w x b 1 f 2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ b 12 Margin 2 w

98 Support Vector Machines We want to maximize: Margin Which is equivalent to minimizing: 2 2 w L( w) w 2 2 But subjected to the following constraints: 1 if w xi b 1 f ( x i ) 1 if w xi b 1 This is a constrained optimization problem Numerical approaches to solve it (e.g., quadratic programming) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

99 Support Vector Machines What if the problem is not linearly separable? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

100 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Support Vector Machines What if the problem is not linearly separable? Introduce slack variables Need to minimize: Subject to: i i i i 1 b x w if 1 1- b x w if 1 ) ( x i f N i k i C w w L ) (

101 Nonlinear Support Vector Machines What if decision boundary is not linear? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

102 Nonlinear Support Vector Machines Transform data into higher dimensional space Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

103 Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

104 General Idea D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C * Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

105 Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: 25 i13 25 i (1 ) i 25 i 0.06 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

106 Examples of Ensemble Methods How to generate an ensemble of classifiers? Method By manipulating the training set Bagging Boosting By manipulating the input features By manipulating the learning algorithm By manipulating the class label Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

107 Bagging Sampling with replacement Original Data Bagging (Round 1) Bagging (Round 2) Bagging (Round 3) Build classifier on each bootstrap sample Each sample has probability (1 1/n) n of being selected Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

108 Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of boosting round Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

109 Boosting Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased Original Data Boosting (Round 1) Boosting (Round 2) Boosting (Round 3) Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

110 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example: AdaBoost Base classifiers: C 1, C 2,, C T Error rate: Importance of a classifier: N j j j i j i y x C w N 1 ) ( 1 i i i 1 ln 2 1

111 Example: AdaBoost Weight update: ( j) j ( j 1) w exp if ( ) i C j xi yi wi j Z j exp if C j( xi ) yi where Z is the normalization factor If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated Classification: j C *( x) arg max C Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ y T j1 j j ( x) y

112 Illustrating AdaBoost Initial weights for each data point Data points for training Original Data B Boosting Round = Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

113 Illustrating AdaBoost Boosting Round Boosting B B Round = = Boosting Round B3 = Overall Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Neural networks