Machine Learning 2nd Edition

Similar documents
Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Classification Part 2. Overview

Lecture Notes for Chapter 5 (PART 1)

Machine Learning 2nd Edi7on

Classification Lecture 2: Methods

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Machine Learning 2nd Edition

Lecture Notes for Chapter 5. Introduction to Data Mining

Advanced classifica-on methods

7 Classification: Naïve Bayes Classifier

Introduction to Machine Learning 2008/B Assignment 5

Decision Tree Learning Lecture 2

C4.5 - pruning decision trees

Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

Data Analysis. Santiago González

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CS 6375 Machine Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

DATA MINING: NAÏVE BAYES

the tree till a class assignment is reached

Notes on Machine Learning for and

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CHAPTER-17. Decision Tree Induction

CS145: INTRODUCTION TO DATA MINING

Learning Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Classification: Decision Trees

Decision trees COMS 4771

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Data Mining vs Statistics

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Learning Decision Trees

Lecture 3: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees.

CMU-Q Lecture 24:

day month year documentname/initials 1

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Data classification (II)

Classification: Decision Trees

Decision Trees Entropy, Information Gain, Gain Ratio

STA 4273H: Statistical Machine Learning

Machine Learning & Data Mining

PATTERN CLASSIFICATION

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

CSCI 5622 Machine Learning

Induction of Decision Trees

Decision Trees.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning (CS 567) Lecture 5

Lecture 7: DecisionTrees

Decision Tree Learning

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Introduction to Machine Learning CMU-10701

Linear Classifiers: Expressiveness

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Dan Roth 461C, 3401 Walnut

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Generative v. Discriminative classifiers Intuition

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Tree Learning and Inductive Inference

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Chapter 6 Classification and Prediction (2)

Introduction to Machine Learning

Lecture VII: Classification I. Dr. Ouiem Bchir

EECS 349:Machine Learning Bryan Pardo

Decision T ree Tree Algorithm Week 4 1

Naïve Bayes classification

Machine Learning 3. week

Machine Learning Linear Classification. Prof. Matteo Matteucci

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Machine Learning Lecture 7

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Support Vector Machines

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Machine Learning for Signal Processing Bayes Classification and Regression

Decision Tree Learning

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Classification II: Decision Trees and SVMs

Chapter 14 Combining Models

Chapter 6: Classification

Applied Multivariate and Longitudinal Data Analysis

Linear & nonlinear classifiers

Transcription:

ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://wwwcstauacil/~apartzin/machinelearning/ and wwwcsprincetonedu/courses/archive/fall01/cs302 /notes/11/emppt The MIT Press, 2010 alpaydin@bounedutr http://wwwcmpebounedutr/~ethem/i2ml2e INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition

Outline Previous class Ch 8: Decision Trees This class: Ch 9: Decision Trees Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10)

CHAPTER 9: Decision Trees Slides from: Blaž Zupan and Ivan Bratko magixfriuni-ljsi/predavanja/uisp

red yes Color? green Dot? no yellow Outline? dashed solid

Decision Tree Color red yellow green Dot square Outline yes no dashed solid triangle square triangle square

Classification 6 Trees What is the good split function? Use Impurity measure Assume Nm training samples reach node m Node m is pure if for all classes either 0 or 1 Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Entropy 7 Measure amount of uncertainty on a scale from 0 to 1 Example: 2 events If p1=p2=05, entropy is 1 which is maximum uncertainty If p2=1=1-p1, entropy is 0, which is no uncertainty (Eq 93) Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Entropy 8 Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Best Split 9 Node is impure, need to split more Have several split criteria (coordinates), have to choose optimal Minimize impurity (uncertainty) after split Stop when impurity is small enough Zero stop impurity=>complex tree with large variance Larger stop impurity=>small tress but large bias Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Best Split 10 Impurity after split: Nmj of N m take branch j N i mj belong to C i (Eq 98) I' ( C x,m, j ) Pˆ p = m i = n j = 1 N i = 1 Find the variable and split that min impurity among all variables split positions for numeric variables N mj m K i mj p i mj log Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11) N N i mj 2 mj p i mj

Information Gain Color? red yellow green

12

Regression Trees 13 Value not a label in a leaf nodes Need other impurity measure Use Average Error b m ( x )={ 1 if x X m : x reaches node m 0 otherwise E m = 1 N m t (r t g m ) 2 b m ( x t ) g m = t b m ( x t )r t t b m ( x t ) Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Regression Trees 14 After splitting: b mj ( x) 1 if x X mj : x reaches node m and branch j = 0 otherwise ( t x ) ( x ) 1 ( ) 2 b t ( t ) t mj m = j t mj mj x mj = t N m b t mj E ' r g b g r t Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Example 15 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Pruning Trees 16 Number of data instances reach a node is small Less then 5% of training data Don t want to split further regardless of impurity Remove subtrees for better generalization Prepruning: Early stopping Postpruning: Grow the whole tree then prune subtrees Set aside pruning set Make sure pruning does not significantly increase error Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Decision Trees and Feature Extraction 17 Univariate Tree uses only certain variable Some variables might not get used Features closer to the root have greater importance Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Interpretability 18 Conditions that are simple to understand Path from the root =>one conjunction of test All paths can be defined using set of IF_THEN rules Form a rule base Percentage of training data covered by the rule Rule support Tool for a Knowledge Extraction Can Based be verified on E Alpaydın 2004 by Introduction experts to Machine Learning The MIT Press (V11)

Rule Extraction from Trees C45Rules (Quinlan, 1993) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 19

Learning Rules Rule induction is similar to tree induction but tree induction is breadth-first, rule induction is depth-first; one rule at a time Rule set contains rules; rules are conjunctions of terms A rule covers an example if all terms of the rule evaluate to true for the example A rule is said to cover an example if the example satisfies all the conditions of the rule Sequential covering: Generate rules one at a time until all positive examples are covered Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 20

Rule-Based Classifier Classify records by using a collection of if then rules Rule: where (Condition) y Condition is a conjunctions of attributes y is the class label Examples of classification rules: (Blood Type=Warm) (Lay Eggs=Yes) Birds (Taxable Income < 50K) (Refund=Yes) Evade=No

Rule-based Classifier (Example) Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians From: Luke Huan 2006

Application of Rule-Based Classifier A rule is said to cover an example if the example satisfies all the conditions of the rule R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no? grizzly bear warm yes no no? The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal From: Luke Huan 2006

Example of Sequential Covering (ii) Step 1 From: Luke Huan 2006

Example of Sequential Covering R1 R1 R2 (iii) Step 2 (iv) Step 3 From: Luke Huan 2006

Ripper Algorithm There are two kinds of loop in Ripper algorithm (Cohen, 1995): Outer loop : adding one rule at a time to the rule base Inner loop : adding one condition at a time to the current rule Conditions are added to the rule to maximize an information gain measure Conditions are added to the rule until it covers no negative example Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 26

Ripper Algorithm In Ripper, conditions are added to the rule to Maximize an information gain measure N ' N + Gain ( R', R) = s (log + 2 log2 ) N ' N R : the original rule R : the candidate rule after adding a condition N (N ): the number of instances that are covered by R (R ) N + (N + ): the number of true positives in R (R ) s : the number of true positives in R and R (after adding the condition) Until it covers no negative example Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 27

Stopping Criterion and Rule Pruning Stopping criterion Compute the gain If gain is not significant, discard the new rule Rule Pruning Similar to post-pruning of decision trees Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct

Summary of Direct Method Grow a single rule Remove Instances from rule Prune the rule (if necessary) Add rule to Current Rule Set Repeat

Direct Method: RIPPER For 2-class problem, choose one of the classes as positive class, and the other as negative class Learn rules for positive class Negative class will be default class For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class

Multivariate Trees 31 Based on for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V11)

Likelihood- vs Discriminant-based Classification Likelihood-based: Assume a model for p(x C i ), use Bayes rule to calculate P(C i x) g i (x) = log P(C i x) Discriminant-based: Assume a model for g i (x Φ i ); no density estimation Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries Learning is the optimization of the model parameters Φ i to maximize the classification accuracy on a given labeled training set Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 32

Linear Discriminant Linear discriminant: d T g i ( x wi,wi 0 ) wi x + wi 0 = wij x j + wi 0 = j = 1 Advantages: Simple: O(d) space/computation Easy to understand: The final output is weighted sum of several factors Accurate: When p(x C i ) are Gaussian with a shared covariance matrix, the optimal discriminant is linear The linear discriminant can be used even when this assumption does not hold Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 33

Generalized Linear Model Quadratic discriminant: i T T ( x Wi, wi,wi 0 ) = x Wi x + wi x wi 0 g + Higher-order (product) terms: z = 2 2 1 = x1, z2 = x2, z3 = x1, z4 = x2, z5 x1x 2 Map from x to z using nonlinear basis functions and use a linear discriminant in z-space k ( x) = φ ( x) g i w j φij ( x ) j = 1 ij : basis functions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 34

Two Classes g ( x) = g ( x) g ( x) 1 T T ( w x + w ) ( w x + w ) = = = w 1 T ( w w ) x + ( w w ) T 1 2 x + w 2 0 10 ( x) 2 10 C1 if g > 0 choose C 2 otherwise 20 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 35

Geometry g T ( x) = w x+ w0 w 0 : determines the location of the hyper-plane with respect to the origin w : determines its orientation Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 36

Multiple Classes i T ( x wi,wi 0 ) = wi x wi 0 g + ( x) = ( x) Choose C if g max g i i j j= 1 K H i : the hyperplane separate the examples of C i from the examples of all other classes Classes are linearly separable Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 37

Pairwise Separation -- -- ij T ( x wij,wij 0 ) = wij x wij 0 g + g ij ( x) > 0 = 0 don't care if x C if x C otherwise ( x) choose C if j i, g > 0 i ij i j -- H ij : the hyperplane separate the examples of C i and the examples of C j Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V10) 38