CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Similar documents
Decision Trees (Cont.)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Decision trees COMS 4771

Decision Trees. Gavin Brown

the tree till a class assignment is reached

Decision Tree Learning Lecture 2

Machine Learning Recitation 8 Oct 21, Oznur Tastan

C4.5 - pruning decision trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Notes on Machine Learning for and

Decision Trees. Tirgul 5

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning 3. week

Machine Learning 2nd Edi7on

Introduction to Machine Learning CMU-10701

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Introduction to Machine Learning Midterm Exam

EECS 349:Machine Learning Bryan Pardo

CS 6375 Machine Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Midterm, Fall 2003

Lecture 3: Decision Trees

Tufts COMP 135: Introduction to Machine Learning

Machine Learning

Decision Trees Part 1. Rao Vemuri University of California, Davis

Introduction to Machine Learning Midterm Exam Solutions

Holdout and Cross-Validation Methods Overfitting Avoidance

ORIE 4741: Learning with Big Messy Data. Generalization

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Statistical Machine Learning from Data

Learning Decision Trees

Lecture 7: DecisionTrees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree Learning

Dan Roth 461C, 3401 Walnut

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification and Regression Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CSCI 5622 Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Decision Trees.

Machine Learning

Probability and Statistical Decision Theory

Machine Learning

15-381: Artificial Intelligence. Decision trees

Machine Learning, Midterm Exam

Data Mining and Analysis: Fundamental Concepts and Algorithms

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

MLE/MAP + Naïve Bayes

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CS145: INTRODUCTION TO DATA MINING

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Learning Decision Trees

Learning Decision Trees

Artificial Intelligence Decision Trees

Introduction to Machine Learning Midterm, Tues April 8

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Empirical Risk Minimization, Model Selection, and Model Assessment

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CS 498F: Machine Learning Fall 2010

10701/15781 Machine Learning, Spring 2007: Homework 2

Classification Logistic Regression

Decision Tree Learning

UVA CS 4501: Machine Learning

Decision Trees. Danushka Bollegala

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning

Overfitting, Bias / Variance Analysis

day month year documentname/initials 1

COMS 4771 Introduction to Machine Learning. Nakul Verma

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Introduction to Data Science Data Mining for Business Analytics

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Chapter 14 Combining Models

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Midterm: CS 6375 Spring 2015 Solutions

Classification: Decision Trees

Name (NetID): (1 Point)

Chapter 6: Classification

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Homework 2 Solutions Kernel SVM and Perceptron

Chapter 3: Decision Tree Learning (part 2)

Lecture 7 Decision Tree Classifier

Lecture 3: Decision Trees

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Deconstructing Data Science

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CPSC 340: Machine Learning and Data Mining

Decision Tree Learning

Transcription:

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri

Announcements Midterm is graded! Average: 39, stdev: 6 HW2 is out today HW2 is due Thursday, May 3, by 5pm in my mailbox

Decision Tree Classifiers Is X2 < 3? Is X1 < 3? Is X1 < 6? X2 Is X2 < 7? X1

Decision Tree Classifiers b root. Is X2 < 3? c b. Is X1 < 3? a. Is X1 < 6? X2 Root c. Is X2 < 7? X1 a

Decision Tree Classifiers Is X2 < 3? Is X1 < 3? Is X2 < 7? Is X1 < 6? Each node is based on a feature At each node, there is a decision based on the value of the feature Each leaf is a class (same class may appear on multiple leaves)

An Example: Data Petal Width(PW) Petal Length(PL) Iris data set Features Petal Length(PL), Petal Width (PW)

An Example: Data Petal Width(PW) Is PW < 0.75? Petal Length(PL) Iris data set Features Petal Length(PL), Petal Width (PW)

An Example: Data Petal Width(PW) Is PW < 0.75? Is PL < 4.45? Petal Length(PL) Iris data set Features Petal Length(PL), Petal Width (PW)

An Example: Data Petal Width(PW) Is PW < 0.75? Is PL < 4.45? Is PW < 1.75? Petal Length(PL) Iris data set Features Petal Length(PL), Petal Width (PW)

An Example: Data Petal Width(PW) Is PW < 0.75? Is PL < 4.45? Is PW < 1.75? Petal Length(PL) Iris data set Features Petal Length(PL), Petal Width (PW)... Is PL < 5.5?

In-Class Problem Draw a decision tree for the following training data set. ( (0, 0), 1), ( (0, 1), 0), ( (1, 0), 0), ( (1, 1), 0)

In-Class Problem Draw a decision tree for the following training data set. ( (0, 0), 1), ( (0, 1), 0), ( (1, 0), 0), ( (1, 1), 0) Is X1 > 0.5 Is X2 > 0.5?

How to build a good decision tree? 1. How to choose a feature and a threshold value at the next node? 2. When do we stop?

Choosing a Feature (ID3 decision trees) Choose a feature, and a threshold that reduces uncertainty the most How do we measure uncertainty?

Measure of Uncertainty: Entropy Let X be a random variable(r.v) that takes values v1,..,vk with probabilities p1,...,pk. Then, entropy of X, denoted by H(X), is defined as: k H(X) = p i log p i Example: i=1 1. X is a 0/1 random variable, P(X=1) = 0. What is H(X)? 2. X is a 0/1 random variable, P(X=1) = 0.5. What is H(X)? 3. X is a r.v which takes values 1,..,k with probabilities 1/k each. What is H(X)?

Measure of Uncertainty: Entropy Let X be a random variable(r.v) that takes values v1,..,vk with probabilities p1,...,pk. Then, entropy of X, denoted by H(X), is defined as: H(X) = k p i log p i i=1 Heatmap of entropy of a random variable with three values Each point in triangle is a distribution (p1, p2, p3) where p1+p2+p3 = 1

Conditional Entropy Let X and Z be two random variables. The conditional entropy of X given Z is defined as: H(X Z) = z Pr(Z = z)h(x Z = z) Essentially, what is the average entropy of X, given that we know Z In-Class Problems: 1. Suppose X = Z (so Z is a perfect predictor for X). What is H(X Z)? 2. Suppose X and Z are independent. What is H(X Z)?

Conditional Entropy Let X and Z be two random variables. The conditional entropy of X given Z is defined as: H(X Z) = z Pr(Z = z)h(x Z = z) Essentially, what is the average entropy of X, given that we know Z Information Gain(Z) = H(X) - H(X Z)

Choosing a Feature (ID3 Decision Trees) Suppose all features are discrete. Goal: Choose a feature that reduces uncertainty the most We choose the feature that maximizes information gain, where Information Gain(Z) = H(X) - H(X Z)

Computing the Information Gain Root: (5 R, 2 B) H(X) = -(5/7) log(5/7) - (2/7) log(2/7) = 0.59 de a: a de a X2 Z = Z = (2 R, 0 B) (3 R, 2 B) X1 H(X Z=0) = -(2/2) log(2/2) - (0/2) log (0/2) H(X Z=1) = -(3/5) log(3/5) - (2/5) log (2/5) P(Z=0) = (2+0)/(5+2) = 2/7 P(Z=1) = (3+2)/(5+2) = 5/7 H(X Z) = P(Z=0)H(X Z=0) + P(Z=1)H(X Z=1) = 0.48

Choosing a Feature: Example Recall, Information Gain(Z) = H(X) - H(X Z) X2 c d a b Root: (5 R, 2 B) H(X) = 0.59 a: (2 R, 0 B), (3 R, 2 B) H(X Z) = 0.48 b: (4 R, 0 B), (1 R, 2 B) H(X Z) = 0.27 X1 c: (5 R, 1 B), (0 R, 1 B) H(X Z) = 0.38 d: (2 R, 2 B), (3 R, 0 B) H(X Z) = 0.39 Pick solution b

Choosing a Feature: Example 2 Recall, Information Gain(Z) = H(X) - H(X Z) b Root: (2 R, 2 B) H(X) = 0.69 X2 a a: (1 R, 1 B), (1 R, 1 B) H(X Z) = 0.69 b: (1 R, 1 B), (1 R, 1 B) H(X Z) = 0.69 X1 What is the information gain?

Choosing a Feature: Example 2 Recall, Information Gain(Z) = H(X) - H(X Z) X2 b a Root: (2 R, 2 B) H(X) = 0.69 a: (1 R, 1 B), (1 R, 1 B) H(X Z) = 0.69 Is X1 < 1? Is X2 < 1? Is X2 < 1? b: (1 R, 1 B), (1 R, 1 B) H(X Z) = 0.69 X1 Information gain is 0, but the correct tree is in the figure Lesson: We shouldn t necessarily stop if the information gain is 0

How to build a decision tree? 1. How to choose a feature and a threshold value at the next node? 2. When do we stop?

Stopping Rule 1 Stop when every node is pure (contains data from one class) But this rule may overfit the training data

An Example X2 But perhaps the blue point is an outlier or an error So we overfit the training data X1 Decision tree if we stop when pure

What is Overfitting? Data (features and labels) come from some underlying true data distribution D. We never see the true distribution -- we only see samples (training sets, test sets, etc) When we choose a classifier, we can talk about its training error: err(h) ˆ = 1 n n 1(h(x i ) = y i ) i=1 We can also talk about its true error: err(h) =P (x,y) D (h(x) = y) For a fixed classifier, as we get more and more data, the training error should approach the true error

Overfitting: Picture True Error Training Error Number of nodes in decision tree As we make the tree more complicated, training error decreases. But at some point, true error stops improving and may even get worse!

Overfitting True underlying distribution has some structure, which we would like to capture. Training data captures some of this structure, but may have some chance structure of its own, which we should try not to model into a classifier + + - - True Decision Boundary Decision Boundary obtained from Training Data

How to Avoid Overfitting? Avoid overfitting by pruning the tree using a validation dataset

How to Prune the Tree? 1. Split training data into training set S and validation set V 2. Build a tree T using training set S 3. Prune using V: Repeat: For each node v in T: Replace the subtree rooted at v by a single note that predicts the majority label in v (according to S) to get tree T If the error of T on V is less than error of T on V, then: T = T Continue until there is no such node v in T

Example: Pruning Is X2 < 4? 8 6 Is X1 < 4? v X2 4 Is X2 < 5? 2 Is X1 < 5? 0 2 4 6 X1 If the blue point is an error, then the subtree at v will hopefully have higher validation error than predicting red at v Is X2 < 7?

With Pruning True Error Validation Error Training Error Number of nodes in decision tree As we make the tree more complicated, training error decreases. But at some point, true error stops improving and may even get worse! This will be reflected in the validation error, if we have sufficient validation data.