ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Similar documents
Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Artificial Intelligence. Topic

Decision Tree Learning and Inductive Inference

Learning Classification Trees. Sargur Srihari

Learning Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Learning Decision Trees

Machine Learning 2nd Edi7on

Data classification (II)

Decision Tree Learning - ID3

Classification and regression trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

CS 6375 Machine Learning

Decision Trees. Tirgul 5

Typical Supervised Learning Problem Setting

Lecture 3: Decision Trees

Decision Tree Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Lecture 3: Decision Trees

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Rule Generation using Decision Trees

Decision Tree Learning

Classification Using Decision Trees

Bayesian Learning Features of Bayesian learning methods:

Classification: Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Bayesian Classification. Bayesian Classification: Why?

Classification and Prediction

Chapter 3: Decision Tree Learning

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Danushka Bollegala

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Algorithms for Classification: The Basic Methods

The Naïve Bayes Classifier. Machine Learning Fall 2017

Decision Trees. Gavin Brown

Dan Roth 461C, 3401 Walnut

Decision Trees.

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Trees.

COMP 328: Machine Learning

EECS 349:Machine Learning Bryan Pardo

Machine Learning in Bioinformatics

Decision Trees / NLP Introduction

10-701/ Machine Learning: Assignment 1


Lecture 9: Bayesian Learning

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Chapter 3: Decision Tree Learning (part 2)

Decision Tree And Random Forest

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

the tree till a class assignment is reached

Classification and Regression Trees

Mining Classification Knowledge

The Solution to Assignment 6

Induction on Decision Trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

UVA CS 4501: Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Classification: Decision Trees

Machine Learning & Data Mining

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning Alternatives to Manual Knowledge Acquisition

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

COMP61011 : Machine Learning. Probabilis*c Models + Bayes Theorem

Decision Trees Part 1. Rao Vemuri University of California, Davis

UVA CS / Introduc8on to Machine Learning and Data Mining

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Decision Support. Dr. Johan Hagelbäck.

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Symbolic methods in TC: Decision Trees

Mining Classification Knowledge

Numerical Learning Algorithms

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Modern Information Retrieval

Stephen Scott.

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Transcription:

INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Supervised learning: classification Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

2

Decision tree learning Supervised learning of decision tree classifier by means of `splitting on attributes 1. What is that? 2. How to split? (ID3) 3

When to play tennis? Example dataset D: N=14 cases, 4 attributes, 1 class variable 4

Decision Tree splits I Let s start building the tree from scratch. we first need to decide on which attribute to make a decision. Let s say 1 we selected Humidity ; split data according to the attribute s values: Humidity D1,D2,D3,D4 D8,D12,D14 high normal D5,D6,D7,D9 D10,D11,D13 1 NB using ID3, you won t have to make this choice yourself 5

Decision Tree splits - II Now let s split the first subset (H=high) D1,D2,D3,D4,D8,D12,D14 using attribute Wind : Humidity high normal strong Wind weak D5,D6,D7,D9 D10,D11,D13 D2,D12,D14 D1,D3,D4,D8 6

Decision Tree splits - III strong Outlook Wind high weak Humidity D1,D3,D4,D8 normal D5,D6,D7,D9 D10,D11,D13 Now let s split the subset H=high & W=strong (D2,D12,D14) using attribute Outlook Sunny No Rain No Overcast Yes entire subset classified 7

Decision Tree splits - IV Now let s split the subset H=high & W=weak (D1,D3,D4,D8) using attribute Outlook Humidity high normal strong Wind weak D5,D6,D7,D9 D10,D11,D13 Outlook Outlook Sunny No Rain No Overcast Sunny Rain Overcast Yes No Yes Yes 8

Decision Tree splits V Now let s split the subset H= normal (D5,D6,D7,D9,D10,D11,D13) using Outlook Humidity high normal wind outlook strong weak Sunny Rain Overcast outlook outlook Yes D5,D6,D10 Yes Sunny No Rain No Overcast Sunny Rain Yes No Yes Overcast Yes 9

Decision Tree splits VI Now let s split subset H=normal & O=rain (D5,D6,D10) using Wind Humidity high normal wind outlook strong weak Sunny Rain Overcast Sunny No outlook Rain No outlook Overcast Sunny Rain Yes No Yes Yes wind Yes Overcast strong Yes No weak Yes 10

Final Decision Tree Note: The decision tree can be expressed as an expression of if then else sentences, or in case of binary outcomes a logical formula: (humidity=high wind=strong outlook=overcast) (humidity=high wind=weak outlook=overcast) (humidity=high wind=weak outlook=rain) (humidity=normal outlook=sunny) (humidity=normal outlook=overcast) (humidity=normal outlook=rain wind=weak) Humidity high normal wind outlook strong weak Sunny Rain Overcast Sunny No outlook Rain No outlook Overcast Sunny Rain Yes No Yes Yes wind Yes Overcast strong Yes No weak Yes 11

Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> =??? Humidity high normal wind outlook strong weak Sunny Rain Overcast outlook outlook Yes wind Yes Sunny Rain Overcast Sunny Rain Overcast strong weak No No Yes No Yes Yes No Yes 12

Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> =??? Humidity high normal wind outlook strong weak Sunny Rain Overcast outlook outlook Yes wind Yes Sunny Rain Overcast Sunny Rain Overcast strong weak No No Yes No Yes Yes No Yes Note that this was an unseen instance (not in data). 13

Alternative Decision Trees Another tree from the same data, using different attributes: We can build quite a large number of (unique) decision trees So which attribute should we choose at branches? 14

ID3: an entropy-based decision tree learner 15

Entropy A measure of the disorder or randomness in a closed system with variable(s) of interest S: where n = S is the number of values of S Convention: 0 log 2 0 = 0 For a degenerate distribution, the entropy will be 0 (why?) For a uniform distribution, the entropy will be log 2 n (= 1 for binary valued variable) Recall: log 2 x = log b x / log b 2 for any base b logarithm 16

Entropy: example In our system we have 1 variable of interest (S=PlayTennis), with 2 possible values i (yes, no) n= S =2. Let p + = p(pt=yes) and p = p(pt=no) and use Frequency counting to establish these probabilities from the data: 9 out of N=14 examples are positive p + = 9/14 5 of these 14 are negative p = 5/14 Entropy(PlayTennis) = = p + log 2 p + p log 2 p = = (9/14)log 2 (9/14) (5/14)log 2 (5/14) = 0.940 17

Conditional Entropy Conditional entropy represents the entropy in a system given the values of another variable. The entropy Entropy(S X ) of S conditioned on X, is the expected value of the entropy given all possible values x of X: Entropy(S X ) = where Entropy(S X = x ) We will use the following short hand notations: Entropy(S X ) for Entropy(S X) Entropy(S x ) for Entropy(S X = x) 18

Conditional Entropy - example We can now evaluate each attribute by calculating how much change they will do in entropy. For example, we can evaluate the attribute Temperature, which has 3 values: hot, mild, cool. So we need to consider 3 subsystems: S hot, S mild, S cool. For each subsystem, probabilities are assessed from a subset of the data D: D hot = {D1,D2,D3,D13} p(hot) = 4/14 D mild = {D4,D8,D10,D11,D12,D14} p(mild) = 6/14 D cool = {D5,D6,D7,D9} p(cool) = 4/14 Now first compute entropy in the subsystems: Entropy(S hot ), Entropy(S mild ), Entropy(S cool ) 19

Conditional Entropy example II D hot ={D1( ),D2( ),D3(+),D13(+)} p + hot = 0.5 and p hot = 0.5 Entropy(S hot ) = 0.5 log 2 0.5 0.5 log 2 0.5 = 1 D mild ={D4 (+),D8( ),D10(+),D11(+),D12(+),D14( )} p + mild = 0.666 and p mild = 0.333 Entropy(S mild ) = 0.666 log 2 0.666 0.333 log 2 0.333 = 0.918 D cool ={D5(+),D6( ),D7(+),D9(+)} p + cool = 0.75 and p cool = 0.25 Entropy(S cool ) = 0.75 log 2 0.75 0.25 log 2 0.25 = 0.811 20

Conditional Entropy example III The conditional entropy after splitting on Temperature now is: Entropy(S Temperature ) = = p(hot) Entropy(S hot ) + p(mild) Entropy(S mild ) + p(cool) Entropy(S cool ) = (4/14)*1 + (6/14)*0.918 + (4/14)*0.811 = 0.9108 Okay: but does this mean we should split on this attribute?? 21

Information Gain We now define the Gain (reduction in entropy) of splitting on attribute X as: Gain(S,X) = Entropy(S) Entropy(S X) Information gain is always a non negative value! (Why?) If Entropy(S X ) = 0, then all cases in S X are correctly classified split on attribute with smallest conditional entropy Equivalently: split on attribute with highest gain 22

Information Gain - example The gain of splitting on Temperature is: Gain(S, Temp) = 0.940 0.9108 = 0.029 Compute the Gain of splitting for all other attributes: Gain(S, Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 We therefore split on Outlook and repeat the process for: S S sunny with D D sunny S S overcast with D D overcast S S rain with D D rain 23

ID3 (Decision Tree Algorithm) Building a decision tree with the ID3 algorithm 1. Start from an empty node 2. Select an attribute with the most information gain 3. Split: create the subsystems (children) for each value of the selected attribute 4. For each associated subset of the data: if not all elements belong to same class then repeat the steps 2 3 for the subset 24

Domain for ID3 example Obstacles Robot Robot can turn left & right, and move forward 25

Cases for ID3 example S: X1 X2 X3 X4 X5 X6 Left Right Forward Back Previous Action Sensor Sensor Sensor Sensor Action Obstacle Free Obstacle Free moveforward TurnRight Free Free Obstacle Free TurnLeft TurnLeft Free Obstacle Free Free MoveForward MoveForward Free Obstacle Free Obstacle TurnLeft MoveForward Obstacle Free Free Free TurnRight MoveForward Free Free Free Obstacle TurnRight MoveForward 26

ID3 Example Entropy(S) = 1/6*log 2 (1/6) 1/6*log 2 (1/6) 4/6*log 2 (4/6) = 1.25 Entropy(S LeftSensor ) = 2/6*Entropy(S LS=obstacle ) + 4/6*Entropy(S LS=free ) = 2/6*1 + 4/6*0.811 = 0.874 Entropy(S RightSensor ) = 2/6*Entropy(S RS=obstacle ) + 4/6*Entropy(S RS=free ) = 2/6*0 + 4/6*1.5 = 1 Entropy(S ForwardSensor ) = 2/6*Entropy(S FS=obstacle ) + 4/6*Entropy(S FS=Free ) = 2/6*1 + 4/6*0 = 0.333 Entropy(S BackSensor ) = 2/6*Entropy(S BS=obstacle ) + 4/6*Entropy(S BS=free ) = 2/6*0 + 4/6*1.5 = 1 Entropy(S PreviousAction ) = 2/6*Entropy(S PA=MoveForw ) + 2/6*Entropy(S PA=TurnL ) + 2/6*Entropy(S PA=TurnR ) = 2/6*1 + 2/6*1 + 2/6*0 = 0.666 Gain(S,LeftSensor) = 1.25 0.874 = 0.376 Gain(S,RightSensor) = 1.25 1 = 0.25 Gain(S,ForwardSensor) = 1.25 0.333 = 0.917 Gain(S,BackSensor) = 1.25 1 = 0.25 Gain(S,PreviousAction) = 1.25 0.666 = 0.584 Select ForwardSensor 27

Decision Tree ID3 Example ForwardSensor free obstacle MoveForward {X1,X2} = S Entropy(S ) = 1/2*log 2 (1/2) 1/2*log 2 (1/2) = 1 (X1: Action = TR; X2: Action = TL) Entropy(S LeftSensor ) = 1/2*Entropy(S LS=obstacle ) + 1/2*Entropy(S LS=free ) = 1/2*0 + 1/2*0 = 0 Gain = 1 0 = 1 Entropy(S RightSensor ) = 1*Entropy(S RS=free ) = 1*1 = 1 Gain = 1 1 = 0 Entropy(S BackSensor ) = exact same Gain = 1 1 = 0 Entropy(S PreviousAction ) = 1/2*Entropy(S PA=MoveForw ) + 1/2*Entropy(S PA=TurnL ) = 1/2*0 +1/2*0 = 0 Gain = 1 0 = 1 Select either LeftSensor or PreviousAction, depending on the execution order 28

Decision Tree ID3 Example ForwardSensor ForwardSensor free obstacle free obstacle MoveForward LeftSensor MoveForward Previous action obstacle free Move forward TurnLeft TurnRight (X1) TurnLeft (X2) TurnRight (X1) TurnLeft (X2) 29

ID3 preference bias example I Babylon 5 universe S D1 D2 D3 D4 D5 Race Name BeenToB5 Good Person Minbari Delenn Yes Yes Minbari Draal Yes Yes Human Morden Yes No Narn G Kar Yes Yes Human Sheridan Yes Yes p yes =0.8 p no =0.2 Entropy(S) = 0.2*log 2 0.2 0.8*log 2 0.8 = 0.72 Split on Race D minbari ={D1(+),D2(+)} Entropy(S minbari )=0 D human ={D3( ),D5(+)} Entropy(S human )=1 D narn ={D4(+)} Entropy(S narn )=0 Entropy(S Race )=2/5*0+2/5*1+1/5*0 = 2/5 Gain(S,Race)=0.72 2/5=0.32 30

ID3 preference bias example II Babylon 5 universe Race Name BeenToB5 Good S Person D1 Minbari Delenn Yes Yes D2 Minbari Draal Yes Yes p yes =0.8 p no =0.2 Entropy(S) = 0.2*log 2 0.2 0.8*log 2 0.8 = 0.72 D3 Human Morden Yes No D4 D5 Narn G Kar Yes Yes Human Sheridan Yes Yes Split on Name D Delenni ={D1(+)} Entropy(S Delenn ) = 0 The entropies of all D Draal ={D2(+)} Entropy(S Draal ) = 0 subsets are 0 D Morden ={D3( )}.. D G Kar ={D4(+)} D Sheridan ={D5(+)} Entropy(S Name ) = 0 Gain(S,Name)=0.72 0=0.72 31

ID3: Preference Bias Name Delenn Draal Morden G kar Yes Yes No Yes Sheridan ID3 prefers some trees over others: It favors shorter trees over longer ones It selects trees that place the attributes with highest information gain closest to the root Its bias is solely a consequence of the ordering of hypotheses by its search strategy. Yes 32

ID3: Overfitting (illustrated) Suppose we receive an additional data point 33

Extra point: Effect on Our Tree NB in previous tree, instance <O=sunny,.,H= normal,. > was classified as PlayTennis =yes.. 34

Effects of ID3 Overfitting Trees may grow to include irrelevant attributes (e.g., Date, Color, etc.) Noisy examples may add spurious nodes to tree 35

ID3 Properties ID3 is complete for consistent(!) training data ID3 is not optimal (greedy Hill climbing approach no guarantees) ID3 can overfit on the training data (accuracy of learned model = prediction on test set) Use of information gain preference bias Continuous data: many more places to split an attribute time consuming search for best split. ID3 has been further optimized e.g. C4.5 and C5.0 ID3 for iterative online learning: ID4 36

More powerful: Naive Bayes 37

Naïve Bayes classifier updated given forecast. Supervised learning of naive Bayes classifier 38

Naive Bayes classifier: learning A naive Bayes classifier specifies a class variable C feature variables F 1,,F n aprior distribution p(c) conditional distributions p(f i C) Distributions p(c) and p(f i C) can be `learned from data. E.g. simple approach: frequency counting. More sophisticated approach also learns the structure of the model, i.e. determines which features to include requires performance measure (e.g. accuracy). 39

Naive Bayes classifier: use A naive Bayes classifier predicts a most likely value c for class C given observed features F i = f i from: where 1/Z = 1/p(F 1,,F n ) is a normalisation constant. This formula is based on Bayes rule: p(a B) = p(b A)p(A)/p(B) and the naive assumption that all n feature variables are independent given the class variable. 40

Example data set: when to play tennis, again 41

Learn NBC - example Model structure is fixed; just need probabilities from data. Feature variables: Outlook, Temp., Humidity, Wind Class Priors: p(playtennis=yes) = 9/14 p(pt=no) = 5/14 Probabilities based on frequency counting, just as in ID3 entropy computations. Class variable: PlayTennis Conditionals p(f i C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 0 O=rain 3/9 2/5 T=hot 2/9 2/5 T=mild 4/9 2/5 T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5 42

Classify with NBC - example Feature variables: O, T, H, W Class variable: PT Class Priors: p(pt=yes) = 9/14 p(pt=no) = 5/14 Classify instance e =<O=sunny, T=hot, H=normal, W=weak>: Conditionals p(f i C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 0 O=rain 3/9 2/5 T=hot 2/9 2/5 T=mild 4/9 2/5 T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5 p(pt=yes e) = 1/Z *9/14*2/9*2/9*6/9*6/9 = 1/Z * 0.01411 > p(pt=no e) = 1/Z *5/14*3/5*2/5*1/5*2/5 = 1/Z * 0.00686 43

NBC Properties NBC learning is complete (Probabilistic: can handle inconsistencies in data) NBC learning is not optimal (Irrealistic independence assumptions class posterior often unreliable; yet accurate prediction of most likely value) Time and space complexity: independence assumptions strongly reduce dimensionality NBC can overfit on the training data (especially with large number of features) NBC has been further optimized TAN/FAN/KDB 44