INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Supervised learning: classification Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
2
Decision tree learning Supervised learning of decision tree classifier by means of `splitting on attributes 1. What is that? 2. How to split? (ID3) 3
When to play tennis? Example dataset D: N=14 cases, 4 attributes, 1 class variable 4
Decision Tree splits I Let s start building the tree from scratch. we first need to decide on which attribute to make a decision. Let s say 1 we selected Humidity ; split data according to the attribute s values: Humidity D1,D2,D3,D4 D8,D12,D14 high normal D5,D6,D7,D9 D10,D11,D13 1 NB using ID3, you won t have to make this choice yourself 5
Decision Tree splits - II Now let s split the first subset (H=high) D1,D2,D3,D4,D8,D12,D14 using attribute Wind : Humidity high normal strong Wind weak D5,D6,D7,D9 D10,D11,D13 D2,D12,D14 D1,D3,D4,D8 6
Decision Tree splits - III strong Outlook Wind high weak Humidity D1,D3,D4,D8 normal D5,D6,D7,D9 D10,D11,D13 Now let s split the subset H=high & W=strong (D2,D12,D14) using attribute Outlook Sunny No Rain No Overcast Yes entire subset classified 7
Decision Tree splits - IV Now let s split the subset H=high & W=weak (D1,D3,D4,D8) using attribute Outlook Humidity high normal strong Wind weak D5,D6,D7,D9 D10,D11,D13 Outlook Outlook Sunny No Rain No Overcast Sunny Rain Overcast Yes No Yes Yes 8
Decision Tree splits V Now let s split the subset H= normal (D5,D6,D7,D9,D10,D11,D13) using Outlook Humidity high normal wind outlook strong weak Sunny Rain Overcast outlook outlook Yes D5,D6,D10 Yes Sunny No Rain No Overcast Sunny Rain Yes No Yes Overcast Yes 9
Decision Tree splits VI Now let s split subset H=normal & O=rain (D5,D6,D10) using Wind Humidity high normal wind outlook strong weak Sunny Rain Overcast Sunny No outlook Rain No outlook Overcast Sunny Rain Yes No Yes Yes wind Yes Overcast strong Yes No weak Yes 10
Final Decision Tree Note: The decision tree can be expressed as an expression of if then else sentences, or in case of binary outcomes a logical formula: (humidity=high wind=strong outlook=overcast) (humidity=high wind=weak outlook=overcast) (humidity=high wind=weak outlook=rain) (humidity=normal outlook=sunny) (humidity=normal outlook=overcast) (humidity=normal outlook=rain wind=weak) Humidity high normal wind outlook strong weak Sunny Rain Overcast Sunny No outlook Rain No outlook Overcast Sunny Rain Yes No Yes Yes wind Yes Overcast strong Yes No weak Yes 11
Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> =??? Humidity high normal wind outlook strong weak Sunny Rain Overcast outlook outlook Yes wind Yes Sunny Rain Overcast Sunny Rain Overcast strong weak No No Yes No Yes Yes No Yes 12
Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> =??? Humidity high normal wind outlook strong weak Sunny Rain Overcast outlook outlook Yes wind Yes Sunny Rain Overcast Sunny Rain Overcast strong weak No No Yes No Yes Yes No Yes Note that this was an unseen instance (not in data). 13
Alternative Decision Trees Another tree from the same data, using different attributes: We can build quite a large number of (unique) decision trees So which attribute should we choose at branches? 14
ID3: an entropy-based decision tree learner 15
Entropy A measure of the disorder or randomness in a closed system with variable(s) of interest S: where n = S is the number of values of S Convention: 0 log 2 0 = 0 For a degenerate distribution, the entropy will be 0 (why?) For a uniform distribution, the entropy will be log 2 n (= 1 for binary valued variable) Recall: log 2 x = log b x / log b 2 for any base b logarithm 16
Entropy: example In our system we have 1 variable of interest (S=PlayTennis), with 2 possible values i (yes, no) n= S =2. Let p + = p(pt=yes) and p = p(pt=no) and use Frequency counting to establish these probabilities from the data: 9 out of N=14 examples are positive p + = 9/14 5 of these 14 are negative p = 5/14 Entropy(PlayTennis) = = p + log 2 p + p log 2 p = = (9/14)log 2 (9/14) (5/14)log 2 (5/14) = 0.940 17
Conditional Entropy Conditional entropy represents the entropy in a system given the values of another variable. The entropy Entropy(S X ) of S conditioned on X, is the expected value of the entropy given all possible values x of X: Entropy(S X ) = where Entropy(S X = x ) We will use the following short hand notations: Entropy(S X ) for Entropy(S X) Entropy(S x ) for Entropy(S X = x) 18
Conditional Entropy - example We can now evaluate each attribute by calculating how much change they will do in entropy. For example, we can evaluate the attribute Temperature, which has 3 values: hot, mild, cool. So we need to consider 3 subsystems: S hot, S mild, S cool. For each subsystem, probabilities are assessed from a subset of the data D: D hot = {D1,D2,D3,D13} p(hot) = 4/14 D mild = {D4,D8,D10,D11,D12,D14} p(mild) = 6/14 D cool = {D5,D6,D7,D9} p(cool) = 4/14 Now first compute entropy in the subsystems: Entropy(S hot ), Entropy(S mild ), Entropy(S cool ) 19
Conditional Entropy example II D hot ={D1( ),D2( ),D3(+),D13(+)} p + hot = 0.5 and p hot = 0.5 Entropy(S hot ) = 0.5 log 2 0.5 0.5 log 2 0.5 = 1 D mild ={D4 (+),D8( ),D10(+),D11(+),D12(+),D14( )} p + mild = 0.666 and p mild = 0.333 Entropy(S mild ) = 0.666 log 2 0.666 0.333 log 2 0.333 = 0.918 D cool ={D5(+),D6( ),D7(+),D9(+)} p + cool = 0.75 and p cool = 0.25 Entropy(S cool ) = 0.75 log 2 0.75 0.25 log 2 0.25 = 0.811 20
Conditional Entropy example III The conditional entropy after splitting on Temperature now is: Entropy(S Temperature ) = = p(hot) Entropy(S hot ) + p(mild) Entropy(S mild ) + p(cool) Entropy(S cool ) = (4/14)*1 + (6/14)*0.918 + (4/14)*0.811 = 0.9108 Okay: but does this mean we should split on this attribute?? 21
Information Gain We now define the Gain (reduction in entropy) of splitting on attribute X as: Gain(S,X) = Entropy(S) Entropy(S X) Information gain is always a non negative value! (Why?) If Entropy(S X ) = 0, then all cases in S X are correctly classified split on attribute with smallest conditional entropy Equivalently: split on attribute with highest gain 22
Information Gain - example The gain of splitting on Temperature is: Gain(S, Temp) = 0.940 0.9108 = 0.029 Compute the Gain of splitting for all other attributes: Gain(S, Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 We therefore split on Outlook and repeat the process for: S S sunny with D D sunny S S overcast with D D overcast S S rain with D D rain 23
ID3 (Decision Tree Algorithm) Building a decision tree with the ID3 algorithm 1. Start from an empty node 2. Select an attribute with the most information gain 3. Split: create the subsystems (children) for each value of the selected attribute 4. For each associated subset of the data: if not all elements belong to same class then repeat the steps 2 3 for the subset 24
Domain for ID3 example Obstacles Robot Robot can turn left & right, and move forward 25
Cases for ID3 example S: X1 X2 X3 X4 X5 X6 Left Right Forward Back Previous Action Sensor Sensor Sensor Sensor Action Obstacle Free Obstacle Free moveforward TurnRight Free Free Obstacle Free TurnLeft TurnLeft Free Obstacle Free Free MoveForward MoveForward Free Obstacle Free Obstacle TurnLeft MoveForward Obstacle Free Free Free TurnRight MoveForward Free Free Free Obstacle TurnRight MoveForward 26
ID3 Example Entropy(S) = 1/6*log 2 (1/6) 1/6*log 2 (1/6) 4/6*log 2 (4/6) = 1.25 Entropy(S LeftSensor ) = 2/6*Entropy(S LS=obstacle ) + 4/6*Entropy(S LS=free ) = 2/6*1 + 4/6*0.811 = 0.874 Entropy(S RightSensor ) = 2/6*Entropy(S RS=obstacle ) + 4/6*Entropy(S RS=free ) = 2/6*0 + 4/6*1.5 = 1 Entropy(S ForwardSensor ) = 2/6*Entropy(S FS=obstacle ) + 4/6*Entropy(S FS=Free ) = 2/6*1 + 4/6*0 = 0.333 Entropy(S BackSensor ) = 2/6*Entropy(S BS=obstacle ) + 4/6*Entropy(S BS=free ) = 2/6*0 + 4/6*1.5 = 1 Entropy(S PreviousAction ) = 2/6*Entropy(S PA=MoveForw ) + 2/6*Entropy(S PA=TurnL ) + 2/6*Entropy(S PA=TurnR ) = 2/6*1 + 2/6*1 + 2/6*0 = 0.666 Gain(S,LeftSensor) = 1.25 0.874 = 0.376 Gain(S,RightSensor) = 1.25 1 = 0.25 Gain(S,ForwardSensor) = 1.25 0.333 = 0.917 Gain(S,BackSensor) = 1.25 1 = 0.25 Gain(S,PreviousAction) = 1.25 0.666 = 0.584 Select ForwardSensor 27
Decision Tree ID3 Example ForwardSensor free obstacle MoveForward {X1,X2} = S Entropy(S ) = 1/2*log 2 (1/2) 1/2*log 2 (1/2) = 1 (X1: Action = TR; X2: Action = TL) Entropy(S LeftSensor ) = 1/2*Entropy(S LS=obstacle ) + 1/2*Entropy(S LS=free ) = 1/2*0 + 1/2*0 = 0 Gain = 1 0 = 1 Entropy(S RightSensor ) = 1*Entropy(S RS=free ) = 1*1 = 1 Gain = 1 1 = 0 Entropy(S BackSensor ) = exact same Gain = 1 1 = 0 Entropy(S PreviousAction ) = 1/2*Entropy(S PA=MoveForw ) + 1/2*Entropy(S PA=TurnL ) = 1/2*0 +1/2*0 = 0 Gain = 1 0 = 1 Select either LeftSensor or PreviousAction, depending on the execution order 28
Decision Tree ID3 Example ForwardSensor ForwardSensor free obstacle free obstacle MoveForward LeftSensor MoveForward Previous action obstacle free Move forward TurnLeft TurnRight (X1) TurnLeft (X2) TurnRight (X1) TurnLeft (X2) 29
ID3 preference bias example I Babylon 5 universe S D1 D2 D3 D4 D5 Race Name BeenToB5 Good Person Minbari Delenn Yes Yes Minbari Draal Yes Yes Human Morden Yes No Narn G Kar Yes Yes Human Sheridan Yes Yes p yes =0.8 p no =0.2 Entropy(S) = 0.2*log 2 0.2 0.8*log 2 0.8 = 0.72 Split on Race D minbari ={D1(+),D2(+)} Entropy(S minbari )=0 D human ={D3( ),D5(+)} Entropy(S human )=1 D narn ={D4(+)} Entropy(S narn )=0 Entropy(S Race )=2/5*0+2/5*1+1/5*0 = 2/5 Gain(S,Race)=0.72 2/5=0.32 30
ID3 preference bias example II Babylon 5 universe Race Name BeenToB5 Good S Person D1 Minbari Delenn Yes Yes D2 Minbari Draal Yes Yes p yes =0.8 p no =0.2 Entropy(S) = 0.2*log 2 0.2 0.8*log 2 0.8 = 0.72 D3 Human Morden Yes No D4 D5 Narn G Kar Yes Yes Human Sheridan Yes Yes Split on Name D Delenni ={D1(+)} Entropy(S Delenn ) = 0 The entropies of all D Draal ={D2(+)} Entropy(S Draal ) = 0 subsets are 0 D Morden ={D3( )}.. D G Kar ={D4(+)} D Sheridan ={D5(+)} Entropy(S Name ) = 0 Gain(S,Name)=0.72 0=0.72 31
ID3: Preference Bias Name Delenn Draal Morden G kar Yes Yes No Yes Sheridan ID3 prefers some trees over others: It favors shorter trees over longer ones It selects trees that place the attributes with highest information gain closest to the root Its bias is solely a consequence of the ordering of hypotheses by its search strategy. Yes 32
ID3: Overfitting (illustrated) Suppose we receive an additional data point 33
Extra point: Effect on Our Tree NB in previous tree, instance <O=sunny,.,H= normal,. > was classified as PlayTennis =yes.. 34
Effects of ID3 Overfitting Trees may grow to include irrelevant attributes (e.g., Date, Color, etc.) Noisy examples may add spurious nodes to tree 35
ID3 Properties ID3 is complete for consistent(!) training data ID3 is not optimal (greedy Hill climbing approach no guarantees) ID3 can overfit on the training data (accuracy of learned model = prediction on test set) Use of information gain preference bias Continuous data: many more places to split an attribute time consuming search for best split. ID3 has been further optimized e.g. C4.5 and C5.0 ID3 for iterative online learning: ID4 36
More powerful: Naive Bayes 37
Naïve Bayes classifier updated given forecast. Supervised learning of naive Bayes classifier 38
Naive Bayes classifier: learning A naive Bayes classifier specifies a class variable C feature variables F 1,,F n aprior distribution p(c) conditional distributions p(f i C) Distributions p(c) and p(f i C) can be `learned from data. E.g. simple approach: frequency counting. More sophisticated approach also learns the structure of the model, i.e. determines which features to include requires performance measure (e.g. accuracy). 39
Naive Bayes classifier: use A naive Bayes classifier predicts a most likely value c for class C given observed features F i = f i from: where 1/Z = 1/p(F 1,,F n ) is a normalisation constant. This formula is based on Bayes rule: p(a B) = p(b A)p(A)/p(B) and the naive assumption that all n feature variables are independent given the class variable. 40
Example data set: when to play tennis, again 41
Learn NBC - example Model structure is fixed; just need probabilities from data. Feature variables: Outlook, Temp., Humidity, Wind Class Priors: p(playtennis=yes) = 9/14 p(pt=no) = 5/14 Probabilities based on frequency counting, just as in ID3 entropy computations. Class variable: PlayTennis Conditionals p(f i C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 0 O=rain 3/9 2/5 T=hot 2/9 2/5 T=mild 4/9 2/5 T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5 42
Classify with NBC - example Feature variables: O, T, H, W Class variable: PT Class Priors: p(pt=yes) = 9/14 p(pt=no) = 5/14 Classify instance e =<O=sunny, T=hot, H=normal, W=weak>: Conditionals p(f i C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 0 O=rain 3/9 2/5 T=hot 2/9 2/5 T=mild 4/9 2/5 T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5 p(pt=yes e) = 1/Z *9/14*2/9*2/9*6/9*6/9 = 1/Z * 0.01411 > p(pt=no e) = 1/Z *5/14*3/5*2/5*1/5*2/5 = 1/Z * 0.00686 43
NBC Properties NBC learning is complete (Probabilistic: can handle inconsistencies in data) NBC learning is not optimal (Irrealistic independence assumptions class posterior often unreliable; yet accurate prediction of most likely value) Time and space complexity: independence assumptions strongly reduce dimensionality NBC can overfit on the training data (especially with large number of features) NBC has been further optimized TAN/FAN/KDB 44