Artificial Intelligence Topic
What is decision tree? A tree where each branching node represents a choice between two or more alternatives, with every branching node being part of a path to a leaf node (bottom of the tree). The leaf node represents a decision, derived from the tree for the given input.
ID3 Algorithm Top-down, greedy search No backtracking: So, algorithm proceeds from top to bottom
What is a greedy search? At each step, make decision which makes greatest improvement in whatever you are trying optimize. Do not backtrack (unless you hit a dead end) This type of search is likely not to be a globally optimum solution, but generally works well. What are we really doing here? At each node of tree, make decision on which attribute best classifies training data at that point. Never backtrack (in ID3) Do this for each branch of tree. End result will be tree structure representing a hypothesis which works best for the training data.
Constructing a decision tree using information gain A decision tree can be constructed top-down using the information gain in the following way: begin at the root node determine the attribute with the highest information gain which is not already used as an ancestor node add a child node for each possible value of that attribute attach all examples to the child node where the attribute values of the examples are identical to the attribute value attached to the node if all examples attached to the child node can be classified uniquely add that classification to that node and mark it as leaf node go back to step two if there are unused attributes left, otherwise add the classification of most of the examples attached to the child node.
Object, sample, example Training Examples Attribute, variable, property Shall we play tennis today? (Tennis 1) decision
Shall we play tennis today?
The tree itself forms hypothesis Disjunction (OR s) of conjunctions (AND s) Each path from root to leaf forms conjunction of constraints on attributes Separate branches are disjunctions Example from Play Tennis decision tree: (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)
Hypothesis space can include disjunctive expressions. Set of possible decision trees. In fact, hypothesis space is complete space of finite discrete-valued functions
Question? How do you determine which attribute best classifies data?
What is entropy? Entropy is a measure of the uncertainty associated with a random variable
e.g. A series of coin tosses with a fair coin has maximum entropy, since there is no way to predict what will come next. A string of coin tosses with a coin with two heads and no tails has zero entropy, since the coin will always come up heads. A single toss of a fair coin has an entropy of one bit, but a particular result (e.g. "heads") has zero entropy, since it is entirely "predictable
Entropy Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: Entropy H ( S) p1 log2( p1) p0 log2( p0) where p 1 is the fraction of positive examples in S and p 0 is the fraction of negatives. If all examples are in one category, entropy is zero (we define 0 log(0)=0) If examples are equally mixed (p 1 =p 0 =0.5), entropy is a maximum of 1. 19
Information gain Information gain is how well one attribute classifies the training data.in other words it is the reduction in entropy. Mathematical expression for information gain: A i Gain( S, A ) H( S) P( A v) H( S ) i i v v Values ( A i )
ID3 algorithm (for boolean-valued function) Calculate the entropy for all training examples positive and negative cases p + = #pos/tot p - = #neg/tot H(S) = -p + log 2 (p + ) - p - log 2 (p - ) Use attribute with greatest information gain as a root
Example: PlayTennis Four attributes used for classification: Outlook = {Sunny,Overcast,Rain} Temperature = {Hot, Mild, Cool} Humidity = {High, Normal} Wind = {Weak, Strong} One predicted (target) attribute (binary) PlayTennis = {Yes,No} Given 14 Training examples 9 positive 5 negative
Examples, minterms, cases, objects, test cases, Training Examples
14 cases 9 positive cases Step 1: Calculate entropy for all cases: entropy N Pos = 9 N Neg = 5 N Tot = 14 H(S) = -(9/14)*log 2 (9/14) - (5/14)*log 2 (5/14) = 0.940
Step 2: Loop over all attributes, calculate gain: Attribute = Outlook Loop over values of Outlook Outlook = Sunny N Pos = 2 N Neg = 3 N Tot = 5 H(Sunny) = -(2/5)*log 2 (2/5) - (3/5)*log 2 (3/5) = 0.971 Outlook = Overcast N Pos = 4 N Neg = 0 N Tot = 4 H(Overcast) = -(4/4)*log 2 4/4) - (0/4)*log 2 (0/4) = 0.00
Outlook = Rain N Pos = 3 N Neg = 2 N Tot = 5 H(Rain) = -(3/5)*log 2 (3/5) - (2/5)*log 2 (2/5) = 0.971 Calculate Information Gain for attribute Outlook Gain(S,Outlook) = H(S) - N Sunny /N Tot *H(Sunny) - N Over /N Tot *H(Overcast) - N Rain /N Tot *H(Rainy) Gain(S,Outlook) = 0.940 - (5/14)*0.971 - (4/14)*0 - (5/14)*0.971 Gain(S,Outlook) = 0.246 Attribute = Temperature (Repeat process looping over {Hot, Mild, Cool}) Gain(S,Temperature) = 0.029
Attribute = Humidity (Repeat process looping over {High, Normal}) Gain(S,Humidity) = 0.029 Attribute = Wind (Repeat process looping over {Weak, Strong}) Gain(S,Wind) = 0.048 Find attribute with greatest information gain: Gain(S,Outlook) = 0.246, Gain(S,Temperature) = 0.029 Gain(S,Humidity) = 0.029, Gain(S,Wind) = 0.048 Outlook is root node of tree
Iterate algorithm to find attributes which best classify training examples under the values of the root node Example continued Take three subsets: Outlook = Sunny (N Tot = 5) Outlook = Overcast (N Tot = 4) Outlook = Rainy (N Tot = 5) For each subset, repeat the above calculation looping over all attributes other than Outlook
For example: Outlook = Sunny (N Pos = 2, N Neg =3, N Tot = 5) H=0.971 Temp = Hot (N Pos = 0, N Neg =2, N Tot = 2) H = 0.0 Temp = Mild (N Pos = 1, N Neg =1, N Tot = 2) H = 1.0 Temp = Cool (N Pos = 1, N Neg =0, N Tot = 1) H = 0.0 Gain(S Sunny,Temperature) = 0.971 - (2/5)*0 - (2/5)*1 - (1/5)*0 Gain(S Sunny,Temperature) = 0.571 Similarly: Gain(S Sunny,Humidity) = 0.971 Gain(S Sunny,Wind) = 0.020 Humidity classifies Outlook=Sunny instances best and is placed as the node under Sunny outcome. Repeat this process for Outlook = Overcast &Rainy
End up with tree:
Drawback A drawback of using decision trees is that the outcomes of decisions, subsequent decisions and payoffs may be based primarily on expectations. For example, if you expect that your parents will pay for half of your college when deciding to go to school, but later discover that you will have to pay for all of your tuition, your expected payoffs will be dramatically different than reality.
Controlling Over fitting Introduce a restriction on the hypotheses space to prevent overly-complex hypotheses from being learned.
Pros and Cons of Decision Tree Advantages: Works well with discrete data Produces human-comprehensible concepts Easy to understand Able to process both numerical and categorical data Disadvantages: Trees created from numeric datasets can be complex Limited to one output attribute
THANKS