Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Decision Trees Mausam (based on slides by UW-AI faculty)

Decision Trees To play or not to play? http://www.sfgate.com/blogs/images/sfgate/sgreen/2007/09/05/2240773250x321.jpg 2

Example data for learning the concept Good day for tennis Day Outlook Humid Wind PlayTennis? d1 s h w n d2 s h s n d3 o h w y d4 r h w y d5 r n w y d6 r n s y d7 o n s y d8 s h w n d9 s n w y d10 r n w y d11 s n s y d12 o h s y d13 o n w y d14 r h s n Outlook = sunny, overcast, rain Humidity = high, normal Wind = weak, strong 3

A Decision Tree for the Same Data Decision Tree for PlayTennis? Leaves = classification output Outlook Arcs = choice of value for parent attribute Sunny Overcast Rain Humidity Yes Wind Normal High Strong Weak Yes No Decision tree is equivalent to logic in disjunctive normal form PlayTennis (Sunny Normal) Overcast (Rain Weak) No Yes 4

Example: Decision Tree for Continuous Valued Features and Discrete Output Input real number attributes (x1,x2), Classification output: 0 or 1 x 2 How do we branch using attribute values x1 and x2 to partition the space correctly? x1 6

Example: Classification of Continuous Valued Inputs x2 Decision Tree 3 4 x1 7

Expressiveness of Decision Trees Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row = path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example But most likely won't generalize to new examples Prefer to find more compact decision trees 8

Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 9

Example Decision tree A decision tree for Wait? based on personal rules of thumb : 10

Input Data for Learning Past examples when I did/did not wait for a table: Classification of examples is positive (T) or negative (F) 11

Decision Tree Learning Aim: find a small tree consistent with training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree 12

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty E.g., splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice For Type?, to wait or not to wait is still at 50% 13

How do we quantify uncertainty? http://a.espncdn.com/media/ten/2006/0306/photo/g_mcenroe_195.jpg

Using information theory to quantify uncertainty Entropy measures the amount of uncertainty in a probability distribution Entropy (or Information Content) of an answer to a question with possible answers v 1,, v n : I(P(v 1 ),, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) 15

Using information theory Imagine we have p examples with Wait = True (positive) and n examples with Wait = false (negative). Our best estimate of the probabilities of Wait = true or false is given by: P true p p n ( ) / p( false) n / p n Hence the entropy of Wait is given by: I( p p n, n p ) n p p n p n log 2 log 2 p n p n n p n 16

Entropy I 1.0 0.5 Entropy is highest when uncertainty is greatest Wait = F Wait = T.00.50 1.00 P(Wait = T) 17

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty and result in gain in information How much information do we gain if we disclose the value of some attribute? Answer: uncertainty before uncertainty after 18

Back at the Restaurant Before choosing an attribute: Entropy = - 6/12 log(6/12) 6/12 log(6/12) = - log(1/2) = log(2) = 1 bit There is 1 bit of information to be discovered 19

Back at the Restaurant If we choose Type: Go along branch French : we have entropy = 1 bit; similarly for the others. Information gain = 1-1 = 0 along any branch If we choose Patrons: In branch None and Some, entropy = 0 For Full, entropy = -2/6 log(2/6)-4/6 log(4/6) = 0.92 Info gain = (1-0) or (1-0.92) bits > 0 in both cases So choosing Patrons gains more information! 20

Entropy across branches How do we combine entropy of different branches? Answer: Compute average entropy Weight entropies according to probabilities of branches 2/12 times we enter None, so weight for None = 1/6 Some has weight: 4/12 = 1/3 Full has weight 6/12 = ½ n p i ni pi ni AvgEntropy ( A) Entropy (, ) p n p n p n i 1 i i i i weight for each branch entropy for each branch 21

Information gain Information Gain (IG) or reduction in entropy from using attribute A: IG(A) = Entropy before AvgEntropy after choosing A Choose the attribute with the largest IG 22

Information gain in our example 2 4 IG( Patrons ) 1 [ I(0,1) 12 12 2 1 1 2 IG( Type) 1 [ I(, ) I 12 2 2 12 I(1,0) 1 ( 2 1, ) 2 6 2 I(, 12 6 4 2 I(, 12 4 4 )] 6 2 ) 4.541 bits 4 12 I 2 ( 4, 2 )] 4 0 bits Patrons has the highest IG of all attributes DTL algorithm chooses Patrons as the root 23

Should I stay or should I go? Learned Decision Tree Decision tree learned from the 12 examples: Substantially simpler than rules-of-thumb tree more complex hypothesis not justified by small amount of data 24

Performance Evaluation How do we know that the learned tree h f? Answer: Try h on a new test set of examples Learning curve = % correct on test set as a function of training set size 25

Overfitting 0.9 0.8 Accuracy On training data On test data 0.7 0.6 Number of Nodes in Decision tree 26

Rule #2 of Machine Learning The best hypothesis almost never achieves 100% accuracy on the training data. (Rule #1 was: you can t learn anything without inductive bias) 28

Avoiding overfitting Stop growing when data split not statistically significant Grow full tree and then prune How to select best tree? Measure performance over the training data Measure performance over separate validation set Add complexity penalty to performance measure 29

0.9 Accuracy Early Stopping Remember this tree and use it as the final classifier On training data 0.8 0.7 On validation data On test data 0.6 Number of Nodes in Decision tree 30

Tune Tune Tune Test Reduced Error Pruning Split data into train and validation set Repeat until pruning is harmful Remove each subtree and replace it with majority class and evaluate on validation set Remove subtree that leads to largest gain in accuracy 31

Reduced Error Pruning Example Sunny Outlook Overcast Rain High Humidity Low Play Wind Strong Weak Don t play Play Don t play Play Validation set accuracy = 0.75 32

Reduced Error Pruning Example Sunny Don t play Outlook Overcast Play Rain Wind Strong Weak Don t play Play Validation set accuracy = 0.80 33

Reduced Error Pruning Example High Sunny Humidity Low Outlook Overcast Play Rain Play Don t play Play Validation set accuracy = 0.70 34

Reduced Error Pruning Example Sunny Don t play Outlook Overcast Play Rain Wind Strong Weak Don t play Play Use this as final tree 35

Scaling Up ID3 and C4.5 assume data fits in main memory (ok for 100,000s examples) SPRINT, SLIQ: multiple sequential scans of data (ok for millions of examples) VFDT: at most one sequential scan (ok for billions of examples) 38

Decision Trees Strengths Very Popular Technique Fast Useful when Target Function is discrete Concepts are likely to be disjunctions Attributes may be noisy 39

Decision Trees Weaknesses Less useful for continuous outputs Can have difficulty with continuous input features as well E.g., what if your target concept is a circle in the x1, x2 plane? Hard to represent with decision trees Very simple with instance-based methods we ll discuss later 40

Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) 41

Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel values left to right, top to bottom row) Greyscale images I p p p N 2 p 1 2 N Pixel value p i can be 0 or 1 (binary image) or 0 to 255 (greyscale) 42

The human brain is extremely good at classifying images Can we develop classification methods by emulating the brain? 43

Brain Computer: What is it? Human brain contains a massively interconnected net of 10 10-10 11 (10 billion) neurons (cortical cells) Biological Neuron - The simple arithmetic computing element 44

Biological Neurons 1. Soma or body cell - is a large, round central body in which almost all the logical functions of the neuron are realized. 2. The axon (output), is a nerve fibre attached to the soma which can serve as a final output channel of the neuron. An axon is usually highly branched. 3. The dendrites (inputs)- represent a highly branching tree of fibres. These long irregularly shaped nerve fibres (processes) are attached to the soma. 4. Synapses are specialized contacts on a neuron which are the termination points for the axons from other neurons. Synapses Soma Axon from other neuron Dendrites Axon Dendrite from other The schematic model of a biological neuron 46

Neurons communicate via spikes Inputs Output spike (electrical pulse) Output spike roughly dependent on whether sum of all inputs reaches a threshold 47

Neurons as Threshold Units Artificial neuron: m binary inputs (-1 or 1), 1 output (-1 or 1) Synaptic weights w ji Threshold i w 1i Weighted Sum Threshold Inputs u j (-1 or +1) w 2i w 3i Output v i (-1 or +1) v i ( wjiu j i j ) (x) = 1 if x > 0 and -1 if x 0 48

Perceptrons for Classification Fancy name for a type of layered feed-forward networks (no loops) Uses artificial neurons ( units ) with binary inputs and outputs Single-layer Multilayer 49

Perceptrons and Classification Consider a single-layer perceptron Weighted sum forms a linear hyperplane wjiu 0 j i j Everything on one side of this hyperplane is in class 1 (output = +1) and everything on other side is class 2 (output = -1) Any function that is linearly separable can be computed by a perceptron 50

Linear Separability Example: AND is linearly separable Linear hyperplane u 1 u 2 AND -1-1 -1 1-1 -1 u 2 1 (1,1) v = 1.5-1 1-1 1 1 1-1 1 u 1-1 u 1 u 2 v = 1 iff u 1 + u 2 1.5 > 0 Similarly for OR and NOT 51

How do we learn the appropriate weights given only examples of (input,output)? Idea: Change the weights to decrease the error in output 52

Perceptron Training Rule 53

What about the XOR function? u 1 u 2 XOR -1-1 1? u 2 1 (1,1) 1-1 -1-1 1-1 -1 1 u 1 1 1 1-1 Can a perceptron separate the +1 outputs from the -1 outputs? 54

Linear Inseparability Perceptron with threshold units fails if classification task is not linearly separable Example: XOR No single line can separate the yes (+1) outputs from the no (-1) outputs! Minsky and Papert s book showing such negative results put a damper on neural networks research for over a decade! -1-1 u 2 1 X 1 (1,1) u 1 55

How do we deal with linear inseparability? 56

Idea 1: Multilayer Perceptrons Removes limitations of single-layer networks Can solve XOR Example: Two-layer perceptron that computes XOR x y Output is +1 if and only if x + y 2 (x + y 1.5) 0.5 > 0 57

Multilayer Perceptron: What does it do? out y 2 1 2 1 1 1 1 1 2 1 1 2 1 1? x y 1 2 x 58

Multilayer Perceptron: What does it do? out y 2 1 1 x y 2 0 =-1 =1 1 y 1 1 2 x 1 1 x y 2 0 1 2 1 1 1 x y 1 2 x 59

Multilayer Perceptron: What does it do? out y 2 =-1 =1 1 1 2 1 1 =1 =-1 2 x y 0 2 x y 0 x y 1 2 x 60

Multilayer Perceptron: What does it do? out y 2 =-1 =1 1 1 1 2 1-1 2 >0 =1 =-1 1 x y 1 2 x 61

Perceptrons as Constraint Satisfaction Networks 1 out 1 1 2 1 y 2 =-1 =1 1 1 x y 2 0 1 2 1 1 1 1 2 1 =1 =-1 2 x y 0 x y 1 2 x 62

Artificial Neuron: Most Popular Activation Functions Linear activation Logistic activation Σ z z z 1 0 z 1 1 z e z Threshold activation z 1, if z 0, sign( z) 1, if z 0. 1 Hyperbolic tangent activation 1 e 1 e 2 u u tanh u u 2 1-1 z - 1 0 z 63

Neural Network Issues Multi-layer perceptrons can represent any function Training multi-layer perceptrons hard Backpropagation Early successes Keeping the car on the road Difficult to debug Opaque 64

Back to Linear Separability Recall: Weighted sum in perceptron forms a linear hyperplane i w x i i b 0 Due to threshold function, everything on one side of this hyperplane is labeled as class 1 (output = +1) and everything on other side is labeled as class 2 (output = -1) 65

Separating Hyperplane Class 1 i wi x i b 0 denotes +1 output denotes -1 output Class 2 Need to choose w and b based on training data 66

Separating Hyperplanes Different choices of w and b give different hyperplanes Class 1 denotes +1 output denotes -1 output Class 2 (This and next few slides adapted from Andrew Moore s) 67

Which hyperplane is best? Class 1 denotes +1 output denotes -1 output Class 2 68

How about the one right in the middle? Intuitively, this boundary seems good Avoids misclassification of new test points if they are generated from the same distribution as training points 69

Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 70

Maximum Margin and Support Vector Machine Support Vectors are those datapoints that the margin pushes up against The maximum margin classifier is called a Support Vector Machine (in this case, a Linear SVM or LSVM) 71

Why Maximum Margin? Robust to small perturbations of data points near boundary There exists theory showing this is best for generalization to new points Empirically works great 72

What if data is not linearly separable? Outliers (due to noise) 73

Approach 1: Soft Margin SVMs ξ Allow errors ξ i (deviations from margin) Trade off margin with errors. Minimize: margin + error-penalty 74

What if data is not linearly separable: Other ideas? Not linearly separable 76

What if data is not linearly separable? Approach 2: Map original input space to higherdimensional feature space; use linear classifier in higher-dim. space x φ(x) Kernel: additional bias to convert into high d space 77

Problem with high dimensional spaces x φ(x) Computation in high-dimensional feature space can be costly The high dimensional projection function φ(x) may be too complicated to compute Kernel trick to the rescue! 78

The Kernel Trick Dual Formulation: SVM maximizes the quadratic function: i 1 i 2 i, j y subject to 0 and i i j i y j ( x i i x i j y i ) 0 Insight: The data points only appear as inner product No need to compute high-dimensional φ(x) explicitly! Just replace inner product x i x j with a kernel function K(x i,x j ) = φ(x i ) φ(x j ) E.g., Gaussian kernel K(x i,x j ) = exp(- x i -x j 2 /2 2 ) E.g., Polynomial kernel K(x i,x j ) = x i x j +1) d 79

K-Nearest Neighbors A simple non-parametric classification algorithm Idea: Look around you to see how your neighbors classify data Classify a new data-point according to a majority vote of your k nearest neighbors 81

Distance Metric How do we measure what it means to be a neighbor (what is close )? Appropriate distance metric depends on the problem Examples: x discrete (e.g., strings): Hamming distance d(x 1,x 2 ) = # features on which x 1 and x 2 differ x continuous (e.g., vectors over reals): Euclidean distance d(x 1,x 2 ) = x 1 -x 2 = square root of sum of squared differences between corresponding elements of data vectors 82

Example Input Data: 2-D points (x 1,x 2 ) Two classes: C 1 and C 2. New Data Point + K = 4: Look at 4 nearest neighbors of + 3 are in C 1, so classify + as C 1 83

Decision Boundary using K-NN Some points near the boundary may be misclassified (but maybe noise) 84

What if we want to learn continuous-valued functions? Output Input 85

Regression K-Nearest neighbor take the average of k-close by points Linear/Non-linear Regression fit parameters (gradient descent) minimizing the regression error/loss Neural Networks remove the threshold function 86

Large Feature Spaces Easy to overfit Regularization add penalty for large weights prefer weights that are zero or close to zero minimize regression error + C.regularization penalty 87

Regularizations L1 : diamond L2 : circle Derivatives L1 : constant L2 : high for large weights L1 harder to optimize, but not too hard. - discontinuous but convex 88

L1 vs. L2 89

Ensemble Classifiers Mausam (based on slides of Dan Weld) 90

Ensembles of Classifiers Traditional approach: Use one classifier Alternative approach: Use lots of classifiers Approaches: Cross-validated committees Bagging Boosting Stacking Daniel S. Weld 91

Ensembles of Classifiers Assume Errors are independent (suppose 30% error) Majority vote Probability that majority is wrong = area under binomial distribution Prob 0.2 0.1 If individual area is 0.3 Area under curve for 11 wrong is 0.026 Order of magnitude improvement! Daniel S. Weld Number of classifiers in error 92

Voting Daniel S. Weld 93

Constructing Ensembles Holdout Cross-validated committees Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Now train a classifier on each set Daniel S. Weld 94

Ensemble Construction II Bagging Generate k sets of training examples For each set Draw m examples randomly (with replacement) From the original set of m examples Each training set corresponds to 63.2% of original (+ duplicates) Now train classifier on each set Intuition: Sampling helps algorithm become more robust to noise/outliers in the data Daniel S. Weld 95

Ensemble Creation III Boosting Maintain prob distribution over set of training examples Create k sets of training data iteratively: On iteration i Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error example Daniel S. Weld 96

Ensemble Creation IV Stacking Train several base learners Next train meta-learner Learns when base learners are right / wrong Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = test set from cross validation Daniel S. Weld 97

Why do ensembles work? Statistical Search through hypothesis space average: reduces risk of wrong classifier Computational Intractable to get best hypothesis Representational Increases the representable hypotheses 98

Example: Random Forests Create k decision trees For each decision tree Pick training data as in bagging Randomly sample f features in the data Construct best tree based only on these features Voting for final prediction Advantages Efficient, highly accurate, thousands of vars 99