Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Similar documents
Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Learning and Neural Networks

Learning from Examples

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Introduction To Artificial Neural Networks

Learning Decision Trees

EECS 349:Machine Learning Bryan Pardo

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

18.9 SUPPORT VECTOR MACHINES

CS 6375 Machine Learning

Chapter 6: Classification

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Artificial Intelligence Roman Barták

Incremental Stochastic Gradient Descent

Bayesian learning Probably Approximately Correct Learning

Learning Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE

Nonlinear Classification

Numerical Learning Algorithms

Learning Decision Trees

Notes on Machine Learning for and

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Classification Algorithms

From inductive inference to machine learning

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Data Mining Part 5. Prediction

Neural Networks biological neuron artificial neuron 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Introduction to Machine Learning CMU-10701

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

the tree till a class assignment is reached

Introduction to Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Neural networks. Chapter 20, Section 5 1

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Decision Trees. Tirgul 5

Decision Trees.

Decision Tree Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CSC321 Lecture 5: Multilayer Perceptrons

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Lecture 4: Perceptrons and Multilayer Perceptrons

Classification and Regression Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Statistical Learning. Philipp Koehn. 10 November 2015

Linear & nonlinear classifiers

Neural networks. Chapter 20. Chapter 20 1

Classification Algorithms

Neural Networks: Introduction

Lecture 3: Decision Trees

Lecture 5: Logistic Regression. Neural Networks

Introduction to Machine Learning Midterm Exam Solutions

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Classification II: Decision Trees and SVMs

18.6 Regression and Classification with Linear Models

Artificial neural networks

Midterm, Fall 2003

Machine Learning 2nd Edi7on

Decision Trees.

Holdout and Cross-Validation Methods Overfitting Avoidance

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS:4420 Artificial Intelligence

Data Mining und Maschinelles Lernen

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Machine Learning. Neural Networks

Introduction to Machine Learning Midterm Exam

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

AI Programming CS F-20 Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Decision Tree Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Machine Learning Lecture 7

Neural networks. Chapter 19, Sections 1 5 1

PAC-learning, VC Dimension and Margin-based Bounds

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Perceptron Revisited: Linear Separators. Support Vector Machines

10-701/ Machine Learning - Midterm Exam, Fall 2010

UVA CS 4501: Machine Learning

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Artificial Neural Networks

Mining Classification Knowledge

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Lecture 7 Artificial neural networks: Supervised learning

9 Classification. 9.1 Linear Classifiers

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

PAC-learning, VC Dimension and Margin-based Bounds

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Final Exam, Fall 2002

Transcription:

Supervised Learning (contd) Decision Trees Mausam (based on slides by UW-AI faculty)

Decision Trees To play or not to play? http://www.sfgate.com/blogs/images/sfgate/sgreen/2007/09/05/2240773250x321.jpg 2

Example data for learning the concept Good day for tennis Day Outlook Humid Wind PlayTennis? d1 s h w n d2 s h s n d3 o h w y d4 r h w y d5 r n w y d6 r n s y d7 o n s y d8 s h w n d9 s n w y d10 r n w y d11 s n s y d12 o h s y d13 o n w y d14 r h s n Outlook = sunny, overcast, rain Humidity = high, normal Wind = weak, strong 3

A Decision Tree for the Same Data Decision Tree for PlayTennis? Leaves = classification output Outlook Arcs = choice of value for parent attribute Sunny Overcast Rain Humidity Yes Wind Normal High Strong Weak Yes No Decision tree is equivalent to logic in disjunctive normal form PlayTennis (Sunny Normal) Overcast (Rain Weak) No Yes 4

Example: Decision Tree for Continuous Valued Features and Discrete Output Input real number attributes (x1,x2), Classification output: 0 or 1 x 2 How do we branch using attribute values x1 and x2 to partition the space correctly? x1 6

Example: Classification of Continuous Valued Inputs x2 Decision Tree 3 4 x1 7

Expressiveness of Decision Trees Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row = path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example But most likely won't generalize to new examples Prefer to find more compact decision trees 8

Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 9

Example Decision tree A decision tree for Wait? based on personal rules of thumb : 10

Input Data for Learning Past examples when I did/did not wait for a table: Classification of examples is positive (T) or negative (F) 11

Decision Tree Learning Aim: find a small tree consistent with training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree 12

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty E.g., splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice For Type?, to wait or not to wait is still at 50% 13

How do we quantify uncertainty? http://a.espncdn.com/media/ten/2006/0306/photo/g_mcenroe_195.jpg

Using information theory to quantify uncertainty Entropy measures the amount of uncertainty in a probability distribution Entropy (or Information Content) of an answer to a question with possible answers v 1,, v n : I(P(v 1 ),, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) 15

Using information theory Imagine we have p examples with Wait = True (positive) and n examples with Wait = false (negative). Our best estimate of the probabilities of Wait = true or false is given by: P true p p n ( ) / p( false) n / p n Hence the entropy of Wait is given by: I( p p n, n p ) n p p n p n log 2 log 2 p n p n n p n 16

Entropy I 1.0 0.5 Entropy is highest when uncertainty is greatest Wait = F Wait = T.00.50 1.00 P(Wait = T) 17

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty and result in gain in information How much information do we gain if we disclose the value of some attribute? Answer: uncertainty before uncertainty after 18

Back at the Restaurant Before choosing an attribute: Entropy = - 6/12 log(6/12) 6/12 log(6/12) = - log(1/2) = log(2) = 1 bit There is 1 bit of information to be discovered 19

Back at the Restaurant If we choose Type: Go along branch French : we have entropy = 1 bit; similarly for the others. Information gain = 1-1 = 0 along any branch If we choose Patrons: In branch None and Some, entropy = 0 For Full, entropy = -2/6 log(2/6)-4/6 log(4/6) = 0.92 Info gain = (1-0) or (1-0.92) bits > 0 in both cases So choosing Patrons gains more information! 20

Entropy across branches How do we combine entropy of different branches? Answer: Compute average entropy Weight entropies according to probabilities of branches 2/12 times we enter None, so weight for None = 1/6 Some has weight: 4/12 = 1/3 Full has weight 6/12 = ½ n p i ni pi ni AvgEntropy ( A) Entropy (, ) p n p n p n i 1 i i i i weight for each branch entropy for each branch 21

Information gain Information Gain (IG) or reduction in entropy from using attribute A: IG(A) = Entropy before AvgEntropy after choosing A Choose the attribute with the largest IG 22

Information gain in our example 2 4 IG( Patrons ) 1 [ I(0,1) 12 12 2 1 1 2 IG( Type) 1 [ I(, ) I 12 2 2 12 I(1,0) 1 ( 2 1, ) 2 6 2 I(, 12 6 4 2 I(, 12 4 4 )] 6 2 ) 4.541 bits 4 12 I 2 ( 4, 2 )] 4 0 bits Patrons has the highest IG of all attributes DTL algorithm chooses Patrons as the root 23

Should I stay or should I go? Learned Decision Tree Decision tree learned from the 12 examples: Substantially simpler than rules-of-thumb tree more complex hypothesis not justified by small amount of data 24

Performance Evaluation How do we know that the learned tree h f? Answer: Try h on a new test set of examples Learning curve = % correct on test set as a function of training set size 25

Overfitting 0.9 0.8 Accuracy On training data On test data 0.7 0.6 Number of Nodes in Decision tree 26

27

Rule #2 of Machine Learning The best hypothesis almost never achieves 100% accuracy on the training data. (Rule #1 was: you can t learn anything without inductive bias) 28

Avoiding overfitting Stop growing when data split not statistically significant Grow full tree and then prune How to select best tree? Measure performance over the training data Measure performance over separate validation set Add complexity penalty to performance measure 29

0.9 Accuracy Early Stopping Remember this tree and use it as the final classifier On training data 0.8 0.7 On validation data On test data 0.6 Number of Nodes in Decision tree 30

Tune Tune Tune Test Reduced Error Pruning Split data into train and validation set Repeat until pruning is harmful Remove each subtree and replace it with majority class and evaluate on validation set Remove subtree that leads to largest gain in accuracy 31

Reduced Error Pruning Example Sunny Outlook Overcast Rain High Humidity Low Play Wind Strong Weak Don t play Play Don t play Play Validation set accuracy = 0.75 32

Reduced Error Pruning Example Sunny Don t play Outlook Overcast Play Rain Wind Strong Weak Don t play Play Validation set accuracy = 0.80 33

Reduced Error Pruning Example High Sunny Humidity Low Outlook Overcast Play Rain Play Don t play Play Validation set accuracy = 0.70 34

Reduced Error Pruning Example Sunny Don t play Outlook Overcast Play Rain Wind Strong Weak Don t play Play Use this as final tree 35

Scaling Up ID3 and C4.5 assume data fits in main memory (ok for 100,000s examples) SPRINT, SLIQ: multiple sequential scans of data (ok for millions of examples) VFDT: at most one sequential scan (ok for billions of examples) 38

Decision Trees Strengths Very Popular Technique Fast Useful when Target Function is discrete Concepts are likely to be disjunctions Attributes may be noisy 39

Decision Trees Weaknesses Less useful for continuous outputs Can have difficulty with continuous input features as well E.g., what if your target concept is a circle in the x1, x2 plane? Hard to represent with decision trees Very simple with instance-based methods we ll discuss later 40

Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) 41

Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel values left to right, top to bottom row) Greyscale images I p p p N 2 p 1 2 N Pixel value p i can be 0 or 1 (binary image) or 0 to 255 (greyscale) 42

The human brain is extremely good at classifying images Can we develop classification methods by emulating the brain? 43

Brain Computer: What is it? Human brain contains a massively interconnected net of 10 10-10 11 (10 billion) neurons (cortical cells) Biological Neuron - The simple arithmetic computing element 44

45

Biological Neurons 1. Soma or body cell - is a large, round central body in which almost all the logical functions of the neuron are realized. 2. The axon (output), is a nerve fibre attached to the soma which can serve as a final output channel of the neuron. An axon is usually highly branched. 3. The dendrites (inputs)- represent a highly branching tree of fibres. These long irregularly shaped nerve fibres (processes) are attached to the soma. 4. Synapses are specialized contacts on a neuron which are the termination points for the axons from other neurons. Synapses Soma Axon from other neuron Dendrites Axon Dendrite from other The schematic model of a biological neuron 46

Neurons communicate via spikes Inputs Output spike (electrical pulse) Output spike roughly dependent on whether sum of all inputs reaches a threshold 47

Neurons as Threshold Units Artificial neuron: m binary inputs (-1 or 1), 1 output (-1 or 1) Synaptic weights w ji Threshold i w 1i Weighted Sum Threshold Inputs u j (-1 or +1) w 2i w 3i Output v i (-1 or +1) v i ( wjiu j i j ) (x) = 1 if x > 0 and -1 if x 0 48

Perceptrons for Classification Fancy name for a type of layered feed-forward networks (no loops) Uses artificial neurons ( units ) with binary inputs and outputs Single-layer Multilayer 49

Perceptrons and Classification Consider a single-layer perceptron Weighted sum forms a linear hyperplane wjiu 0 j i j Everything on one side of this hyperplane is in class 1 (output = +1) and everything on other side is class 2 (output = -1) Any function that is linearly separable can be computed by a perceptron 50

Linear Separability Example: AND is linearly separable Linear hyperplane u 1 u 2 AND -1-1 -1 1-1 -1 u 2 1 (1,1) v = 1.5-1 1-1 1 1 1-1 1 u 1-1 u 1 u 2 v = 1 iff u 1 + u 2 1.5 > 0 Similarly for OR and NOT 51

How do we learn the appropriate weights given only examples of (input,output)? Idea: Change the weights to decrease the error in output 52

Perceptron Training Rule 53

What about the XOR function? u 1 u 2 XOR -1-1 1? u 2 1 (1,1) 1-1 -1-1 1-1 -1 1 u 1 1 1 1-1 Can a perceptron separate the +1 outputs from the -1 outputs? 54

Linear Inseparability Perceptron with threshold units fails if classification task is not linearly separable Example: XOR No single line can separate the yes (+1) outputs from the no (-1) outputs! Minsky and Papert s book showing such negative results put a damper on neural networks research for over a decade! -1-1 u 2 1 X 1 (1,1) u 1 55

How do we deal with linear inseparability? 56

Idea 1: Multilayer Perceptrons Removes limitations of single-layer networks Can solve XOR Example: Two-layer perceptron that computes XOR x y Output is +1 if and only if x + y 2 (x + y 1.5) 0.5 > 0 57

Multilayer Perceptron: What does it do? out y 2 1 2 1 1 1 1 1 2 1 1 2 1 1? x y 1 2 x 58

Multilayer Perceptron: What does it do? out y 2 1 1 x y 2 0 =-1 =1 1 y 1 1 2 x 1 1 x y 2 0 1 2 1 1 1 x y 1 2 x 59

Multilayer Perceptron: What does it do? out y 2 =-1 =1 1 1 2 1 1 =1 =-1 2 x y 0 2 x y 0 x y 1 2 x 60

Multilayer Perceptron: What does it do? out y 2 =-1 =1 1 1 1 2 1-1 2 >0 =1 =-1 1 x y 1 2 x 61

Perceptrons as Constraint Satisfaction Networks 1 out 1 1 2 1 y 2 =-1 =1 1 1 x y 2 0 1 2 1 1 1 1 2 1 =1 =-1 2 x y 0 x y 1 2 x 62

Artificial Neuron: Most Popular Activation Functions Linear activation Logistic activation Σ z z z 1 0 z 1 1 z e z Threshold activation z 1, if z 0, sign( z) 1, if z 0. 1 Hyperbolic tangent activation 1 e 1 e 2 u u tanh u u 2 1-1 z - 1 0 z 63

Neural Network Issues Multi-layer perceptrons can represent any function Training multi-layer perceptrons hard Backpropagation Early successes Keeping the car on the road Difficult to debug Opaque 64

Back to Linear Separability Recall: Weighted sum in perceptron forms a linear hyperplane i w x i i b 0 Due to threshold function, everything on one side of this hyperplane is labeled as class 1 (output = +1) and everything on other side is labeled as class 2 (output = -1) 65

Separating Hyperplane Class 1 i wi x i b 0 denotes +1 output denotes -1 output Class 2 Need to choose w and b based on training data 66

Separating Hyperplanes Different choices of w and b give different hyperplanes Class 1 denotes +1 output denotes -1 output Class 2 (This and next few slides adapted from Andrew Moore s) 67

Which hyperplane is best? Class 1 denotes +1 output denotes -1 output Class 2 68

How about the one right in the middle? Intuitively, this boundary seems good Avoids misclassification of new test points if they are generated from the same distribution as training points 69

Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 70

Maximum Margin and Support Vector Machine Support Vectors are those datapoints that the margin pushes up against The maximum margin classifier is called a Support Vector Machine (in this case, a Linear SVM or LSVM) 71

Why Maximum Margin? Robust to small perturbations of data points near boundary There exists theory showing this is best for generalization to new points Empirically works great 72

What if data is not linearly separable? Outliers (due to noise) 73

Approach 1: Soft Margin SVMs ξ Allow errors ξ i (deviations from margin) Trade off margin with errors. Minimize: margin + error-penalty 74

What if data is not linearly separable: Other ideas? Not linearly separable 76

What if data is not linearly separable? Approach 2: Map original input space to higherdimensional feature space; use linear classifier in higher-dim. space x φ(x) Kernel: additional bias to convert into high d space 77

Problem with high dimensional spaces x φ(x) Computation in high-dimensional feature space can be costly The high dimensional projection function φ(x) may be too complicated to compute Kernel trick to the rescue! 78

The Kernel Trick Dual Formulation: SVM maximizes the quadratic function: i 1 i 2 i, j y subject to 0 and i i j i y j ( x i i x i j y i ) 0 Insight: The data points only appear as inner product No need to compute high-dimensional φ(x) explicitly! Just replace inner product x i x j with a kernel function K(x i,x j ) = φ(x i ) φ(x j ) E.g., Gaussian kernel K(x i,x j ) = exp(- x i -x j 2 /2 2 ) E.g., Polynomial kernel K(x i,x j ) = x i x j +1) d 79

K-Nearest Neighbors A simple non-parametric classification algorithm Idea: Look around you to see how your neighbors classify data Classify a new data-point according to a majority vote of your k nearest neighbors 81

Distance Metric How do we measure what it means to be a neighbor (what is close )? Appropriate distance metric depends on the problem Examples: x discrete (e.g., strings): Hamming distance d(x 1,x 2 ) = # features on which x 1 and x 2 differ x continuous (e.g., vectors over reals): Euclidean distance d(x 1,x 2 ) = x 1 -x 2 = square root of sum of squared differences between corresponding elements of data vectors 82

Example Input Data: 2-D points (x 1,x 2 ) Two classes: C 1 and C 2. New Data Point + K = 4: Look at 4 nearest neighbors of + 3 are in C 1, so classify + as C 1 83

Decision Boundary using K-NN Some points near the boundary may be misclassified (but maybe noise) 84

What if we want to learn continuous-valued functions? Output Input 85

Regression K-Nearest neighbor take the average of k-close by points Linear/Non-linear Regression fit parameters (gradient descent) minimizing the regression error/loss Neural Networks remove the threshold function 86

Large Feature Spaces Easy to overfit Regularization add penalty for large weights prefer weights that are zero or close to zero minimize regression error + C.regularization penalty 87

Regularizations L1 : diamond L2 : circle Derivatives L1 : constant L2 : high for large weights L1 harder to optimize, but not too hard. - discontinuous but convex 88

L1 vs. L2 89

Ensemble Classifiers Mausam (based on slides of Dan Weld) 90

Ensembles of Classifiers Traditional approach: Use one classifier Alternative approach: Use lots of classifiers Approaches: Cross-validated committees Bagging Boosting Stacking Daniel S. Weld 91

Ensembles of Classifiers Assume Errors are independent (suppose 30% error) Majority vote Probability that majority is wrong = area under binomial distribution Prob 0.2 0.1 If individual area is 0.3 Area under curve for 11 wrong is 0.026 Order of magnitude improvement! Daniel S. Weld Number of classifiers in error 92

Voting Daniel S. Weld 93

Constructing Ensembles Holdout Cross-validated committees Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Now train a classifier on each set Daniel S. Weld 94

Ensemble Construction II Bagging Generate k sets of training examples For each set Draw m examples randomly (with replacement) From the original set of m examples Each training set corresponds to 63.2% of original (+ duplicates) Now train classifier on each set Intuition: Sampling helps algorithm become more robust to noise/outliers in the data Daniel S. Weld 95

Ensemble Creation III Boosting Maintain prob distribution over set of training examples Create k sets of training data iteratively: On iteration i Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error example Daniel S. Weld 96

Ensemble Creation IV Stacking Train several base learners Next train meta-learner Learns when base learners are right / wrong Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = test set from cross validation Daniel S. Weld 97

Why do ensembles work? Statistical Search through hypothesis space average: reduces risk of wrong classifier Computational Intractable to get best hypothesis Representational Increases the representable hypotheses 98

Example: Random Forests Create k decision trees For each decision tree Pick training data as in bagging Randomly sample f features in the data Construct best tree based only on these features Voting for final prediction Advantages Efficient, highly accurate, thousands of vars 99