day month year documentname/initials 1

Similar documents
ECE Lecture 1. Terminology. Statistics are used much like a drunk uses a lamppost: for support, not illumination

CS145: INTRODUCTION TO DATA MINING

CS 6375 Machine Learning

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Machine Learning & Data Mining

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

UVA CS 4501: Machine Learning

PATTERN CLASSIFICATION

Machine Learning 3. week

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Lecture 3: Decision Trees

Decision Tree Learning

Decision Tree Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning 2nd Edi7on

Holdout and Cross-Validation Methods Overfitting Avoidance

Classification using stochastic ensembles

Decision Tree Learning

Machine Leanring Theory and Applications: third lecture

Classification and Prediction

Machine Learning Recitation 8 Oct 21, Oznur Tastan

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

BAGGING PREDICTORS AND RANDOM FOREST

Decision Trees (Cont.)

Decision Tree Learning Lecture 2

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Learning Decision Trees

Jeffrey D. Ullman Stanford University

Statistics and learning: Big Data

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Learning Decision Trees

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

L11: Pattern recognition principles

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Supervised Learning via Decision Trees

ABC random forest for parameter estimation. Jean-Michel Marin

Decision trees COMS 4771

Informal Definition: Telling things apart

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Dan Roth 461C, 3401 Walnut

1 Handling of Continuous Attributes in C4.5. Algorithm

Growing a Large Tree

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Classification Using Decision Trees

the tree till a class assignment is reached

10-701/ Machine Learning, Fall

Chapter 14 Combining Models

Classification: Decision Trees

Algorithm-Independent Learning Issues

Statistical Learning. Philipp Koehn. 10 November 2015

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayesian Decision Theory

Decision Trees.

Randomized Decision Trees

SF2930 Regression Analysis

Multivariate statistical methods and data mining in particle physics

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Decision Tree And Random Forest

Introduction to Machine Learning Midterm Exam

Introduction to Signal Detection and Classification. Phani Chavali

Induction of Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Bayesian Decision Theory

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Chapter 6: Classification

Artificial Intelligence Roman Barták

Decision Trees. Danushka Bollegala

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Information Theory & Decision Trees

Ensemble Methods and Random Forests

Rule Generation using Decision Trees

Classifier performance evaluation

Nonlinear Classification

Generalization Error on Pruning Decision Trees

Machine Learning Lecture 12

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Applied Multivariate and Longitudinal Data Analysis

Learning with multiple models. Boosting.

Machine Learning 2nd Edition

Decision Trees.

Deconstructing Data Science

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

PATTERN RECOGNITION AND MACHINE LEARNING

Transcription:

ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi Email: hqi@utk.edu Pattern Classification Statistical Approach Non-Statistical Approach Supervised Basic concepts: Baysian decision rule (MPP, LR, Discri.) Parameter estimate (ML, BL) Non-Parametric learning (knn) LDF (Perceptron) NN (BP) Support Vector Machine Deep Learning (DL) Unsupervised Basic concepts: Distance Agglomerative method k-means Winner-takes-all Kohonen maps Mean-shift Decision-tree Syntactic approach Dimensionality Reduction FLD, PCA Performance Evaluation ROC curve (TP, TN, FN, FP) cross validation Stochastic Methods local opt (GD) global opt (SA, GA) Classifier Fusion maority voting NB, BKS Review - Bayes Decision Rule Maximum Posterior Probability Likelihood Ratio Discriminant Function Non-parametric knn p = ( ω x) P ( x ω ) P( ω ) p( x) For a given x, if P(w1 x) > P(w2 x), then x belongs to class 1, otherwise 2 p(x!1) If, then x belongs to class 1, otherwise, 2. p(x!2) > P (!2) P (!1) The classifier will assign a feature vector x toclassωi if gi( x) > g ( x) Case 1: Minimum Euclidean Distance (Linear Machine), Si=s 2 I Case 2: Minimum Mahalanobis Distance (Linear Machine), Si = S Case 3: Quadratic classifier, Si = arbitrary For a given x, if k1/k > k2/k, then x belongs to class 1, otherwise 2 Estimate Gaussian (Maximum Likelihood Estimation, MLE), Two-modal Gaussian Dimensionality reduction Performance evaluation ROC curve 3 day month year documentname/initials 1

Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering 4 Some Terminologies Decision tree n Root n Link (branch) - directional n Leaf n Descendent node 5 CART Classification and regression trees A generic tree growing methodology Issues studied n How many splits from a node? n Which property to test at each node? n When to declare a leaf? n How to prune a large, redundant tree? n If the leaf is impure, how to classify? n How to handle missing data? 6 day month year documentname/initials 2

Number of Splits Binary tree Expressive power and comparative simplicity in training 7 Node Impurity Occam s Razor The fundamental principle underlying tree creation is that of simplicity: we prefer simple, compact tree with few nodes http://math.ucr.edu/home/baez/physics/occam.html Occam's (or Ockham's) razor is a principle attributed to the 14th century logician and Franciscan friar; William of Occam. Ockham was the village in the English county of Surrey where he was born. The principle states that Entities should not be multiplied unnecessarily. "when you have two competing theories which make exactly the same predictions, the one that is simpler is the better. Stephen Hawking explains in A Brief History of Time: We could still imagine that there is a set of laws that determines events completely for some supernatural being, who could observe the present state of the universe without disturbing it. However, such models of the universe are not of much interest to us mortals. It seems better to employ the principle known as Occam's razor and cut out all the features of the theory which cannot be observed. Everything should be made as simple as possible, but not simpler 8 Property Query and Impurity Measurement We seek a property query T at each node N that makes the data reach the immediate descendent nodes as pure as possible We want i(n) to be 0 if all the patterns reach the node bear the same category label Entropy impurity (information impurity) i( N) = P( ω )log 2 P ω ( ) ( ) is the fraction of patterns at node N that are in category ω P ω 9 day month year documentname/initials 3

Other Impurity Measurements Variance impurity (2-category case) Gini impurity ( N) P( ω ) P( ) i = Misclassification impurity 1 ω 2 2 ( N) = P( ωi) P( ω ) = 1 P ( ω ) i i ( N) =1 max P( ω ) i 10 Choose the Property Test? Choose the query that decreases the impurity as much as possible ( N) = i( N) P i( N ) ( P )( i N ) Δi 1 L n N L, N R: left and right descendent nodes n i(n L), i(n R): impurities n P L: fraction of patterns at node N that will go to N L Solve for extrema (local extrema) L L R 11 Example Node N: n 90 patterns in w 1 n 10 patterns in w 2 Split candidate: n 70 w 1 patterns & 0 w 2 patterns to the right n 20 w 1 patterns & 10 w 2 patterns to the left 12 day month year documentname/initials 4

When to Stop Splitting? Two extreme scenarios n Overfitting (each leaf is one sample) n High error rate Approaches n Validation and cross-validation w 90% of the data set as training data w 10% of the data set as validation data n Use threshold w Unbalanced tree w Hard to choose threshold n Minimum description length (MDL) MDL = α size+ w i(n) measures the uncertainty of the training data leaf nodes w Size of the tree measures the complexity of the classifier itself i( N) 13 When to Stop Splitting? (cont ) Use stopping criterion based on the statistical significance of the reduction of impurity n Use chi-square statistic n Whether the candidate split differs significantly from a random split 2 χ = 2 2 ( nil nie) i= 1 n ie 14 Pruning Another way to stop splitting Horizon effect n Lack of sufficient look ahead Let the tree fully grow, i.e. beyond any putative horizon, then all pairs of neighboring leaf nodes are considered for elimination 15 day month year documentname/initials 5

Instability 16 Other Methods Quinlan s ID3 C4.5 (successor and refinement of ID3) http://www.rulequest.com/personal/ 17 MATLAB Routine classregtree Classification tree vs. Regression tree If the target variable is categorical or numeric http://www.mathworks.com/help/stats/classregtree.html 18 day month year documentname/initials 6

http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees 19 Random Forest Potential issue with decision trees Prof. Leo Breiman Ensemble learning methods Bagging (Bootstrap aggregating): Proposed by Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets Random forest: bagging + random selection of features at each node to determine a split 20 MATLAB Implementation B = TreeBagger(nTrees, train_x, train_y); pred = B.predict(test_x); 21 day month year documentname/initials 7

Reference [CART] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984. [Bagging] L. Breiman, Bagging predictors, Machine Learning, 24(2):123-140, August 1996. (citation: 16,393) [RF] L. Breiman, Random forests, Machine Learning, 45(1):5-32, October 2001. 22 day month year documentname/initials 8