Tree Structured Classifier

Similar documents
Pattern Recognition 2014 Support Vector Machines

Chapter 3: Cluster Analysis

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Part 3 Introduction to statistical classification techniques

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

IAML: Support Vector Machines

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Hypothesis Tests for One Population Mean

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Chapter 3 Kinematics in Two Dimensions; Vectors

Differentiation Applications 1: Related Rates

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Distributions, spatial statistics and a Bayesian perspective

Resampling Methods. Chapter 5. Chapter 5 1 / 52

AP Statistics Notes Unit Two: The Normal Distributions

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Checking the resolved resonance region in EXFOR database

Engineering Decision Methods

Support-Vector Machines

The blessing of dimensionality for kernel methods

SPH3U1 Lesson 06 Kinematics

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Math 10 - Exam 1 Topics

What is Statistical Learning?

Chapter 8: The Binomial and Geometric Distributions

ENSC Discrete Time Systems. Project Outline. Semester

1b) =.215 1c).080/.215 =.372

Pre-Calculus Individual Test 2017 February Regional

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Preparation work for A2 Mathematics [2017]

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Support Vector Machines and Flexible Discriminants

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Computational modeling techniques

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Elements of Machine Intelligence - I

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

Lead/Lag Compensator Frequency Domain Properties and Design Methods

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

, which yields. where z1. and z2

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Homology groups of disks with holes

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

WYSE Academic Challenge Regional Mathematics 2007 Solution Set

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Probability, Random Variables, and Processes. Probability

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Source Coding and Compression

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

OF SIMPLY SUPPORTED PLYWOOD PLATES UNDER COMBINED EDGEWISE BENDING AND COMPRESSION

SticiGui Chapter 4: Measures of Location and Spread Philip Stark (2013)

Chapter 15 & 16: Random Forests & Ensemble Learning

3. Classify the following Numbers (Counting (natural), Whole, Integers, Rational, Irrational)

Thermodynamics and Equilibrium

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Sequential Allocation with Minimal Switching

B. Definition of an exponential

Hiding in plain sight

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

15-381/781 Bayesian Nets & Probabilistic Inference

NUMBERS, MATHEMATICS AND EQUATIONS

Localized Model Selection for Regression

A proposition is a statement that can be either true (T) or false (F), (but not both).

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

ELT COMMUNICATION THEORY

Chapter 2 GAUSS LAW Recommended Problems:

Work, Energy, and Power

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Compressibility Effects

M thematics. National 5 Practice Paper D. Paper 1. Duration 1 hour. Total marks 40

ANSWER KEY FOR MATH 10 SAMPLE EXAMINATION. Instructions: If asked to label the axes please use real world (contextual) labels

Least Squares Optimal Filtering with Multirate Observations

Equilibrium of Stress

Part One: Heat Changes and Thermochemistry. This aspect of Thermodynamics was dealt with in Chapter 6. (Review)

Lecture 24: Flory-Huggins Theory

Floating Point Method for Solving Transportation. Problems with Additional Constraints

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Exam #1. A. Answer any 1 of the following 2 questions. CEE 371 October 8, Please grade the following questions: 1 or 2

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

M thematics. National 5 Practice Paper E. Paper 1. Duration 1 hour. Total marks 40

Transcription:

Tree Structured Classifier Reference: Classificatin and Regressin Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stne, Chapman & Hall, 98. A Medical Eample (CART): Predict high risk patients wh will nt survive at least 3 days n the basis f the initial 2-hur data. 9 variables are measured during the first 2 hurs. These include bld pressure, age, etc. A tree structure classificatin rule is as fllws: Is the minimum systlic bld pressure ver the initial 2 hur perid > 9? yes n Is sinus tachycardia present? Is age > 2.5 yes yes n n Lw risk High risk High risk Lw risk

Dente the feature space by X. The input vectr X X cntains p features X, X 2,..., X p, sme f which may be categrical. Tree structured classifiers are cnstructed by repeated splits f subsets f X int tw descendant subsets, beginning with X itself. Definitins: nde, terminal nde (leaf nde), parent nde, child nde. The unin f the regins ccupied by tw child ndes is the regin ccupied by their parent nde. Every leaf nde is assigned with a class. A query is assciated with class f the leaf nde it lands in. Ntatin: A nde is dented by t. Its left child nde is dented by t L and right by t R. The cllectin f all the ndes is dented by T ; and the cllectin f all the leaf ndes by T. A split is dented by s. The set f splits is dented by S. 2

X split X X 2 X X 2 X X 3 split 2 split 3 X X 5 X X X 2 X 3 X X 5 X X 7 X X 3 X X 5 X 8 X X 2 X 3 X X 5 X split 2 2 X 7 X 8 3 3

The Three Elements The cnstructin f a tree invlves the fllwing three elements:. The selectin f the splits. 2. The decisins when t declare a nde terminal r t cntinue splitting it. 3. The assignment f each terminal nde t a class. In particular, we need t decide the fllwing:. A set Q f binary questins f the frm {Is X A?}, A X. 2. A gdness f split criterin Φ(s, t) that can be evaluated fr any split s f any nde t. 3. A stp-splitting rule.. A rule fr assigning every terminal nde t a class.

Standard Set f Questins The input vectr X = (X,X 2,...,X p ) cntains features f bth categrical and rdered types. Each split depends n the value f nly a unique variable. Fr each rdered variable X j, Q includes all questins f the frm fr all real-valued c. {Is X j c?} Since the training data set is finite, there are nly finitely many distinct splits that can be generated by the questin {Is X j c?}. If X j is categrical, taking values, say in{, 2,...,M}, then Q cntains all questins f the frm {Is X j A?}. A ranges ver all subsets f {, 2,...,M}. The splits fr all p variables cnstitute the standard set f questins. 5

Gdness f Split The gdness f split is measured by an impurity functin defined fr each nde. Intuitively, we want each leaf nde t be pure, that is, ne class dminates. Definitin: An impurity functin is a functin φ defined n the set f all K-tuples f numbers (p,...,p K ) satisfying p j, j =,..., K, j p j = with the prperties:. φ is a maimum nly at the pint ( K, K,..., K ). 2. φ achieves its minimum nly at the pints (,,..., ), (,,,..., ),..., (,,...,, ). 3. φ is a symmetric functin f p,..., p K, i.e., if yu permute p j, φ remains cnstant.

Definitin: Given an impurity functin φ, define the impurity measure i(t) f a nde t as i(t) = φ(p( t),p(2 t),...,p(k t)), where p(j t) is the estimated prbability f class j within nde t. Gdness f a split s fr nde t, dented by Φ(s, t), is defined by Φ(s, t) = i(s, t) = i(t) p R i(t R ) p L i(t L ), where p R and p L are the prprtins f the samples in nde t that g t the right nde t R and the left nde t L respectively. 7

Define I(t) = i(t)p(t), that is, the impurity functin f nde t weighted by the estimated prprtin f data that g t nde t. The impurity f tree T, I(T) is defined by I(T) = t T I(t) = t T i(t)p(t). Nte fr any nde t the fllwing equatins hld: Define p(t L ) + p(t R ) = p(t) p L = p(t L )/p(t), p R = p(t R )/p(t) p L + p R = I(s, t) = I(t) I(t L ) I(t R ) = p(t)i(t) p(t L )i(t L ) p(t R )i(t R ) = p(t)(i(t) p L i(t L ) p R i(t R )) = p(t) i(s, t) 8

Pssible impurity functin:. Entrpy: K j= p j lg p j. If p j =, use the limit lim pj p j lg p j =. 2. Misclassificatin rate: ma j p j. 3. Gini inde: K j= p j( p j ) = K j= p2 j. Gini inde seems t wrk best in practice fr many prblems. The twing rule: At a nde t, chse the split s that maimizes p L p R 2 p(j t L ) p(j t R ). j 9

Estimate the psterir prbabilities f classes in each nde: The ttal number f samples is N and the number f samples in class j, j K, is N j. The number f samples ging t nde t is N(t); the number f samples with class j ging t nde t is N j (t). K j= N j(t) = N(t). N j (t L ) + N j (t R ) = N j (t). Fr a full tree (balanced), the sum f N(t) ver all the t s at the same level is N. Dente the prir prbability f class j by π j. The prirs π j can be estimated frm the data by N j /N. Smetimes prirs are given befre-hand. The estimated prbability f a sample in class j ging t nde t is p(t j) = N j (t)/n j. p(t L j) + p(t R j) = p(t j). Fr a full tree, the sum f p(t j) ver all t s at the same level is.

The jint prbability f a sample being in class j and ging t nde t is thus: p(j,t) = π j p(t j) = π j N j (t)/n j. The prbability f any sample ging t nde t is: p(t) = K p(j,t) = j= K π j N j (t)/n j. j= Nte p(t L ) + p(t R ) = p(t). The prbability f a sample being in class j given that it ges t nde t is: p(j t) = p(j,t)/p(t). Fr any t, K j= p(j t) =. When π j = N j /N, we have the fllwing simplificatin: p(j t) = N j (t)/n(t). p(t) = N(t)/N. p(j, t) = N j (t)/n.

Stpping Criteria A simple criteria: stp splitting a nde t when ma s S I(s,t) < β, where β is a chsen threshld. The abve stpping criteria is unsatisfactry. A nde with a small decrease f impurity after ne step f splitting may have a large decrease after multiple levels f splits. 2

Class Assignment Rule A class assignment rule assigns a class j = {,...,K} t every terminal nde t T. The class assigned t nde t T is dented by κ(t). Fr - lss, the class assignment rule is: κ(t) = arg ma j p(j t). The resubstitutin estimate r(t) f the prbability f misclassificatin, given that a case falls int nde t is r(t) = ma j Dente R(t) = r(t)p(t). p(j t) = p(κ(t) t). The resubstitutin estimate fr the verall misclassificatin rate R(T) f the tree classifier T is: R(T) = t T R(t). 3

Prpsitin: Fr any split f a nde t int t L and t R, Prf: Dente j = κ(t). R(t) R(t L ) + R(t R ). p(j t) = p(j,t L t) + p(j,t R t) = p(j t L )p(t L t) + p(j t R )p(t R t) = p L p(j t L ) + p R p(j t R ) Hence, p L ma j p(j t L ) + p R ma j p(j t R ) r(t) = p(j [ t) ] p L map(j t L ) + p R map(j t R ) j j = p L ( map(j t L )) + p R ( map(j t R )) j j = p L r(t L ) + p R r(t R ) Finally, R(t) = p(t)r(t) p(t)p L r(t L ) + p(t)p R r(t R ) = p(t L )r(t L ) + p(t R )r(t R ) = R(t L ) + R(t R )

Digit Recgnitin Eample (CART) The digits are shwn by different n-ff cmbinatins f seven hrizntal and vertical lights. Each digit is represented by a 7-dimensinal vectr f zers and nes. The ith sample is i = ( i, i2,..., i7 ). If ij =, the jth light is n; if ij =, the jth light is ff. Digit 2 3 5 7 2 3 5 7 8 9 2 3 5 7 5

The data fr the eample are generated by a malfunctining calculatr. Each f the seven lights has prbability. f being in the wrng state independently. The training set cntains 2 samples generated accrding t the specified distributin. A tree structured classifier is applied. The set f questins Q cntains: Is j =?, j =, 2,..., 7. The twing rule is used in splitting. The pruning crss-validatin methd is used t chse the right sized tree. Classificatin perfrmance: The errr rate estimated by using a test set f size 5 is.3. The errr rate estimated by crss-validatin using the training set is.3. The resubstitutin estimate f the errr rate is.29. The Bayes errr rate is.2. There is little rm fr imprvement ver the tree classifier.

Y X5= N Y X= N Y X2= N 2 Y X= Y X2= N Y X= N 7 3 Y X= N Y X3= N 8 Y X3= N 5 9 Accidently, every digit ccupies ne leaf nde. In general, ne class may ccupy any number f leaf ndes and ccasinally n leaf nde. X and X 7 are never used. 7

Wavefrm Eample (CART) Three functins h (τ), h 2 (τ), h 3 (τ) are shifted versins f each ther, as shwn in the figure. h h 3 h 2 3 5 7 9 3 5 7 9 2 Each h j is specified by the equal-lateral right triangle functin. Its values at integers τ = 2 are measured. 8

The three classes f wavefrms are randm cnve cmbinatins f tw f these wavefrms plus independent Gaussian nise. Each sample is a 2 dimensinal vectr cntaining the values f the randm wavefrms measured at τ =, 2,..., 2. T generate a sample in class, a randm number u unifrmly distributed in [, ] and 2 randm numbers ɛ, ɛ 2,..., ɛ 2 nrmally distributed with mean zer and variance are generated. j = uh (j) + ( u)h 2 (j) + ɛ j, j =,..., 2. T generate a sample in class 2, repeat the abve prcess t generate a randm number u and 2 randm numbers ɛ,..., ɛ 2 and set j = uh (j) + ( u)h 3 (j) + ɛ j, j =,..., 2. Class 3 vectrs are generated by j = uh 2 (j) + ( u)h 3 (j) + ɛ j, j =,..., 2. Eample randm wavefrms are shwn belw. 9

Class 2 2 8 2 2 5 5 2 5 5 2 Class 2 8 2 2 5 5 2 2 2 5 5 2 5 Class 3 2 2 5 5 2 5 5 5 2 2

3 randm samples are generated using prir prbabilities ( 3, 3, 3 ) fr training. Cnstructin f the tree: The set f questins: {Is j c?} fr c ranging ver all real numbers and j =,..., 2. Gini inde is used fr measuring gdness f split. The final tree is selected by pruning and crssvalidatin. Results: The crss-validatin estimate f misclassificatin rate is.29. The misclassificatin rate n a separate test set f size 5 is.28. The Bayes classificatin rule can be derived. Applying this rule t the test set yields a misclassificatin rate f.. 2

2 85 5 27 2 8 7 8 5 9 3 7 9 7 59 9 7 9 7 85 5 9 7 8 5 7 59 3 9 7 9 7 3 8 5 7 2 5 2 7 3 33 3 3 9 3 2 2 2 <=2. <=.8 2<=. 7<=. <=2.5 <=2. 7<=.9 <=. 5<=.9 <=.8 3 3 3 2 22

Advantages f the Tree-Structured Apprach Handles bth categrical and rdered variables in a simple and natural way. Autmatic stepwise variable selectin and cmpleity reductin. It prvides an estimate f the misclassificatin rate fr a query sample. It is invariant under all mntne transfrmatins f individual rdered variables. Rbust t utliers and misclassified pints in the training set. Easy t interpret. 23

Variable Cmbinatins Splits perpendicular t the crdinate aes are inefficient in certain cases. Use linear cmbinatins f variables: Is a j j c? The amunt f cmputatin is increased significantly. Price t pay: mdel cmpleity increases. 2

Missing Values Certain variables are missing in sme training samples. Often ccurs in gene-epressin micrarray data. Suppse each variable has 5% chance being missing independently. Then fr a training sample with 5 variables, the prbability f missing sme variables is as high as 92.3%. A query sample t be classified may have missing variables. Find surrgate splits. Suppse the best split fr nde t is s which invlves a questin n X m. Find anther split s n a variable X j, j m, which is mst similar t s in a certain sense. Similarly, the secnd best surrgate split, the third, and s n, can be fund. 25