CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Similar documents
Contents. This is page i Printer: Opaque this

Tree Structured Classifier

What is Statistical Learning?

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Simple Linear Regression (single variable)

Pattern Recognition 2014 Support Vector Machines

Resampling Methods. Chapter 5. Chapter 5 1 / 52

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Smoothing, penalized least squares and splines

IAML: Support Vector Machines

Chapter 3: Cluster Analysis

Support Vector Machines and Flexible Discriminants

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Computational modeling techniques

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Support Vector Machines and Flexible Discriminants

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Contents. This is page i Printer: Opaque this

Linear programming III

The blessing of dimensionality for kernel methods

Part 3 Introduction to statistical classification techniques

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Overview of Supervised Learning

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Least Squares Optimal Filtering with Multirate Observations

Support-Vector Machines

Sequential Allocation with Minimal Switching

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Lecture 10, Principal Component Analysis

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Differentiation Applications 1: Related Rates

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Localized Model Selection for Regression

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Chapter 15 & 16: Random Forests & Ensemble Learning

Introduction to Regression

Module 4: General Formulation of Electric Circuit Theory

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Statistical Learning. 2.1 What Is Statistical Learning?

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Distributions, spatial statistics and a Bayesian perspective

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Determining the Accuracy of Modal Parameter Estimation Methods

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Comparing Several Means: ANOVA. Group Means and Grand Mean

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Dead-beat controller design

Linear Classification

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

AP Statistics Notes Unit Two: The Normal Distributions

Lecture 8: Multiclass Classification (I)

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Chapter 4. Unsteady State Conduction

Bayesian nonparametric modeling approaches for quantile regression

Engineering Decision Methods

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

You need to be able to define the following terms and answer basic questions about them:

Graduate AI Lecture 16: Planning 2. Teachers: Martial Hebert Ariel Procaccia (this time)

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

A Scalable Recurrent Neural Network Framework for Model-free

Checking the resolved resonance region in EXFOR database

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Computational Statistics

FIELD QUALITY IN ACCELERATOR MAGNETS

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

CS 109 Lecture 23 May 18th, 2016

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Mathematics Methods Units 1 and 2

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Floating Point Method for Solving Transportation. Problems with Additional Constraints

1 The limitations of Hartree Fock approximation

More Tutorial at

Multiple Source Multiple. using Network Coding

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

Computational modeling techniques

Dataflow Analysis and Abstract Interpretation

Chapter 8: The Binomial and Geometric Distributions

Online Model Racing based on Extreme Performance

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Transcription:

CN700 Additive Mdels and Trees Chapter 9: Hastie et al. (2001) Madhusudana Shashanka Department f Cgnitive and Neural Systems Bstn University CN700 - Additive Mdels and Trees March 02, 2004 p.1/34

Overview Generalized additive mdels Tree-based mdels PRIM bump hunting MARS CN700 - Additive Mdels and Trees March 02, 2004 p.2/34

Generalized Additive Mdels Techniques that use predefined basis functins achieve nnlinearity. Anther apprach generalized additive mdels. Mre autmatic and flexible. In the regressin setting, it can be expressed as E(Y X 1, X 2,..., X p ) = α + f 1 (X 1 ) + f 2 (X 2 ) +... + f p (X p ). f j s are unspecified smth (nnparametric) functins. Fit each functin using a scatterplt smther (cubic smthing spline r kernel smther). Simultaneusly estimate all p functins. CN700 - Additive Mdels and Trees March 02, 2004 p.3/34

Examples In general, the cnditinal mean µ(x) f a respnse Y is related t an additive functin f the predictrs via a link functin g: g[µ(x)] = α + f 1 (X 1 ) +... + f p (X p ). g(µ) = µ is the identity link. Linear and additive mdels fr Gaussian respnse data. g(µ) = lgit(µ) r g(µ) = prbit(µ) fr binmial prbabilities. Prbit is the inverse Gaussian cumulative distributin functin. g(µ) = lg(µ) fr lg-linear r lg-additive mdels fr Pissn cunt data. Mre flexibility Mix linear and ther parametric frms Nnlinear cmpnents in tw r mre variables Separate curves in X j fr each level f the factr X k. CN700 - Additive Mdels and Trees March 02, 2004 p.4/34

Examples g(µ) = X T β + α k + f(z). A semiparametric mdel, where α k is the effect fr the kth level f a qualitative input V. g(µ) = f(x) + g k (Z), where g k (Z) = g(v, Z) is an interactin term fr the effect f V and Z. g(µ) = f(x) + g(z, W ), where g is nnparametric in tw features. Example where additive mdels apply additive decmpsitin f time series, Y t = S t + T t + ɛ t, where S t is a seasnal cmpnent, T t is a trend and ɛ is an errr term. CN700 - Additive Mdels and Trees March 02, 2004 p.5/34

Fitting additive mdels The mdel: Y = α + p j=1 f j(x j ) + ɛ. Criterin: penalized sum f squares P RSS(α, f 1,..., f p ) = N i=1 { yi α p j=1 f j(x ij ) } 2 + p j=1 λ j f j (t j) 2 dt j. λ j 0 are tuning parameters. Minimizer is an additive cubic spline mdel each f j is a cubic spline in X j and knts are at each unique x ij, i = 1,..., N. Hwever, slutin nt unique. Mre restrictins Assume N 1 f j(x ij ) = 0 j, and the matrix f input values is nnsingular. CN700 - Additive Mdels and Trees March 02, 2004 p.6/34

Backfitting Algrithm 1. Initialize: ˆα = 1 N N 1 y i, ˆf j 0, i, j. 2. Cycle: j = 1, 2,..., p,..., 1, 2,..., p,..., [ ˆf j S j {y i ˆα ˆf k (x ik )} N 1 k j ] ˆf j ˆf j 1 N N i=1 ˆf j (x ij ). until the functins ˆf j changes less than a specified threshld. Algrithm analgus t multiple regressin fr linear mdels. CN700 - Additive Mdels and Trees March 02, 2004 p.7/34

Backfitting Algrithm Can accmmdate ther fitting methds by specifying apprpriate smthing peratrs S j : Univariate regressin smthers lcal plynmial regressin and kernel methds. Linear regressin peratrs plynmial fits, piecewise cnstant fits, parametric spline fits, series and Furier fits. Others surface smthers fr secnd (higher) rder interactins, and peridic smthers fr seasnal effects. CN700 - Additive Mdels and Trees March 02, 2004 p.8/34

Eg: Additive Lgistic Regressin lg P r(y =1 X) P r(y =0 X) = α + f 1(X 1 ) +... + f p (X p ). Lcal Scring Algrithm 1. Cmpute starting values: ˆα = lg[ȳ/(1 ȳ)], where ȳ = ave(y i ). Set ˆf j 0 j. 2. Define ˆη i = ˆα + j ˆf j (x ij ) and ˆp j = 1/[1 + exp( ˆη i )]. Iterate Cnstruct z i = ˆη i + (y i ˆp i ) ˆp i (1 ˆp i ). Cnstruct weights w i = ˆp i (1 ˆp i ). Fit an additive mdel t the targets z i with weights w i using a weighted backfitting algrithm. New estimates ˆα, ˆf j, j. 3. Cntinue step 2. until change is less than a specified threshld. CN700 - Additive Mdels and Trees March 02, 2004 p.9/34

Summary: Additive Mdels Flexible, yet interpretable. Familiar tls fr mdelling and inference in linear mdels als avialable here. Backfitting simple and mdular, can chse a fitting methd apprpriate fr each input variable. Limitatins fr large data-mining applicatins. Backfitting fits all predictrs nt feasible r desirable with large data. CN700 - Additive Mdels and Trees March 02, 2004 p.10/34

Overview Generalized additive mdels Tree-based mdels PRIM bump hunting MARS CN700 - Additive Mdels and Trees March 02, 2004 p.11/34

Intrductin Partitin the feature space int a set f rectangles. Fit a simple mdel (like a cnstant) in each ne. Key advantage interpretability. Predictin: ˆf(X) = 5 m=1 c mi{(x 1, X 2 ) R m }. CN700 - Additive Mdels and Trees March 02, 2004 p.12/34

Regressin Trees Data: (x i, y i ) fr i = 1,..., N, with x i = (x i1, x i2,..., x ip ). Aim: algrithm t autmatically decide splitting variables and split pints; and the tree tplgy. Mdel: M regins R 1, R 2,..., R M and a cnstant respnse c m in each regin f(x) = M m=1 c mi(x R m ). Criterin: minimizatin f sum f squares (y i f(x i )) 2. Best ĉ m : average f y i in R m, i.e. ĉ m = ave(y i x i R m ). Best binary partitin: Cmputatinally infeasible. Hw t prceed? Greedy algrithm. CN700 - Additive Mdels and Trees March 02, 2004 p.13/34

Best Split Cnsider a splitting variable j and a split pint s. Define the pair f half-planes R 1 (j, s) = {X X j s} and R 2 (j, s) = {X X j > s}. Find j and s that slve [ min (y i c 1 ) 2 + min c 1 c 2 min j,s x i R 1 (j,s) x i R 2 (j,s) (y i c 2 ) 2]. Inner minimizatin is slved by ĉ 1 = ave(y i x i R 1 (j, s)) and ĉ 2 = ave(y i x i R 2 (j, s)). Find the best pair (j, s) by scanning thrugh all split pints fr each splitting variable and then scanning thrugh all variables. CN700 - Additive Mdels and Trees March 02, 2004 p.14/34

Tree Size Adaptively chsen frm the data. Grw a large tree T 0 till sme minimum nde size is reached. Prune this tree using cst-cmplexity pruning. Cst cmplexity criterin: C α (T ) = T m=1 N mq m (T ) + α T, where Q m (T ) = N 1 m x i R m (y i ĉ m ) 2 and ĉ m = N 1 m x i R m y i. Idea: Fr each α, find the subtree T α T 0 t minimize C α (T ). Tuning parameter α 0 gverns tradeff between tree-size and gdness f fit. CN700 - Additive Mdels and Trees March 02, 2004 p.15/34

Tree Size Tuning parameter Fr each α, there is a unique smallest subtree T α that minimizes C α (T ). Use weakest link pruning t find T α. Successively cllapse the internal nde that prduces the smallest per-nde increase in m N mq m (T ), and cntinue until the single-nde tree. This sequence must cntain T α. Estimatin f α is by crss-validatin. Chse ˆα t minimize the crss-validated sum f squares. Final tree is Tˆα. CN700 - Additive Mdels and Trees March 02, 2004 p.16/34

Classificatin Trees K classes The prprtin f class k bservatins in nde m is given by ˆp mk = 1 N m x i R m I(y i = k). Observatins in nde m classified t class k(m) = arg max k ˆp mk. Different measures Q m (T ) f nde impurity Misclassificatin errr: 1 N m i R m I(y i k(m)) = 1 ˆp mk(m). Gini index: k k ˆp mk ˆp mk = K k=1 ˆp mk(1 ˆp mk ). Crss-entrpy (deviance): K k=1 ˆp mk lg ˆp mk. Crss-entrpy and Gini index are differentiable. Hence, mre amenable t numerical ptimizatin. CN700 - Additive Mdels and Trees March 02, 2004 p.17/34

Classificatin Trees When grwing the tree, either gini index r crss-entrpy shuld be used. T guide cst-cmplexity pruning, typically misclassificatin rate is used. CN700 - Additive Mdels and Trees March 02, 2004 p.18/34

Other issues and mdificatins Categrical Predictrs: Given a predictr with q pssible unrdered values and a binary utcme - Order predictr classes accrding t the prprtin falling in utcme class 1. Split predictr as if it were rdered. Lss Matrix: In the multi-class case, mdify Gini index t k k L kk ˆp mk ˆp mk. Fr tw classes, weight bservatins in class k by L kk. Missing Predictr Values: Categrical predictrs - make a new missing categry. General apprach - make surrgate variables. CN700 - Additive Mdels and Trees March 02, 2004 p.19/34

Disadvantages Instability and high variance: hierarchical nature. Lack f smthness: can degrade perfrmance in regressin setting. Difficulty with additive structures Cnsider Y = c 1 I(X 1 < t 1 ) + c 2 I(X 2 < t 2 ) + ɛ. First split n X 1 near t 1. The next split at bth ndes shuld be n X 2 at t 2. CN700 - Additive Mdels and Trees March 02, 2004 p.20/34

Overview Generalized additive mdels Tree-based mdels PRIM bump hunting MARS CN700 - Additive Mdels and Trees March 02, 2004 p.21/34

Intrductin PRIM patient rule inductin methd. Bxes in feature space where respnse average is high. Lks fr maxima in target functin bump hunting. Bx definitins nt defined by a binary tree. Characterized by peeling and pasting. CN700 - Additive Mdels and Trees March 02, 2004 p.22/34

Algrithm 1. Start with a maximal bx cntaining all training data. 2. Shrink bx by cmpressing ne face, s as t peel ff prprtin α f bservatins such that the peeling prduces the highest respnse mean in the remaining bx. 3. Repeat step 2 until sme minimal number f bservatins remain in the bx. 4. Expand alng any face, as lng as the resulting bx mean increases. 5. Steps 1-4 give a sequence f bxes, with different numbers f bservatins in each bx. Use crss-validatin t chse a member f the sequence and call the bx B 1. 6. Remve the data in bx B 1 frm the dataset and repeat steps 2-5 t btain a secnd bx, and cntinue t get as many bxes as desired. CN700 - Additive Mdels and Trees March 02, 2004 p.23/34

Algrithm Illustratin Tw classes blue (class 0) and red (class 1). CN700 - Additive Mdels and Trees March 02, 2004 p.24/34

PRIM and CART PRIM handles categrical variables and missing values like CART. PRIM: N simple way t deal with k > 2 classes simultaneusly. Run PRIM separately fr each class versus a baseline class. Advantage f PRIM ver CART: patience. CART fragments data quite quickly. lg 2 (N) 1 steps befre running ut f data. PRIM: apprx lg(n)/ lg(1 α) peeling steps befre running ut f data. CN700 - Additive Mdels and Trees March 02, 2004 p.25/34

Overview Generalized additive mdels Tree-based mdels PRIM bump hunting MARS CN700 - Additive Mdels and Trees March 02, 2004 p.26/34

Intrductin MARS Multivariate Adaptive Regressin Splines. Generalizatin f stepwise linear regressin. Mdificatin f CART fr the regressin setting. Well-suited fr high-dimensinal prblems. Uses expansins in piecewise linear basis functins f the frm (x t) + and (t x) +. The functins frm a reflected pair with a knt at value t. Basis functin 0.0 0.1 0.2 0.3 0.4 0.5 t-x x-t 0.0 0.2 0.4 t 0.6 0.8 1.0 x CN700 - Additive Mdels and Trees March 02, 2004 p.27/34

MARS descriptin Idea: Frm reflected pairs fr each input X j with knts at each bserved value x ij f that input. Cllectin f basis functins is Mdel has the frm C = {(X j t) +, (t X j ) + } t {x 1j,x 2j,...,x Nj } j=1,2,...,p f(x) = β 0 + M m=1 β m h m (X), where each h m (X) is a functin in C, r a prduct f tw r mre such functins. Given h m, β m estimated by standard linear regressin. CN700 - Additive Mdels and Trees March 02, 2004 p.28/34

Basis Functins Start with the cnstant functin h 0 (X) = 1 in the mdel set M. At each stage, cnsider as a new basis functin pair all prducts f a functin h m in M with ne f the reflected pairs in C. Add t M the term f the frm ˆβ M+1 h l (X).(X j t) + + ˆβ M+2 h l (X).(t X j ) +, h l M that prduces the largest decrease in training errr. Estimate cefficients by least-squares. Cntinue until M cntains sme preset maximum number f terms. Restrictin: each input at mst nce in a prduct. CN700 - Additive Mdels and Trees March 02, 2004 p.29/34

MARS illustratin c x y z x y y z y x x y z M n the left clumn and C n the right. Selected functins shwn in red. CN700 - Additive Mdels and Trees March 02, 2004 p.30/34

Backward deletin Mdel M is large and typically verfits data. At each stage Term whse remval causes the smallest increase in residual squared errr is deleted. Estimated best mdel ˆfλ, where λ = number f terms. Use generalized crss-validatin (GCV) fr ptimal λ. GCV criterin: GCV (λ) = N i=1 (y i ˆf λ (x i )) 2 (1 M(λ)/N) 2. M(λ) is the effective number f parameters: M(λ) = r + ck, r L.I. basis functins in M, K knts and c = 3. c = 2 when mdel is restricted t be additive. CN700 - Additive Mdels and Trees March 02, 2004 p.31/34

Advantages Piecewise linear functins perate lcally. X2 X1 Cmputatins: cnsider pdt f a functin in M with each f the N reflected pairs fr an input X j. O(N) peratins t try every knt! Multiway prducts built up frm prducts invlving terms already in mdel reasnable wrking assumptin. CN700 - Additive Mdels and Trees March 02, 2004 p.32/34

MARS fr classificatin Tw classes: 0/1 respnse, treat as regressin. Multiclass: 0/1 indicatr variables, use multirespnse MARS regressin. Masking prblems with the abve apprach. PlyMARS specifically designed fr classificatin: Multiple lgistic framewrk. Use quadratic apprximatin t the multinmial lg-likelihd t search fr the next basis-functin pair. Fit enlarged mdel by maximum likelihd. CN700 - Additive Mdels and Trees March 02, 2004 p.33/34

References Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements f Statistical Learning: data mining, inference, and predictin. Springer-Verlag. CN700 - Additive Mdels and Trees March 02, 2004 p.34/34