Kernel Density Estimation

Similar documents
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Decision trees COMS 4771

Informal Definition: Telling things apart

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

SF2930 Regression Analysis

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Decision Tree Learning Lecture 2

day month year documentname/initials 1

Learning Decision Trees

Lecture 7: DecisionTrees

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

CS145: INTRODUCTION TO DATA MINING

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Learning Decision Trees

Machine Learning 2nd Edi7on

Machine Learning & Data Mining

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Lecture 9: Classification, LDA

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Jeffrey D. Ullman Stanford University

Lecture 9: Classification, LDA

Introduction to Signal Detection and Classification. Phani Chavali

Knowledge Discovery and Data Mining

Applied Multivariate and Longitudinal Data Analysis

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

The Bayes classifier

Chapter 14 Combining Models

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Machine Learning Linear Classification. Prof. Matteo Matteucci

Information Theory & Decision Trees

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Error Error Δ = 0.44 (4/9)(0.25) (5/9)(0.2)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Dyadic Classification Trees via Structural Risk Minimization

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

the tree till a class assignment is reached

Regression tree methods for subgroup identification I

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

REGRESSION TREE CREDIBILITY MODEL

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Classification: Decision Trees

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

University of Alberta

Growing a Large Tree

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining and Analysis: Fundamental Concepts and Algorithms

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Tree Learning

Lecture 3: Decision Trees

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Kernel Logistic Regression and the Import Vector Machine

Introduction to Machine Learning CMU-10701

Statistics and learning: Big Data

What is Predictive Modeling?

Chapter 10. Additive Models, Trees, and Related Methods

Adapted Feature Extraction and Its Applications

ABC-LogitBoost for Multi-Class Classification

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Statistical Data Mining and Machine Learning Hilary Term 2016

Artificial Intelligence

Statistical Methods for Data Mining

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Lecture 3 Classification, Logistic Regression

Proteomics and Variable Selection

Supervised Learning via Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Statistics for high-dimensional data: Group Lasso and additive models

Induction of Decision Trees

CS 6375 Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Day 3: Classification, logistic regression

Data classification (II)

Chapter 6: Classification

Naïve Bayes classification

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Classification and regression trees

Linear Methods for Classification

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Information Theory, Statistics, and Decision Trees

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

B x1. Voronoi Tessellation, Decision Surfaces

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSCI 5622 Machine Learning

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Learning Classification Trees. Sargur Srihari

Introduction to Data Science Data Mining for Business Analytics

Support Vector Machines for Classification: A Statistical Portrait

Transcription:

Kernel Density Estimation If Y {1,..., K} and g k denotes the density for the conditional distribution of X given Y = k the Bayes classifier is f (x) = argmax π k g k (x) k If ĝ k for k = 1,..., K are density estimators non-parametric kernel density estimators, say then using the plug-in principle ˆf (x) = argmax ˆπ k ĝ k (x) k is an estimator of the Bayes classifier. This is the non-parametric version of LDA. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 1 / 23

Naive Bayes High-dimensional kernel density estimation suffers from the curse of dimensionality. Assume that the X -coordinates are independent given the Y, then p g k (x) = g k,i (x i ) with g k,i univariate densities. log Pr(Y = k X = x) Pr(Y = K X = x) i=1 = log π k + log g k(x) π K g K (x) p = log π k π K + = log π k π K + i=1 log g k,i(x i ) g K,i (x i ) }{{} h k,i (x) p h k,i (x) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 2 / 23 i=1

Naive Bayes Continued The conditional distribution above is an example of a generalized additive model. Estimation of h k,i using univariate (non-parametric) density estimators ĝ k,i ; ĥ k,i = log ĝk,i(x i ) ĝ K,i (x i ) is known as naive or even idiot s Bayes. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 3 / 23

Naive Bayes Discrete Version If some or all of the X variables are discrete, univariate kernel density estimation can be replaced by appropriate estimation of point probabilities. If all X i take values in {a 1,..., a n } the extreme implementation of naive Bayes is to estimate ĝ k,i (r) = 1 N k j:y j =k 1(x ji = a r ), N k = N 1(y j = k). j=1 This is a possible solution procedure for prac 7. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 4 / 23

Generalized Additive Models A generalized additive model of Y given X is given by a link function g such that the mean µ(x ) of Y given X is g(µ(x )) = α + f 1 (X 1 ) +... + f p (X p ). This is an extension from general linear models by allowing for non-linear but univariate effects given by the f i -functions. The functions are not in general identifiable and we can face a problem similar to collinearity, which is known as concurvity. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 5 / 23

Generalized Additive Logistic Regression An important example arise with Y {0, 1} with the logit link ( ) µ g(µ) = log, µ (0, 1) 1 µ Then µ(x ) = Pr(Y = 1 X ) = exp(α + f 1(X 1 ) +... + f p (X p )) 1 + exp(α + f 1 (X 1 ) +... + f p (X p )) Like logistic regression we can use other link functions like the probit link g(µ) = Φ 1 (µ) where Φ is the distribution function for the normal distribution. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 6 / 23

Penalized Estimation The general, penalized minus-log-likelihood function is l N (α, f 1,..., f p ) + p b λ j j=1 a f j (x)dx with tuning parameters λ 1,..., λ p. The minimizer, if it exists, consists of natural cubic splines. For identification purposes we assume N f j (x ij ) = 0, j = 1,..., p. i=1 This is equivalent to f j = (f j (x 1j ),..., f j (x Nj )) T being perpendicular to the column vector 1 for j = 1,..., p. The penalization resolves overparameterization problems for the non-linear part but not the linear part of the fit. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 7 / 23

Informal Tests for Non-Linear Effects Using smoothing splines there is to each estimated function ˆf j an associated linear smoother matrix S j. The effective degress of freedom for the non-linear part of the fit is df j = trace(s j ) 1 Implementations often do ad hoc χ 2 -tests for the non-linear part using χ 2 -distributions with df j degress of freedom these test are at best justified by approximations and simulation studies, and can be used as guidelines only. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 8 / 23

Spam Email Classification Whether an email is a spam email or a regular email is a great example of a problem where prediction is central and interpretation is secondary. The (simplistic) example in the book deals with 4601 emails to an employee at Hewlett-Packard. Each email is dimension reduced to a 57-dimensional vector containing Quantitative variables of word or special character percentages. Quantitative variables describing the occurrence of capital letters. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 9 / 23

Figure 9.1 Non-linear Email Spam Predictor Effects Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 10 / 23

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive algorithm for basis function selection. A recursive, binary partition (a tree) is given by a list of splits {(t 01 ), (t 11, t 12 ), (t 21, t 22, t 23, t 24 ),..., (t n1,..., t n2 n)} and corresponding split variable indices {(i 01 ), (i 11, i 12 ), (i 21, i 22, i 23, i 24 ),..., (i n1,..., i n2 n)} R 1 = (x i01 < t 01 ) (x i11 < t 11 )... (x in1 < t n1 ) and we can determine if x R 1 in n steps M = 2 n. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 11 / 23

Figure 9.2 Recursive Binary Partitions The recursive partition of [0, 1] 2 above and the representation of the partition by a tree. A binary tree of depth n can represent up to 2 n partitions/basis functions. We can determine which R j an x belongs to by n recursive yes/no questions. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 12 / 23

Figure 9.2 General Partitions A general partition that can not be represented as binary splits. With M sets in a general partition we would in general need of the order M yes/no questions to determine which of the sets an x belongs to. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 13 / 23

Figure 9.2 Recursive Binary Partitions For a fixed partition R 1,..., R M the least squares estimates are ĉ m = ȳ(r m ) = 1 N m N m = {i x i R m }. i:x i R m y i The recursive partition allows for rapid computation of the estimates and rapid predition of new observations. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 14 / 23

Greedy Splitting Algorithm With squared error loss and an unknown partition R 1,..., R M we would seek to minimize N (y i ȳ(r m(i) )) 2 i=1 over the possible binary, recursive partitions. But this is computationally difficult. An optimal single split on a region R is determined by min min (y i ȳ(r(j, s))) 2 + (y i ȳ(r(j, s) c )) 2 j s i:x i R(j,s) i:x i R(j,s) }{{ c } univariate optimization problem with R(j, s) = {x R x j < s} The tree growing algorithm recursively does single, optimal splits on each of the partitions obtained in the previous step. Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 15 / 23

Tree Pruning The full binary tree, T 0, representing the partitions R 1,..., R M with M = 2 n may be too large. We prune it by snipping of leafs or subtrees. For any subtree T of T 0 with T leafs and partition R 1 (T ),..., R T (T ) the cost-complexity of T is C α (T ) = N (y i ȳ(r m(i) (T ))) 2 + α T. i=1 Theorem There is a finite set of subtrees T 0 T α1 T α2... T αr with 0 α 1 < α 2 <... < α r such that T αi minimizes C α (T ) for α [α i, α i+1 ) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 16 / 23

Node Impurities and Classification Trees Define the node impurity as the average loss for the node R Q(R) = 1 (y i ȳ(r)) 2 N(R) The greedy split of R is found by i:x i R min min (N(R(j, s))q(r(j, s)) + N(R(j, s) c )Q(R(j, s) c )) j s with R(j, s) = {x R x j < s} and we have C α (T ) = T m=1 N(R m (T ))Q(R m (T )) + α T. If Y takes K discrete values we focus on the node estimate for R m (T ) in tree T as being ˆp m (T )(k) = 1 1(y i = k) N m i:x i R m(t ) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 17 / 23

Node Impurities and Classification Trees The loss functions for classification enter in the specification of the node impurities used for splitting an cost-complexity computations. Examples 0-1 loss gives misclassification error impurity: Q(R m (T )) = 1 max{ˆp(r m (T ))(1),..., ˆp(R m (T ))(K)} likelihood loss gives entropy impurity: Q(R m (T )) = The Gini index impurity: Q(R m (T )) = K ˆp(R m (T ))(k) log ˆp(R m (T ))(k) k=1 K ˆp(R m (T ))(k)(1 ˆp(R m (T ))(k)) k=1 Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 18 / 23

Figure 9.3 Node Impurities Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 19 / 23

Spam Example Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 20 / 23

Spam Example Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 21 / 23

Sensitivity and Specificity The sensitivity is the probability of predicting 1 given that the true value is 1 (predict a case given that there is a case). sensitivity = Pr(f (X ) = 1 Y = 1) = Pr(Y = 1, f (X ) = 1) Pr(Y = 1, f (X ) = 1) + Pr(Y = 1, f (X ) = 0) The specificity is the probability of predicting 0 given that the true value is 0 (predict that there is no case given that there is no case). specificity = Pr(f (X ) = 0 Y = 0) = Pr(Y = 0, f (X ) = 0) Pr(Y = 0, f (X ) = 0) + Pr(Y = 0, f (X ) = 1) Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 22 / 23

ROC curves reciever operating characteristic Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 13, 2011 23 / 23