Variable Selection and Weighting by Nearest Neighbor Ensembles

Similar documents
Classification using stochastic ensembles

A Bias Correction for the Minimum Error Rate in Cross-validation

Variance Reduction and Ensemble Methods

Lecture 6: Methods for high-dimensional problems

CMSC858P Supervised Learning Methods

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Holdout and Cross-Validation Methods Overfitting Avoidance

Regression, Ridge Regression, Lasso

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Permutation-invariant regularization of large covariance matrices. Liza Levina

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Random Forests for Ordinal Response Data: Prediction and Variable Selection

Nearest Neighbors Methods for Support Vector Machines

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Lecture 4 Discriminant Analysis, k-nearest Neighbors

MS-C1620 Statistical inference

Recap from previous lecture

Machine Learning Linear Classification. Prof. Matteo Matteucci

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Machine Learning for OR & FE

day month year documentname/initials 1

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Data Mining und Maschinelles Lernen

Machine Learning for OR & FE

Linear Methods for Prediction

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Prediction & Feature Selection in GLM

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

UVA CS 4501: Machine Learning

Variance and Bias for General Loss Functions

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Machine Learning 2017

Fast Regularization Paths via Coordinate Descent

Adaptive Crowdsourcing via EM with Prior

REGRESSION TREE CREDIBILITY MODEL

Random projection ensemble classification

Variable Selection in Data Mining Project

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Classification via kernel regression based on univariate product density estimators

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

application in microarrays

L11: Pattern recognition principles

BAGGING PREDICTORS AND RANDOM FOREST

Multivariate statistical methods and data mining in particle physics

Independent Factor Discriminant Analysis

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

High-dimensional regression modeling

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Performance of Cross Validation in Tree-Based Models

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Regularized Discriminant Analysis and Its Application in Microarrays

MSA220 Statistical Learning for Big Data

CS534 Machine Learning - Spring Final Exam

Linear Regression and Discrimination

Applied Multivariate and Longitudinal Data Analysis

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Regularization in Cox Frailty Models

Robust Inverse Covariance Estimation under Noisy Measurements

2. (Today) Kernel methods for regression and classification ( 6.1, 6.2, 6.6)

Regularized Discriminant Analysis and Its Application in Microarray

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Bits of Machine Learning Part 1: Supervised Learning

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Chapter 6. Ensemble Methods

Classification. Chapter Introduction. 6.2 The Bayes classifier

Applied Machine Learning Annalisa Marsico

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Proximity-Based Anomaly Detection using Sparse Structure Learning

Introduction to Signal Detection and Classification. Phani Chavali

MSA220/MVE440 Statistical Learning for Big Data

Unsupervised machine learning

A Robust Approach to Regularized Discriminant Analysis

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

arxiv: v1 [stat.ml] 3 Mar 2019

Course in Data Science

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Regularization and Variable Selection via the Elastic Net

ENSEMBLES OF DECISION RULES

Feature Engineering, Model Evaluations

Statistical Data Mining and Machine Learning Hilary Term 2016

Iterative Laplacian Score for Feature Selection

Classification 2: Linear discriminant analysis (continued); logistic regression

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Maximum Entropy Generative Models for Similarity-based Learning

Statistical Consulting Topics Classification and Regression Trees (CART)

Transcription:

Variable Selection and Weighting by Nearest Neighbor Ensembles Jan Gertheiss (joint work with Gerhard Tutz) Department of Statistics University of Munich WNI 2008

Nearest Neighbor Methods Introduction One of the simplest and most intuitive techniques for statistical discrimination (Fix & Hodges, 1951). Nonparametric and memory based. Given data (y i, x i ) with categorial response y i {1,..., G} and (metric) p-dim. predictor x i : Place a new observation with unknown class label into the class of the observation from the training set that is closest to the new observation - with respect to covariates. Closeness, resp. distance d of two observations is derived from a specific metric in the predictor space. Given the Euclidian metric d(x i, x r ) = p j=1 x ij x rj 2.

Nearest Neighbor Methods Class probability estimates Use not only the first but the k nearest neighbors. The relative frequency of predictions in favor of category g among these neighbors can be seen as an estimate of the probability of category g. Estimates ˆπ ig can take values h/k, h {0,..., k}. If neighbors are weighted with respect to their distance to the observation of interest, ˆπ can in principle take all values in [0, 1].

Nearest Neighbor Ensembles Basic Concept Final estimation by an ensemble of single predictors: Use the ensemble formula for computing the probability that observation i falls in category g: ˆπ ig = p c j ˆπ ig(j), with c j 0 j and j j=1 c j = 1. With k nearest neighbor estimates ˆπ ig(j) based on predictor j only. Weights - or coefficients - c 1,..., c p need to be determined.

Nearest Neighbor Ensembles More Flexibility? Why not ˆπ ig = j c gj ˆπ ig(j), with c gj 0 g, j and j c gj = 1 g? It can be shown: Restriction c 1j =... = c Gj = c j is the only possibility to ensure that 1. ˆπ ig 0 g and 2. P g ˆπ ig = 1 for all possible future estimations {ˆπ ig(j) } with 1. ˆπ ig(j) 0 g, j and 2. P g ˆπ ig(j) = 1 j.

Determination of Weights Principles Given all {ˆπ ig(j) }, matrix ˆΠ with (ˆΠ) ig = ˆπ ig depends on c = (c 1,..., c p ) T. Given the training data with predictors x 1,..., x n and true class labels y = (y 1,..., y n ) T, a previously chosen loss function - or score - L(y, ˆΠ) is minimized over all possible c. Note: The categorial response y i is alternatively represented by a vector z i = (z i1,..., z ig ) T of dummy variables { 1, if yi = g z ig = 0, otherwise

Determination of Weights Possible loss functions Log Score L(y, ˆΠ) = i z ig log(1/ˆπ ig ) g + likelihood based Hypersensitive Inapplicable for nearest neighbor estimates. Approximate Log Score L(y, ˆΠ) = i g z ig ( (1 ˆπ ig ) + 1 2 (1 ˆπ ig ) 2 ) + Hypersensitivity removed Not "incentive compatible" (Selten, 1998), i.e. expected loss E(L) = G y=1 π y L(y, ˆπ y ) not minimized by ˆπ g = π g.

Determination of Weights Possible loss functions Quadratic Loss / Brier Score L(y, ˆΠ) = i (z ig ˆπ ig ) 2 g (introduced by Brier, 1950) + Not hypersensitive + Incentive compatible (see e.g. Selten, 1998) + Also takes into account how the estimated probabilities are distributed over the false classes.

Determination of Weights Practical implementation 1. For each observation i create a matrix P i of predictions: (P i ) gj = ˆπ ig(j). 2. Create a vector z = (z T 1,..., zt n ) T and a matrix P = (P T 1... PT n ) T. 3. Now the Brier Score as function of c can be written in matrix notation: L(c) = (z Pc) T (z Pc). 4. Given restrictions c j 0 j and j c j = 1, weights c j can be determined using quadratic programming methods; e.g. using the R add-on package quadprog. Given the approximate log score the weights can be determined in a similar way.

Variable Selection Variable Selection means setting c j = 0 for some j. Thresholding: Hard: c j = 0, if c j < t; c j = c j, otherwise; e.g. t = 0.25 max j {c j }. Soft: c j = (c j t) +. (followed by rescaling) Lasso based approximate solutions: If restrictions are replaced by j c j s, a lasso type problem (Tibshirani, 1996) arises. Lasso typical selection characteristics cause c j = 0 for some j. (followed by rescaling and c j = c + j )

Including Interactions Matrix P may be augmented by including interactions of predictors. Adding all predictions ˆπ ig(jl), resp. ˆπ ig(jlm) based on two or even three predictors. Feasible for small scale problems only; P has p + ( p 2) +... columns.

Simulation Studies I Two classification problems There are 10 independent features x j, each uniformly distributed on [0, 1]. The two class 0/1 coded response y is defined as follows (cf. Hastie et al., 2001): as an "easy" problem: y = I (x 1 > 0.5), and as a "difficult" problem: y = I (sign( 3 j=1 (x j 0.5)) > 0).

Simulation Studies I Reference methods Nearest neighbor methods: (3) Nearest neighbor based extended forward / backward variable selection. With tuning parameter S as the number of simple forward / backward selection steps that are checked in each iteration. Weighted (5) nearest neighbor prediction; R add-on package kknn. Some alternative classification tools: Linear discriminant analysis (LDA); R add-on package MASS. CART (Breiman et al., 1984) and Random Forests (Breiman, 2001); R add-on packages rpart, randomforest.

Simulation Studies I The easy problem Prediction performance on the test set in terms of the Brier Score and No. of Missclassified Observations: loss 0 50 100 150 200 250 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF

Simulation Studies I The easy problem Variable selection/weighting by nearest neighbor based (extended) forward/backward selection (left) or nearest neighbor ensembles (right): (1) approx. Log Score used Realization 5 10 15 20 25 30 Weight Weight 0.0 0.4 0.8 0.0 0.4 0.8 15 35 55 75 95 115 135 155 175 Term No. (2) Quadratic Score used 1 2 3 4 5 6 7 8 9 10 15 35 55 75 95 115 135 155 175 Variable No. Term No.

Simulation Studies I The difficult problem Prediction performance on the test set in terms of the Brier Score and No. of Missclassified Observations: loss 100 200 300 400 500 600 700 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF

Simulation Studies I The difficult problem Variable selection/weighting by nearest neighbor based (extended) forward/backward selection (left) or nearest neighbor ensembles (right): (1) approx. Log Score used Realization 5 10 15 20 25 30 Weight Weight 0.0 0.4 0.8 0.0 0.4 0.8 15 35 55 75 95 115 135 155 175 Term No. (2) Quadratic Score used 1 2 3 4 5 6 7 8 9 10 15 35 55 75 95 115 135 155 175 Variable No. Term No.

Simulation Studies II cf. Hastie & Tibshirani (1996) 1. 2 Dimensional Gaussian: Two Gaussian classes in two dimensions. 2. 2 Dimensional Gaussian with 14 Noise: Additionally 14 independent standard normal noise variables. 3. Unstructured: 4 classes, each with 3 spherical bivariate normal subclasses; means are chosen at random. 4. Unstructured with 8 Noise: Augmented with 8 independent standard normal predictors. 5. 4 Dimensional Spheres with 6 Noise: First 4 predictors in class 1 independent standard normal, conditioned on radius > 3; class 2 without restrictions. 6. 10 Dimensional Spheres: All 10 predictors in class 1 conditioned on 22.4 < radius 2 < 40. 7. Constant Class Probabilities: Class probabilities (0.1,0.2,0.2,0.5) are independent of the predictors. 8. Friedman s example: Predictors in class 1 independent standard normal, in class 2 independent normal with mean and variance proportional to j and 1/ j respectively, j = 1,..., 10.

Simulation Studies II Scenario 1-4 loss 50 100 150 200 250 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF loss 100 200 300 400 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF (1) 2 normals (2) 2 normals with noise loss 0 200 400 600 800 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF loss 0 200 400 600 800 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF (3) unstructured (4) unstructured with noise

Simulation Studies II Scenario 5-8 loss 100 200 300 400 500 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF loss 100 200 300 400 500 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF (5) 4D sphere in 10 dimensions (6) 10D sphere in 10 dimensions loss 500 600 700 800 900 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF loss 100 150 200 250 300 350 400 450 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF (7) constant class probabilities (8) Friedman's example

Real World Data Glass data, R package mlbench Forecast the type of glass (6 classes) on the basis of the chemical analysis given in form of 9 metric predictors. Result 3NNE-QS / all data Weight 0.0 0.1 0.2 0.3 0.4 10 20 30 40 50 60 70 80 90 100 110 120 Term No.

Real World Data Glass data, R package mlbench Forecast the type of glass (6 classes) on the basis of the chemical analysis given in form of 9 metric predictors. Performance / 50 random splits Loss 5 10 15 20 25 30 35 3NN FS 3NN BS 3NNE LS 3NNE QS w5nn LDA CART RF

Summary Nearest neighbor ensembles Nonparametric probability estimation by an ensemble, i.e. weighted average of nearest neighbor estimates. Each estimate is based on a single or a very small subset of predictors. No black box (by contrast to many other ensemble methods). Good performance for small scale problems, particularly if pure noise variables can be separated from relevant covariates. Direct application to high dimensional problems with interactions is not recommended. Given microarrays possibly useful as nonparametric gene preselection tool. May be employed for automatic choice of the most appropriate metrics or the right neighborhood. Application to regression problems is possible as well.

References BREIMAN, L. (2001): Random Forests, Machine Learning 45, 5 32 BREIMAN, L. et al. (1984): Classification and Regression Trees, Monterey, CA: Wadsworth. BRIER, G. W. (1950): Verification of forcasts expressed in terms of probability, Monthly Weather Review 78, 1 3. FIX, E. and J. L. HODGES (1951): Discriminatory analysis - nonparametric discrimination: consistency properties, US air force school of aviation medicine, Randolph Field Texas. GERTHEISS, J. and TUTZ, G. (2008): Feature Selection and Weighting by Nearest Neighbor Ensembles, University of Munich, Department of Statistics: Technical Report, No.33. HASTIE, T. and R. TIBSHIRANI (1996): Discriminant adaptive nearest neighbor classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 607 616. HASTIE, T. et al. (2001): The Elements of Statistical Learning, New York: Springer. R Development Core Team (2007): R: A Language and Environment for Statistical Computing, Vienna: R Foundation for Statistical Computing. SELTEN, R. (1998): Axiomatic characterization of the quadratic scoring rule, Experimental Economics 1, 43 62. TIBSHIRANI, R. (1996): Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society B, 58, 267 288.