The role of dimensionality reduction in classification

Similar documents
Jeff Howbert Introduction to Machine Learning Winter

What is semi-supervised learning?

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear & nonlinear classifiers

Midterm exam CS 189/289, Fall 2015

Dimensionality Reduction by Unsupervised Regression

Support vector machines Lecture 4

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Linear & nonlinear classifiers

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

6.036 midterm review. Wednesday, March 18, 15

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

STA 414/2104: Lecture 8

Support Vector Machine (continued)

MIRA, SVM, k-nn. Lirong Xia

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Classification: The rest of the story

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Machine Learning Lecture 7

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Perceptron Revisited: Linear Separators. Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS 6375 Machine Learning

Recap from previous lecture

ECS289: Scalable Machine Learning

Statistical Pattern Recognition

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Support Vector Machines: Maximum Margin Classifiers

Neural networks and optimization

Support Vector Machines for Classification: A Statistical Portrait

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Learning with kernels and SVM

Neural networks and support vector machines

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear Models for Regression CS534

Kernel Methods and Support Vector Machines

Active and Semi-supervised Kernel Classification

Neural Networks, Convexity, Kernels and Curses

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Models for Regression CS534

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

COMS 4771 Introduction to Machine Learning. Nakul Verma

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CS534 Machine Learning - Spring Final Exam

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

ESS2222. Lecture 4 Linear model

Learning by constraints and SVMs (2)

Day 3 Lecture 3. Optimizing deep networks

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Support Vector Machine (SVM) and Kernel Methods

9 Classification. 9.1 Linear Classifiers

Supervised locally linear embedding

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ML (cont.): SUPPORT VECTOR MACHINES

Kernel-based Feature Extraction under Maximum Margin Criterion

STA 414/2104: Lecture 8

Entropic Affinities: Properties and Efficient Numerical Computation

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning Basics

Warm up: risk prediction with logistic regression

Neural networks and optimization

Pattern Recognition and Machine Learning

L11: Pattern recognition principles

Support Vector Machines and Kernel Methods

ECS289: Scalable Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Support Vector Machines and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

LEC 3: Fisher Discriminant Analysis (FDA)

CSC 411 Lecture 17: Support Vector Machine

ECE521 week 3: 23/26 January 2017

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Kaggle.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Machine Learning Practice Page 2 of 2 10/28/13

CS 188: Artificial Intelligence Spring Announcements

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

CS 188: Artificial Intelligence Spring Announcements

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Models for Classification

Support Vector Machines

Chemometrics: Classification of spectra

Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Lecture Support Vector Machine (SVM) Classifiers

Classification and Pattern Recognition

Transcription:

The role of dimensionality reduction in classification Weiran Wang and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Low-dimensional classifiers We consider constructing a nonlinear classifier by first reducing dimension nonlinearly and then classifying linearly. Three questions we ask in this paper: ❶ How to train a nonlinear classifier of the form g: linear SVM, g(z) = w T z+b y = g(f(x)) F: nonlinear dimensionality reduction RBF, deep net, GP... ❷ ❸ optimizing jointly over g and F? What is the role of the nonlinear dimensionality reduction F in the resulting classifier? How good and fast is the resulting nonlinear, low-dimensional classifier? p. 1

❶ Learning low-dim features for a linear SVM Consider binary classification throughout (see the multiclass case in the paper). Objective function over the parameters of g and F given a dataset {(x n,y n )} N n=1 of (inputs,labels): min F,g,ξ s.t. λr(f)+ 1 2 w 2 +C N n=1 y n (w T F(x n )+b) 1 ξ n, ξ n 0, n = 1,...,N. If F = identity then y = g(f(x)) = w T x+b is a linear SVM and the problem is a convex quadratic program, easy to solve. If F is nonlinear then y = g(f(x)) = w T F(x)+b is a nonlinear classifier and the problem is nonconvex, difficult to solve. ξ n p. 2

❶ Learning low-dim features for a linear SVM (cont.) Simple but suboptimal approach often used in practice ( filter approach): 1. First reduce dimension of the inputs ignoring the classifier. This requires optimizing a proxy objective function over F only. unsupervised, e.g. PCA; supervised, e.g. LDA 2. Then fix F and train g: fit a linear SVM with inputs {F(x n )} and labels {y n }. The features are not optimal for the linear SVM classifier. They are still helpful to remove out-of-manifold noise and speed up the classifier. We want a wrapper approach, i.e., optimize the overall classification error jointly over the features F and classifier g. We apply the method of auxiliary coordinates (MAC) for nested systems Carreira-Perpinan & Wang, AISTATS, 2014 p. 3

❶ Solution with auxiliary coordinates (MAC) The problem is nested because of g(f( )) in the constraints: min F,g,ξ s.t. λr(f)+ 1 N 2 w 2 +C n=1 y n (w T F(x n )+b) 1 ξ n, ξ n 0, n = 1,...,N. ξ n Break the nesting by introducing an auxiliary coordinate z n per point: min F,g,ξ,Z s.t. λr(f)+ 1 N 2 w 2 +C n=1 ξ n y n (w T z n +b) 1 ξ n, ξ n 0, z n = F(x n ), n = 1,...,N. The auxiliary coordinates correspond to the low-dim projections but are separate parameters. p. 4

❶ Solution with auxiliary coordinates (MAC) (cont.) Solve this constrained problem with a quadratic-penalty method: optimize for fixed penalty parameter µ > 0 and drive µ : min F,g,ξ,Z s.t. λr(f)+ 1 N 2 w 2 +C ξ n + µ 2 n=1 N z n F(x n ) 2 n=1 y n (w T z n +b) 1 ξ n, ξ n 0, n = 1,...,N. Alternating optimization of the penalty function for fixed µ: g step: fit a linear SVM to {(z n,y n )} N n=1 A classifier using as inputs the auxiliary coordinates. F step: fit a nonlinear mapping F to {(x n,z n )} N n=1 A regressor using as outputs the auxiliary coordinates. Z step: set z n = F(x n )+γ n y n w, n = 1,...,N A closed-form update for each point s auxiliary coordinate z n. } in parallel in parallel p. 5

❶ MAC algorithm: solution of the Z step The Z step decouples into N independent problems, each a quadratic program on L parameters (L = dimension of latent space): min z n,ξ n z n F(x n ) 2 + 2C µ ξ n s.t. y n (w T z n +b) 1 ξ n, ξ n 0, z n R L with closed-form solution z n = F(x n )+γ n y n w, where γ n takes one of three possible values. Cost: O(L). λ 1 = 0 0 < λ 1 < c λ 1 = c z opt ξ n z opt = F(x n ) z opt F(x n ) F(x n ) 0 z 0 z 0 z p. 6

❶ MAC algorithm: advantages In summary, all the algorithm does is repeatedly fit a regressor and a linear SVM, and update each point s coordinates. It is like iterating a filter but correcting the coordinates so we converge to a true optimum. No complicated gradients obtained from the chain rule. Instead, it reuses algorithms: classifier g: linear SVM regressor F: RBF, deep net, GP... Parallel: classifier, regressor trained in parallel multiclass case: K SVMs trained in parallel (one per class) N coordinates updated in parallel. Converges to a local optimum as µ. Fast iterations if using standard optimization techniques warm starts, caching matrix factorizations, inexact steps, etc. Large progress in first few iterations. p. 7

❷ Role of nonlinear dimension reduction in the linsvm The linear SVM g is less flexible than the nonlinear mapping F, so, intuitively, F will do most of the work: Ideal case: collapse classes onto maximally linearly separable centroids corners of a simplex, if using L K 1 dimensions In practice: the amount of collapse depends on how flexible F is (number of parameters, regularization). This is very different from what PCA/LDA dimension reduction does. The optimal dimension reduction for linear classification destroys all manifold structure. It only cares about facilitating linear separability. p. 8

❷ Role of dimension reduction: experiments We reduce dimension with radial basis function networks (RBFs): F(x) = M X αm φm (x) each BF φm is a Gaussian. m=1 Configuration of the low-dim projections as a function of the number of basis functions M : the more flexible F is (larger M ), the more the classes collapse and separate. F linear M =4 M = 10 M = 20 M = 40 M = 100 M = 200 M = 2000 p. 9

❷ Role of dimension reduction: experiments (cont.) Relation between the dimension L and the number of classes K: the classification accuracy improves drastically as L increases from 1 and stabilizes at L K 1, by which time the training samples are perfectly separated and the classes form point-like clusters approximately lying on the vertices of a simplex. L=1 L=2 L=3 L=4 L=5 L=6 L=7 L=8 L=9 L = 10 K=6 K=4 K=2 data This gives a recipe to choose L: starting from L = K 1, increase L until the classification error does not improve. Initialize the latent projections Z to the corners of a simplex. p. 10

❸ How good/fast is the nonlinear low-dim classifier? Consider dimension reduction with a radial basis function mapping M F(x) = α m φ m (x) each BF φ m is e.g. a Gaussian. m=1 Then the nonlinear low-dim classifier has the form of a nonlinear SVM: M y = v m φ m (x)+b m=1 We have direct control on the classifier complexity through the number of basis functions M unlike in an SVM, where the number of support vectors M is often very large. We can trade off classification error vs runtime: Training: linear on sample size N and number of BFs M. Testing: linear on number of BFs M. Error comparable to an SVM s, but with far fewer basis functions. In general, using the same MAC algorithm, we can use other forms for F (e.g. a neural net) and need not reduce dimension. p. 11

❸ How good/fast is the nonlinear low-dim classifier? 10k MNIST handwritten digit images of 28 28 = 784 dimensions with K = 10 classes (one-versus-all). A low-dim SVM (L = K dimensions) does as well as a kernel SVM but using 5.5 times fewer support vectors. The low-dim SVM is also faster to train and parallelizes trivially. Method Test error # BFs Nearest Neighbor 5.34 10 000 Linear SVM 9.20 Gaussian SVM 2.93 13 827 low-dim SVM 2.99 2 500 LDA (9) + Gaussian SVM 10.67 8 740 PCA (5) + Gaussian SVM 24.31 13 638 PCA (10) + Gaussian SVM 7.44 5 894 PCA (40) + Gaussian SVM 2.58 12 549 PCA (40) + low-dim SVM 2.60 2 500 classification error wrt L low-dim space (2D PCA) parallel processing speedup error rate (%) 70 60 50 40 30 20 10 speedup 8 6 4 2 0 5 10 15 20 25 30 L 2 4 6 8 10 12 # processors p. 12

Conclusions ❶ ❷ ❸ Learning optimal nonlinear low-dim features for a linear SVM is a nonconvex problem, but it can be solved easily and efficiently with the method of auxiliary coordinates. The algorithm is very intuitive: repeatedly fit a regressor and a linear SVM, and update each point s coordinates. It is like iterating a filter but correcting the coordinates so we converge to a true optimum. It shows that classes collapse and maximally separate in latent space, so the optimal dimensionality reduction for linear classification destroys the manifold structure. This justifies filter approaches that maximize class separability and minimize intra-class scatter, but also obviates them just train features and classifier jointly. The resulting classifier is competitive with nonlinear SVMs but faster to train and at test time. Matlab code: http://eecs.ucmerced.edu. Future work: what is the interplay of dimensionality reduction with a nonlinear classifier? p. 13

Formulation for the multiclass problem With K classes, we can use the one-versus-all scheme and have K SVMs, each of which classifies whether a point belongs to one class or not. The objective function is the sum of that of K SVMs. In the Z step, we solve for each point a quadratic program on L+K variables and 2K constraints: K min z F(x) 2 + C k ξ k z,{ξ k } K k=1 s.t. k=1 y k ((w k ) T z+b k ) 1 ξ k, ξ k 0, k = 1,...,K. This has no closed-form solution, so we solve it numerically. p. 14

Document binary classification Results on the PC/MAC subset of 20 newsgroups. methods % error (std) nearest neighbor 19.16 (0.74) linear SVM 13.5 (0.72) PCA (L = 2) 42.10 (1.22) LDA (L = 1) 14.21 (1.63) LMNN (L = 2) 15.91 (1.65) low-dim SVM (L = 1) 13.12 (0.67) low-dim SVM (L = 2) 12.94 (0.82) low-dim SVM (L = 20) 12.76 (0.81) PC MAC PCA LMNN LDA low-dim SVM p. 15

MNIST odd/even classification error errors of different algorithms wrt dataset size NN LSVM GSVM 10 1 KPCA KLDA Ours LDA+GSVM PCA+GSVM PCA+Ours 12 5 10 20 30 40 50 size KPCA odd even low-dim SVM p. 16