Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Similar documents
Sharp Generalization Error Bounds for Randomly-projected Classifiers

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

DATA MINING AND MACHINE LEARNING

Ad Placement Strategies

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Least Squares Regression

Lecture 2 Machine Learning Review

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Convex Optimization Algorithms for Machine Learning in 10 Slides

Least Squares Regression

Logistic Regression Logistic

Machine Learning for NLP

Support Vector Machine via Nonlinear Rescaling Method

Machine Learning And Applications: Supervised Learning-SVM

Linear Models in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Learning From Data: Modelling as an Optimisation Problem

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Statistical Methods for Data Mining

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Consistency of Nearest Neighbor Methods

Regression.

Regularization and Variable Selection via the Elastic Net

ICS-E4030 Kernel Methods in Machine Learning

Fast Regularization Paths via Coordinate Descent

Bayesian Methods for Sparse Signal Recovery

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

The Frank-Wolfe Algorithm:

Tufts COMP 135: Introduction to Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Proteomics and Variable Selection

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

STA141C: Big Data & High Performance Statistical Computing

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Lecture 1b: Linear Models for Regression

BAYESIAN MULTINOMIAL LOGISTIC REGRESSION FOR AUTHOR IDENTIFICATION

Introduction to Machine Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Gaussian discriminant analysis Naive Bayes

Dual Augmented Lagrangian, Proximal Minimization, and MKL

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Sparse Approximation and Variable Selection

Learning Task Grouping and Overlap in Multi-Task Learning

y(x) = x w + ε(x), (1)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Discriminative Models

Support Vector Machines: Maximum Margin Classifiers

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Support Vector Machine (continued)

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Support Vector Machines.

Multiplicative Updates for L 1 Regularized Linear and Logistic Regression

Warm up: risk prediction with logistic regression

Machine Learning Practice Page 2 of 2 10/28/13

Discriminative Models

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

Linear Models for Regression CS534

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Part of the slides are adapted from Ziko Kolter

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Linear Regression. Volker Tresp 2018

Lecture 9: PGM Learning

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

6. Regularized linear regression

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Midterm exam CS 189/289, Fall 2015

Linear & nonlinear classifiers

MULTIPLEKERNELLEARNING CSE902

Introduction to Machine Learning

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

ECE521 week 3: 23/26 January 2017

CSC 411: Lecture 04: Logistic Regression

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Statistical Data Mining and Machine Learning Hilary Term 2016

Support Vector Machines and Kernel Methods

Lecture Support Vector Machine (SVM) Classifiers

Introduction to Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Big Data Analytics: Optimization and Randomization

Linear classifiers: Logistic regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Bias-Variance Tradeoff

Classification Logistic Regression

Transcription:

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features Ata Kabán Robert J. Durrant School of Computer Science The University of Birmingham Birmingham B15 2TT, UK ECML 08, 15-19 September 2008, Antwerp, Belgium

Roadmap Intro: L q<1 -norm regularised logistic regression Analysis: generalisation & sample complexity Implementation Experiments & results Conclusions

Intro: L q<1 -regularised logistic regression Training set {(x j, y j )} n j=1, with x j R m the inputs and y j { 1,1} their labels. Scenario of interest: few r << m relevant features, small sample size n << m. Consider regularised logistic regression for concreteness: max w nx log p(y j x j, w) subject to w q A (1) j=1 where p(y w T x) = 1/(1 + exp( yw T x)), and w q = ( P m i=1 w i q ) 1/q. - if q = 2: L2-regularised ( ridge ) - if q = 1: L1-regularised ( lasso ) - if q < 1: L q<1 -regularised : non-convex, non-differentiable at 0

Intro: L q<1 -regularised logistic regression L1-regularisation - a workhorse in machine learning sparsity convexity sample complexity logarithmic in m (Ng, ICML 04) [L2-reg s is linear] Non-convex norm regularisation seems to have added value statistics (Fan & Li, 01): oracle property signal reconstruction (Wipf & Rao, 05) compressed sensing (Chartrand, 08) 0-norm SVM classification (Weston et al., 03) - but results are data-dependent genomic data classification (Liu et al., 07) - used with data resampling Question: When is L q<1 -norm regularisation superior for Machine Learning? i.e. in terms of generalisation ability & sample complexity

Analysis: Sample complexity bound H = {h(x, y) = log p(y w T x) : x R m, y { 1,1}} the function class er P (h) = E (x,y) iidp[h(x, y)] the true error of h êr z (h) = 1 P n n j=1 h(x j, y j ) the sample error of h on training set z of size n opt P (H) = inf h H er P (h) the approximation error of H L(z) = min h H êr z (h) the function returned by the learning algorithm Theorem (extending a result of A.Ng, 04 from L 1 to L q<1 ). ǫ > 0, δ > 0, m, n 1, in order to ensure that er P (L(z)) opt P (H)+ǫ with probability 1 δ, it is enough to have the training set size: n = Ω((log m) poly(a, r 1/q,1/ǫ,log(1/δ)) (2) - logarithmic in the data dimensionality m; - polynomial in #relevant features, but growing with r 1/q

Analysis: Comments The sample complexity bound derived is an upper-bound on the logloss error, and consequently also on the 0-1 error. Quantitatively quite loose. However, it gives some insight into the behaviour of L q -regularised logistic regression - for smaller q the sample complexity grows faster in r so, a small q is more advantageous for the small r case (rather than the large r case) this is also intuitive: we may expect the sparsity-inducing effect is not beneficial in such a case (under-fitting) - given the v small r case, it doesn t say if the smaller q is actually better, though experiments and some existing other kinds of analyses indicate so.

Analysis: Empirical check 0 1 error 40 30 20 10 0.2 0.4 0.6 0.8 1 q test logloss 1.5 1 0.5 0.2 0.4 0.6 0.8 1 q validation logloss 1.4 1.2 1 0.8 0.6 0.4 0.2 r=5 r=10 r=30 r=50 r=100 0.2 0.4 0.6 0.8 1 q Experiments on m = 200 dimensional data sets, varying the number of relevant features r {5,10,30,50,100}. The medians of 60 independent trials are shown and the error bars represent one standard error. The 0-1 errors are out of 100.

Implementation Using the Lagrangian form of the objective. Maximise w.r.t. w: nx n o mx L = log 1 + exp( y j w T x j ) α w i q (3) j=1 Implementation is non-trivial since the objective function is not convex and not differentiable at zero. We discuss 3 methods: - Method 1: Smoothing at zero - Method 2: Local quadratic variational bounds - Method 3: Local linear variational bounds Methods 2 and 3 turn out to be equivalent in terms of finding a local optimum of the same objective. i=1

Method 1: Using a smooth approximation Proposed in (Liu et al., Stat App to Gen & Mol. Biol, 07) mx mx w i q (wi 2 + γ)q/2 (4) i=1 i=1 where γ is set to a small value. Pro: Easy to implement, any nonlinear optimisation method applies Con: no good way to set γ - if γ is too small, numerically unstable, specially when q is also small - if γ is larger, over-smoothing, losing the sparsity effect Con: No guarantee to produce an increase in the objective at each iteration

Method 2: Local quadratic variational bounds With q < 1, w i q is concave. Using convex duality, we get: f(w i ) = w i q = min λ i λi w 2 i f (λ i ) f (λ i ) = min ηi w i q f (η i ) 2η i (w 2 i η2 i ) + f(η i) = 1 2 λi η 2 i f(η i) (5) (6) n q η i q 2 w 2 i + (2 q) η i qo (7) with equality when η i = ± w i, and η i are variational parameters. - quadratic in w. - we note in the case of q = 1, the bound (7) recovers exactly the bound proposed in (Krishnapuram et al., PAMI, 05). - for q < 1 (in linear regression), the expression on the r.h.s. of (7) was proposed as an approximation in (Fan & Li, JASA 01). Now we see it is a rigorous upper bound (in fact an auxiliary function ).

w i q 1 2 q ηi q 2 w 2 i + (2 q) η i q 7 2.5 6 5 4 2.0 1.5 3 1.0 2 1 0.5 K4 K2 0 2 4 w i K4 K2 0 2 4 w i Examples: Left: q = 0.5, η i = 1, so the quadratic upper bound is tangent in ±1. Right: q = 0.5, η i = 3, so the quadratic upper bound is tangent in ±3.

Iterative estimation algorithm Initialisation Loop until convergence to a local optimum: solve for the var parameters η, i.e. tighten the bound solve for w, i.e. an L2-regularised logistic regression problem End Loop [Details in the paper.] - we initialised η to ones, which worked well. Other schemes could be investigated. - we used a variational bound to the logistic term due to Jaakkola & Jordan, which then conveniently yielded close form updates throughout. Other existing methods could also be used instead.

Method 3: Local linear variational bound Recently proposed by (Zou & Li, The Annals of Stats, 08), this appears to be a closer approximation. 4 2.0 3 1.5 2 1.0 1 0.5 K4 K2 0 2 4 w i K4 K2 0 2 4 w i Examples - Left: q = 0.5, η i = 0.3, so the linear upper bound is tangent in ±0.3. Right: q = 0.5, η i = 3, so the linear upper bound is tangent in ±3

We can derive it from convex duality in the a similar way as before (details in paper). We get: with equality when η i = ± w i. w i q q η i q 1 w i + (1 q) η i q (8)

We can derive it from convex duality in the a similar way as before (details in paper). We get: with equality when η i = ± w i. w i q q η i q 1 w i + (1 q) η i q (9) The estimation algorithm becomes: Initialisation Loop until convergence to a local optimum: solve for the var parameters η, i.e. tighten the bound solve for w, i.e. an L1-regularised logistic regression problem End Loop

We can derive it from convex duality in the a similar way as before (details in paper). We get: with equality when η i = ± w i. w i q q η i q 1 w i + (1 q) η i q (10) However, the L1-regularised problem can be estimated exactly by the use of a local quadratic bound approximation. Consequently, deriving a generalised EM algorithm turns out to be the same as the algorithm we had before with local quadratic bounding method! [details in paper] Therefore we decided to implement Method 2 for the experiments reported.

Experiments & Results Synthetic data sets following and experimental protocol previously used in (Ng, ICML 04) in the first instance: 1000-dimensional data, of which: - 1 relevant feature - 3 relevant features - exponential decay of feature relevance Training + validation set (to determine α) size: 70 + 30 points. Test set of 100 points.

1 relevant feature 3 relevant features exp decay relevance Test 0 1 errors 8 6 4 2 0 0.2 0.4 0.6 0.8 1 q 1 relevant feature Test 0 1 errors 14 12 10 8 6 4 2 0.2 0.4 0.6 0.8 1 q 3 relevant features Test 0 1 errors 14 12 10 8 0.2 0.4 0.6 0.8 1 q exp decay relevance Test logloss 0.25 0.2 0.15 0.1 0.05 Test logloss 0.3 0.2 0.1 Test logloss 0.4 0.35 0.3 0.25 0.2 0.4 0.6 0.8 1 q 0.2 0.4 0.6 0.8 1 q 0.2 0.4 0.6 0.8 1 q 1 relevant feature 3 relevant features exp decay relevance # features retained 60 40 20 0 1 2 3 4 5 6 7 8 9 10 q*10 # features retained 60 40 20 0 1 2 3 4 5 6 7 8 9 10 q*10 # features retained 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 q*10 Results on synthetic data from (A.Ng, 04). Training set size n 1 = 70, validation set size=30, and the out-of-sample test set size=100. The statistics are over 10 independent runs with dimensionality ranging from 100 to 1000.

1 relevant feature 3 relevant features Test 0 1 errors 8 6 4 2 Test 0 1 errors 15 10 5 0 0.3 200 400 600 800 1000 200 400 600 800 1000 q=0.1 q=0.3 q=0.5 q=0.7 q=1 0.4 Test logloss 0.2 0.1 Test logloss 0.3 0.2 0.1 200 400 600 800 1000 Data dimensions 200 400 600 800 1000 Data dimensions Comparative results on 1000-dimensional synthetic data from (Ng, 04). Each point is the median of > 100 indep. trials. The 0-1 errors are out of 100.

Train + valid. set size = 52+23 Train + valid. set size = 35+15 35 30 40 0 1 test errors 25 20 15 10 5 0 1 test errors 30 20 10 0 q=0.1 q=0.5 q=1 0 q=0.1 q=0.5 q=1 0.6 Train + valid. set size = 52+23 0.8 Train + valid. set size = 35+15 0.5 0.6 test logloss 0.4 0.3 0.2 0.1 test logloss 0.4 0.2 0 q=0.1 q=0.5 q=1 0 q=0.1 q=0.5 q=1 Results on 5000-dimensional synthetic data with only one relevant feature and even smaller sample size. The improvement over L1 becomes larger. (The 0-1 errors are out of 100.)

Conclusions - We studied L q<1 -norm regularisation for logistic regression both theoretically and empirically, for high dimensional data with many irrelevant features. - We developed a variational method for parameter estimation, and have shown an equivalence between the use of local quadratic and local linear variational bounds to the regularisation term. - We found that L q<1 -norm regularisation is more suitable in cases when the number of relevant features is very small, and works very well despite a very large number of irrelevant features being present. - In Bayesian terms, L q<1 -norm regularisation may be interpreted as a MAPestimation with a Generalised Laplace Prior. Future work will assess the sample complexity of a full Bayesian estimate.

Selected References J Fan and R Li. Variable Selection via Non-concave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, Dec 2001, Vol. 96, No. 456, Theory and Methods. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 957-968, 2005. Z Liu, F Jiang, G Tian, S Wang, F Sato, S.J Meltzer, M Tan. Sparse Logistic Regression with Lp Penalty for Biomarker Identification. Statistical Applications in Genetics and Molecular Biology. Vol.6, Issue 1, 2007. A.Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proc. ICML 2004. Hui Zou and Runze Li: One-step sparse estimates in non-concave penalized likelihood models. The Annals of Statistics, 2008.