Analysis of Greedy Algorithms

Similar documents
Boosting. Jiahui Shen. October 27th, / 44

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Restricted Strong Convexity Implies Weak Submodularity

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

An Introduction to Sparse Approximation

Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations

Stability and Robustness of Weak Orthogonal Matching Pursuits

Generalized Orthogonal Matching Pursuit- A Review and Some

Generalized greedy algorithms.

sparse and low-rank tensor recovery Cubic-Sketching

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

The Pros and Cons of Compressive Sensing

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Greedy Sparsity-Constrained Optimization

STAT 200C: High-dimensional Statistics

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Greedy Signal Recovery and Uniform Uncertainty Principles

Robust Sparse Recovery via Non-Convex Optimization

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

The Pros and Cons of Compressive Sensing

of Orthogonal Matching Pursuit

Sparse Linear Models (10/7/13)

Linear Methods for Regression. Lijun Zhang

The Analysis Cosparse Model for Signals and Images

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CoSaMP: Greedy Signal Recovery and Uniform Uncertainty Principles

1 Regression with High Dimensional Data

Sparse analysis Lecture III: Dictionary geometry and greedy algorithms

Gradient Descent with Sparsification: An iterative algorithm for sparse recovery with restricted isometry property

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Compressive Sensing and Beyond

(Part 1) High-dimensional statistics May / 41

Multipath Matching Pursuit

OWL to the rescue of LASSO

Model-Based Compressive Sensing for Signal Ensembles. Marco F. Duarte Volkan Cevher Richard G. Baraniuk

Sparse representation classification and positive L1 minimization

The Frank-Wolfe Algorithm:

Tractable Upper Bounds on the Restricted Isometry Constant

Reconstruction from Anisotropic Random Measurements

Stopping Condition for Greedy Block Sparse Signal Recovery

Linear Convergence of Stochastic Iterative Greedy Algorithms with Sparse Constraints

GREEDY SIGNAL RECOVERY REVIEW

Conditional Gradient (Frank-Wolfe) Method

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Machine Learning for Signal Processing Sparse and Overcomplete Representations

TRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS

Lasso Regression: Regularization for feature selection

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

A Tight Bound of Hard Thresholding

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Compressed Sensing and Sparse Recovery

Introduction to Compressed Sensing

ESL Chap3. Some extensions of lasso

Bias-free Sparse Regression with Guaranteed Consistency

Sparsity in Underdetermined Systems

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CoSaMP. Iterative signal recovery from incomplete and inaccurate samples. Joel A. Tropp

Signal Recovery from Permuted Observations

Convex relaxation for Combinatorial Penalties

Sparsity Regularization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

ORTHOGONAL matching pursuit (OMP) is the canonical

arxiv: v1 [math.na] 26 Nov 2009

CSC 576: Variants of Sparse Learning

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Optimization methods

Optimization for Compressed Sensing

Sparse Approximation and Variable Selection

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Optimization methods

Analysis of Multi-stage Convex Relaxation for Sparse Regularization

Methods for sparse analysis of high-dimensional data, II

CPSC 540: Machine Learning

Lasso Regression: Regularization for feature selection

A Short Introduction to the Lasso Methodology

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A new method on deterministic construction of the measurement matrix in compressed sensing

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

An iterative hard thresholding estimator for low rank matrix recovery

Lecture 5 : Projections

Descent methods. min x. f(x)

Nonlinear Optimization for Optimal Control

Exponential decay of reconstruction error from binary measurements of sparse signals

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

arxiv: v2 [cs.lg] 6 May 2017

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Statistics for high-dimensional data: Group Lasso and additive models

Solving Corrupted Quadratic Equations, Provably

Transcription:

Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th

Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm Analysis on hard-thresholding pursuit

Introduction Greedy algorithms: Optimization in each step No global optimality guarantee Examples Boosting (AdaBoost, Gradient Boosting), Matching Pursuit (OMP, CoSaMP), Forward and Backward algorithms (FoBa)

Some notation Abbreviate l(x β; y) as l(β) J (β): support of β, i.e. J (β) = {j : β j 0} X S : sub-matrix of X formed with columns in set S β S : sub-vector of β on set S β : the true coefficient; β t : estimated β in tth iteration J \J : the elements in J but not in J, i.e. J J C J : cardinality of set J e j : a vector with only jth element as 1, others as 0

Example Consider OMP (greedy least square) with true model y = X β + ε Note: p > n, so X T X is not invertible OMP procedure: Select and update support: j t = argmax j l(β t 1 ) j = argmax j X j, y X β t 1 ; J t = J t 1 {j t } Update estimator: β t = argmin β l(β) subject to J (β) J t (Full correction) Orthogonal: residual (y X β t ) is orthogonal to the selected support (due to full correction)

Problem setup Key ingredients in greedy algorithms: Choice of loss function: quadratic loss (regression); exponential loss; non-convex loss Selection criterion: select one/multiple features; choose the one with largest gradient/largest decrease in function value; involve backward procedure/no backward Iterative rule: keep the previous weights/modify the weights

Problem setup Objective function: min l(x β; y) subject to β 0 q Consider learning problems with large number of features (p > n) Sparse target: linear combination of small number of features (q < n) Directly solve sparse learning problem (L 0 regularization) Given weak classifiers, Boosting can be formulated into this framework

Example Assumption: no noise; X j 2 = 1 for each j (unit vector) Intuition: make a connection between l(β t ) and l(β t 1 ) In regression, l(β) = y X β 2, l(β) = X, X β y A simple analysis: here L 1 is not exactly L 1 norm (definition omitted) y X β t 2 2 y X β t 1 2 2 Optimal y X β t 1 αx j t 2 2 = y X β t 1 2 2 2α y X β t 1, X j t + α 2 FC = y X β t 1, y Optimal Select α = y X β t 1, X j t y X β t 1, X j t y L1

Example Combine the two equations: y X β t 2 2 y X β t 1 2 2(1 y X β t 1 2 2 y 2 L 1 ) Result by induction: Drawback: Noise? X β X β t 2 2 Estimation error? y 2 L 1? no noise = y X β t 2 2 y 2 L 1 t + 1

Target of analysis Commonly used: Prediction error: X β X β t 2 2 Statistical error: β β t 2 2 Selection consistency (support recovery): J (β ) = J (β t ) Some others: Minimax error bound Iteration time Note: Many papers consider the globally optimal solution instead of the true β. Most of the time, they can be replaced with each other. (Belief: β should approximately optimize l(β))

Regularity condition Commonly used and well known: Restricted isometry property (RIP): ρ (s) β 2 2 X β 2 2 ρ + (s) β 2 2 for all β R p with β 0 s Restricted strong convexity/smoothness (RSC/RSS): ρ (s) β β 2 2 l(β ) l(β) l(β), β β ρ + (s) β β 2 2 for all β β R p with β β 0 s

Regularity condition Values for ρ + (s) and ρ (s) when n = 200, s increases from 1 to n; X has i.i.d. N(0, 1/ n) entries

Regularity condition Other: Restricted gradient optimal constant: l(β ), β ɛ s (β ) β 2 for all β 0 s ɛ s (β ): a measure of noise, on the σ s log(p) level for regression Sparse eigenvalue condition (a different name of RIP, but use only one side): { X β 2 } ρ (s) = inf 2 β 2 ; β 0 s 2 We will use ρ and ρ + with the definition in RSC/RSS in this talk

Full correction effect Full correction step: ˆβ = argmin β l(β), subject to J (β) J Effect: l(β) J = 0 for β J Result: l(β ) l(β) ρ (s) β β 2 2 + l(β) J \J, (β β) J \J where s J J Benefit: whenever consider l(β), β β, only consider l(β) J \J, (β β) J \J ; bound with J \J is better than J J

Forward effect Two common choices (if adding only one feature in each step): Select j t = argmin η,j l(β + ηe j ) (line search) Select j t = argmax j l(β) j (Computationally efficient) Same result for the two selections with full correction (due to the crude bound): J \J {l(β) min l(β + ηe t η j )} ρ (s) ρ + (1) {l(β) l(β )} Comments: Interpretation: Transfer l(β) argmin η l(β + ηe t j ) into l(β) l(β ) for any β Full correction turns J J into J \J

Forward effect More details: Select j t = argmin η,j l(β + ηe j ): l(β) min l(β + ηe j t ) optimality l(β) min l(β + ηe η j) η,j J \J Select j t = argmax j l(β) j : = l(β) min η,j J \J l(β + η(β j β j )e j ) l(β) min l(β + ηe j t ) = l(β) min l(β + ηsgn(β η η i)e j t ) optimality l(β) min η,j J \J l(β + ηsgn(β j)e j ) Comment: Union bound used in J \J to derive the final result

OMP A bit refined analysis using forward effect: {l(β) min l(β + ηe j t )} η ρ (s) ρ + (1) J \J {l(β) l(β )} Taking β as β t, β + ηe j t as β t+1 and β as β we have l(β t+1 ) l(β t ) c t {l(β t ) l(β )} where c t = ρ (s)/{ρ + (1) J \J t } It can be transformed into l(β t+1 ) l(β ) (1 c t ){l(β t ) l(β )} e ct {l(β t ) l(β )} which gives l(β t ) l(β ) e Σct l(β ) + e Σct l(β 0 )

OMP Recall restricted gradient optimal constant: for β 0 s, l(β ), β ɛ s (β ) β 2 Usage: statistical error bound can be achieved through l(β) l(β ): ρ (s) β β 2 2 2l(β) 2l(β ) + ɛ s(β ) 2 ρ (s) where s J t J Key step in the proof: l(β) l(β ) = l(β) l(β ) l(β ), β β + l(β ), β β Once we get l(β t ) l(β ), bound on β t β 2 2 can also be achieved

OMP The analysis can further be refined using several techniques: Use a different l(β ) in each step so the bound can be more precise with another term q k. q k comes from the fact that l(β) l(β ) 1.5ρ + (s) β J \J 2 2 + 0.5ɛ s (β )/ρ + (s) Give a criterion on t so that c t can be made into a constant to combine with q k into induction. Final result (s = J (β ) 0 + t since we consider β β t ): l(β t ) l(β ) + 2.5ɛ s (β )/ρ (s) β t β 2 6ɛ s (β )/ρ (s) = O(σ J (β ) log p) when t = 4 J ρ +(1) ρ (s) ln 20ρ +( J ) ρ (s)

Termination Time Intuition: if the decrease is significant for each step, then there should not be too many iterations Stop before any over-fitting happens: l(β t ) l(β ) A routine to get a bound: iteration time t controls certain parameter in another bound. A restriction on that parameter gives a bound on iteration time.

Forward-backward greedy algorithm FoBa-obj/FoBa-gdt Process: Forward: Select the one with largest decrease in function value/largest gradient, do full correction; stop if δ t = l(β t+1 ) l(β t ) δ Backward: delete a selected feature if min j l(β t β j e j ) l(β t ) δ t /2, do full correction Intuition of FoBa: Forward procedure ensures significant decrease in function value Backward procedure removes incorrect features in early stage If decreasing is significant, gradient should be large; Otherwise, there is a bound on the infinity norm of the gradient δ is used to control forward and backward effect

Backward effect Assume β is also the global optimal solution Delete j t = argmin j l(β β j e j ) l(β) and do full correction Make a good control of β on J \J : β J \J 2 2 J \J ρ + (1) {min l(β β j ) l(β)} j Crude usage: β β 2 (β β ) J \J 2 = β J \J 2 Full correction turns J J into J \J

FoBa How to analyze? δ can be a tool to make bounds for different quantities; δ t can be a bridge to connect bounds A simple proof of a bound on gradient: l(β) ρ + (1)δ δ l(β) min l(β + ηe j ) η,j max ηe j, l(β) ρ + (1)η 2 η,j max j l(β) j 2 ρ + (1) Start with an assumption on selecting appropriate δ so that l(β ) l(β t ). δ > 4ρ +(1) ρ 2 (s) l(β )

General Framework Strategy I: Use an auxiliary variable β as the optimal solution on J (β ) = J (β ) J (β t ) to help analysis Termination rule comes from l(β t ) l(β ). Divide l(β t ) l(β ) into l(β t ) l(β ) {l(β ) l(β )} and use full correction result on each part For each part, we get β t β and β β Forward step gives bound on β t β ; Backward step gives bound on β t β ; both through δ t β t β β t β + β β gives a relationship between β β and β t β

Termination time for FoBa Full correction and RSC/RSS: 0 l(β t ) l(β ) = l(β t ) l(β ) {l(β ) l(β )} {ρ + (s) ρ (s)(k 1) 2 } β t β 2 2 where β t β 2 k β t β 2 Bound on forward step: δ t ρ 2 (s) ρ +(1) J \J t 1 βt β 2 2 Bound on backward step: β t β 2 2 J \J ρ +(1) δt Combination through δ t gives: k = ρ2 (s) J \J ρ +(1) J \J t 1 Recall J J t J J + t, which gives an upper bound on t as: { t ( J + 1) ( ρ + (s) ρ (s) + 1)2ρ +(1) ρ (s) } 2

General Framework Strategy II (an easy approach): use simple inequality with regularity condition to derive bound Use RSC/RSS, transfer l(β t ) l(β ) into terms with gradient and β β t 2 2 Use Holder s inequality directly to deal with the gradient term, l(β t ), β t β l(β t ) β t β 1 β t β 1 transfers into 2-norm bound. l(β) is bounded by the design of the algorithm (involving δ)

FoBa Details: 0 l(β ) l(β t ) l(β t ), β β t + ρ (s) β β t Final result: ρ (s) β β t 2 2 l(β t ) J \J t, (β β t ) J \J t ρ + (1)δ J \J t β β t 2 β β t ρ+ (1)δ 2 J ρ (s) \J t β β t 2 2 ρ +(1)δ ρ 2 (s) where = {j J \J t : βj 2ρ+(1) γ}, γ = ρ 2 (s) Other bounds can be achieved as well: l(β t ) l(β ) ρ +(1)δ ρ (s) ; ρ (s) 2 ρ + (1) J t \J J \J t

FoBa To make the bound look better (a trick): ρ + (1)δ ρ 2 (s) J \J t β J \J t 2 2 where γ = 2ρ+(1) ρ 2 (s) Then γ 2 {j J \J t : β j γ} = 2ρ +(1) ρ 2 (s) {j J \J t : β j γ} J \J t 2 {j J \J t : β j γ} = 2( J \J t {j J \J t : β j < γ} ) which leads to J \J t 2 {j J \J t : β j < γ}

FoBa Strategy III: use random matrix theory and simple inequalities to derive bound X β t y 2 2 = X βt X β 2 2 ε, X βt X β + ε 2 2 Define ε = l(β ). Then a generalized version is l(β t ) = X β t X β 2 2 ε, X β t X β + l(β ) ε, X β t X β can be bounded using random matrix theory l(β t ) l(β ) can be upper bounded through forward and backward effect on l(β t ) l(β ) and l(β ) l(β ), but some more precise analysis with tricks are involved Termination time bound will also change accordingly Benefit: no need assumption on RSS (ρ + )

FoBa Assume ε is sub-gaussion with parameter σ Comparison between results from strategy II and III: with δ ρ 2 +(1)ρ 1 (s) ε 2 β β t 2 2 δρ 2 +(1)ρ 1 (s) β β t 2 2 ρ 1 (s)σ2 J + δρ 2 (s) with δ ρ 1 (s)σ2 log p Comparison with LASSO: A bit better than LASSO error bound: O(σ 2 J log p) LASSO also needs stronger condition (irrepresentable condition) for selection consistency

Selection consistency Target: J = J t Several ways to evaluate: In FoBa, max{ J \J t, J t \J } = O( ); need < 1 with high probability Suppose β is known, build necessary/sufficient condition and analyze (e.g. KKT) Derive an upper bound for β β t and add a β -min condition

Hard-thresholding pursuit HTP procedure: Select q features with largest absolutely values after running gradient descent: β t = Θ(β t 1 η l(β t 1 ); q); do full correction The analysis in the paper use the global optimal solution for a discussion (β is the global minimum under β q) Global optimal solution is easier to be analyzed; but we can use random matrix theory to derive bounds between β and global optimal solution

Hard-thresholding pursuit A naive analysis: Assume l(β ) l(β t ) RSC and Holder inequality gives: l(β ) l(β t ) l(β t ) J \J t 2 β t β 2 +ρ (s) β t β 2 2 If J t J, min j β t j < β β t 2 min j β t j 2q l(β t ) ρ (2q) 2q l(β t ) ρ (2q) guarantees support recovery

Hard-thresholding pursuit The complete analysis is more precise with several lemmas and tricks (details omitted) Main ideas: Under certain conditions (those unknown constant terms involved), HTP terminates when β t reaches β HTP will not terminate before β t reaching β The iteration time is finite

Forward effect for HTP Key idea: handle the gradient by regularity condition (β β) J 2 2 = β β, (β β) J = β β η l(β ) + η l(β), (β β) J η l(β), (β β) J ρ β β 2 + η l(β) 2 (β β) J 2 where ρ = 1 2ηρ (s) + η 2 ρ + (s) is obtained from β β, l(β ) l(β) ρ (s) β β 2 2 l(β ) l(β) 2 ρ + (s) β β 2 Result: β β 2 β J \J 2 1 ρ + η l(β) J \J 2 1 ρ

Some comments In general, full correction make the analysis easier but not necessarily better in practice Almost every analysis needs to use RSC/RSS (or equivalently type) Induction is still a good tool to do analysis, but the bound can be very complicated The so called constant part in bound can play a significant role in practice, so the method may fail

Literature Forward-backward greedy algorithm: Barron, A. R., Cohen, A., Dahmen, W., & DeVore, R. A. (2008). Approximation and learning by greedy algorithms. The annals of statistics, 36(1), 64-94. Liu, J., Ye, J., & Fujimaki, R. (2014, January). Forward-backward greedy algorithms for general convex smooth functions over a cardinality constraint. In International Conference on Machine Learning (pp. 503-511). Zhang, T. (2011). Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE transactions on information theory, 57(7), 4689-4708. Matching pursuit Needell, D., & Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3), 301-321. Zhang, T. (2009). On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10(Mar), 555-568. Zhang, T. (2011). Sparse recovery with orthogonal matching pursuit under RIP. IEEE Transactions on Information Theory, 57(9), 6215-6221. Hard thresholding pursuit: Bahmani, S., Raj, B., & Boufounos, P. T. (2013). Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14(Mar), 807-841. Yuan, X., Li, P., & Zhang, T. (2014, January). Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning (pp. 127-135). Yuan, X., Li, P., & Zhang, T. (2016). Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems (pp. 3558-3566).