Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th
Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm Analysis on hard-thresholding pursuit
Introduction Greedy algorithms: Optimization in each step No global optimality guarantee Examples Boosting (AdaBoost, Gradient Boosting), Matching Pursuit (OMP, CoSaMP), Forward and Backward algorithms (FoBa)
Some notation Abbreviate l(x β; y) as l(β) J (β): support of β, i.e. J (β) = {j : β j 0} X S : sub-matrix of X formed with columns in set S β S : sub-vector of β on set S β : the true coefficient; β t : estimated β in tth iteration J \J : the elements in J but not in J, i.e. J J C J : cardinality of set J e j : a vector with only jth element as 1, others as 0
Example Consider OMP (greedy least square) with true model y = X β + ε Note: p > n, so X T X is not invertible OMP procedure: Select and update support: j t = argmax j l(β t 1 ) j = argmax j X j, y X β t 1 ; J t = J t 1 {j t } Update estimator: β t = argmin β l(β) subject to J (β) J t (Full correction) Orthogonal: residual (y X β t ) is orthogonal to the selected support (due to full correction)
Problem setup Key ingredients in greedy algorithms: Choice of loss function: quadratic loss (regression); exponential loss; non-convex loss Selection criterion: select one/multiple features; choose the one with largest gradient/largest decrease in function value; involve backward procedure/no backward Iterative rule: keep the previous weights/modify the weights
Problem setup Objective function: min l(x β; y) subject to β 0 q Consider learning problems with large number of features (p > n) Sparse target: linear combination of small number of features (q < n) Directly solve sparse learning problem (L 0 regularization) Given weak classifiers, Boosting can be formulated into this framework
Example Assumption: no noise; X j 2 = 1 for each j (unit vector) Intuition: make a connection between l(β t ) and l(β t 1 ) In regression, l(β) = y X β 2, l(β) = X, X β y A simple analysis: here L 1 is not exactly L 1 norm (definition omitted) y X β t 2 2 y X β t 1 2 2 Optimal y X β t 1 αx j t 2 2 = y X β t 1 2 2 2α y X β t 1, X j t + α 2 FC = y X β t 1, y Optimal Select α = y X β t 1, X j t y X β t 1, X j t y L1
Example Combine the two equations: y X β t 2 2 y X β t 1 2 2(1 y X β t 1 2 2 y 2 L 1 ) Result by induction: Drawback: Noise? X β X β t 2 2 Estimation error? y 2 L 1? no noise = y X β t 2 2 y 2 L 1 t + 1
Target of analysis Commonly used: Prediction error: X β X β t 2 2 Statistical error: β β t 2 2 Selection consistency (support recovery): J (β ) = J (β t ) Some others: Minimax error bound Iteration time Note: Many papers consider the globally optimal solution instead of the true β. Most of the time, they can be replaced with each other. (Belief: β should approximately optimize l(β))
Regularity condition Commonly used and well known: Restricted isometry property (RIP): ρ (s) β 2 2 X β 2 2 ρ + (s) β 2 2 for all β R p with β 0 s Restricted strong convexity/smoothness (RSC/RSS): ρ (s) β β 2 2 l(β ) l(β) l(β), β β ρ + (s) β β 2 2 for all β β R p with β β 0 s
Regularity condition Values for ρ + (s) and ρ (s) when n = 200, s increases from 1 to n; X has i.i.d. N(0, 1/ n) entries
Regularity condition Other: Restricted gradient optimal constant: l(β ), β ɛ s (β ) β 2 for all β 0 s ɛ s (β ): a measure of noise, on the σ s log(p) level for regression Sparse eigenvalue condition (a different name of RIP, but use only one side): { X β 2 } ρ (s) = inf 2 β 2 ; β 0 s 2 We will use ρ and ρ + with the definition in RSC/RSS in this talk
Full correction effect Full correction step: ˆβ = argmin β l(β), subject to J (β) J Effect: l(β) J = 0 for β J Result: l(β ) l(β) ρ (s) β β 2 2 + l(β) J \J, (β β) J \J where s J J Benefit: whenever consider l(β), β β, only consider l(β) J \J, (β β) J \J ; bound with J \J is better than J J
Forward effect Two common choices (if adding only one feature in each step): Select j t = argmin η,j l(β + ηe j ) (line search) Select j t = argmax j l(β) j (Computationally efficient) Same result for the two selections with full correction (due to the crude bound): J \J {l(β) min l(β + ηe t η j )} ρ (s) ρ + (1) {l(β) l(β )} Comments: Interpretation: Transfer l(β) argmin η l(β + ηe t j ) into l(β) l(β ) for any β Full correction turns J J into J \J
Forward effect More details: Select j t = argmin η,j l(β + ηe j ): l(β) min l(β + ηe j t ) optimality l(β) min l(β + ηe η j) η,j J \J Select j t = argmax j l(β) j : = l(β) min η,j J \J l(β + η(β j β j )e j ) l(β) min l(β + ηe j t ) = l(β) min l(β + ηsgn(β η η i)e j t ) optimality l(β) min η,j J \J l(β + ηsgn(β j)e j ) Comment: Union bound used in J \J to derive the final result
OMP A bit refined analysis using forward effect: {l(β) min l(β + ηe j t )} η ρ (s) ρ + (1) J \J {l(β) l(β )} Taking β as β t, β + ηe j t as β t+1 and β as β we have l(β t+1 ) l(β t ) c t {l(β t ) l(β )} where c t = ρ (s)/{ρ + (1) J \J t } It can be transformed into l(β t+1 ) l(β ) (1 c t ){l(β t ) l(β )} e ct {l(β t ) l(β )} which gives l(β t ) l(β ) e Σct l(β ) + e Σct l(β 0 )
OMP Recall restricted gradient optimal constant: for β 0 s, l(β ), β ɛ s (β ) β 2 Usage: statistical error bound can be achieved through l(β) l(β ): ρ (s) β β 2 2 2l(β) 2l(β ) + ɛ s(β ) 2 ρ (s) where s J t J Key step in the proof: l(β) l(β ) = l(β) l(β ) l(β ), β β + l(β ), β β Once we get l(β t ) l(β ), bound on β t β 2 2 can also be achieved
OMP The analysis can further be refined using several techniques: Use a different l(β ) in each step so the bound can be more precise with another term q k. q k comes from the fact that l(β) l(β ) 1.5ρ + (s) β J \J 2 2 + 0.5ɛ s (β )/ρ + (s) Give a criterion on t so that c t can be made into a constant to combine with q k into induction. Final result (s = J (β ) 0 + t since we consider β β t ): l(β t ) l(β ) + 2.5ɛ s (β )/ρ (s) β t β 2 6ɛ s (β )/ρ (s) = O(σ J (β ) log p) when t = 4 J ρ +(1) ρ (s) ln 20ρ +( J ) ρ (s)
Termination Time Intuition: if the decrease is significant for each step, then there should not be too many iterations Stop before any over-fitting happens: l(β t ) l(β ) A routine to get a bound: iteration time t controls certain parameter in another bound. A restriction on that parameter gives a bound on iteration time.
Forward-backward greedy algorithm FoBa-obj/FoBa-gdt Process: Forward: Select the one with largest decrease in function value/largest gradient, do full correction; stop if δ t = l(β t+1 ) l(β t ) δ Backward: delete a selected feature if min j l(β t β j e j ) l(β t ) δ t /2, do full correction Intuition of FoBa: Forward procedure ensures significant decrease in function value Backward procedure removes incorrect features in early stage If decreasing is significant, gradient should be large; Otherwise, there is a bound on the infinity norm of the gradient δ is used to control forward and backward effect
Backward effect Assume β is also the global optimal solution Delete j t = argmin j l(β β j e j ) l(β) and do full correction Make a good control of β on J \J : β J \J 2 2 J \J ρ + (1) {min l(β β j ) l(β)} j Crude usage: β β 2 (β β ) J \J 2 = β J \J 2 Full correction turns J J into J \J
FoBa How to analyze? δ can be a tool to make bounds for different quantities; δ t can be a bridge to connect bounds A simple proof of a bound on gradient: l(β) ρ + (1)δ δ l(β) min l(β + ηe j ) η,j max ηe j, l(β) ρ + (1)η 2 η,j max j l(β) j 2 ρ + (1) Start with an assumption on selecting appropriate δ so that l(β ) l(β t ). δ > 4ρ +(1) ρ 2 (s) l(β )
General Framework Strategy I: Use an auxiliary variable β as the optimal solution on J (β ) = J (β ) J (β t ) to help analysis Termination rule comes from l(β t ) l(β ). Divide l(β t ) l(β ) into l(β t ) l(β ) {l(β ) l(β )} and use full correction result on each part For each part, we get β t β and β β Forward step gives bound on β t β ; Backward step gives bound on β t β ; both through δ t β t β β t β + β β gives a relationship between β β and β t β
Termination time for FoBa Full correction and RSC/RSS: 0 l(β t ) l(β ) = l(β t ) l(β ) {l(β ) l(β )} {ρ + (s) ρ (s)(k 1) 2 } β t β 2 2 where β t β 2 k β t β 2 Bound on forward step: δ t ρ 2 (s) ρ +(1) J \J t 1 βt β 2 2 Bound on backward step: β t β 2 2 J \J ρ +(1) δt Combination through δ t gives: k = ρ2 (s) J \J ρ +(1) J \J t 1 Recall J J t J J + t, which gives an upper bound on t as: { t ( J + 1) ( ρ + (s) ρ (s) + 1)2ρ +(1) ρ (s) } 2
General Framework Strategy II (an easy approach): use simple inequality with regularity condition to derive bound Use RSC/RSS, transfer l(β t ) l(β ) into terms with gradient and β β t 2 2 Use Holder s inequality directly to deal with the gradient term, l(β t ), β t β l(β t ) β t β 1 β t β 1 transfers into 2-norm bound. l(β) is bounded by the design of the algorithm (involving δ)
FoBa Details: 0 l(β ) l(β t ) l(β t ), β β t + ρ (s) β β t Final result: ρ (s) β β t 2 2 l(β t ) J \J t, (β β t ) J \J t ρ + (1)δ J \J t β β t 2 β β t ρ+ (1)δ 2 J ρ (s) \J t β β t 2 2 ρ +(1)δ ρ 2 (s) where = {j J \J t : βj 2ρ+(1) γ}, γ = ρ 2 (s) Other bounds can be achieved as well: l(β t ) l(β ) ρ +(1)δ ρ (s) ; ρ (s) 2 ρ + (1) J t \J J \J t
FoBa To make the bound look better (a trick): ρ + (1)δ ρ 2 (s) J \J t β J \J t 2 2 where γ = 2ρ+(1) ρ 2 (s) Then γ 2 {j J \J t : β j γ} = 2ρ +(1) ρ 2 (s) {j J \J t : β j γ} J \J t 2 {j J \J t : β j γ} = 2( J \J t {j J \J t : β j < γ} ) which leads to J \J t 2 {j J \J t : β j < γ}
FoBa Strategy III: use random matrix theory and simple inequalities to derive bound X β t y 2 2 = X βt X β 2 2 ε, X βt X β + ε 2 2 Define ε = l(β ). Then a generalized version is l(β t ) = X β t X β 2 2 ε, X β t X β + l(β ) ε, X β t X β can be bounded using random matrix theory l(β t ) l(β ) can be upper bounded through forward and backward effect on l(β t ) l(β ) and l(β ) l(β ), but some more precise analysis with tricks are involved Termination time bound will also change accordingly Benefit: no need assumption on RSS (ρ + )
FoBa Assume ε is sub-gaussion with parameter σ Comparison between results from strategy II and III: with δ ρ 2 +(1)ρ 1 (s) ε 2 β β t 2 2 δρ 2 +(1)ρ 1 (s) β β t 2 2 ρ 1 (s)σ2 J + δρ 2 (s) with δ ρ 1 (s)σ2 log p Comparison with LASSO: A bit better than LASSO error bound: O(σ 2 J log p) LASSO also needs stronger condition (irrepresentable condition) for selection consistency
Selection consistency Target: J = J t Several ways to evaluate: In FoBa, max{ J \J t, J t \J } = O( ); need < 1 with high probability Suppose β is known, build necessary/sufficient condition and analyze (e.g. KKT) Derive an upper bound for β β t and add a β -min condition
Hard-thresholding pursuit HTP procedure: Select q features with largest absolutely values after running gradient descent: β t = Θ(β t 1 η l(β t 1 ); q); do full correction The analysis in the paper use the global optimal solution for a discussion (β is the global minimum under β q) Global optimal solution is easier to be analyzed; but we can use random matrix theory to derive bounds between β and global optimal solution
Hard-thresholding pursuit A naive analysis: Assume l(β ) l(β t ) RSC and Holder inequality gives: l(β ) l(β t ) l(β t ) J \J t 2 β t β 2 +ρ (s) β t β 2 2 If J t J, min j β t j < β β t 2 min j β t j 2q l(β t ) ρ (2q) 2q l(β t ) ρ (2q) guarantees support recovery
Hard-thresholding pursuit The complete analysis is more precise with several lemmas and tricks (details omitted) Main ideas: Under certain conditions (those unknown constant terms involved), HTP terminates when β t reaches β HTP will not terminate before β t reaching β The iteration time is finite
Forward effect for HTP Key idea: handle the gradient by regularity condition (β β) J 2 2 = β β, (β β) J = β β η l(β ) + η l(β), (β β) J η l(β), (β β) J ρ β β 2 + η l(β) 2 (β β) J 2 where ρ = 1 2ηρ (s) + η 2 ρ + (s) is obtained from β β, l(β ) l(β) ρ (s) β β 2 2 l(β ) l(β) 2 ρ + (s) β β 2 Result: β β 2 β J \J 2 1 ρ + η l(β) J \J 2 1 ρ
Some comments In general, full correction make the analysis easier but not necessarily better in practice Almost every analysis needs to use RSC/RSS (or equivalently type) Induction is still a good tool to do analysis, but the bound can be very complicated The so called constant part in bound can play a significant role in practice, so the method may fail
Literature Forward-backward greedy algorithm: Barron, A. R., Cohen, A., Dahmen, W., & DeVore, R. A. (2008). Approximation and learning by greedy algorithms. The annals of statistics, 36(1), 64-94. Liu, J., Ye, J., & Fujimaki, R. (2014, January). Forward-backward greedy algorithms for general convex smooth functions over a cardinality constraint. In International Conference on Machine Learning (pp. 503-511). Zhang, T. (2011). Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE transactions on information theory, 57(7), 4689-4708. Matching pursuit Needell, D., & Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3), 301-321. Zhang, T. (2009). On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10(Mar), 555-568. Zhang, T. (2011). Sparse recovery with orthogonal matching pursuit under RIP. IEEE Transactions on Information Theory, 57(9), 6215-6221. Hard thresholding pursuit: Bahmani, S., Raj, B., & Boufounos, P. T. (2013). Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14(Mar), 807-841. Yuan, X., Li, P., & Zhang, T. (2014, January). Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning (pp. 127-135). Yuan, X., Li, P., & Zhang, T. (2016). Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems (pp. 3558-3566).