Analysis of Greedy Algorithms

Size: px

Start display at page:

Download "Analysis of Greedy Algorithms"

Hillary Summers
5 years ago
Views:

1 Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th

2 Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm Analysis on hard-thresholding pursuit

3 Introduction Greedy algorithms: Optimization in each step No global optimality guarantee Examples Boosting (AdaBoost, Gradient Boosting), Matching Pursuit (OMP, CoSaMP), Forward and Backward algorithms (FoBa)

4 Some notation Abbreviate l(x β; y) as l(β) J (β): support of β, i.e. J (β) = {j : β j 0} X S : sub-matrix of X formed with columns in set S β S : sub-vector of β on set S β : the true coefficient; β t : estimated β in tth iteration J \J : the elements in J but not in J, i.e. J J C J : cardinality of set J e j : a vector with only jth element as 1, others as 0

5 Example Consider OMP (greedy least square) with true model y = X β + ε Note: p > n, so X T X is not invertible OMP procedure: Select and update support: j t = argmax j l(β t 1 ) j = argmax j X j, y X β t 1 ; J t = J t 1 {j t } Update estimator: β t = argmin β l(β) subject to J (β) J t (Full correction) Orthogonal: residual (y X β t ) is orthogonal to the selected support (due to full correction)

6 Problem setup Key ingredients in greedy algorithms: Choice of loss function: quadratic loss (regression); exponential loss; non-convex loss Selection criterion: select one/multiple features; choose the one with largest gradient/largest decrease in function value; involve backward procedure/no backward Iterative rule: keep the previous weights/modify the weights

7 Problem setup Objective function: min l(x β; y) subject to β 0 q Consider learning problems with large number of features (p > n) Sparse target: linear combination of small number of features (q < n) Directly solve sparse learning problem (L 0 regularization) Given weak classifiers, Boosting can be formulated into this framework

8 Example Assumption: no noise; X j 2 = 1 for each j (unit vector) Intuition: make a connection between l(β t ) and l(β t 1 ) In regression, l(β) = y X β 2, l(β) = X, X β y A simple analysis: here L 1 is not exactly L 1 norm (definition omitted) y X β t 2 2 y X β t Optimal y X β t 1 αx j t 2 2 = y X β t α y X β t 1, X j t + α 2 FC = y X β t 1, y Optimal Select α = y X β t 1, X j t y X β t 1, X j t y L1

9 Example Combine the two equations: y X β t 2 2 y X β t 1 2 2(1 y X β t y 2 L 1 ) Result by induction: Drawback: Noise? X β X β t 2 2 Estimation error? y 2 L 1? no noise = y X β t 2 2 y 2 L 1 t + 1

10 Target of analysis Commonly used: Prediction error: X β X β t 2 2 Statistical error: β β t 2 2 Selection consistency (support recovery): J (β ) = J (β t ) Some others: Minimax error bound Iteration time Note: Many papers consider the globally optimal solution instead of the true β. Most of the time, they can be replaced with each other. (Belief: β should approximately optimize l(β))

11 Regularity condition Commonly used and well known: Restricted isometry property (RIP): ρ (s) β 2 2 X β 2 2 ρ + (s) β 2 2 for all β R p with β 0 s Restricted strong convexity/smoothness (RSC/RSS): ρ (s) β β 2 2 l(β ) l(β) l(β), β β ρ + (s) β β 2 2 for all β β R p with β β 0 s

12 Regularity condition Values for ρ + (s) and ρ (s) when n = 200, s increases from 1 to n; X has i.i.d. N(0, 1/ n) entries

13 Regularity condition Other: Restricted gradient optimal constant: l(β ), β ɛ s (β ) β 2 for all β 0 s ɛ s (β ): a measure of noise, on the σ s log(p) level for regression Sparse eigenvalue condition (a different name of RIP, but use only one side): { X β 2 } ρ (s) = inf 2 β 2 ; β 0 s 2 We will use ρ and ρ + with the definition in RSC/RSS in this talk

14 Full correction effect Full correction step: ˆβ = argmin β l(β), subject to J (β) J Effect: l(β) J = 0 for β J Result: l(β ) l(β) ρ (s) β β l(β) J \J, (β β) J \J where s J J Benefit: whenever consider l(β), β β, only consider l(β) J \J, (β β) J \J ; bound with J \J is better than J J

15 Forward effect Two common choices (if adding only one feature in each step): Select j t = argmin η,j l(β + ηe j ) (line search) Select j t = argmax j l(β) j (Computationally efficient) Same result for the two selections with full correction (due to the crude bound): J \J {l(β) min l(β + ηe t η j )} ρ (s) ρ + (1) {l(β) l(β )} Comments: Interpretation: Transfer l(β) argmin η l(β + ηe t j ) into l(β) l(β ) for any β Full correction turns J J into J \J

16 Forward effect More details: Select j t = argmin η,j l(β + ηe j ): l(β) min l(β + ηe j t ) optimality l(β) min l(β + ηe η j) η,j J \J Select j t = argmax j l(β) j : = l(β) min η,j J \J l(β + η(β j β j )e j ) l(β) min l(β + ηe j t ) = l(β) min l(β + ηsgn(β η η i)e j t ) optimality l(β) min η,j J \J l(β + ηsgn(β j)e j ) Comment: Union bound used in J \J to derive the final result

17 OMP A bit refined analysis using forward effect: {l(β) min l(β + ηe j t )} η ρ (s) ρ + (1) J \J {l(β) l(β )} Taking β as β t, β + ηe j t as β t+1 and β as β we have l(β t+1 ) l(β t ) c t {l(β t ) l(β )} where c t = ρ (s)/{ρ + (1) J \J t } It can be transformed into l(β t+1 ) l(β ) (1 c t ){l(β t ) l(β )} e ct {l(β t ) l(β )} which gives l(β t ) l(β ) e Σct l(β ) + e Σct l(β 0 )

18 OMP Recall restricted gradient optimal constant: for β 0 s, l(β ), β ɛ s (β ) β 2 Usage: statistical error bound can be achieved through l(β) l(β ): ρ (s) β β 2 2 2l(β) 2l(β ) + ɛ s(β ) 2 ρ (s) where s J t J Key step in the proof: l(β) l(β ) = l(β) l(β ) l(β ), β β + l(β ), β β Once we get l(β t ) l(β ), bound on β t β 2 2 can also be achieved

19 OMP The analysis can further be refined using several techniques: Use a different l(β ) in each step so the bound can be more precise with another term q k. q k comes from the fact that l(β) l(β ) 1.5ρ + (s) β J \J ɛ s (β )/ρ + (s) Give a criterion on t so that c t can be made into a constant to combine with q k into induction. Final result (s = J (β ) 0 + t since we consider β β t ): l(β t ) l(β ) + 2.5ɛ s (β )/ρ (s) β t β 2 6ɛ s (β )/ρ (s) = O(σ J (β ) log p) when t = 4 J ρ +(1) ρ (s) ln 20ρ +( J ) ρ (s)

20 Termination Time Intuition: if the decrease is significant for each step, then there should not be too many iterations Stop before any over-fitting happens: l(β t ) l(β ) A routine to get a bound: iteration time t controls certain parameter in another bound. A restriction on that parameter gives a bound on iteration time.

21 Forward-backward greedy algorithm FoBa-obj/FoBa-gdt Process: Forward: Select the one with largest decrease in function value/largest gradient, do full correction; stop if δ t = l(β t+1 ) l(β t ) δ Backward: delete a selected feature if min j l(β t β j e j ) l(β t ) δ t /2, do full correction Intuition of FoBa: Forward procedure ensures significant decrease in function value Backward procedure removes incorrect features in early stage If decreasing is significant, gradient should be large; Otherwise, there is a bound on the infinity norm of the gradient δ is used to control forward and backward effect

22 Backward effect Assume β is also the global optimal solution Delete j t = argmin j l(β β j e j ) l(β) and do full correction Make a good control of β on J \J : β J \J 2 2 J \J ρ + (1) {min l(β β j ) l(β)} j Crude usage: β β 2 (β β ) J \J 2 = β J \J 2 Full correction turns J J into J \J

23 FoBa How to analyze? δ can be a tool to make bounds for different quantities; δ t can be a bridge to connect bounds A simple proof of a bound on gradient: l(β) ρ + (1)δ δ l(β) min l(β + ηe j ) η,j max ηe j, l(β) ρ + (1)η 2 η,j max j l(β) j 2 ρ + (1) Start with an assumption on selecting appropriate δ so that l(β ) l(β t ). δ > 4ρ +(1) ρ 2 (s) l(β )

24 General Framework Strategy I: Use an auxiliary variable β as the optimal solution on J (β ) = J (β ) J (β t ) to help analysis Termination rule comes from l(β t ) l(β ). Divide l(β t ) l(β ) into l(β t ) l(β ) {l(β ) l(β )} and use full correction result on each part For each part, we get β t β and β β Forward step gives bound on β t β ; Backward step gives bound on β t β ; both through δ t β t β β t β + β β gives a relationship between β β and β t β

25 Termination time for FoBa Full correction and RSC/RSS: 0 l(β t ) l(β ) = l(β t ) l(β ) {l(β ) l(β )} {ρ + (s) ρ (s)(k 1) 2 } β t β 2 2 where β t β 2 k β t β 2 Bound on forward step: δ t ρ 2 (s) ρ +(1) J \J t 1 βt β 2 2 Bound on backward step: β t β 2 2 J \J ρ +(1) δt Combination through δ t gives: k = ρ2 (s) J \J ρ +(1) J \J t 1 Recall J J t J J + t, which gives an upper bound on t as: { t ( J + 1) ( ρ + (s) ρ (s) + 1)2ρ +(1) ρ (s) } 2

26 General Framework Strategy II (an easy approach): use simple inequality with regularity condition to derive bound Use RSC/RSS, transfer l(β t ) l(β ) into terms with gradient and β β t 2 2 Use Holder s inequality directly to deal with the gradient term, l(β t ), β t β l(β t ) β t β 1 β t β 1 transfers into 2-norm bound. l(β) is bounded by the design of the algorithm (involving δ)

27 FoBa Details: 0 l(β ) l(β t ) l(β t ), β β t + ρ (s) β β t Final result: ρ (s) β β t 2 2 l(β t ) J \J t, (β β t ) J \J t ρ + (1)δ J \J t β β t 2 β β t ρ+ (1)δ 2 J ρ (s) \J t β β t 2 2 ρ +(1)δ ρ 2 (s) where = {j J \J t : βj 2ρ+(1) γ}, γ = ρ 2 (s) Other bounds can be achieved as well: l(β t ) l(β ) ρ +(1)δ ρ (s) ; ρ (s) 2 ρ + (1) J t \J J \J t

28 FoBa To make the bound look better (a trick): ρ + (1)δ ρ 2 (s) J \J t β J \J t 2 2 where γ = 2ρ+(1) ρ 2 (s) Then γ 2 {j J \J t : β j γ} = 2ρ +(1) ρ 2 (s) {j J \J t : β j γ} J \J t 2 {j J \J t : β j γ} = 2( J \J t {j J \J t : β j < γ} ) which leads to J \J t 2 {j J \J t : β j < γ}

29 FoBa Strategy III: use random matrix theory and simple inequalities to derive bound X β t y 2 2 = X βt X β 2 2 ε, X βt X β + ε 2 2 Define ε = l(β ). Then a generalized version is l(β t ) = X β t X β 2 2 ε, X β t X β + l(β ) ε, X β t X β can be bounded using random matrix theory l(β t ) l(β ) can be upper bounded through forward and backward effect on l(β t ) l(β ) and l(β ) l(β ), but some more precise analysis with tricks are involved Termination time bound will also change accordingly Benefit: no need assumption on RSS (ρ + )

30 FoBa Assume ε is sub-gaussion with parameter σ Comparison between results from strategy II and III: with δ ρ 2 +(1)ρ 1 (s) ε 2 β β t 2 2 δρ 2 +(1)ρ 1 (s) β β t 2 2 ρ 1 (s)σ2 J + δρ 2 (s) with δ ρ 1 (s)σ2 log p Comparison with LASSO: A bit better than LASSO error bound: O(σ 2 J log p) LASSO also needs stronger condition (irrepresentable condition) for selection consistency

31 Selection consistency Target: J = J t Several ways to evaluate: In FoBa, max{ J \J t, J t \J } = O( ); need < 1 with high probability Suppose β is known, build necessary/sufficient condition and analyze (e.g. KKT) Derive an upper bound for β β t and add a β -min condition

32 Hard-thresholding pursuit HTP procedure: Select q features with largest absolutely values after running gradient descent: β t = Θ(β t 1 η l(β t 1 ); q); do full correction The analysis in the paper use the global optimal solution for a discussion (β is the global minimum under β q) Global optimal solution is easier to be analyzed; but we can use random matrix theory to derive bounds between β and global optimal solution

33 Hard-thresholding pursuit A naive analysis: Assume l(β ) l(β t ) RSC and Holder inequality gives: l(β ) l(β t ) l(β t ) J \J t 2 β t β 2 +ρ (s) β t β 2 2 If J t J, min j β t j < β β t 2 min j β t j 2q l(β t ) ρ (2q) 2q l(β t ) ρ (2q) guarantees support recovery

34 Hard-thresholding pursuit The complete analysis is more precise with several lemmas and tricks (details omitted) Main ideas: Under certain conditions (those unknown constant terms involved), HTP terminates when β t reaches β HTP will not terminate before β t reaching β The iteration time is finite

35 Forward effect for HTP Key idea: handle the gradient by regularity condition (β β) J 2 2 = β β, (β β) J = β β η l(β ) + η l(β), (β β) J η l(β), (β β) J ρ β β 2 + η l(β) 2 (β β) J 2 where ρ = 1 2ηρ (s) + η 2 ρ + (s) is obtained from β β, l(β ) l(β) ρ (s) β β 2 2 l(β ) l(β) 2 ρ + (s) β β 2 Result: β β 2 β J \J 2 1 ρ + η l(β) J \J 2 1 ρ

36 Some comments In general, full correction make the analysis easier but not necessarily better in practice Almost every analysis needs to use RSC/RSS (or equivalently type) Induction is still a good tool to do analysis, but the bound can be very complicated The so called constant part in bound can play a significant role in practice, so the method may fail

37 Literature Forward-backward greedy algorithm: Barron, A. R., Cohen, A., Dahmen, W., & DeVore, R. A. (2008). Approximation and learning by greedy algorithms. The annals of statistics, 36(1), Liu, J., Ye, J., & Fujimaki, R. (2014, January). Forward-backward greedy algorithms for general convex smooth functions over a cardinality constraint. In International Conference on Machine Learning (pp ). Zhang, T. (2011). Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE transactions on information theory, 57(7), Matching pursuit Needell, D., & Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3), Zhang, T. (2009). On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10(Mar), Zhang, T. (2011). Sparse recovery with orthogonal matching pursuit under RIP. IEEE Transactions on Information Theory, 57(9), Hard thresholding pursuit: Bahmani, S., Raj, B., & Boufounos, P. T. (2013). Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14(Mar), Yuan, X., Li, P., & Zhang, T. (2014, January). Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning (pp ). Yuan, X., Li, P., & Zhang, T. (2016). Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems (pp ).

Boosting. Jiahui Shen. October 27th, / 44

Boosting. Jiahui Shen. October 27th, / 44 Boosting Jiahui Shen October 27th, 2017 1 / 44 Target of Boosting Figure: Weak learners Figure: Combined learner 2 / 44 Boosting introduction and notation Boosting: combines weak learners into a strong