STAT 200C: High-dimensional Statistics

Size: px
Start display at page:

Download "STAT 200C: High-dimensional Statistics"

Transcription

1 STAT 200C: High-dimensional Statistics Arash A. Amini May 30, / 57

2 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

3 Linear regression setup The data is (y, X ) where y R n and X R n d, and the model θ R d is an unknown parameter. y = X θ + w. w R n is the vector of noise variables. Equivalently, y i = θ, x i + w i, i = 1,..., n where x i R d is the nth row of X : x1 T x2 T X =.. xn T }{{} d Recall θ, x i = d j=1 θ j x ij. 3 / 57

4 Sparsity models When n < d, no hope of estimating θ, unless we impose some sort of of low-dimensional model on θ. Support of θ (recall [d] = {1,..., d}): supp(θ ) := S(θ ) = { j [d] : θ j 0 }. Hard sparsity assumption: s = S(θ ) d. Weaker sparsity assumption via l q balls for q [0, 1] q = gives l 1 ball. B q (R q ) = { θ R d : q = 0 the l 0 ball, same as hard sparsity: d θ j q R q }. j=1 θ 0 := S(θ ) = # { j; θ j 0 } 4 / 57

5 (from HDS book) 5 / 57

6 Basis pursuit Consider the noiseless case y = X θ. We assume that θ 0 is small. Ideal program to solve: min θ 0 subject to y = X θ θ R d 0 is highly non-convex, relax to 1 : This is called basis pursuit (regression). (1) is a convex program. In fact, can be written as a linear program 1. min θ 1 subject to y = X θ (1) θ R d Global solutions can be obtained very efficiently. 1 Exercise: Introduce auxiliary variables s j R and note that minimizing j s j subject to θ j s j gives the l 1 norm of θ. 6 / 57

7 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 7 / 57

8 Define C(S) = { R d : S c 1 S 1 }. (2) Theorem 1 The following two are equivalent: For any θ R d with support S, the basis pursuit program (1) applied to the data (y = X θ, X ) has unique solution θ = θ. The restricted null space (RNS) property holds, i.e., C(S) ker(x ) = {0}. (3) 8 / 57

9 Proof Consider the tangent cone to the l 1 ball (of radius θ 1 ) at θ : T(θ ) = { R d : θ + t 1 θ 1, for some t > 0.} i.e., the set of descent directions for l 1 norm at point θ. Feasible set is θ + ker(x ), i.e. ker(x ) is the set of feasible directions = θ θ. Hence, there is a minimizer other than θ if and only if T(θ ) ker(x ) {0} (4) It is enough to show that C(S) = T(θ ). θ R d : supp(θ ) S 9 / 57

10 B1 θ (1) T(θ (2) ) Ker(X) T(θ (1) ) θ (2) C(S) d = 2, [d] = {1, 2}, S = {2}, θ(1) = (0, 1), θ = (0, 1). (2) C(S) = {( 1, 2 ) : 1 2 }. 10 / 57

11 It is enough to show that C(S) = T(θ ) (5) θ R d : supp(θ ) S We have T 1 (θ ) iff 2 S c 1 θs 1 θs + S 1 We have T 1 (θ ) for some θ R d s.t. supp(θ ) S iff S c 1 sup θs Rd ] [ θs 1 θs + S 1 = S 1 2 Let T 1 (θ ) be the subset of T(θ ) where t = 1, and argue that w.l.o.g. we can work this subset. 11 / 57

12 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 12 / 57

13 Sufficient conditions for restricted nullspace [d] := {1,..., d} For a matrix X R d, let X j be its jth column (for j [d]). The pairwise incoherence of X is defined as δ PW (X ) := max i, j [d] X i, X j n Alternative form: X T X is the Gram matrix of X, (X T X ) ij = X i, X j. 1{i = j} δ PW (X ) := X T X n I p where is the vector l norm of the matrix. 13 / 57

14 Proposition 1 (HDS Prop. 7.1) (Uniform) restricted nullspace holds for all S with S s if δ PW (X ) 1 3s Proof: Exercise / 57

15 A more relaxed condition: Definition 1 (RIP) X R n d satisfies a restricted isometry property (RIP) of order s with constant δ s (X ) > 0 if X T S X S n I op δ s (X ), for all S with S s PW incoherence is close to RIP with s = 2; for example, when X j / n 2 = 1 for all j, we have δ 2 (X ) = δ PW (X ). In general, for any s 2, (Exercise 7.4) δ PW (X ) δ s (X ) s δ PW (X ). 15 / 57

16 Definition (RIP) X R n d satisfies a restricted isometry property (RIP) of order s with constant δ s (X ) > 0 if X T S X S n I op δ s (X ), for all S with S s Let x T i be the i th row of X. Consider the sample covariance matrix: Σ := 1 n X T X = 1 n n i=1 x i x T i R d d. Then Σ SS = 1 n X S T X S, hence, RIP is Σ SS I op δ < 1 i.e., Σ SS I s. More precisely, (1 δ) u 2 Σ SS u 2 (1 + δ) u 2, u R s 16 / 57

17 RIP gives sufficient conditions: Proposition 2 (HDS Prop. 7.2) (Uniform) restricted null space holds for all S with S s if δ 2s (X ) 1 3 Consider a sub-gaussian matrix X with i.i.d. entries (Exercise 7.7): We have n s 2 log d = δ PW (X ) < 1 3s, n s log ( ed ) = δ2s < 1 s 3, w.h.p.. Sample complexity requirement for RIP is milder. Above corresponds to Σ = αi. w.h.p.. For more general covariance Σ, it is harder to satisfy either PW or RIP. 17 / 57

18 Neither RIP or PW is necessary Consider X R n d with i.i.d. rows X i N(0, Σ). Letting 1 R d be the all-ones vector, and Σ := (1 µ)i d + µ11 T for µ [0, 1). (A spiked covariance matrix.) We have γ max (Σ SS ) = 1 + µ(s 1) as s. Exercise 7.8, (a) PW is violated w.h.p. unless µ 1/s. (b) RIP is violated w.h.p. unless µ 1/ s. In fact δ 2s grows like µ s for any fixed µ (0, 1). However, for any µ [0, 1), basis pursuit succeeds w.h.p. if (A later result shows this.) n s log ( ed ). s 18 / 57

19 19 / 57

20 Noisy sparse regression A very popular estimator is the l 1 -regularized least-squares: [ 1 ] θ argmin θ R 2n y X θ λ θ 1 d (6) The idea: minimizing l 1 norm leads to sparse solutions. (6) is a convex program; global solution can be obtained efficiently. Other options: constrained form of lasso and relaxed basis persuit min θ 1 R 1 2n y X θ 2 2 (7) min θ R d θ 1 s.t. 1 2n y X θ 2 2 b 2 (8) 20 / 57

21 For a constant α 1, A strengthening of RNS is: Definition 2 (RE condition) C α (S) := { R d S c 1 α S 1 }. A matrix X satisfies the restricted eigenvalue (RE) condition over S with parameters (κ, α) if 1 n X 2 2 κ 2 2 for all C α (S). Intuition: θ minimizes L(θ) := 1 2n X θ y 2. Ideally, δl := L( θ) L(θ ) is small. Want to translate deviation in loss to deviations in parameter θ θ. Controlled by the curvature of the loss, captured by the Hessian 2 L(θ) = 1 n X T X. 21 / 57

22 Ideally would like strong convexity (in all directions): or in the context of regression, 2 L(θ) κ 2, R d \ {0}. 1 n X 2 2 κ 2, R d \ {0}. 22 / 57

23 In high-dimensions, cannot guarantee this in all directions, the loss is flat over ker X. 23 / 57

24 Side note: Strong convexity A twice differentiable function is strongly convex if 2 L(θ) κi for all θ. In other words if 2 L(θ) κi 0, for all θ. Hessian is uniformly bounded below (in all directions). By Taylor expansion, the function will have a quadratic lower bound: L(θ + ) L(θ ) + L(θ ), + κ 2 2. Alternatively, L(θ) is strongly convex if L(θ) κ 2 θ 2 is convex. In contrast, assuming smoothness, L is strictly convex iff 2 L(θ) 0, not necessarily uniformly lower bounded. Example: f (x) = e x on R, strictly convex but not strongly convex. f (x) > 0 for all x but f (x) 0 as x. Similarly: f (x) = 1/x over (0, ). 24 / 57

25 Theorem 2 Assume that y = X θ + w, where X R n d and θ R d, and θ is supported on S [d] with S s X satisfies RE(κ, 3) over S. Let us define z = X T w n and γ 2 := w 2 2 2n. Then, we have the following: (a) Any solution of Lasso (6) with λ 2 z satisfies θ θ 2 κ 3 s λ (b) Any solution of constrained Lasso (7) with R = θ 1 satisfies θ θ 2 4 κ s z (c) Any solution of relaxed basis pursuit (8) with b 2 γ 2 satisfies θ θ 2 4 κ s z + 2 κ b2 γ 2 25 / 57

26 Example (fixed design regression) Assume y = X θ + w where w N(0, σ 2 I n ), and X R n d fixed and satisfying RE condition and normalization where X j is the jth column of X. Recall z = X T w/n. X j max C. j=1,...,d n It is easy to show that w.p. 1 2e nδ2 /2, Thus, setting λ = 2Cσ ( 2 log d n w.p. at least 1 2e nδ2 /2. ( 2 log d ) z Cσ + δ n + δ ), Lasso solution satisfies θ θ 2 6Cσ ( 2 log d ) s + δ κ n 26 / 57

27 Taking δ = 2 log /n, we have w.h.p. (i.e., 1 2d 1 ). θ θ 2 σ s log d This is the typical high-dimensional scaling in sparse problems. Had we known the support S in advance, our rate would be (w.h.p.) s θ θ 2 σ n. The log d factor is the price for not knowing the support; roughly the price for searching over ( d s) d s collection of candidate supports. n 27 / 57

28 Proof of Theorem 2 Let us simplify the loss L(θ) := 1 2n X θ y 2. Setting = θ θ, where z = X T w/n. Hence, L(θ) = 1 2n X (θ θ ) w 2 = 1 X w 2 2n = 1 2n X 2 1 X, w + const. n = 1 2n X 2 1 n, X T w + const. = 1 2n X 2, z + const. L(θ) L(θ ) = 1 2n X 2, z. (9) Exercise: Show that (9) is the Taylor expansion of L around θ. 28 / 57

29 Proof (constrained version) By optimality of θ and feasibility of θ : L( θ) L(θ ) Error vector := θ θ satisfies basic inequality Using Holder inequality 1 2n X 2 2 z,. 1 2n X 2 2 z 1. Since θ 1 θ 1, we have = θ θ C 1 (S), hence 1 = S 1 + S c 1 2 S 1 2 s 2. Combined with RE condition ( C 3 (S) as well) which gives the desired result. 1 2 κ s z / 57

30 Proof (Lagrangian version) Let L(θ) := L(θ) + λ θ 1 be the regularized loss. Basic inequality is L( θ) + λ θ 1 L(θ ) + λ θ 1 Rearranging We have 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Since λ 2 z, θ 1 θ 1 = θ S 1 θ S + S 1 S c 1 S 1 S c 1 1 n X 2 2 λ 1 + 2λ( S 1 S c 1 ) λ( 3 S 1 S c 1 ) It follows that C 3 (S) and the rest of proof follows. 30 / 57

31 RE condition for anisotropic design For a PSD matrix Σ, let Theorem 3 ρ 2 (Σ) = max Σ ii. i Let X R n d with rows i.i.d. from N(0, Σ). Then, there exist universal constants c 1 < 1 < c 2 such that X θ 2 2 n c 1 Σθ 2 2 c 2 ρ 2 (Σ) log d n θ 2 1, for all θ R d (10) withe probability at least 1 e n/32 /(1 e n/32 ). Exercise 7.11: (10) implies RE condition over C 3 (S) uniformly over all subsets of cardinality S c 1 γ min (Σ) n 32c 2 ρ 2 (Σ) log d In other words, s log d n = RE condition over C 3 (S) for all S s. 31 / 57

32 Examples Toeplitz family: Σ ij = ν i j, ρ 2 (Σ) = 1, γ min (Σ) (1 ν) 2 > 0 Spiked model: Σ := (1 µ)i d + µ11 T, ρ 2 (Σ) = 1, γ min (Σ) = 1 µ For future applications, note that (10) implies X θ 2 2 n α 1 θ 2 2 α 2 θ 2 1, θ R d. where α 1 = c 1 γ min (Σ) and α 2 = c 2 ρ 2 (Σ) log d n. 32 / 57

33 Lasso oracle inequality For simplicity, let κ = γ min (Σ) and ρ 2 = ρ 2 (Σ) = max i Σ ii Theorem 4 Under the condition 10, consider the Lagrangian Lasso with regularization parameter λ 2 z where z = X T w/n. For any θ R d, any optimal solution θ satisfies the bound θ θ λ 2 c1 2 κ 2 S + 16 λ c 1 κ θ S c c 2 ρ 2 log d c 1 κ n θ S c 2 1 }{{}}{{} Estimation error ApproximationError (11) valid for any subset S with cardinality S c 1 κ n 64c 2 ρ 2 log d. 33 / 57

34 Simplifying the bound θ θ 2 2 κ 1 λ 2 S + κ 2 λ θ S c 1 + κ 3 log d n θ S c 2 1 where κ 1, κ 2, κ 3 are constant dependent on Σ. Assume σ = 1 (noise variance) for simplicity. Since we z log d/n w.h.p., we can take λ of this order: θ θ 2 2 log d log d n S + n θ S c 1 + log d n θ S c 2 1 Optimizing the bound θ θ 2 2 inf S n log d [ ] log d log d n S + n θ S c 1 + log d n θ S c 2 1 An oracle that knows θ can choose the optimal S. 34 / 57

35 Example: l q -ball sparsity Assume that θ B q, i.e., d j=1 θ j q 1, for some q [0, 1]. Then, assuming σ 2 = 1, we have the rate (Exercise 7.12) Sketch: θ θ 2 2 ( log d ) 1 q/2. n Trick: take S = {i : θi > τ} and find a good threshold τ later. Show that θs c 1 τ 1 q and S τ q. The bound would be of the form (ε := log d/n) ε 2 τ q + ετ 1 q + (ετ 1 q ) 2. Ignore the last term (assuming ετ 1 q 1, it is not dominant), 35 / 57

36 Proof of Theorem 4 Let L(θ) := L(θ) + λ θ 1 be the regularized loss. Basic inequality is L( θ) + λ θ 1 L(θ ) + λ θ 1 Rearranging We have 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Since λ 2 z, θ 1 θ 1 = θ S 1 θ S + S 1 S c 1 + θ S c 1 S 1 S c 1 + θ S c 1 1 n X 2 2 λ 1 + 2λ( S 1 S c 1 + θ S c 1) λ(3 S 1 S c θ S c 1). 36 / 57

37 Let s = S and b = 2 θs 1. c Then, the error satisfies S c 1 3 S 1 + b. That is, 1 4 S 1 + b, hence 2 1 (4 S 1 + b) 2 32 S b 2 32 s S b 2 32 s b 2. Bound (10) can be written as (α 1, α 2 > 0) X 2 2 n α α 2 2 1, R d. Applying to, we have X 2 2 n α α 2 ( 32 s b 2) = (α 1 32α 2 s) 2 2 2α 2 b / 57

38 We want α 1 32α 2 s to be strictly positive. Assume that α 1 /2 32α 2 s so that α 1 32α 2 s α 1 /2. We obtain α α 2 b 2 λ(3 S 1 S c θ S c 1) Drop S c 1, use S 1 s 2 and rearrange α λ(3 s θ S c 1) + 2α 2 b 2 This is a quadratic inequality in 2. Using the inequality on next slide 2 2 2(3λ s) 2 α1 2/4 + 2(2λ θ S c 1 + 2α 2 b 2 ) α 1 /2 72λ2 s α 2 1 where we used b = 2 θ S c λ θ S c 1 α α 2 θ S c 2 1 α 1 This proves the theorem with better constants! 38 / 57

39 In general, ax 2 bx + c and x 0 imply x b a + c a which itself implies x 2 2b2 a 2 + 2c a. 39 / 57

40 Bounds on prediction error The following can be thought of as the mean-squared prediction error 1 n X ( θ θ ) 2 2 = 1 n n ( x i, θ θ ) 2 i=1 Letting f θ (x) = x, θ be the regression function, in prediction we are interested in estimating the function f θ ( ). Defining the empirical norm ( 1 f n = n n i=1 ) 1/2 f 2 (x i ) we can write 1 n X ( θ θ ) 2 2 = f θ f θ 2 n (12) For sufficiently regular points, f 2 n f 2 L = f 2 (x)dx. 2 There is another explanation in HDS. 40 / 57

41 Prediction error bounds Theorem 5 Consider Lagrangian lasso with λ 2 z where z = X T w/n: (a) Any optimal solution θ satisfies 1 n X ( θ θ ) θ 1 λ (b) If θ is supported on S with S s and X satisfies (κ, 3)-RE condition over S, they any optimal solution satisfies 1 n X ( θ θ ) κ sλ2 41 / 57

42 Example (fixed design regression, no RE) Assume y = X θ + w where w has iid σ-sub-gaussian entries, and X R n d fixed and satisfies C-column normalization where X j is the jth column of X. X j max C. j=1,...,d n Recalling z = X T w/n, w.p. 1 2e nδ2 /2, ( 2 log d ) z Cσ + δ n Thus, setting λ = 2Cσ ( 2 log d n + δ ), Lasso solution satisfies 1 n X ( θ θ ) 2 2 ( 24 θ 2 log d ) 1 Cσ + δ n w.p. at least 1 2e nδ2 /2. 42 / 57

43 Example (fixed design regression, with RE) Assume y = X θ + w where w has iid σ-sub-gaussian entries, and X R n d fixed and satisfies RE condition and C-column normalization where X j is the jth column of X. X j max C. j=1,...,d n Recalling z = X T w/n, with probability 1 2e nδ2 /2, Thus, setting λ = 2Cσ ( 2 log d n w.p. at least 1 2e nδ2 /2. ( 2 log d ) z Cσ + δ n + δ ), Lasso solution satisfies 1 n X ( θ θ ) κ C 2 σ 2( 2s log d ) + δ 2 s n 43 / 57

44 Under very mild assumptions (no RE), slow rate 1 n X ( θ θ ) 2 2 ( 24 θ 2 log d ) 1 Cσ + δ n Under stronger assumptions (e.g., RE condition), fast rate Is RE needed for fast rates? 1 n X ( θ θ ) κ C 2 σ 2( s 2 log d ) + δ 2 s n 44 / 57

45 Proof of part (a) Recall the basic inequality, where = θ θ : 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Using λ 2 z, and Hölder inequality z, z 1 λ 2 1 λ 2 ( θ 1 + θ 1 ) Putting the pieces together we conclude θ 1 3 θ 1. Triangle inequality gives 1 4 θ 1. Since θ 1 θ 1 1, 1 2n X 2 2 λ λ 1 12λ 2 θ 1 using the blue upper bound. 45 / 57

46 Proof of part (b) As before, we obtain and that C 3 (S). 1 n X 2 2 3λ s 2 We now apply RE condition to the other side X 2 2 n 3λ s 2 3λ s 1 κ X 2 n which gives the desired result. 46 / 57

47 Variable selection using lasso Can we recover the exact support of θ? Needs the most stringent conditions: Lower eigenvalues ( X T γ S X ) S min c min > 0 n Mutual incoherence: There exists some α [0, 1) such that max (X T j S c S X S ) 1 XS T X j 1 α (LowEig) (MuI) The expression is the l 1 norm of ω where ω = argmin ω R s X j X S ω 2 2. Letting Σ = 1 n X T X be the sample covariance, we can write (MuI) as Σ 1 SS Σ SS c 1 α Consider the projection matrix, projecting onto [Im(X S )] : Π S = I n X S (XS T X S ) 1 XS T 47 / 57

48 Theorem 6 Consider S-sparse linear model with design X satisfying (LowEig) and (MuI). Let w := w/n and assume that λ 2 1 α X T S c Π S w. (13) Then the Lagrangian Lasso has the following properties: (a) There is a unique optimal solution θ. (b) No false inclusion: Ŝ := supp( θ) S. (c) l bounds: The error θ θ satisfies θ S θ S Σ 1 SS X S T w + Σ 1 SS λ }{{} τ(λ;x ) (14) (d) No false exclusion: The lasso includes all indices i S such that θ i > τ(λ; X ), hence is variable selection consistent if min i S θ i > τ(λ; X ). 48 / 57

49 Theorem 7 Consider S-sparse linear model with design X satisfying (LowEig) and (MuI). Assume that noise vector w is zero mean with i.i.d. σ-sub-gaussian entries. Assume that X is C-column normalized deterministic design. Take the regularization parameter to be λ = 2Cσ { 2 log(d s) } + δ 1 α n for some δ > 0. Then the optimal solution θ is unique, has support contained in S and satisfies the l error bound θ S θ S all with probability at least 1 4e nδ2 /2. σ ( 2 log s ) + δ + Σ 1 SS cmin n λ 49 / 57

50 Need to verify (14): Enough to control Z j := Xj T Π S w, for j S c We have Π S X j 2 X j 2 C n. (Projections are nonexpansive.) Hence, Z j is a sub-gaussian with squared-parameter C 2 σ 2 /n. (Exercise.) It follows that P [ max j S c Z j t ] 2(d s)e n t2 /(2C 2 σ 2 ) Choice of λ satisfies (14) with high probability. Define Z 1 S = Σ S X S T w. Each Z i = ei T Σ 1 S X S T w is sub-gaussian with parameter at most It follows that [ P max Z i i=1,...,s }{{} Z S σ 2 n 1 Σ SS op σ2 c min n > σ ( 2 log s )] + δ cmin n 2e nδ2 /2 50 / 57

51 Exercise Assume that w R d has independent sub-gaussian entries, with sub-gaussian squared-parameters σ 2. Let x R d be a deterministic vector. Then, x T w is sub-gaussian with squared-parameter σ 2 x / 57

52 Corollary applies to fixed designs. Similar result holds for Gaussian random design (rows iid from N(0, Σ)), assuming that Σ satisfies α-incoherence. (Exercise 7.19): Then sample covariance Σ satisfies α-incoherence holds if n s log(d s) 52 / 57

53 Detour: subgradients Consider a convex function f : R d R. z R d is a sub-gradient of f at θ, denoted z f (θ) if f (θ + ) f (θ) + z,, R d. For a convex function θ minimizes f is equivalent to 0 f (θ). For l 1 norm, i.e. f (θ) = θ 1, z θ 1 z j = sign(θ j ) where sign( ) is the generalized sign, i.e. sign(0) = [ 1, 1]. For Lasso, ( θ, ẑ) is primal-dual optimal if θ is a minimizer and ẑ θ 1. Equivalently, primal-dual optimality conditions can be written as where (15) is the zero subgradient condition. 1 n X T (X θ y) + λẑ = 0, (15) ẑ θ 1 (16) 53 / 57

54 Proof of Theorem Primal-dual witness (PDW) construction: 1. Set θ S c = Determine ( θ S, ẑ S ) R s R s by solving oracle subproblem 1 θ S argmin θ S R s 2n y X Sθ S λ θ S 1 and choosing ẑ S θ S 1 s.t. f (θ S ) θs = θ S + λẑ S = Solve for ẑ S c R d s via zero subgradient equaltion and check for strict dual feasibility ẑ S c < 1. Lemma 1 Under condition (LowEig) the success of the PDW construction implies that ( θ S, 0) is the unique optimal solution of the Lasso. Proof: Only need to show uniqueness, which follows from this: Under strong duality, the set of saddle-points of the Lagrangian form a Cartesian product. I.e., we can mix and match primal and dual parts of two primal-dual pair to also get primal-dual pairs. 54 / 57

55 Using y = X θ + w and θ S c = 0 (by construction) and θs = 0 (by c assumption) we can write zero sub-gradient condition [ ] ] [ ] [ ] [ ΣSS ΣSS c [ θs θ us ẑs S 0 + λ = Σ Sc S Σ Sc S 0 u c S c ẑ S c 0] where u = X T w/n so that u S = XS T w/n and so on. Top equation satisfied since ( θ S, θs ) is chosen to solve oracle Lasso. Only need to satisfy the bottom EQ. Do so by choosing ẑ S c as needed ẑ S c = 1 λ Σ Sc S( θ S θ S) + u S c λ Since by assumption Σ SS is invertible, can solve for θ S θ S from 1st EQ: Combining ẑ S c θ S θ S = Σ 1 SS (u S λẑ S ). = Σ 1 S c S Σ SS ẑs + 1 ( u S c λ Σ ) 1 S c S Σ SS u S 55 / 57

56 We had ẑ S c = Σ 1 Sc S Σ SS ẑs + 1 ( u S c λ Σ ) 1 Sc S Σ SS u S Note that ( w = w/n) u S c Σ 1 Sc S Σ SS u S = XS T w Σ 1 c Sc S Σ SS X S T w = X T S c [I X S(X T S X S ) 1 X T S ] w = X T S c Π S w Thus, we have ẑ S c = Σ S c S Σ 1 SS ẑs } {{ } µ + XS T Π ( w c S nλ ) } {{ } v By (MuI) we have µ α. and by our choice of λ, v < 1 α. This verifies strict dual feasibility ẑ S c < 1, hence the constructed pair is primal-dual feasible and the primal solution is unique. 56 / 57

57 It remains to show the l bound which follows from a applying triangle inequality to leading to θ S θ S = Σ 1 SS (u S λẑ S ). θ S θ S Σ 1 SS u S + λ Σ 1 SS using sub-multiplicative property of operator norms and z / 57

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Constrained optimization

Constrained optimization Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

High-dimensional statistics: Some progress and challenges ahead

High-dimensional statistics: Some progress and challenges ahead High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Emmanuel Candes and Prof. Wotao Yin

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

High dimensional ising model selection using l 1 -regularized logistic regression

High dimensional ising model selection using l 1 -regularized logistic regression High dimensional ising model selection using l 1 -regularized logistic regression 1 Department of Statistics Pennsylvania State University 597 Presentation 2016 1/29 Outline Introduction 1 Introduction

More information

IEOR 265 Lecture 3 Sparse Linear Regression

IEOR 265 Lecture 3 Sparse Linear Regression IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Lecture Notes 9: Constrained Optimization

Lecture Notes 9: Constrained Optimization Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Composite Loss Functions and Multivariate Regression; Sparse PCA

Composite Loss Functions and Multivariate Regression; Sparse PCA Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 3: Sparse signal recovery: A RIPless analysis of l 1 minimization Yuejie Chi The Ohio State University Page 1 Outline

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization Sparse Optimization Lecture: Dual Certificate in l 1 Minimization Instructor: Wotao Yin July 2013 Note scriber: Zheng Sun Those who complete this lecture will know what is a dual certificate for l 1 minimization

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

General principles for high-dimensional estimation: Statistics and computation

General principles for high-dimensional estimation: Statistics and computation General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin

More information

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 6, JUNE 2011 3841 Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization Sahand N. Negahban and Martin

More information

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Constructing Explicit RIP Matrices and the Square-Root Bottleneck Constructing Explicit RIP Matrices and the Square-Root Bottleneck Ryan Cinoman July 18, 2018 Ryan Cinoman Constructing Explicit RIP Matrices July 18, 2018 1 / 36 Outline 1 Introduction 2 Restricted Isometry

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Lecture 3 January 28

Lecture 3 January 28 EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

6 Compressed Sensing and Sparse Recovery

6 Compressed Sensing and Sparse Recovery 6 Compressed Sensing and Sparse Recovery Most of us have noticed how saving an image in JPEG dramatically reduces the space it occupies in our hard drives as oppose to file types that save the pixel value

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Electronic Journal of Statistics ISSN: 935-7524 arxiv: arxiv:503.0388 Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators Yuchen Zhang, Martin J. Wainwright

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Low-Rank Matrix Recovery

Low-Rank Matrix Recovery ELE 538B: Mathematics of High-Dimensional Data Low-Rank Matrix Recovery Yuxin Chen Princeton University, Fall 2018 Outline Motivation Problem setup Nuclear norm minimization RIP and low-rank matrix recovery

More information

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016 Lecture notes 5 February 9, 016 1 Introduction Random projections Random projections are a useful tool in the analysis and processing of high-dimensional data. We will analyze two applications that use

More information

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)

The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics Vol. 0 (2010) ISSN: 1935-7524 The adaptive the thresholded Lasso for potentially misspecified models ( a lower bound for the Lasso) Sara van de Geer Peter Bühlmann Seminar

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analysis Handout 1 Luca Trevisan October 3, 017 Scribed by Maxim Rabinovich Lecture 1 In which we begin to prove that the SDP relaxation exactly recovers communities

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

1 Computing with constraints

1 Computing with constraints Notes for 2017-04-26 1 Computing with constraints Recall that our basic problem is minimize φ(x) s.t. x Ω where the feasible set Ω is defined by equality and inequality conditions Ω = {x R n : c i (x)

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information

arxiv: v1 [math.st] 10 Sep 2015

arxiv: v1 [math.st] 10 Sep 2015 Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and

More information

Date: July 5, Contents

Date: July 5, Contents 2 Lagrange Multipliers Date: July 5, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 14 2.3. Informative Lagrange Multipliers...........

More information

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Strengthened Sobolev inequalities for a random subspace of functions

Strengthened Sobolev inequalities for a random subspace of functions Strengthened Sobolev inequalities for a random subspace of functions Rachel Ward University of Texas at Austin April 2013 2 Discrete Sobolev inequalities Proposition (Sobolev inequality for discrete images)

More information

10725/36725 Optimization Homework 4

10725/36725 Optimization Homework 4 10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

ACCORDING to Shannon s sampling theorem, an analog

ACCORDING to Shannon s sampling theorem, an analog 554 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 59, NO 2, FEBRUARY 2011 Segmented Compressed Sampling for Analog-to-Information Conversion: Method and Performance Analysis Omid Taheri, Student Member,

More information

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls

Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls 6976 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls Garvesh Raskutti, Martin J. Wainwright, Senior

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Lecture 2: Convex Sets and Functions

Lecture 2: Convex Sets and Functions Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are

More information

Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso

Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso Adam Smith asmith@cse.psu.edu Pennsylvania State University Abhradeep Thakurta azg161@cse.psu.edu Pennsylvania

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

19.1 Maximum Likelihood estimator and risk upper bound

19.1 Maximum Likelihood estimator and risk upper bound ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 19: Denoising sparse vectors - Ris upper bound Lecturer: Yihong Wu Scribe: Ravi Kiran Raman, Apr 1, 016 This lecture

More information

Sparse PCA in High Dimensions

Sparse PCA in High Dimensions Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

A Sharpened Hausdorff-Young Inequality

A Sharpened Hausdorff-Young Inequality A Sharpened Hausdorff-Young Inequality Michael Christ University of California, Berkeley IPAM Workshop Kakeya Problem, Restriction Problem, Sum-Product Theory and perhaps more May 5, 2014 Hausdorff-Young

More information

Three Generalizations of Compressed Sensing

Three Generalizations of Compressed Sensing Thomas Blumensath School of Mathematics The University of Southampton June, 2010 home prev next page Compressed Sensing and beyond y = Φx + e x R N or x C N x K is K-sparse and x x K 2 is small y R M or

More information

Geometry of log-concave Ensembles of random matrices

Geometry of log-concave Ensembles of random matrices Geometry of log-concave Ensembles of random matrices Nicole Tomczak-Jaegermann Joint work with Radosław Adamczak, Rafał Latała, Alexander Litvak, Alain Pajor Cortona, June 2011 Nicole Tomczak-Jaegermann

More information

Introduction to Compressed Sensing

Introduction to Compressed Sensing Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS

ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS MATHEMATICS OF OPERATIONS RESEARCH Vol. 28, No. 4, November 2003, pp. 677 692 Printed in U.S.A. ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS ALEXANDER SHAPIRO We discuss in this paper a class of nonsmooth

More information

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES Fenghui Wang Department of Mathematics, Luoyang Normal University, Luoyang 470, P.R. China E-mail: wfenghui@63.com ABSTRACT.

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis (SPA) Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universität Berlin Winter School on Compressed Sensing December 5, 2015

More information

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1 CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems.

More information

Chapter 2 Convex Analysis

Chapter 2 Convex Analysis Chapter 2 Convex Analysis The theory of nonsmooth analysis is based on convex analysis. Thus, we start this chapter by giving basic concepts and results of convexity (for further readings see also [202,

More information