Lecture 12: February 28

Size: px

Start display at page:

Download "Lecture 12: February 28"

Elinor Walters
5 years ago
Views:

1 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. I the previous lecture we talked about the Restricted Null space Property (RNP. We begi this lecture by recoutig the relevat theorems, ad the improvig o these results with Restricted Eigevalues (RE Prelimiaries We first recall some defiitios: Defiitio 12.1 (Restricted Nullspace Property(RNSP A matrix X satisfies the restricted ullspace property (RNSP with respect to S {1,..., d} if C(S ull(x = {0} (12.1 where C(S := { R d S c 1 S 1 } is the coe of vectors whose l 1 -orm off the support is domiated by the l 1 -orm o the support. We are iterested i the RNSP because whe it is satisfied, we kow that solvig the Basis Pursuit Liear Program (BPLP mi θ 1 such that Xθ = y (12.2 θ R d is equivalet to solvig the l 0 regularized problem mi θ 1 such that Xθ = y (12.3 θ R d which is typically computatioally ifeasible to solve (sice it requires doig a search over a space which grows expoetially with the sparsity of θ. This is formalized by the followig theorem: Theorem 12.2 The followig two properties are equivalet: 1. For ay vector θ R d with support S, the BPLP equatio 12.2 applied with y = Xθ has a uique solutio ˆθ = θ 2. The matrix X satisfies the RNSP with respect to S For a give X, idetifyig the subsets S which satisfy the RNSP results i the same difficulties as solvig the l 0 regularized problem. To circumvet this, we itroduce 2 defiitios which allow us to more easily certify whe the RNSP holds: 12-1

2 12-2 Lecture 12: February 28 Defiitio 12.3 (Pairwise Icoherece The pairwise icoherece of a desig matrix X, deoted δ PW (X is defied as x j, x k δ PW (X := max j k (12.4 Defiitio 12.4 (Restricted Isometry Property For a give iteger s {1,..., d}, we say that X R d satisfies a restricted isometry property (RIP of order s with costat δ s (X > 0 if xt s x s I s δ s (RIP (X ( for all subsets S of size at most s. Here 2 deotes the l 2 -operator orm of a matrix, correspodig to its maximum sigular value. From this, we have the theorems: Theorem 12.5 Theorem 12.6 If the pairwise icoherece of X satisfies the boud δ PW (X δ (RIP s (x sδ pw (x (12.6 δ PW (X 1 3s (12.7 the the RNSP holds for all subsets S of cardiality at most s. Theorem 12.7 If the RIP costat of order 2s is bouded as δ (RIP 2s (X 1 3 (12.8 the the RNSP holds for ay subset S of cardiality S s Notice that Theorem 12.6 is a much stroger statemet tha 12.7, because we ca substitute 12.6 ito 12.5 to get This result is hard to use, because RNP is hard to check. Attemptig to improve upo this result, ad further our uderstadig of sufficiet statistics, we will come up with a better coditio that correspods to RNP holdig. This leads us to restricted eigevalues Estimatio i Noisy Settigs Everythig we have discussed so far has bee uder the assumptio that we do t have ay oise i our observatios so that we observe the pair (y, X R R d which are related by the liear model y = Xθ (12.9 However, i most settigs we are iterested i solvig the more realistic problem y = Xθ + w (12.10

3 Lecture 12: February which is similar to Problem (12.9 except that there is a oise vector w R. Aalogous to the BPLP, we ca defie the followig equivalet l 1 regularized problems: mi{ 1 θ R d 2 y Xθ λ θ 1 } (Lagragia Lasso (12.11 mi{ 1 θ R d 2 y Xθ 2 2} such that θ 1 R (Costraied Lasso (12.12 mi θ 1 such that 1 θ R d 2 y Xθ 2 2 b 2 (Relaxed Basis Pursuit Program (12.13 Remark: Note that all of the above formulatios are equivalet i the sese that for specific parameters of λ, R, b, the above formulatios all retur the same solutio. Havig defied the aalogs of the BPLP, we might be tempted to ask uder what coditios ca we recover θ. However, because of the oise, this seems like a ureasoable questio to ask as it will ot always be possible to exactly recover θ. Istead we ask the followig more appropriate questios: (1 Uder what assumptios ca we boud the error ˆθ θ 2 betwee the Lasso solutio ˆθ ad the ukow regressio vector θ? (2 Uder these assumptios what are bouds we ca obtai? The Restricted Eigevalue Coditio I this sectio we will aswer the first questio: uder what assumptios ca we boud ˆθ θ 2? To frame the problem, remember that the poit of studyig this is to determie whe we ca optimally solve a sparse liear model. Previously, we showed that we ca solve such models, but pragmatically it was too computatioally complex, ad was NP Hard. However, with some simple coditios o the data, we ca make this problem tractable. For ituitio with respect to the coe, thik of the situatio where: Y = Xθ S = supp(θ The, imagie that there exists a θ such that S = supp(θ = supp( θ ad X θ = 0. The: Y = X(θ + θ = Xθ Ad thus this is a uder-determied ad uidetifiable liear system. I summary, if X has a itersectio with S, the the problem is uidetifiable. However, if it has a small itersectio with S, it is solvable: ull(x s =

4 12-4 Lecture 12: February 28 where s = { s c = 0} (for the L0 problem. To make this problem solvable with L1 loss, more vectors are required. For example, istead of requirig the itersectio with just the y-axis, it is ow the y-axis plus all of the vectors close to the y-axis, show i Figure Thus, if there is a ull itersectio with all vectors that form a coe aroud the y-axis, the this problem with L1 loss is solvable. Figure 12.1: (a shows how i high dimesios, a covex fuctio is ofte curved i some directios but flat i others. (b visualizes the coe of vectors close to the y-axis. C α (s. Figure from [waiwright] Let: { } C α (s := R s c 1 α s 1 (12.14 Now we wat a ull space property with respect to α. Thik about: X mi Which represets the miimum eige value of X. Istead, restrict to lie i the coe of C α (s. This leads us to a form of the Restricted Eigevalue coditio: mi C α(s, 0 If Equatio is satisfied, the we kow that RNP is also satisfied. X 2 κ > 0 ( If RNP were ot satisfied, the we kow that Equatio would be equal to 0. Because, i this settig we ca eforce that Equatio is greater tha zero, we kow that the RNP is satisfied. For some ituitio o why RE is required, thik of this as the curvature of the squared loss. If the curvature is ot sigificat eough, the havig a small error may ot be represetative of beig close to the true parameter. To better uderstad why we eed this, cosider the squared loss:

5 Lecture 12: February The hessia represets the curvature: L (θ = 1 2 Y Xθ 2 2 ( X 2 T X L (θ = The above is a d d matrix, with rak at most which is less tha d. As previously oted, i machie learig problems we are primarily cocered with miimizig loss. However, gettig ɛ-close i loss does ot gauretee that we are ɛ-close i terms of the actual parameter (see Figure This iformatio is captured by the hessia, ad therefore we wat to boud this to esure that beig close i loss meas that we are also close i the parameter. Figure 12.2: Relatioship betwee the curvature of the cost fuctio ad estimatio error. [waiwright] Figure from The importace of curvature is problematic for the sparse settig, because some eigevalues will be equal to zero i some directios, because the loss ladscape ca be flat i may directios (show i part (a of Figure However, Equatio meas that we do t care about curvature i all directios. We will see that imposig Equatio is eough.

6 12-6 Lecture 12: February Bouds o l 2 -error for Hard Sparse Models Now that we have aswered the first questio we tur to the secod oe: uder the assumptio that the desig matrix X satisfies the restricted eigevalue coditio, what bouds ca we obtai o ˆθ θ 2? For the remaider of this subsectio, we will be operatig uder the followig assumptios: (A1 The vector θ is supported o a subset S {1,..., d} with S = s (A2 The desig matrix satisfies the restricted eigevalue coditio over S with parameters (κ, 3 Theorem 12.8 Suppose that assumptios A1 ad A1 are satisfied ad that the λ 2 xt w ˆθ θ 2 3 κ sλ where ˆθ is a solutio to the Largragia Lasso (Eq s (12.16 (12.17 Proof: Assume ˆθ is a solutio to the Lagragia Lasso. This implies L (ˆθ + λ ˆθ 1 L (θ + λ θ 1 (12.18 Now, expadig L (ˆθ we see that L (ˆθ = 1 2 y X ˆθ 2 2 (12.19 = 1 2 X(ˆθ θ + w 2 2 (settig y = Xθ + w (12.20 = 1 2 X(ˆθ θ w wt x(ˆθ θ (12.21 Similarly we ca see that L (θ = 1 2 w 2 2 by settig ˆθ = θ i Eq Pluggig these back ito Eq ad settig ˆ = ˆθ θ we have o 1 2 X ˆ 2 2 ( wt X ˆ + λ ( θ 1 θ + ˆ 1 }{{}}{{} Term 2 Term 1 Now, we will aalyze Term 1 ad Term 2 separately. For Term 1, we have ( xt X ˆ = 1 XT w, ˆ ( Xt w ˆ 1 (Holder (12.25 λ 2 ˆ 1 (Assumptio o Lambda (Eq (12.26 = λ 2 ( ˆ S 1 + ˆ S c 1 (12.27

7 Lecture 12: February For Term 2 we have So that θ 1 θ + ˆ 1 = θ S 1 θ S + ˆ s 1 ˆ S c 1 (12.28 θ S 1 θ s 1 + ˆ S 1 ˆ S c 1 (12.29 = ˆ S 1 ˆ S c 1 ( X ˆ 2 λ (3 S 1 ˆ S c 1 (12.31 ˆ S c 1 3 ˆ S 1 (12.32 so that ˆ C 3 (S which implies X ˆ 2 κ ˆ 2 2. This implies that κ ˆ s 2 2 3λ ˆ S 1 3λ s ˆ s 2. Solvig for ˆ s 2 gives the desired result. Corollary 12.9 Let w i i.i.d. N (0, σ 2 i = {1,..., }. The, Now aalyze XT j w. Lookig at the variace shows: x j max 2 c j={1,...,} Var( XT j w = 1 2 E[(XT j w(xj T w] = σ2 2 XT j IX j = X j 2 σ 2 2 c2 σ 2 Ad therefore X j is sub-gaussia. The boudig the sum of sub-gaussia RVs: Now let t δ = 2 log(dcσ This allows us to set λ : Which implies: ( P max Z j > t j=1,...,d + cσδ, ad therefore: ˆθ θ 2 3 κ { t 2 } { t 2 } 2d exp 2c 2 σ 2 = 2 exp 2c 2 σ 2 + logd ( P XT w > t δ 2e δ2 2 ( 2 log d λ = 2cσ + δ sλ = 6cσ s log d κ + 6cσ sδ (12.33 κ The above holds with probability 1 2 exp{ δ2 2 }. Oe iterestig observatio from this loss is that log d is the oly extra loss suffered, ad ca be though of as the cost of searchig. Refereces [waiwright] M. Waiwright, High Dimesioal Statistics, Prerelease, 2019

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short