19.1 Maximum Likelihood estimator and risk upper bound

Size: px

Start display at page:

Download "19.1 Maximum Likelihood estimator and risk upper bound"

Lorraine Hunt
5 years ago
Views:

1 ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 19: Denoising sparse vectors - Ris upper bound Lecturer: Yihong Wu Scribe: Ravi Kiran Raman, Apr 1, 016 This lecture focuses on the problem of denoising for a sparse vector and upper bound of the minimax ris corresponding to the problem. Let θ Θ = B 0 ( = {θ R p : θ 0 }, be a sparse vector. We observe Y = θ + Z, where Z N (0, I p. Recall that the last lecture obtains the upper bound on the minimax ris for the problem using the mutual information method as R n(θ = inf ˆθ sup E θ [ ˆθ θ ] log θ Θ ( p. This lecture focuses on obtaining the upper bound to the minimax error by analyzing the ris corresponding to the maximum lielihood estimator. Remar Estimators, ˆθ are typically efficiently computable for the denoising problem defined above. Further, adaptive estimators that function in the absence of nowledge of can be defined Maximum Lielihood estimator and ris upper bound MLE and Basic Inequality The maximum lielihood estimator for the denoising problem under additive Gaussian noise is given by ˆθ MLE (y arg min y θ. (19.1 θ B 0 ( We now show that θ B 0 (, ( p ˆθ MLE θ log holds both, under expectation and with high probability. For ease of notation, we shall henceforth refer to the ML estimator as ˆθ. We observe that the ground truth θ is a feasible solution of (19.1. Since the estimator minimizes the l distance, we have Z h = y ˆθ y θ = Z, where h = ˆθ θ. Thus h 0. Hence we have h h h, Z = h Z, h h sup Z, u h u S p 1 B 0 ( sup u S p 1 B 0 ( Z, u. (19. 1

2 19.1. Ris upper bound through Gaussian width Let G = S p 1 B 0 (. Thus, from (19., we have [ ] E [ h ] E sup u, Z = w(g, where w( is the Gaussian width. We now that Sudaov minorization lower bounds the Gaussian width as w(g ɛ ( ep log (N(G,, ɛ log, as long as ɛ 1. The above result follows from the Gilbert-Varshamov lower bound via pacing Hamming spheres. However we are interesting in an upper bound for the Gaussian width here. One way to obtain this is using Dudley s entropy integral method [? ], w(g rad(g log log pe, log(n(g,, ɛdɛ ( 1 ( p dɛ (19.3 ɛ where (19.3 follows from the fact that the vectors projected onto the set of support vectors lie on S 1 and the fact that there are ( p possible support vector combinations Ris upper bound through Covering argument We now provide an alternate proof to show that the upper bound is held with high probability (consequently in expectation. Let J represent a set of indices. Let us partition G as G = J = G J = J = {x R p : supp(x = J, x J S 1}. Hence, we have sup u, Z = max sup u, Z = max Z J. J = J J = Fix an index set J such that J =. Let U = {u 1,..., u N } be an ɛ-net of G J. Thus, the set of vectors form a cover of a dimensional sphere. Thus, N = N(S 1,, ɛ ( 3. ɛ Now, u G J, i [N] such that u u i ɛ. Thus, r G J such that u = u i + r. Thus we have sup u, Z max u i, Z + sup J i [N] r r, Z. G J

3 Now, we now that Using this, we have sup r r, Z ɛ sup u, Z. G J J sup u, Z max u i, Z, J i [N] when ɛ is an appropriately chosen constant. Here u i, Z N (0, 1 as u i = 1. Since ( p choices of index sets are possible, we bound the tail probability using union bound as [ ] P sup u, Z > t [ ] P sup u, Z > t J = J P [ u i, Z > t] J = i [N] ( p exp(cq(t exp ( log p ( exp(c exp t, where the last step follows from bounding the size of the ɛ-net and the Q-function. Thus, for t log ep, (scaled by an appropriately large constant, the tail probability is arbitrarily low. Thus, with high probability, sup u, Z log pe Ris upper bound using tail bound for χ distribution As observed earlier, sup u, Z = max Z J. J = Since Z N (0, I p, Z J χ for a given J. We first study a few properties of the χ random variable. Let L χ m. Then, E [L] = m, var(l m i.e.σ L m. Theorem 19.1 ([? ]. If L χ m, then P [ L m > S m + S ] ( S exp P [ L m < S m ] ( S exp. Now, applying the above concentration inequality for m =, S = c log p, we have [ P Z J > + c log p + c log p ] [ P Z J > log pe ] ( c log p exp. Thus, with high probability, sup u, Z 3 log pe.

4 19. Thresholding schemes and Ris upper bounds Hard and Soft thresholding For the denoising problem defined above, the hard thresholding estimate corresponding to the threshold τ is given by { y i, if y i > τ ˆθ HT (y i = 0, if y i τ Similarly, the soft thresholding estimate is given by y i τ, if y i > τ ˆθ ST (y i = y i + τ, if y i < τ 0, if y i τ The HT estimate is not continuous and the corresponding ris function does not vary monotonically. On the other hand, the soft thresholding avoids both these issues. The hard and soft thresholding estimators can alternatively be written in the form of penalized objective functions. Consider the problem defined as follows: θ (y = arg min θ R p y θ + λ θ 0. Then, for appropriately chosen penalty factor λ, θ (y = ˆθ HT (y. Similarly, for the problem for appropriately chosen λ, θ (y = ˆθ ST (y. θ (y = arg min θ R p y θ + λ θ 1, Note: Under such thresholding schemes, we may not necessarily obtain a -sparse vector as we desire. However, we shall ignore this fact as we are interested in only the ris upper bounds l -constrained procedure Consider the following l -constrained formulation of the problem ˆθ(y arg min θ 0. θ R p : y θ τ We observe that the hard thresholding estimate is a feasible solution to the above problem. (The set that minimizes the above objective function is in reality a continuum of points. The constraint of interest is that y θ τ. Thus, setting θ i = 0 when y i τ and θ i = y i when y i > τ satisfies the constraint. Further, this estimate also minimizes the l 0 norm and thus ˆθ(y is a feasible solution. Theorem 19.. For all θ B 0 (, ˆθ a feasible solution to the above problem, for τ = log p, with high probability, ˆθ θ 16 log p. 4

5 Proof. We shall decompose the proof into three steps. Step 1: Set τ to ensure feasibility of ground truth. Since Y = θ + Z, y θ = Z log p whp. Thus we observe that the ground truth is feasible high probability. Step : Analyze structure of error. The error is given by h = ˆθ θ. Since θ is a feasible solution, ˆθ 0 θ 0. Thus, h 0. Step 3: Bound l norm. h h h 0 ˆθ θ ( ˆθ y + y θ (19.4 8τ = 16 log p, where (19.4 follows from the triangle inequality. We note that all the above statements hold with high probability following the statement of feasibility. Similarly, consider the problem ˆθ(y arg min θ 1. θ R p : y θ τ We observe here that for any θ satisfying the constraint, θ 1 p i=1 ( y τ1 { y > τ}. The soft thresholding estimate satisfies the above bound and the constraint and is thus a feasible solution to the problem. Theorem For all θ B 0 (, ˆθ a feasible solution to the above problem, for τ = log p, with high probability, ˆθ θ 3 log p. Proof. We proceed in similar fashion to the earlier proof. Step 1: Set τ to ensure feasibility of ground truth. Since Y = θ + Z, y θ = Z log p whp. Thus we observe that the ground truth is feasible with high probability. Step : Analyze structure of error. 5

6 The error is given by h = ˆθ θ. Thus h τ. Let J = supp(θ. Define the cone C = {x R p : x J c 1 x J 1 }. We now have h J 1 h J c 1 = ˆθ i θ i ˆθ i θ 1 ˆθ 1 0, i J i J c which follows from the triangle inequality and the feasibility of θ. Thus h C. Step 3: Bound l norm. h h 1 h (19.5 4τ h J 1 4τ h J (19.6 4τ h h 3 log p, where (19.5 and (19.6 follow from Holder s inequality and Cauchy-Schwarz inequality respectively. Remar 19. (Approximate Sparsity. Let J be a set of indices of size. Let h C = {x R p : x J c 1 x J 1 }. Consider the set of largest elements in h J c indexed by the set K. Then, h (J K c 1 h. Proof. For every element, we have Thus, h J c h (i J c 1 i h J c 1. p i=+1 h (i J c p i=+1 1 h J c h J K, 1 i h J c 1 which follows from Cauchy-Schwarz inequality and the fact that h C. Remar When the vector is sufficiently sparse, specifically = o(p, R ( + o(1 log p. Further, the bound can be achieved in the adaptive case too.. If = Θ(p, i.e, p α (0, 1] as p, then, where β(α is a constant dependent on α. R p (β(α + o(1, References 6

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In