36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 6.1 Metric entropy and its uses Let (X, d) be a metric space. We gave some examples of metric spaces, including (R d, p ), the d-dimensional real space with the l p -norm, and L p ([0, 1], µ) (the L p function space on [0, 1] with measure µ) for p 1. We are interested in measuring how big these spaces are. 6.1.1 Covering numbers and metric entropy Definition 6.1 (Covering numbers) Let 0. A -covering or -net of (X, d) is any set {θ 1,..., θ N } X where N = N(), such that for any θ X, there exists i [N] such that d(θ, θ i ) The -covering number of (X, d), denoted as N(, X, d), is the size of a smallest -covering. There are several remarks: 1. For any (X, d), its -covering number is unique, but there can be several -coverings of that size. 2. Let B(θ i, d) = {θ X : d(θ, θ i ) }. Then X N(,X,d) B(θ i, d) 3. We will only consider metric spaces (X, d) that are totally bounded, i.e., N(, X, d) < for any > 0. Note that diam(x ) = sup θ,θ d(θ, θ ) < in such case. 4. In general, N(, X, d) decreases as increases and diverges to as 0. 6-1
6-2 Lecture 6: September 19 Example. Let X = [ 1, 1] and d(x, y) = x y for x, y X. Then, for some C > 0. If X = [ 1, 1] p, then N(, X, d) 1 + 1 C N(, X, d) C p Definition 6.2 (Metric entropy) The metric entropy of (X, d) is defined as log N(, X, d) Typically, for bounded subsets of R p with, or any of its equivalent norms, the metric entropy scales by ( ) 1 C p log In general, bounded subsets of R p are considered as small spaces. For non-euclidean spaces, e.g. function spaces, the metric entropy scales differently. We consider these as large spaces. Example. Let F = {f : [0, 1] R f is L-Lipschitz}. Then, log N(, F, d) L where denotes less than equal up to positive constants. The bound generalizes to L-Lipschitz functions on [0, 1] p by ( ) p L log N(, F, d) Further notions in the book can be useful depending on the area of interest. 6.1.2 Packing numbers Definition 6.3 (Packing numbers) A -packing of (X, d) is any set where M = M(), such that for all i j. {θ 1,..., θ M } X d(θ i, θ j ) > The -packing number of (X, d), denoted as M(, X, d), is the size of a largest -packing set. Again, the -packing number may be unique while the -packing set that achieves the number is not. Sometimes we would prefer using covering numbers, while sometimes we would prefer using packing numbers. Figure 6.1 shows an example of an ε-covering and an ε-packing. The following is a classic lemma on the relationship between covering and packing numbers. Lemma 6.4 For any > 0, M(2, X, d) N(, X, d) M(, X, d) Proof: Homework.
Lecture 6: September 19 6-3 Figure 6.1: A comparison of an ε-covering (left) and an ε-packing (right). Figures from [GKKW06]. 6.1.3 Volumetric ratios and covering numbers Proposition 6.5 Let and be two norms on R p (e.g. 1 and 2 ). Let and B p be the corresponding unit balls. 1 Then, ( ) p 1 Vol ( ) Vol ( ) N(,, ) Vol ( 2 Bp + B ) p Vol ( ) where, for α, β > 0, α = {αx : x }, and β + B p = {βx + y : x, y B p}. Proof: First note that, by homework, Vol ( ) = p Vol ( ) for any > 0. Also, if {x 1,..., x N } is a -covering of in, then N { xi + } where { x i + B p} = {x : x xi }. Together, we get Vol ( ) NVol ( ) N p Vol ( ) Note that we assume the norm is equivalent to the L p norm, so that we have invariance of volumes. This gives us the lower bound N(,, ) Vol () Vol ( 1 ) p 1 See previous lecture note for examples of norm balls.
6-4 Lecture 6: September 19 To get the upper bound, let {y i,..., y M } be a maximal -packing of in. Then, this set is also a -covering of in, because otherwise we can find another point that will contradict the maximality of the -packing set. The -balls { y i + M 2 p} B are disjoint by the maximality of the -packing set. Thus, Taking volumes we get M M { y i + } 2 + 2 ( ) 2 Vol ( ( ) ) 2 (( )) 2 Vol 2 2 + Note that the union simply becomes a product on the left-hand side, because the balls are disjoint. Thus, M(,, ) Vol ( 2 Bp + B ) p Vol ( ) Since the -covering number is bounded below by the -packing number, we have the upper bound as well. In our applications, we can simply take = to conclude that ( ) ( 1 p log log N(,, ) p log 1 + 2 ) p log ( ) 3 Note once again that this result holds for any norm in R d, including the Euclidean norm. 6.1.4 Discretization Covering and packing numbers can be used to discretize a supremum over an infinite space into a maximum over a finite number of covering or packing sets. We can then give a bound on this maximum, as done in e.g. Theorem 6.7 with sub-gaussian random vectors. Definition 6.6 (Sub-Gaussian random vectors.) A random vector X R d Gaussian with parameter σ 2, denoted as X SG d (σ 2 ), if with E [X] = 0 is sub- v T X SG(σ 2 ) for all v S d 1, where S d 1 = {v R d : v = 1} is the d-dimensional unit sphere. Theorem 6.7 Let X SG d (σ 2 ), and let B d be the unit ball in (R d, 2 ). Then, [ ] [ E max θ T X = E max θ T X ] 4σ d θ B d θ B d In other words, for (0, 1), with probability 1. max θ T X 4σ d + θ B d 2σ log ( ) 1 d
Lecture 6: September 19 6-5 Proof: Let N 1/2 be a 1 2 -covering of B d in 2. Then, N 1/2 5 d Next, for any θ B d, there exists z = z(θ) N 1/2 such that for some x R d such that x 1 2. Thus, θ = z + x max θ T X max z T X + max x T X θ B d z N 1/2 x 1 2 B d Now, notice that max x 1 2 B d xt X = 1 2 max θ B d θ T X. This implies that max θ T X 2 max z T X θ B d z N 1/2 This holds almost everywhere. Taking expectations, we get [ ] [ ] E max θ T X 2 E max z T X θ B d z N 1/2 2σ 2 log N1/2 where we used Lemma 6.4 for the second inequality. 2σ 2d log 5 4σ d For the second claim, we use the union bound (second inequality below). For any t > 0, ( ) ( ) P max θ T X t P 2 max z T X t θ B d z N 1/2 ( z T X t ) 2 Find t such that the expression is bounded by : z N 1/2 P { t2 } N 1/2 exp 8σ 2 } 5 d exp { t2 8σ 2 t = σ 8d log 5 + 2σ 2 log (1/) 6.2 Covariance estimation Using these techniques, we will show various bounds on estimating the covariance matrix of a random vector. First, recall the following result we covered in homework 1.
6-6 Lecture 6: September 19 Theorem 6.8 (Lemma 12, [Yuan10]; Lemma 1, [RWRY11]) Let (X 1,..., X d ) R d be a zero-mean random vector with covariance Σ such that X i Σii SG(σ 2 ) for i = 1,..., d. Let ˆΣ be the empirical covariance matrix. Then, for any t > 0, with probability at least 1 e t. max i,j t + log d ˆΣ ij Σ ij n Note that d can be larger than n, and that the empirical covariance matrix need not be positive definite, as long as d is a polynomial in n. We first review some basic notions in matrix algebra. For A R m n with rank(a) = r min{m, n}, the singular value decomposition (SVD) of A is given by A = UDV T where D = diag(σ 1,..., σ r ), σ 1 σ r > 0 are the singular values, and U R m r, V R n r has r orthonormal columns. Note that, for j = 1,..., r, where u j S m 1 is the jth column of U, and where v j S n 1 is the jth column of V. AA T u j = σ 2 j u j A T Av j = σ 2 j v j The largest singular value can also be characterized as the operator norm: σ max (A) = max x 0 Ax x = max x S m 1,y S n 1 x T Ay If A R n n is symmetric and positive definite, then the singular values are the square root of the eigenvalues. In the next lecture, we will give a bound on the distance between ˆΣ and Σ in the operator norm.
Lecture 6: September 19 6-7 References [GKKW06] [Yuan10] [RWRY11] L. Györfi, M. Kohler, A. Krzyzak and H. Walk, A distribution-free theory of nonparametric regression, Springer Science & Business Media, 2006. M. Yuan, High dimensional inverse covariance matrix estimation via linear programming, Journal of Machine Learning Research 11, 2010, pp. 2261 2286. P. Ravikumar, M. Wainwright, G. Raskutti and B. Yu, High-dimensional covariance estimation by minimizing l 1 -penalized log-determinant divergence, Electronic Journal of Statistics 5, 2011, pp. 935 980.