Lecture 6: September 19

Similar documents
21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Lecture 1: September 25

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

Lecture 5: September 15

Lecture 6: September 17

Lecture 5: September 12

Lecture 3: Review of Linear Algebra

Lecture 20: November 1st

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Homework 1. Yuan Yao. September 18, 2011

Lecture 4: September 12

Lecture 15: October 15

Lecture 9: September 28

Course 212: Academic Year Section 1: Metric Spaces

Lecture 3: Review of Linear Algebra

Lecture 2: August 31

Lecture 26: April 22nd

Functional Analysis Review

Lecture 8: Linear Algebra Background

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

(x, y) = d(x, y) = x y.

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Lecture 25: November 27

Lecture 14: October 11

EE731 Lecture Notes: Matrix Computations for Signal Processing

Lecture 9: Low Rank Approximation

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Lecture 6: September 12

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 4: January 26

Review of Some Concepts from Linear Algebra: Part 2

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Lecture 2: Linear Algebra Review

10-725/36-725: Convex Optimization Prerequisite Topics

Lecture Notes 1: Vector spaces

STA141C: Big Data & High Performance Statistical Computing

Supremum of simple stochastic processes

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

NORMS ON SPACE OF MATRICES

Lecture 14: October 17

Section 3.9. Matrix Norm

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Characterisation of Accumulation Points. Convergence in Metric Spaces. Characterisation of Closed Sets. Characterisation of Closed Sets

STA141C: Big Data & High Performance Statistical Computing

Basic Properties of Metric and Normed Spaces

Functional Analysis (2006) Homework assignment 2

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Week 3: January 22-26, 2018

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Basic Calculus Review

Exercise Solutions to Functional Analysis

Lecture 14: Newton s Method

Some Background Material

Notes taken by Graham Taylor. January 22, 2005

Lecture 23: November 19

Lecture 1: Review of linear algebra

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

1 Stochastic Dynamic Programming

2 Two-Point Boundary Value Problems

1 Directional Derivatives and Differentiability

Functional Analysis Review

EECS 275 Matrix Computation

The Johnson-Lindenstrauss Lemma

Lecture # 3 Orthogonal Matrices and Matrix Norms. We repeat the definition an orthogonal set and orthornormal set.

Mathematical foundations - linear algebra

Lecture 18: March 15

1 Vectors. Notes for Bindel, Spring 2017 Numerical Analysis (CS 4220)

Lecture notes: Applied linear algebra Part 1. Version 2

16.1 Bounding Capacity with Covering Number

Lecture 8 : Eigenvalues and Eigenvectors

Z Algorithmic Superpower Randomization October 15th, Lecture 12

Symmetric Matrices and Eigendecomposition

Elementary linear algebra

Invertibility of symmetric random matrices

Riemannian geometry of surfaces

Functional Analysis I

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

MA651 Topology. Lecture 9. Compactness 2.

Lecture 4: September 19

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

M17 MAT25-21 HOMEWORK 6

Iowa State University. Instructor: Alex Roitershtein Summer Homework #5. Solutions

Real Analysis Notes. Thomas Goller

Your first day at work MATH 806 (Fall 2015)

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Lecture 17: Primal-dual interior-point methods part II

Lecture 11: October 2

Knowledge Discovery and Data Mining 1 (VO) ( )

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Hilbert Spaces. Contents

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

1 Quantum states and von Neumann entropy

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Homework Assignment #5 Due Wednesday, March 3rd.

Transcription:

36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 6.1 Metric entropy and its uses Let (X, d) be a metric space. We gave some examples of metric spaces, including (R d, p ), the d-dimensional real space with the l p -norm, and L p ([0, 1], µ) (the L p function space on [0, 1] with measure µ) for p 1. We are interested in measuring how big these spaces are. 6.1.1 Covering numbers and metric entropy Definition 6.1 (Covering numbers) Let 0. A -covering or -net of (X, d) is any set {θ 1,..., θ N } X where N = N(), such that for any θ X, there exists i [N] such that d(θ, θ i ) The -covering number of (X, d), denoted as N(, X, d), is the size of a smallest -covering. There are several remarks: 1. For any (X, d), its -covering number is unique, but there can be several -coverings of that size. 2. Let B(θ i, d) = {θ X : d(θ, θ i ) }. Then X N(,X,d) B(θ i, d) 3. We will only consider metric spaces (X, d) that are totally bounded, i.e., N(, X, d) < for any > 0. Note that diam(x ) = sup θ,θ d(θ, θ ) < in such case. 4. In general, N(, X, d) decreases as increases and diverges to as 0. 6-1

6-2 Lecture 6: September 19 Example. Let X = [ 1, 1] and d(x, y) = x y for x, y X. Then, for some C > 0. If X = [ 1, 1] p, then N(, X, d) 1 + 1 C N(, X, d) C p Definition 6.2 (Metric entropy) The metric entropy of (X, d) is defined as log N(, X, d) Typically, for bounded subsets of R p with, or any of its equivalent norms, the metric entropy scales by ( ) 1 C p log In general, bounded subsets of R p are considered as small spaces. For non-euclidean spaces, e.g. function spaces, the metric entropy scales differently. We consider these as large spaces. Example. Let F = {f : [0, 1] R f is L-Lipschitz}. Then, log N(, F, d) L where denotes less than equal up to positive constants. The bound generalizes to L-Lipschitz functions on [0, 1] p by ( ) p L log N(, F, d) Further notions in the book can be useful depending on the area of interest. 6.1.2 Packing numbers Definition 6.3 (Packing numbers) A -packing of (X, d) is any set where M = M(), such that for all i j. {θ 1,..., θ M } X d(θ i, θ j ) > The -packing number of (X, d), denoted as M(, X, d), is the size of a largest -packing set. Again, the -packing number may be unique while the -packing set that achieves the number is not. Sometimes we would prefer using covering numbers, while sometimes we would prefer using packing numbers. Figure 6.1 shows an example of an ε-covering and an ε-packing. The following is a classic lemma on the relationship between covering and packing numbers. Lemma 6.4 For any > 0, M(2, X, d) N(, X, d) M(, X, d) Proof: Homework.

Lecture 6: September 19 6-3 Figure 6.1: A comparison of an ε-covering (left) and an ε-packing (right). Figures from [GKKW06]. 6.1.3 Volumetric ratios and covering numbers Proposition 6.5 Let and be two norms on R p (e.g. 1 and 2 ). Let and B p be the corresponding unit balls. 1 Then, ( ) p 1 Vol ( ) Vol ( ) N(,, ) Vol ( 2 Bp + B ) p Vol ( ) where, for α, β > 0, α = {αx : x }, and β + B p = {βx + y : x, y B p}. Proof: First note that, by homework, Vol ( ) = p Vol ( ) for any > 0. Also, if {x 1,..., x N } is a -covering of in, then N { xi + } where { x i + B p} = {x : x xi }. Together, we get Vol ( ) NVol ( ) N p Vol ( ) Note that we assume the norm is equivalent to the L p norm, so that we have invariance of volumes. This gives us the lower bound N(,, ) Vol () Vol ( 1 ) p 1 See previous lecture note for examples of norm balls.

6-4 Lecture 6: September 19 To get the upper bound, let {y i,..., y M } be a maximal -packing of in. Then, this set is also a -covering of in, because otherwise we can find another point that will contradict the maximality of the -packing set. The -balls { y i + M 2 p} B are disjoint by the maximality of the -packing set. Thus, Taking volumes we get M M { y i + } 2 + 2 ( ) 2 Vol ( ( ) ) 2 (( )) 2 Vol 2 2 + Note that the union simply becomes a product on the left-hand side, because the balls are disjoint. Thus, M(,, ) Vol ( 2 Bp + B ) p Vol ( ) Since the -covering number is bounded below by the -packing number, we have the upper bound as well. In our applications, we can simply take = to conclude that ( ) ( 1 p log log N(,, ) p log 1 + 2 ) p log ( ) 3 Note once again that this result holds for any norm in R d, including the Euclidean norm. 6.1.4 Discretization Covering and packing numbers can be used to discretize a supremum over an infinite space into a maximum over a finite number of covering or packing sets. We can then give a bound on this maximum, as done in e.g. Theorem 6.7 with sub-gaussian random vectors. Definition 6.6 (Sub-Gaussian random vectors.) A random vector X R d Gaussian with parameter σ 2, denoted as X SG d (σ 2 ), if with E [X] = 0 is sub- v T X SG(σ 2 ) for all v S d 1, where S d 1 = {v R d : v = 1} is the d-dimensional unit sphere. Theorem 6.7 Let X SG d (σ 2 ), and let B d be the unit ball in (R d, 2 ). Then, [ ] [ E max θ T X = E max θ T X ] 4σ d θ B d θ B d In other words, for (0, 1), with probability 1. max θ T X 4σ d + θ B d 2σ log ( ) 1 d

Lecture 6: September 19 6-5 Proof: Let N 1/2 be a 1 2 -covering of B d in 2. Then, N 1/2 5 d Next, for any θ B d, there exists z = z(θ) N 1/2 such that for some x R d such that x 1 2. Thus, θ = z + x max θ T X max z T X + max x T X θ B d z N 1/2 x 1 2 B d Now, notice that max x 1 2 B d xt X = 1 2 max θ B d θ T X. This implies that max θ T X 2 max z T X θ B d z N 1/2 This holds almost everywhere. Taking expectations, we get [ ] [ ] E max θ T X 2 E max z T X θ B d z N 1/2 2σ 2 log N1/2 where we used Lemma 6.4 for the second inequality. 2σ 2d log 5 4σ d For the second claim, we use the union bound (second inequality below). For any t > 0, ( ) ( ) P max θ T X t P 2 max z T X t θ B d z N 1/2 ( z T X t ) 2 Find t such that the expression is bounded by : z N 1/2 P { t2 } N 1/2 exp 8σ 2 } 5 d exp { t2 8σ 2 t = σ 8d log 5 + 2σ 2 log (1/) 6.2 Covariance estimation Using these techniques, we will show various bounds on estimating the covariance matrix of a random vector. First, recall the following result we covered in homework 1.

6-6 Lecture 6: September 19 Theorem 6.8 (Lemma 12, [Yuan10]; Lemma 1, [RWRY11]) Let (X 1,..., X d ) R d be a zero-mean random vector with covariance Σ such that X i Σii SG(σ 2 ) for i = 1,..., d. Let ˆΣ be the empirical covariance matrix. Then, for any t > 0, with probability at least 1 e t. max i,j t + log d ˆΣ ij Σ ij n Note that d can be larger than n, and that the empirical covariance matrix need not be positive definite, as long as d is a polynomial in n. We first review some basic notions in matrix algebra. For A R m n with rank(a) = r min{m, n}, the singular value decomposition (SVD) of A is given by A = UDV T where D = diag(σ 1,..., σ r ), σ 1 σ r > 0 are the singular values, and U R m r, V R n r has r orthonormal columns. Note that, for j = 1,..., r, where u j S m 1 is the jth column of U, and where v j S n 1 is the jth column of V. AA T u j = σ 2 j u j A T Av j = σ 2 j v j The largest singular value can also be characterized as the operator norm: σ max (A) = max x 0 Ax x = max x S m 1,y S n 1 x T Ay If A R n n is symmetric and positive definite, then the singular values are the square root of the eigenvalues. In the next lecture, we will give a bound on the distance between ˆΣ and Σ in the operator norm.

Lecture 6: September 19 6-7 References [GKKW06] [Yuan10] [RWRY11] L. Györfi, M. Kohler, A. Krzyzak and H. Walk, A distribution-free theory of nonparametric regression, Springer Science & Business Media, 2006. M. Yuan, High dimensional inverse covariance matrix estimation via linear programming, Journal of Machine Learning Research 11, 2010, pp. 2261 2286. P. Ravikumar, M. Wainwright, G. Raskutti and B. Yu, High-dimensional covariance estimation by minimizing l 1 -penalized log-determinant divergence, Electronic Journal of Statistics 5, 2011, pp. 935 980.