Matrix concentration inequalities

Size: px

Start display at page:

Download "Matrix concentration inequalities"

Mary Hodges
5 years ago
Views:

1 ELE 538B: Mathematics of High-Dimensional Data Matrix concentration inequalities Yuxin Chen Princeton University, Fall 2018

2 Recap: matrix Bernstein inequality Consider a sequence of independent random matrices { X l R d } 1 d 2 E[X l ] = 0 X l B for each l variance statistic: { E [ ] ] } v := max X l lxl E, [ l X l X l Theorem 3.1 (Matrix Bernstein inequality) For all τ 0, { P X l l } τ (d 1 + d 2 ) exp ( τ 2 /2 ) v + Bτ/3 Matrix concentration 3-2

3 Recap: matrix Bernstein inequality exponential tail exponential tail Gaussian tail Matrix concentration 3-3

4 This lecture: detailed introduction of matrix Bernstein An introduction to matrix concentration inequalities Joel Tropp 15

5 Outline Matrix theory background Matrix Laplace transform method Matrix Bernstein inequality Application: random features Matrix concentration 3-5

6 Matrix theory background

7 Matrix function Suppose the eigendecomposition of a symmetric matrix A R d d is λ 1 A = U... U λ d Then we can define f(a) := U f(λ 1 )... f(λd) U Matrix concentration 3-7

8 Examples of matrix functions Let f(a) = c 0 + k=1 c k a k, then f(a) := c 0 I + c k A k k=1 matrix exponential: e A := I + k=1 1 k! Ak (why?) monotonicity: if A H, then tr e A tr e H matrix logarithm: log(e A ) := A monotonicity: if 0 A H, then log A log(h) Matrix concentration 3-8

9 Matrix moments and cumulants Let X be a random symmetric matrix. Then matrix moment generating function (MGF): M X (θ) := E[e θx ] matrix cumulant generating function (CGF): Ξ X (θ) := log E[e θx ] Matrix concentration 3-9

10 Matrix Laplace transform method

11 Matrix Laplace transform A key step for a scalar random variable Y : by Markov s inequality, P {Y t} inf θ>0 e θt E [ e θy ] This can be generalized to the matrix case Matrix concentration 3-11

12 Matrix Laplace transform Lemma 3.2 Let Y be a random symmetric matrix. For all t R, P {λ max (Y ) t} inf θ>0 e θt E [ tr e θy ] can control extreme eigenvalues of Y via the trace of the matrix MGF Matrix concentration 3-12

13 Proof of Lemma 3.2 For any θ > 0, P {λ max (Y ) t} = P {e θλmax(y ) e θt} E[eθλmax(Y ) ] e θt = E[eλmax(θY ) ] e θt (Markov s inequality) = E[λ max(e θy )] e θt (e λmax(z) = λ max (e Z )) E[tr eθy ] e θt This completes the proof since it holds for any θ > 0 Matrix concentration 3-13

14 Issues of the matrix MGF The Laplace transform method is effective for controlling an independent sum when MGF decomposes in the scalar case where X = X X n with independent {X l }: M X (θ) = E[e θx 1+ +θx n ] = E[e θx 1 ] E[e θxn ] = Issues in the matrix settings: e X 1+X 2 e X 1 e X 2 unless X 1 and X 2 commute n M Xl (θ) l=1 }{{} look at each X l separately tr e X 1+ +X n tr e X 1 e X1 e Xn Matrix concentration 3-14

15 Subadditivity of the matrix CGF Fortunately, the matrix CGF satisfies certain subadditivity rules, allowing us to decompose independent matrix components Lemma 3.3 Consider a finite sequence {X l } 1 l n of independent random symmetric matrices. Then for any θ R, [ ] E tr e θ l X l }{{} tr exp ( Ξ Σl X l (θ) ) ( tr exp l log E[ e θx ]) l }{{} ( ) tr exp l Ξ X l (θ) this is deep result based on Lieb s Theorem! Matrix concentration 3-15

16 Lieb s Theorem Theorem 3.4 (Lieb 73) Fix a symmetric matrix H. Then A tr exp(h + log A) Elliott Lieb is concave on positive-semidefinite cone Lieb s Theorem immediately implies (exercise (Jensen s inequality)) E [ tr exp(h + X) ] tr exp ( H + log E [ e X]) (3.1) Matrix concentration 3-16

17 Proof of Lemma 3.3 E [ tr e θ l X ] l = E [ tr exp ( θ n 1 X )] l=1 l + θx n [ E tr exp (θ n 1 X l=1 l + log E [ e θxn])] (by (3.1)) [ E tr exp (θ n 2 X l=1 l + log E [ e θx ] n 1 + log E [ e θxn])] ( n tr exp log l=1 E[ e θx ]) l Matrix concentration 3-17

18 Master bounds Combining the Laplace transform method with the subadditivity of CGF yields: Theorem 3.5 (Master bounds for sum of independent matrices) Consider a finite sequence {X l } of independent random symmetric matrices. Then ( P {λ max X ) } tr exp ( l l t l inf log E[eθX l] ) θ>0 e θt this is a general result underlying the proofs of the matrix Bernstein inequality and beyond (e.g. matrix Chernoff) Matrix concentration 3-18

19 Matrix Bernstein inequality

20 Matrix CGF ( P {λ max X ) } tr exp ( l l t l inf log E[eθX l] ) θ>0 e θt To invoke the master bound, one needs tocontrol the matrix CGF }{{} main step for proving matrix Bernstein Matrix concentration 3-20

21 Symmetric case Consider a sequence of independent random symmetric matrices { Xl R d d} E[X l ] = 0 λ max (X l ) B for each l variance statistic: v := E [ ] l X2 l Theorem 3.6 (Matrix Bernstein inequality: symmetric case) For all τ 0, ( P {λ max l X l ) ( } τ 2 ) /2 τ d exp v + Bτ/3 Matrix concentration 3-21

22 Bounding matrix CGF For bounded random matrices, one can control the matrix CGF as follows: Lemma 3.7 Suppose E[X] = 0 and λ max (X) B. Then for 0 < θ < 3/B, log E [ e θx] θ2 /2 1 θb/3 E[X2 ] Matrix concentration 3-22

23 Proof of Theorem 3.6 Let g(θ) := θ2 /2 1 θb/3, then it follows from the master bound that ( P {λ max X ) } tr exp ( n i t inf i=1 log E[e θx i ] ) i θ>0 Lemma 3.7 inf 0<θ<3/B e θt tr exp ( g(θ) n i=1 E[Xi 2]) e θt d exp ( g(θ)v ) inf 0<θ<3/B e θt t v+bt/3 Taking θ = matrix Bernstein and simplifying the above expression, we establish Matrix concentration 3-23

24 Proof of Lemma 3.7 Define f(x) = eθx 1 θx x 2, then for any X with λ max (X) B: e θx = I + θx + ( e θx I θx ) = I + θx + X f(x) X I + θx + f(b) X 2 In addition, we note an elementary inequality: for any 0 < θ < 3/B, f(b) = eθb 1 θb B 2 = 1 B 2 (θb) k k=2 k! θ2 2 k=2 (θb) k 2 3 k 2 = θ2 /2 1 θb/3 = e θx I + θx + θ2 /2 1 θb/3 X2 Since X is zero-mean, one further has E [ e θx] ( I + θ2 /2 θ 2 ) /2 1 θb/3 E[X2 ] exp 1 θb/3 E[X2 ] Matrix concentration 3-24

25 Application: random features

26 Kernel trick A modern idea in machine learning: replace inner product by a kernel evaluation (i.e. certain similarity measure) Advantage: work beyond the Euclidean domain via task-specific similarity measures Matrix concentration 3-26

27 Similarity measure Define the similarity measure Φ Φ(x, x) = 1 Φ(x, y) 1 Φ(x, y) = Φ(y, x) Example: angular similarity Φ(x, y) = 2 π x, y 2 (x, y) arcsin = 1 x 2 y 2 π Matrix concentration 3-27

28 Kernel matrix Consider N data points x 1,, x N R d. Then the kernel matrix G R N N is G i,j = Φ(x i, x j ) 1 i, j N Kernel Φ is said to be positive definite if G 0 for any {x i } Challenge: kernel matrices are usually large cost of constructing G is O(dN 2 ) Question: can we approximate G more efficiently? Matrix concentration 3-28

29 Random features Introduce a random variable w and a feature map ψ such that Φ(x, y) = E w [ψ(x; w) ψ(y; w) ] }{{} decouple x and y example (angular similarity) Φ(x, y) = 1 2 (x, y) π with w uniformly drawn from unit sphere = E w [sgn x, w sgn y, w ] (3.2) Matrix concentration 3-29

30 Random features Introduce a random variable w and a feature map ψ such that Φ(x, y) = E w [ψ(x; w) ψ(y; w) ] }{{} decouple x and y this results in a random feature vector z 1 ψ(x 1 ; w) z =. = z N. ψ(x N ; w) zz }{{} is an unbiased estimate of G, i.e. G = E[zz ] rank 1 Matrix concentration 3-29

31 Example Angular similarity: 2 (x, y) Φ(x, y) = 1 π = E w [sign x, w sign y, w ] where w is uniformly drawn from unit sphere As a result, the random feature map is ψ(x, w) = sign x, w Matrix concentration 3-30

32 Random feature approximation Generate n independent copies of R = zz, i.e. {R l } 1 l n Estimator of kernel matrix G: Ĝ = 1 n n R l l=1 Question: how many random features are needed to guarantee accurate estimation? Matrix concentration 3-31

33 Statistical guarantees for random feature approximation Consider the angular similarity example (3.2): To begin with, E[Rl 2 ] = E[zz zz ] = NE[zz ] = NG = v = 1 n n 2 l=1 E[R2 l ] = N n G Next, 1 n R = 1 n z 2 2 = N n := B Applying the matrix Bernstein inequality yields: with high prob. Ĝ G N v log N + B log N n G log N + N n log N N n G log N (for sufficiently large n) }{{} 1 Matrix concentration 3-32

34 Sample complexity Define the intrinsic dimension of G as intdim(g) = trg G = If n ε 2 intdim(g) log N, then we have N G Ĝ G G ε Matrix concentration 3-33

35 Reference [1] An introduction to matrix concentration inequalities, J. Tropp, Foundations and Trends in Machine Learning, [2] Convex trace functions and the Wigner-Yanase-Dyson conjecture, E. Lieb, Advances in Mathematics, [3] User-friendly tail bounds for sums of random matrices, J. Tropp, Foundations of computational mathematics, [4] Random features for large-scale kernel machines, A. Rahimi, B. Recht, NIPS, Matrix concentration 3-34

On Some Extensions of Bernstein s Inequality for Self-Adjoint Operators

On Some Extensions of Bernstein s Inequality for Self-Adjoint Operators Stanislav Minsker e-mail: sminsker@math.duke.edu Abstract: We present some extensions of Bernstein s inequality for random self-adjoint