Risi Kondor, The University of Chicago. Nedelina Teneva UChicago. Vikas K Garg TTI-C, MIT

Size: px

Start display at page:

Download "Risi Kondor, The University of Chicago. Nedelina Teneva UChicago. Vikas K Garg TTI-C, MIT"

Mitchell Montgomery
6 years ago
Views:

1 Risi Kondor, The University of Chicago Nedelina Teneva UChicago Vikas K Garg TTI-C, MIT

2 Risi Kondor, The University of Chicago Nedelina Teneva UChicago Vikas K Garg TTI-C, MIT

3 {(x 1, y 1 ), (x 2, y 2 ),, (x m, y m )} f : x y 3 3/51

4 [ 1 f = argmin f F m m ] l(f(x i ), y i ) + λ f 2 2 i=1 General framework (regularized risk minimization) that encompasses Kernel methods, including Gaussian processes and SVMs etc Most of Bayesian statistics Boosting, etc [Vapnik & Chervnonenkis 1971 ] 4 4/51

5 5 5/51

6 Data: {(x, y)} m i=1 Features: {(ϕ 1 (x i ),, ϕ n (x i ), y)} m i=1 f : x y optimization 6 6/51

7 [ 1 f = argmin f F m m ] l(f(x i ), y i ) + λ f 1 i=1 The l 1 norm induces Crucially, minimizing f 1 is almost as good as minimizing f 0 This lead to an explosion of activity related to The Lasso [Tibshirani, 1996], Basis Pursuit [Donoho, 1998], Elastic Net [Hui & Hastie, 2005], Compressed Sensing [Donoho, 2004] [Candés et al 2005 ] [Osher] However, no more Representer Theorem 7 7/51

8 Data: {(x, y)} m i=1 Features: {(ϕ 1 (x i ),, ϕ n (x i ), y)} m i=1 f : x y optimization 8 8/51

9 Black box machine learning ignores the structure of the data itself: The algorithmic part is separated from the statistical part O(n 3 ) complexity is totally unrealistic in today s world No obvious parallelism Ignores the most important ideas of the Applied Math of the last 30 years, such as multiresolution analysis, fast multipole methods and multigrid (The relationship of deep learning to all this is not clear yet) 9 9/51

11 Data Features Mulitresolution (eg, Scattering) f : x y 11 11/51

12 MMF is a multilevel factorization of the data similarity (kernel) matrix A of the form ( ) ( ) ( ) ( ) ( ) ( ) P P Q L Q 1 where A Q 1 Q L P is just a permutation matrix to reorder the rows and columns of A, Each Q l is an orthogonal matrix that has special, Outside its [Q l ] 1:δl 1, 1:δ l 1 block, each Q l is just the identity for some fixed sequence n δ 1 δ L H A hierarchical, multipole-type description of the data In factorized form, A is much easier to deal with 12 12/51

13 The eigendecomposition (PCA) of a symmetric matrix A R n n is ( )( )( ) ( ) = Q A The MMF factorization is ( ) ( ) ( ) ( ) ( ) ( ) P P Q D Q L Q 1 A Q 1 Q L H The eigendecomposition always exists, is essentially unique, and truncating it after k terms is the optimal approximation in the Frobenius and nuclear norms But it is expensive to compute O(n 3 ), eigenvectors are dense, does not take advantage of hierarchical structrure 13 13/51

14 In general, multiresolution analysis on a space X is a filtration where V l = V l+1 W l+1 and Each V l s orthonormal basis is {ϕ l m} m Each W l s orthonormal basis is {ψ l m} m The spaces are chosen so that as l increases, V l contain functions that are increasingly smooth wrt some self-adjoint operator T : L(X) L(X) 14 14/51

15 The central dogma of harmonic analysis is that the structure of the space of functions on a set X can shed light on the structure of X itself G L(G) The interplay between geometry of sets, function spaces on sets, and operators on sets is classical in Harmonic Analysis [Coifman & Maggioni, 2006] Multiresolution analysis is an attractive paradigm for ML because Data naturally clusters into (soft) hierarchies The resulting data structures can form the basis of fast algorithms 15 15/51

16 But how does a matrix induce multiresolution??? 16 16/51

18 R Mallat [1989] defined multiresolution on R by the following axioms: 1 j V l = {0}, 2 l V l is dense in L 2 (R), 3 If f V l then f (x) = f(x 2 l m) is also in V l for any m Z, 4 If f V l, then f (x) = f(2x) is in V l 1, which imply the existence of a mother wavelet ψ and a father wavelet ϕ s t ψ l m = 2 l/2 ψ(2 l x m) and ϕ l m = 2 l/2 ϕ(2 l x m) 18 18/51

19 Which of the ideas from classical multiresolution still make sense? Recursively split L(X) into smoother and rougher parts Basis functions should be localized in space & frequency Each Φ l Q l Φl+1 Ψ l+1 transform is orthogonal and sparse Each ψ l m is derived by translating ψl MAYBE Each ψ l is derived by scaling ψ??? 19 19/51

20 1 The sequence L(X) = V 0 V 1 V 2 is a filtration of R n in terms of smoothness with respect to T in the sense that µ l = inf f, T f / f, f f V l \{0} increases at a given rate 2 The wavelets are localized in the sense that inf sup ψm(y) l x X y X d(x, y) α increases no faster than a certain rate 3 Letting Q l be the matrix expressing Φ l Ψ l in the previous basis Φ l 1, ie, ϕ l m = dim(v l 1 ) i=1 [Q l ] m,i ϕ l 1 i ψm l = dim(v l 1 ) i=1 [Q l ] m+dim(vl 1 ),i ϕ l 1 i, each Q l orthogonal transform is sparse, guaranteeing the existence of a fast wavelet transform (Φ 0 is taken to be the standard basis, ϕ 0 m = e m ) 20 20/51

22 If X = n is finite, representing T by a symmetric matrix A R, each basis transform V l V l+1 W l+1 is like applying a rotation matrix A Q 1 AQ 1 Q 2 Q 1 AQ 1 Q 2 and then fixing a subset of the coordinates as wavelets In addition, Q 1,, Q L must obey sparsity constraints multiresolution analysis multilevel matrix factorization 22 22/51

23 ( ) ( ) P ( ) ( ) P ( ) ( ) Q L Q 1 A Q 1 Q L H Given a symmetric matrix A R n n, a class of sparse rotations Q, and a sequence n δ 1 δ L, a of A is A = Q 1 Q 2 Q LH Q L Q 2 Q 1, where each Q l Q rotation satisfies [Q l ] [n]\sl, [n]\s l = I n δl 1 for some nested sequence of sets [n] = S 1 S 2 S L+1 with S l = δ l 1, and H is S L+1 core diagonal If this is factorization is exact, we say that A is (over G with δ 1,, δ L ) generalization of rank 23 23/51

24 Q l It is critical that the Q l must be very simple and local rotations Two choices: 1 k : Jacobi MMFs ( ) Q = I n k (i1,,i k ) O = P P for some O SO(k) for k = 2, just a Givens rotation 2 k Parallel MMFs ( ) Q = (i 1 1,,i 1 k ) O 1 (i 2 1 1,,i 2 k ) O 2 (i m 2 1,,i m km ) O m = P P for some O 1,, O m SO(k) 24 24/51

25 Given A, ideally, we would like to solve minimize A Q 1 Q L H Q L Q 1 2 Frob [n] S 1 S L H HS n ; Q L 1,, Q L Q for a given class Q of local rotations and dimensions δ 1 δ 2 δ L In general, this optimization problem is combinatorially hard Easy to approximate it in a greedy way (level by level) To solve the combinatorial part of the problem (at each level) use a Deterministic strategy, or a Randomized strategy 25 25/51

26 If Q l = I n k I O with I = (i 1,, i k ) and J l = {i k }, then the contribution of level l to the MMF approximation error (in Frobenius norm) is k 1 E l = EI O = 2 [O[A l 1 ] I,I O ] 2 k,p + 2[OBO ] k,k, p=1 where B = [A l 1 ] I,Sl ([A l 1 ] I,Sl ) In the special case of k =2 and I l = (i, j), E l = E O (i,j) = 2[O[A l 1] (i,j),(i,j) O ] 2 2,1 + 2[OBO ] k,k with B = [A l 1 ] (i,j),sl ([A l 1 ] (i,j),sl ) 26 26/51

27 ( Let A) R 2 2 be diagonal, B R 2 2 symmetric and cos α sin α O = Set a = (A 1,1 A 2,2 ) 2 /4, b = B 1,2, sin α cos α c = (B 2,2 B 1,1 )/2, e = b 2 + c 2, θ = 2α and ω = arctan(c/b) Then if α minimizes ([OAO ] 2,1 ) 2 + [OBO ] 2,2, then θ satisfies (a/e) sin(2θ) + sin(θ + ω + π/2) = /51

28 If Q l is a compound rotation of the form Q l = I1 O 1 Im O m for some partition I 1 I m of [n] with k 1,, k m k, and some sequence of orthogonal matrices O 1,, O m, then level l s contribution to the MMF error obeys [k m j 1 E l 2 [O j [A l 1 ] Ij,I j Oj ] 2 k j,p + [O jb j Oj ] kj,k j ], (1) j=1 p=1 where B j = [A l 1 ] Ij,S l 1 \I j ([A l 1 ] Ij,S l 1 \I j ) For compression tasks parallel MMFs are generally preferable to Jacobi MMFs because Unrelated parts of the matrix are processed independently, in parallel Gives more compact factorizations Jacobi MMFs can exhibit cascades The sets I 1,, I m can be found by a randomized strategy or exact matching (O(n 3 ) time) 28 28/51

29 The sequence in which MMF (with k 3) eliminates dimensions induces a (soft) hierarchical clustering amongst the dimensions (mixture of trees) Connection to hierarchical clustering 29 29/51

30 1 Find a (hierarchically) sparse basis for A 2 Hierarchically cluster data 3 Find community structure 4 Generate hierarchical graphs 5 Compress graphs & matrices 6 Provide a basis for sparse approximations such as the LASSO 7 Provide a basis for fast numerics (NLA, multigrid, etc) 30 30/51

31 Diffusion wavelets also start with the matrix representation of a smoothing operator (the diffusion operator) and compress it in multiple stages However, at each stage, the wavelets are constructed from the columns of A itself by a rank-revealing QR type process A Q 1 R 1 A 2 Q 1 R 1 R 1 }{{} Q 2 R 2 Q 1 A 4 Q 1 Q 2 R 2 R 2 }{{} Q2 Q 1 Q 3 R 3 Very strong theoretical foundations, but the sparsity (locality) of the Q l matrices is hard to control [Coifman & Maggioni, 2006] 31 31/51

32 Treelets are a special case of Jacobi MMF Q 3 Q 2 Q 1 AQ 1 Q 2 Q 3, but Restricted to Givens rotations (k = 2) only recovers a single tree Each Q i is chosen to eliminate the maximal off-diagonal entry, rather than minimizing overall error not intended as a factorization method A is regarded as a covariance matrix probabilistic analysis [Lee, Nadler & Wasserman, 2008] 32 32/51

33 Multigrid methods solve systems of pde s by shuttling back and forth between grids/meshes at different levels of resolution [Brandt, 1973; Livne & Brandt, 2010] Fast multipole methods evaluate a kernel (such as the Gaussian kernel) between a large number of particles, by aggregating them at different levels [Greengard & Rokhlin, 1987] H matrices [Hackbusch, 1999], H 2 matrices [Borm, 2007] and Hierarchically Semi-Separable matrices [Chandrasekaran et al, 2005] iteratively decompose into blocked matrices, with low rank structure in each of the blocks 33 33/51

34 Eigendecomposition (PCA) Singular Value Decomposition Laplacian eigenmaps, etc Rank Revealing QR Johnson & Lindenstrauss, 1986 Halko, Martinsson & Tropp, 2011 Ailon & Chazelle, 2006 Sarlós, 2006 Le, Sarlós & Smola, 2013 Jolliffe, 1972 Drineas et al, 2009 Deshpande et al, 2006 Boutisdis et al, 2009 Coifman & Maggioni, 2006 Lee, Nadler & Wasserman, 2008 Drineas & Mahoney, 2006 Koutis, Miller & Peng, 2010 Benczur & Karger, 1996 Spielman & Teng, 2004 Achlioptas & McSherry, 2001 Williams & Seeger, 2011 Fowlkes et al, 2004 Drineas & Mahoney, 2005 Kumar at al, /51

35 In classical wavelet transforms one proves that if f is α Hölder, ie, f(x) f(y) c H d(x, y) α x, y X, then the wavelet coefficients decay at a certain rate, eg, f, ψl m c l α+β Results of this type generally hold for spaces of, in which Vol(B(x, 2r)) c hom Vol(B(x, r)) x X, r > 0 Natural notion of distance between rows in MMF is d(i, j) = A i,:, A j,: /51

36 Λ We say that a symmetric matrix A R n n is Λ up to order K, if for any S [n] of size at most K, letting Q = A S,: A :,S, setting D to be the diagonal matrix with D i,i = Q i,: 1, and Q = D 1/2 QD 1/2, the λ 1,, λ S eigenvalues of Q satisfy Λ < λ i < 1 Λ, and furthermore c 1 T D i,i c T for some constant c T Inuitively Different rows are neither too parallel or totally orthogonal Generalization of the restricted isometry property from compressed sensing [Candes & Tao, 2005] 36 36/51

37 Let A R n n be a symmetric matrix that is Λ rank homogeneous up to order K and has an MMF factorization A = U 1 U L H U L U 1 Assume ψ l m is a wavelet in this factorization arising from row i of A l 1 supported on a set S of size K K and that H i,: 2 ϵ Then if f : [n] R is (c H, 1/2) Hölder with respect to d(i, j) = A i,:, A j,: 1, then f, ψ l m c T ch c Λ ϵ 1/2 K with c Λ = 4/(1 (1 2Λ) 2 ) 37 37/51

39 Frobenius norm error on the Zackary Karate Club graph (left) and a matrix of genetic relationship between 50 individuals from [Crossett, 2013](right) 39 39/51

40 Frobenius norm error of the MMF and Nyström methods on a vs a (Kronecker product) matrix 40 40/51

41 Frobenius norm error of the MMF and Nyström methods on large network datasets 41 41/51

42 MMF is a new type of matrix factorization mirroring multiresolution analysis generalization of rank MMF exploits hierarchical structure, but does not enforce a single hierarchy Empirical evidence suggests that MMF is a good model for real data Finding MMF factorizations is a fundamentally local and parallelizable process O(n log n) algorithms should be within reach Once in MMF form, a range of matrix computations become faster MMF has strong ties to: Diffusion wavelets, Treelets, Multiscale SVD, structured matrices, algebraic multigrid, and fast multipole methods 42 42/51

Multiresolution Matrix Compression

Multiresolution Matrix Compression Nedelina Teneva Pramod K Mudrakarta Risi Kondor Department of Computer Science The University of Chicago pramodkm@uchicago.edu Department of Computer Science The University of Chicago nteneva@uchicago.edu