Large Scale Matrix Analysis and Inference

Size: px

Start display at page:

Download "Large Scale Matrix Analysis and Inference"

Roderick Singleton
5 years ago
Views:

1 Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS / 32

2 Introductory musing What is a matrix? a i,j 1 A vector of n 2 parameters 2 A covariance 3 A generalized probability distribution / 32

3 1. A vector of n 2 parameters When you regularize with the squared Frobenius norm min W W 2 F + n loss(tr(wx n )) 3 / 32

4 1. A vector of n 2 parameters When you regularize with the squared Frobenius norm min W W 2 F + n loss(tr(wx n )) Equivalent to min vec(w) vec(w) n loss(vec(w) vec(x n )) No structure: n 2 independent variables 4 / 32

5 2. A covariance View the symmetric positive definite matrix C as a covariance matrix of some random feature vector c R n, i.e. C = E ((c E(c))(c E(c)) ) n features plus their pairwise interactions 5 / 32

6 Symmetric matrices as ellipses Ellipse = {Cu : u 2 = 1} Dotted lines connect point u on unit ball with point Cu on ellipse 6 / 32

7 Symmetric matrices as ellipses Eigenvectors form axes Eigenvalues are lengths 7 / 32

8 Dyads uu, where u unit vector One eigenvalue one All others zero Rank one projection matrix 8 / 32

9 Directional variance along direction u V(c u) = u Cu = tr(c uu ) 0 The outer figure eight is direction u times the variance u C u PCA: find direction of largest variance 9 / 32

10 3 dimensional variance plots tr(c uu ) is generalized probability when tr(c) = 1 10 / 32

11 3. Generalized probability distributions Probability vector ω = (.2,.1.,.6,.1) = i ω i }{{} mixture coefficients Density matrix W = i ω i }{{} mixture coefficients e i }{{} pure events w i w i }{{} pure density matrices 11 / 32

12 3. Generalized probability distributions Probability vector ω = (.2,.1.,.6,.1) = i ω i }{{} mixture coefficients Density matrix W = i ω i }{{} mixture coefficients Matrices as generalized distributions e i }{{} pure events w i w i }{{} pure density matrices 12 / 32

13 3. Generalized probability distributions Probability vector ω = (.2,.1.,.6,.1) = i ω i }{{} mixture coefficients Density matrix W = i ω i }{{} mixture coefficients Matrices as generalized distributions Many mixtures lead to same density matrix e i }{{} pure events w i w i }{{} pure density matrices There always exists a decomposition into n eigendyads Density matrix: Symmetric positive matrix of trace one 13 / 32

14 It s like a probability! Total variance along orthogonal set of directions is 1 u 1 Wu 1 +u 2 Wu 2 = 1 a + b + c = 1 14 / 32

15 Uniform density? 1 n I All dyads have generalized probability 1 n tr( 1 n I uu ) = 1 n tr(uu ) = 1 n Generalized probabilities of n orthogonal dyads sum to 1 15 / 32

16 Conventional Bayes Rule P(M i y) = P(M i)p(y M i ) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 16 / 32

17 Conventional Bayes Rule P(M i y) = P(M i)p(y M i ) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 17 / 32

18 Conventional Bayes Rule P(M i y) = P(M i)p(y M i ) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 18 / 32

19 Conventional Bayes Rule P(M i y) = P(M i)p(y M i ) P(y) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 19 / 32

20 Bayes Rule for density matrices D(M y) = exp (log D(M) + log D(y M)) tr (above matrix) 1 update with data likelyhood matrix D(y M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 20 / 32

21 Bayes Rule for density matrices D(M y) = exp (log D(M) + log D(y M)) tr (above matrix) 2 updates with same data likelyhood matrix D(y M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 21 / 32

22 Bayes Rule for density matrices D(M y) = exp (log D(M) + log D(y M)) tr (above matrix) 3 updates with same data likelyhood matrix D(y M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 22 / 32

23 Bayes Rule for density matrices D(M y) = exp (log D(M) + log D(y M)) tr (above matrix) 4 updates with same data likelyhood matrix D(y M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 23 / 32

24 Bayes Rule for density matrices D(M y) = exp (log D(M) + log D(y M)) tr (above matrix) 10 updates with same data likelyhood matrix D(y M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 24 / 32

25 Bayes Rule for density matrices D(M y) = exp (log D(M) + log D(y M)) tr (above matrix) 20 updates with same data likelyhood matrix D(y M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 25 / 32

26 Bayes rules vector matrix Bayes rule P(M i y)= P(M i ) P(y M i ) j P(M j ) P(y M j ) D(M y) = D(M) D(y M) tr(d(m) D(y M) A B := exp(log A + log B) 26 / 32

27 Bayes rules vector matrix Bayes rule P(M i y)= P(M i ) P(y M i ) j P(M j ) P(y M j ) D(M y) = D(M) D(y M) tr(d(m) D(y M) A B := exp(log A + log B) Regularizer Entropy Quantum Entropy 27 / 32

28 Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy becomes Often the hardest problem ie bounds for the vector case lift to the matrix case 28 / 32

29 Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy becomes Often the hardest problem ie bounds for the vector case lift to the matrix case This phenomenon has been dubbed the free matrix lunch Size of matrix = size of vector = n 29 / 32

30 PCA setup Data vectors C = n x nx n max u C u unit u }{{} not convex in u = max tr(cuu ) dyad uu }{{} linear in uu Corresponding vector problem max e i c e i }{{} linear in e i Vector problem is matrix problem when everything happens in the same eigensystem Uncertainty over unit: probability vector Uncertainty over dyads: density matrix Uncertainty over k-sets of units: capped probability vector Uncertainty over rank k projection matrices: capped density matrix 30 / 32

31 For PCA Solve the vector problem first Do all bounds Lift to matrix case: essentially replace by Regret bounds stay the same Free Matrix Lunch 31 / 32

32 Questions When can you lift vector case to matrix case? When is there a free matrix lunch? Lifting matrices to tensors? Efficient algorithms for large matrices? Approximations of Avoid eigenvalue decomposition by sampling / 32

On-line Variance Minimization

On-line Variance Minimization Manfred Warmuth Dima Kuzmin University of California - Santa Cruz 19th Annual Conference on Learning Theory M. Warmuth, D. Kuzmin (UCSC) On-line Variance Minimization COLT06