Rank, Trace-Norm & Max-Norm

Size: px

Start display at page:

Download "Rank, Trace-Norm & Max-Norm"

Nigel Pope
5 years ago
Views:

1 Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University

2 Matrix Learning users movies ? ? 5 3? ? ? 5? ? 5 2? ? ? X model of the data Reconstructing latent signal gene expression (biological processes) Capturing structure in a corpus documents, images, etc (topics, etc) Prediction: collaborative filtering movie ratings Fit (partially) observed Y with X from hypothesis class of matrices Low Rank: { X rank(x) k } Low Trace-Norm: { X X tr B } Low Max-Norm: { X X max B } Low dimensional factorization Low norm factorization MMMF [NIPS 04] In this talk: The three hypothesis classes (measures of matrix complexity) Generalization error bounds (for predicting unobserved entries) Relationships between the three measures / hypothesis classes

3 Learning with Low Rank Matrices X rank k = U V Low Rank: { X rank(x) k } = { UV U n k, V m k } For binary target matrices, only care about A=sign(X) Dimensional Complexity of a binary matrix A: dc(a) = min rank(x) sign(x)=a

4 Learning with Low Rank Matrices: Geometric Interpretation hypotheses V dc(a) = min rank(x) rank(x)=a rows of U data points U columns of V What is the minimum dimension into which a concept class can be embedded as linear separators?

5 Max-Margin Matrix Factorization: Bound norms of U,V instead of their dimensionality low norm U V bound norms uniformly: (max i U i 2 ) (max j V j 2 ) R 2 rows of U columns of V For each Y ij ±1: Y ij X ij Margin U i,v j

6 Max-Margin Matrix Factorization: Bound norms of U,V instead of their dimensionality low norm n U m V bound norms uniformly: (max i U i 2 ) (max j V j 2 ) R 2 X max = min (max i U i )(max j V j ) X=UV bound norms on avarage: ( i U i 2 ) ( j V j 2 ) nmr 2 X Σ = min U Fro V Fro X=UV Optimize V given U: each column is a SVM For each Y ij ±1: Y ij X ij 1 U i,v j

7 Max-Margin Matrix Factorization: Bound norms of U,V instead of their dimensionality low norm n U m V bound norms uniformly: (max i U i 2 ) (max j V j 2 ) R 2 X max = min (max i U i )(max j V j ) X=UV mc(a) = min X max A ij X ij 1 bound norms on avarage: ( i U i 2 ) ( j V j 2 ) nmr 2 X Σ = min U Fro V Fro X=UV ac(a) = min X nm Σ A ij X ij 1 For each Y ij ±1: Y ij X ij 1 U i,v j

8 Three Measures of Matrix Complexity Used for fitting observed data matrices bound norms uniformly: (max i U i 2 ) (max j V j 2 ) R 2 For matrices representing concept classes: 1/margin required for embedding as linear classifiers Convex! (semi-definite programming) [NIPS 04] Not convex dimension required for embedding as linear classifiers X max = min (max i U i )(max j V j ) X=UV mc(a) = min X max A ij X ij 1 bound norms on avarage: ( i U i 2 ) ( j V j 2 ) nmr 2 X Σ = min U Fro V Fro X=UV ac(a) = min X nm Σ A ij X ij 1 bound dimensionality of U,V: dc(a) = min rank(x) A ij X ij >0

9 Outline Three measures of matrix complexity Generalization error bounds Relationships between the three measures

10 Generalization Error Bounds D(X;Y) = { ij X ij Y ij 0 } / nm generalization error Assuming a low-rank structure (eigengap): Asymptotic behavior [Azar+01] Sample complexity, query strategy [Drineas+02] X Y S

11 Generalization Error Bounds D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij 0 } / S empirical error Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ hypothesis X Y S source distribution training set random unknown, assumption-free

12 Generalization Error Bounds: Learning with Low Rank Matrices D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = { X rank(x) k } Only signs matter, equivalent to using the class: { A ±1 n m A=sign(X), rank(x) k } = { A ±1 n m dc(a) k } Finite class of size: { A ±1 n m dc(a) k } (8em/k) k(n+m) Union bound yields generalization bound with: ε = 8em k( n + m)log + log 1 2 S k δ [S,Alon,Jaakkola 04] [BenDavid,Eiron,Simon01]: most concept classes have high dc(a), i.e. cannot be embedded as low-dim linear classifiers

Generalization Error Bounds: Learning with Low Trace-Norm Matrices D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error = { X X 2 Σ < nmr 2 } Y

13 Generalization Error Bounds: Learning with Low Trace-Norm Matrices D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error = { X X 2 Σ < nmr 2 } Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = convex-hull( { uv u n, v m u = v =1 } ) nm R X Σ = min X=UV U Fro V Fro = (singular values of X) s i = X Σ s i u i v i outer product of norm-1 vectors: rank-1 norm-1 matrix unit norm columns S unit U norm rows V X

14 Generalization Error Bounds: Learning with Low Trace-Norm Matrices D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error = { X X 2 Σ < nmr 2 } Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = convex-hull( { uv u n, v m u = v =1 } ) nm R Rademacher complexity of { uv u R n, v R m u = v =1} Rademacher complexity of { X X 2 Σ < nmr 2 } Generalization error bounds for { X X 2 Σ < nmr 2 } E S E rand signs σ [ sup X=uv ij s S σ ij s X(s) ij ]/ S S n m σ

15 Generalization Error Bounds: Learning with Low Trace-Norm Matrices D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error = { X X 2 Σ < nmr 2 } Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = convex-hull( { uv u n, v m u = v =1 } ) nm R Rademacher complexity of { uv u R n, v R m u = v =1} Rademacher complexity of { X X 2 Σ < nmr 2 } Generalization error bounds for { X X 2 Σ < nmr 2 } E S E rand signs σ [ sup X=uv ij S σ ij X ij = E[ σ 2 ]/ S ij 3 2 K ( n + m)log use [Seginer00] bound on singular values of random matrix ij n ]/ S S ε = K R ( n + m)log n + log 1 S δ

16 Generalization Error Bounds: Learning with Low Max-Norm Matrices D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = { X X max < R } = R convex-hull(??? ) conv( { uv u ±1 n, v ±1m} ) { X X max 1 } 1.79 conv( { uv u ±1 n, v ±1m} ) Grothendiek's Inequality ε = 12 R 2 ( n + m) + log S 1 δ

17 Generalization Error Bounds: Low Trace-Norm, Max-Norm or Rank D(X;Y) = { ij X ij Y ij 0 } / nm generalization error D S (X;Y) = { ij S X ij Y ij <1 } / S empirical error Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ = { X R n m rank(x) k } dc(a) = min rank(x) A ij X ij >0 ε = 8en k( n + m)log + log 1 2 S k δ = { X R n m X 2 Σ/nm R 2 } ac 2 (A) = min X 2 Σ/nm A ij X ij 1 ε = K R ( n + m)log n + log 1 S δ = { X R n m X 2 max R 2 } mc 2 (A) = min X 2 max A ij X ij 1 ε = 12 R 2 ( n + m) + log S 1 δ

18 Relationship between average margin complexity, max margin complexity and dimensional complexity average maximum ac 2 (A) mc 2 (A) dc(a) 10 mc 2 (A) log(3nm) Randomly project high-dimensional large margin arrangement to obtain low dimensional arrangement [Arriaga,Vempala99] Reverse inequalities? Gaps?

19 Between the Rank and the Max-Norm Low margin complexity Low dimensional complexity: dc(a) 10 mc 2 (A) log(3nm) ½ (k-1)n log(m) log { A ±1 n m dc(a) k } k(n+m) log(8en/k) R 2 n log(n/r) log { A ±1 n m mc(a) R } 10 R 2 (n+m) log(3nm) log(n/r) Low dimensional complexity? Low margin complexity: A, ( dc(a) log(n) ) p mc 2 (A) A ij = Parity(bits(i)&bits(j)) mc 2 (T n )=Θ(log n) dc(t n )=2 [BenDavid Eiron Simon 01] mc 2 (H n )=n n 0.5 dc(h n )<n 0.8 [Forster et al 02,03] Kronecker exponent of triangular matrices

20 Between the Rank and the Max-Norm Low margin complexity Low dimensional complexity: dc(a) 10 mc 2 (A) log(3nm) ½ (k-1)n log(m) log { A ±1 n m dc(a) k } k(n+m) log(8en/k) R 2 n log(n/r) log { A ±1 n m mc(a) R } 10 R 2 (n+m) log(3nm) log(n/r) Low dimensional complexity Low margin complexity: A, ( dc(a) log(n) ) p mc 2 (A) Low max-norm similar matrix with low rank: rank(x ) 9/ε 2 X 2 max log(3nm) for some X -X <ε dc(a) k=10m 2 log(3nm) mc(a) M Open: low dc(a)? low mc 2 (A ) for some A A low rank(x)? low X 2 max for some low X -X 1

21 Between the Trace-Norm and Max-Norm r n ac 2 (A) mc 2 (A) n r Anything ac 2 (A) < 2r 3 /n 2 + 2

22 Between the Trace-Norm and Max-Norm r n ac 2 (A) mc 2 (A) ac 2 (A) < 2r 3 /n r n For r=n 2/3, we can get: ac 2 (A) < 4 while: mc 2 (A) > n 2/3 and: dc 2 (A) > n 1/3

23 Random Sampling Assumption for Generalization Error Bounds D(X;Y) = P ij (X ij Y ij 0) Y Pr S ( X D(X;Y)<D S (X;Y)+ε ) > 1-δ D S (X;Y) = { ij S X ij Y ij <1 } / S = { X R n m X 2 Σ/nm R 2 } ε = K R Y ( n + m)log n + log 1 S S δ random unknown, assumption-free Requires uniform sampling of entries = { X R n m X 2 max R 2 } = { X R n m rank(x) k } ε = 12 ε = R 2 ( n + m) + log S 2 S 1 δ 8en k( n + m)log + log 1 k δ Applies to any distribution over index pairs ij (which entries are observed)

24 Between the Trace-Norm and Max-Norm r n ac 2 (A) mc 2 (A) ac 2 (A) < 2r 3 /n r n For r=n 2/3, we can get: ac 2 (A) < 4 while: mc 2 (A) > n 2/3 and: dc 2 (A) > n 1/3 This is the largest gap possible: mc 2 (A) 9(ac 2 (A) n n) 1/3 Using similar techniques: (R 4/3-2)n 4/3 log { A ±1 n m ac(a) R } 7 R 2/3 n 5/3 log(3nm) log(n/m 2 )

25 Rank, Max-Norm and Trace-Norm as complexity measures for fitting Data Matrix rank(x) = min dim(u,v) X=UV dc(a) = min rank(x) A ij X ij >0 X max = min (max i U i )(max j V j ) X=UV mc(a) = min X max A ij X ij 1 X Σ = min U Fro V Fro X=UV ac(a) = min A ij X ij 1 X Σ nm (dc(a), mc(a) previously studied as embedability of a concept class) Generalization error bounds scale with: k=rank(x) R 2 = X 2 max R 2 = X 2 Σ /nm Relationships: dc(a) 10 mc 2 (A) log(3nm), but A, ( dc(a) log(n) ) p < ac 2 (A) mc 2 (A) ac 2 (A) mc 2 (A) 9(ac 2 (A) n n) 1/3, and this is tight Open issues: low dc(a) low mc 2 (A ) for some A A? low rank(x) low X 2 max for some bounded X -X 1? dc(a) = O(mc 2 (A))? also tighten estimate of { A ±1 n m mc(a) R } ; Median mc(a)? Tighten polynomial gap in bounds on log( { A ±1 n m ac(a) R } )

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola