A Robust Form of Kruskal s Identifiability Theorem

Size: px

Start display at page:

Download "A Robust Form of Kruskal s Identifiability Theorem"

Shannon Wheeler
5 years ago
Views:

1 A Robust Form of Kruskal s Identifiability Theorem Aditya Bhaskara (Google NYC) Joint work with Moses Charikar (Princeton University) Aravindan Vijayaraghavan (NYU Northwestern)

2 Background: understanding data Data from various sources: QUESTION ( simple explanation ): can we think of data as being generated from a model with a small number of parameters?

3 Background: understanding data Data from various sources: QUESTION ( simple explanation ): can we think of data as being generated from a model with a small number of parameters? Very successful; e.g. topic models for documents, hidden markov models for speech,...

4 Topic model for documents Large collection of documents Carricature model: each document is about a topic, and each topic is a distribution over n words;

5 Topic model for documents Large collection of documents Carricature model: each document is about a topic, and each topic is a distribution over n words; e.g. aardvark apple ball lion... zebra sports wildlife

6 Topic model for documents Large collection of documents Carricature model: each document is about a topic, and each topic is a distribution over n words; e.g. aardvark apple ball lion... zebra sports wildlife Model parameters: probability that document is on topic i : w i (sum to 1) word probability vector for topic i : p i R n

7 Topic model for documents Generating a document: say r-word document Topic w 1 w 2 w R word p 2 word p 2 word p 2 r words

8 Topic model for documents Generating a document: say r-word document Topic w 1 w 2 w R word p 2 word p 2 word p 2 r words ASSUMPTION: picking a document from corpus at random, randomly sampling r words picking r word document as above

9 Topic model for documents Generating a document: say r-word document Topic w 1 w 2 w R word p 2 word p 2 word p 2 r words ASSUMPTION: picking a document from corpus at random, randomly sampling r words picking r word document as above Goal: find the {w r, p r }

10 What about tensors? Experiment: Pick three random words from random document QUESTION: What is Pr[word 1 = i, word 2 = j, word 3 = k]?

11 What about tensors? Experiment: Pick three random words from random document QUESTION: What is Pr[word 1 = i, word 2 = j, word 3 = k]? Precisely equal to R r=1 w rp r (i)p r (j)p r (k).

12 What about tensors? Experiment: Pick three random words from random document QUESTION: What is Pr[word 1 = i, word 2 = j, word 3 = k]? Precisely equal to R r=1 w rp r (i)p r (j)p r (k). (i, j, k) = R r=1 w r(p r p r p r ). Finding parameters tensor decomposition!

13 Recipe for tensor methods in mixture models Algebraic statistics literature: [Allman, Mathias, Rhodes 09], Estimate a tensor whose decomposition allows reading off the model parameters

14 Recipe for tensor methods in mixture models Algebraic statistics literature: [Allman, Mathias, Rhodes 09], Estimate a tensor whose decomposition allows reading off the model parameters 2. Use tensor decomposition

Recipe for tensor methods in mixture models Algebraic statistics literature: [Allman, Mathias, Rhodes 09],... 1.

15 Recipe for tensor methods in mixture models Algebraic statistics literature: [Allman, Mathias, Rhodes 09], Estimate a tensor whose decomposition allows reading off the model parameters 2. Use tensor decomposition Many applications: mixtures of gaussians, hidden markov models, communities, crabs,...

16 Are we done?

17 Two caveats Efficiency We need polynomial time algorithms for decomposition! Given T = R r=1 p r p r p r, can we find {p r } R r=1?

18 Two caveats Efficiency We need polynomial time algorithms for decomposition! Given T = R r=1 p r p r p r, can we find {p r } R r=1? Robustness Algorithm needs to work with noisy tensor In applications, we estimate the tensor from samples. N samples = 1/ N error per entry, typically.

19 Two caveats Efficiency We need polynomial time algorithms for decomposition! Given T = R r=1 p r p r p r, can we find {p r } R r=1? Robustness Algorithm needs to work with noisy tensor In applications, we estimate the tensor from samples. N samples = 1/ N error per entry, typically. GOAL: Given, target accuracy ε, and T = R p r p r p r + N, with N < ε/poly(n), r=1 recover {p r } up to error ε.

20 A success story: the full rank case R Given T = p r p r p r + N, r=1 N < ε/poly(n) Define P : matrix (n R) with columns {p r }.

21 A success story: the full rank case R Given T = p r p r p r + N, r=1 N < ε/poly(n) Define P : matrix (n R) with columns {p r }. Theorem If σ R (P ) > 1/poly (n), then can find {p r } up to error ε in polynomial time.

22 A success story: the full rank case R Given T = p r p r p r + N, r=1 N < ε/poly(n) Define P : matrix (n R) with columns {p r }. Theorem If σ R (P ) > 1/poly (n), then can find {p r } up to error ε in polynomial time. Discovered many times: [Harshman 72], [Leurgans, et al. 93], [Chang 96], [Anandkumar, et al. 10], [Goyal et al. 13],...

23 Beyond non degeneracy Question: Can we approximately recover parameters under weaker conditions? What about the case R > n?

24 Beyond non degeneracy Question: Can we approximately recover parameters under weaker conditions? What about the case R > n? This talk: easier problem of identifiability. Theorem (Informal theorem.) If the Kruskal rank condition holds, then it is possible to recover the decomposition up to error ε.

25 Beyond non degeneracy Question: Can we approximately recover parameters under weaker conditions? What about the case R > n? This talk: easier problem of identifiability. Theorem (Informal theorem.) If the Kruskal rank condition holds, then it is possible to recover the decomposition up to error ε. Not efficient ; open problem to do it efficiently

26 Beyond non degeneracy Question: Can we approximately recover parameters under weaker conditions? What about the case R > n? This talk: easier problem of identifiability. Theorem (Informal theorem.) If the Kruskal rank condition holds, then it is possible to recover the decomposition up to error ε. Not efficient ; open problem to do it efficiently Can be done if the components are nearly orthogonal [Anandkumar et al. 14]

27 Kruskal s theorem (1977) R T = a i b i c i r=1 GOAL: Find conditions under which the decomposition unique.

28 Kruskal s theorem (1977) T = R a i b i c i r=1 GOAL: Find conditions under which the decomposition unique. Kruskal rank For matrix A(n R), the rank is the largest integer k(a) s.t. every k(a) columns of A are linearly independent.

29 Kruskal s theorem (1977) T = R a i b i c i r=1 GOAL: Find conditions under which the decomposition unique. Kruskal rank For matrix A(n R), the rank is the largest integer k(a) s.t. every k(a) columns of A are linearly independent. Note: Much stronger than the usual notion of rank; reminiscent of restricted isometry

30 Kruskal s theorem (1977) T = R a i b i c i r=1 GOAL: Find conditions under which the decomposition unique. Kruskal rank For matrix A(n R), the rank is the largest integer k(a) s.t. every k(a) columns of A are linearly independent. Note: Much stronger than the usual notion of rank; reminiscent of restricted isometry Theorem (Kruskal) Suppose T is defined as above, and A, B, C are n R matrices with columns a r, b r, c r, respectively. Then a sufficient condition for uniqueness of decomposition is k(a) + k(b) + k(c) 2R + 2

31 Our result R T = a i b i c i + N r=1 GOAL: Find conditions under which decomposition is unique.

32 Our result T = R a i b i c i + N r=1 GOAL: Find conditions under which decomposition is unique. Robust Kruskal rank For matrix A(n R) and parameter τ > 0, the rank is the largest integer k τ (A) s.t. every n k τ (A) submatrix has condition number < τ. Note: Recall that condition number of B is σ max(b)/σ min (B).

33 Our result T = R a i b i c i + N r=1 GOAL: Find conditions under which decomposition is unique. Robust Kruskal rank For matrix A(n R) and parameter τ > 0, the rank is the largest integer k τ (A) s.t. every n k τ (A) submatrix has condition number < τ. Note: Recall that condition number of B is σ max(b)/σ min (B). Theorem (Rough) Let T be as above, and A, B, C be n R matrices as before. Then the decomposition is robustly unique if k τ (A) + k τ (B) + k τ (C) 2R + 2

34 Our result T T rank k-tensors We show: For any ε > 0, there is an ε = ε/poly(n), such that T = ε T implies the decompositions are ε-close, up to a permutation.

35 Our result T = R a i b i c i ; T = r=1 Theorem (Formal) R a i b i c i Suppose T, T are defined as above, and A, B, C, A, B, C are n R matrices. Further, suppose for some τ > 0, that r=1 k τ (A) + k τ (B) + k τ (C) 2R + 2. Then for any ε > 0, there is an ε = ε/poly(n, τ) such that if T T < ε, then there exist diagonal matrices Γ A, Γ B, Γ C, and a permutation Π such that Γ A Γ B Γ C = I, and A = ε Γ A ΠA, B = ε Γ B ΠB, C = ε Γ C ΠC.

36 Remarks Conditions only about k τ (A), k τ (B), k τ (C); nothing is needed about A, B, C (except an upper bound on column lengths)

37 Remarks Conditions only about k τ (A), k τ (B), k τ (C); nothing is needed about A, B, C (except an upper bound on column lengths) Implies that no tensor close to T has rank < R

38 Remarks Conditions only about k τ (A), k τ (B), k τ (C); nothing is needed about A, B, C (except an upper bound on column lengths) Implies that no tensor close to T has rank < R Naturally generalizes to higher order tensors (similar to De Lathauwer s extension of Kruskal s theorem)

39 Remarks Conditions only about k τ (A), k τ (B), k τ (C); nothing is needed about A, B, C (except an upper bound on column lengths) Implies that no tensor close to T has rank < R Naturally generalizes to higher order tensors (similar to De Lathauwer s extension of Kruskal s theorem) Main difficulty in proof: handling 1/poly(n) noise (If ε = ε/exp(n), much easier to show that decompositions are ε-close)

40 Proof overview Suppose A, B, C, A, B, C are n R, column lengths ρ, and k τ (A) = k τ (B) = k τ (C) = n ; R = 4n/3. (I.e., any n columns of A, B, C are well conditioned (thus lin.ind.))

41 Proof overview Suppose A, B, C, A, B, C are n R, column lengths ρ, and k τ (A) = k τ (B) = k τ (C) = n ; R = 4n/3. (I.e., any n columns of A, B, C are well conditioned (thus lin.ind.)) 1. Show that A is a scaled permutation of A, B of B, C of C 2. Show that permutations are equal, and scalings multiply to identity

42 Proof overview Suppose A, B, C, A, B, C are n R, column lengths ρ, and k τ (A) = k τ (B) = k τ (C) = n ; R = 4n/3. (I.e., any n columns of A, B, C are well conditioned (thus lin.ind.)) 1. Show that A is a scaled permutation of A, B of B, C of C 2. Show that permutations are equal, and scalings multiply to identity Crux: first part permutation lemma

43 Permutation lemma Sufficient conditions for A, A having same columns (up to permutation) (Suppose for now, that no two cols of A are parallel)

44 Permutation lemma Sufficient conditions for A, A having same columns (up to permutation) (Suppose for now, that no two cols of A are parallel) Key idea. Look at the spaces spanned by subsets of columns of A, A. S Definition: A S := span of columns indexed by S [R] A

45 Permutation lemma Sufficient conditions for A, A having same columns (up to permutation) (Suppose for now, that no two cols of A are parallel) Key idea. Look at the spaces spanned by subsets of columns of A, A. S Definition: A S := span of columns indexed by S [R] A Informal: Suppose for all S of size (n 1), A S contains at least S columns of A. Then the same is true for all S.

46 Permutation lemma Sufficient conditions for A, A having same columns (up to permutation) (Suppose for now, that no two cols of A are parallel) Key idea. Look at the spaces spanned by subsets of columns of A, A. S Definition: A S := span of columns indexed by S [R] A Informal: Suppose for all S of size (n 1), A S contains at least S columns of A. Then the same is true for all S. (If true for S = 1, then every column of A is parallel to some column of A)

47 Downward induction: base case S = (n 1) S A A a 1 b 1 c 1 + a 2 b 2 c 2 + a 1 b 1 c 1 + a 2 b 2 c For any w, 4n/3 r=1 w, ar (br cr) 4n/3 r=1 w, a r (b r c r).

48 Downward induction: base case S = (n 1) S A A a 1 b 1 c 1 + a 2 b 2 c 2 + a 1 b 1 c 1 + a 2 b 2 c For any w, 4n/3 r=1 w, ar (br cr) 4n/3 r=1 w, a r (b r c r). 1. Pick w to be orthogonal to A S

49 Downward induction: base case S = (n 1) S A A a 1 b 1 c 1 + a 2 b 2 c 2 + a 1 b 1 c 1 + a 2 b 2 c For any w, 4n/3 r=1 w, ar (br cr) 4n/3 r=1 w, a r (b r c r). 1. Pick w to be orthogonal to A S 2. Only n/3 + 1 terms remain on RHS = at most this number must remain on LHS! (by Kruskal condition)

50 Downward induction: base case S = (n 1) S A A a 1 b 1 c 1 + a 2 b 2 c 2 + a 1 b 1 c 1 + a 2 b 2 c For any w, 4n/3 r=1 w, ar (br cr) 4n/3 r=1 w, a r (b r c r). 1. Pick w to be orthogonal to A S 2. Only n/3 + 1 terms remain on RHS = at most this number must remain on LHS! (by Kruskal condition) 3. Implies at least (n 1) cols {a r } belong to A S

51 Downward induction: base case S = (n 1) S A A a 1 b 1 c 1 + a 2 b 2 c 2 + a 1 b 1 c 1 + a 2 b 2 c For any w, 4n/3 r=1 w, ar (br cr) 4n/3 r=1 w, a r (b r c r). 1. Pick w to be orthogonal to A S 2. Only n/3 + 1 terms remain on RHS = at most this number must remain on LHS! (by Kruskal condition) 3. Implies at least (n 1) cols {a r } belong to A S

52 Downward induction Show for S = (n 2),..., 1:

53 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such..

54 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such.. let T i be indices of cols of A that lie in A S i ; T i (n 1)

55 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such.. let T i be indices of cols of A that lie in A S i ; T i (n 1) Main idea. Any col in T i T j must be contained in A S

56 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such.. let T i be indices of cols of A that lie in A S i ; T i (n 1) Main idea. Any col in T i T j must be contained in A S 1. Say we have: a = r S α ra r + α i a i = r S β ra r + β j a j

57 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such.. let T i be indices of cols of A that lie in A S i ; T i (n 1) Main idea. Any col in T i T j must be contained in A S 1. Say we have: a = r S α ra r + α i a i = r S β ra r + β j a j 2. Since a i, a j are not parallel, this means α i = β j = 0

58 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such.. let T i be indices of cols of A that lie in A S i ; T i (n 1) Main idea. Any col in T i T j must be contained in A S Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T

59 Downward induction Show for S = (n 2),..., 1: start with some S of size (n 2) define S i = S {i}, for various i S (n/3) + 2 such.. let T i be indices of cols of A that lie in A S i ; T i (n 1) Main idea. Any col in T i T j must be contained in A S Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family

60 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family

61 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family Claim. Implies that the core has size (n 2)

62 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family Claim. Implies that the core has size (n 2) Proof. Basic counting argument. We know T i n 1 for all i.

63 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family Claim. Implies that the core has size (n 2) Proof. Basic counting argument. We know T i n 1 for all i. Say T = n 3, then T i \ T 2 for all i.

64 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family Claim. Implies that the core has size (n 2) Proof. Basic counting argument. We know T i n 1 for all i. Say T = n 3, then T i \ T 2 for all i. These don t intersect for different i (sunflower!)

65 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family Claim. Implies that the core has size (n 2) Proof. Basic counting argument. We know T i n 1 for all i. Say T = n 3, then T i \ T 2 for all i. These don t intersect for different i (sunflower!) Thus the #(cols) of A is > (n 3) + 2 (n/3 + 2) > (4n/3)..

66 Downward induction Thus if T is the set of cols contained in A S, then for any i, j, T i T j = T Thus T i form a sunflower family Claim. Implies that the core has size (n 2) Proof. Basic counting argument. We know T i n 1 for all i. Say T = n 3, then T i \ T 2 for all i. These don t intersect for different i (sunflower!) Thus the #(cols) of A is > (n 3) + 2 (n/3 + 2) > (4n/3)..

67 Making things robust permutation lemma Sufficient conditions for A, A having approximately same columns (up to permutation)

68 Making things robust permutation lemma Sufficient conditions for A, A having approximately same columns (up to permutation) S Key idea. Look at the spaces spanned by subsets of columns of A, A. A Informal: Suppose for all S of size (n 1), A S contains at least S columns of A, up to error ε. Then the same is true for all S, with error ε. (If true for singletons, then every col of A is ε -close to a scaling of a col of A)

69 Issue with inductive proof Suppose we have claim for S = (n 1) with error ε; then can prove claim for S = (n 2) with error only ε n.

70 Issue with inductive proof Suppose we have claim for S = (n 1) with error ε; then can prove claim for S = (n 2) with error only ε n. Continuing this way, we only get the claim for S = 1 with error ε exp(n) Means the initial error should have been exponentially small..

71 Issue with inductive proof Suppose we have claim for S = (n 1) with error ε; then can prove claim for S = (n 2) with error only ε n. Continuing this way, we only get the claim for S = 1 with error ε exp(n) Means the initial error should have been exponentially small.. The fix More combinatorial invariant (different definition of sets T ) Stronger sunflower argument Better base case

72 Issue with inductive proof Suppose we have claim for S = (n 1) with error ε; then can prove claim for S = (n 2) with error only ε n. Continuing this way, we only get the claim for S = 1 with error ε exp(n) Means the initial error should have been exponentially small.. The fix More combinatorial invariant (different definition of sets T ) Stronger sunflower argument Better base case

73 Directions Are there polynomial time algorithms under Kruskal s conditions? What about other tensor based algorithms in the case R > n?

74 Directions Are there polynomial time algorithms under Kruskal s conditions? What about other tensor based algorithms in the case R > n? (a) Higher order tensors: [Cardoso 92], next talk (also [Goyal et al. 13])

75 Directions Are there polynomial time algorithms under Kruskal s conditions? What about other tensor based algorithms in the case R > n? (a) Higher order tensors: [Cardoso 92], next talk (also [Goyal et al. 13]) (b) Recent work assuming incoherence [Anandkumar, et al. 14]

76 Directions Are there polynomial time algorithms under Kruskal s conditions? What about other tensor based algorithms in the case R > n? (a) Higher order tensors: [Cardoso 92], next talk (also [Goyal et al. 13]) (b) Recent work assuming incoherence [Anandkumar, et al. 14] Can we show robust uniqueness under more algebraic conditions?

77 Directions Are there polynomial time algorithms under Kruskal s conditions? What about other tensor based algorithms in the case R > n? (a) Higher order tensors: [Cardoso 92], next talk (also [Goyal et al. 13]) (b) Recent work assuming incoherence [Anandkumar, et al. 14] Can we show robust uniqueness under more algebraic conditions? Can we show that a generic n n n tensor has a stable unique decomposition up to rank n 2 /4?

78 Thank you! Questions?

Orthogonal tensor decomposition

Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.