TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS

Size: px

Start display at page:

Download "TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS"

Sharlene Montgomery
5 years ago
Views:

1 TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY

2 SPEARMAN S HYPOTHESIS Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve

3 SPEARMAN S HYPOTHESIS Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve educive (adj): the ability to make sense out of complexity reproducive (adj): the ability to store and reproduce informaion

4 SPEARMAN S HYPOTHESIS Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve To test this theory, he invented Factor Analysis: students (1000) tests (10) M A B T educive (adj): the ability to make sense out of complexity reproducive (adj): the ability to store and reproduce informaion

5 SPEARMAN S HYPOTHESIS Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve To test this theory, he invented Factor Analysis: students (1000) inner- dimension (2) tests (10) M A B T educive (adj): the ability to make sense out of complexity reproducive (adj): the ability to store and reproduce informaion

6 Given: M = a i b i = A B T correct factors

7 Given: M = a i b i = A B T correct factors When can we recover the factors a i and b i uniquely?

8 Given: M = a i b i = A B T = AR R - 1 B T correct factors alternaive factorizaion When can we recover the factors a i and b i uniquely?

9 Given: M = a i b i = A B T = AR R - 1 B T correct factors alternaive factorizaion When can we recover the factors a i and b i uniquely? Claim: The factors {a i } and {b i } are not determined uniquely unless we impose addiional condiions on them

10 Given: M = a i b i = A B T = AR R - 1 B T correct factors alternaive factorizaion When can we recover the factors a i and b i uniquely? Claim: The factors {a i } and {b i } are not determined uniquely unless we impose addiional condiions on them e.g. if {a i } and {b i } are orthogonal, or rank(m)=1

11 Given: M = a i b i = A B T = AR R - 1 B T correct factors alternaive factorizaion When can we recover the factors a i and b i uniquely? Claim: The factors {a i } and {b i } are not determined uniquely unless we impose addiional condiions on them e.g. if {a i } and {b i } are orthogonal, or rank(m)=1 This is called the rota=on problem, and is a major issue in factor analysis and moivates the study of tensor methods

12 OUTLINE The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiions Part I: Algorithms The RotaIon Problem Jennrich s Algorithm Part II: Applica=ons PhylogeneIc ReconstrucIon Pure Topic Models Part III: Smoothed Analysis Overcomplete Problems Kruskal Rank and the Khatri- Rao Product

13 MATRIX DECOMPOSITIONS M = a 1 b 1 + a 2 b a R b R + +

14 MATRIX DECOMPOSITIONS M = a 1 b 1 + a 2 b a R b R + + TENSOR DECOMPOSITIONS T = a 1 b 1 c a R b R c R (i, j, k) entry of x y z is x(i) y(j) z(k)

15 When are tensor decomposiions unique?

16 When are tensor decomposiions unique? Theorem [Jennrich 1970]: Suppose {a i } and {b i } are linearly independent and no pair of vectors in {c i } is a scalar muliple of each other

17 When are tensor decomposiions unique? Theorem [Jennrich 1970]: Suppose {a i } and {b i } are linearly independent and no pair of vectors in {c i } is a scalar muliple of each other. Then T = a 1 b 1 c a R b R c R is unique up to permuing the rank one terms and rescaling the factors.

18 When are tensor decomposiions unique? Theorem [Jennrich 1970]: Suppose {a i } and {b i } are linearly independent and no pair of vectors in {c i } is a scalar muliple of each other. Then T = a 1 b 1 c a R b R c R is unique up to permuing the rank one terms and rescaling the factors. Equivalently, the rank one factors are unique

19 When are tensor decomposiions unique? Theorem [Jennrich 1970]: Suppose {a i } and {b i } are linearly independent and no pair of vectors in {c i } is a scalar muliple of each other. Then T = a 1 b 1 c a R b R c R is unique up to permuing the rank one terms and rescaling the factors. Equivalently, the rank one factors are unique There is a simple algorithm to compute the factors too!

20 JENNRICH S ALGORITHM Compute T(,, x ) i.e. add up matrix slices x x i T i

21 JENNRICH S ALGORITHM Compute T(,, x ) i.e. add up matrix slices x x i T i If T = a b c then T(,, x ) = c, x a b

22 JENNRICH S ALGORITHM Compute T(,, x ) i.e. add up matrix slices x x i T i

23 JENNRICH S ALGORITHM Compute T(,, x ) = c i, x a i b i i.e. add up matrix slices x i T i x

24 JENNRICH S ALGORITHM Compute T(,, x ) = c i, x a i b i i.e. add up matrix slices x i T i x (x is chosen uniformly at random from S n- 1 )

25 JENNRICH S ALGORITHM Diag( c i, x ) Compute T(,, x ) = A D x B T i.e. add up matrix slices x x i T i (x is chosen uniformly at random from S n- 1 )

26 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T

27 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T

28 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1

29 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1 A D x B T (B T ) - 1 D y - 1 A - 1

30 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1 A D x D y - 1 A - 1

31 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1 A D x D y - 1 A - 1 Claim: whp (over x,y) the eigenvalues are disinct, so the EigendecomposiIon is unique and recovers a i s

32 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1

33 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1 Diagonalize T(,, y ) T(,, x ) - 1

34 JENNRICH S ALGORITHM Compute T(,, x ) = A D x B T Compute T(,, y ) = A D y B T Diagonalize T(,, x ) T(,, y ) - 1 Diagonalize T(,, y ) T(,, x ) - 1 Match up the factors (their eigenvalues are reciprocals) and find {c i } by solving a linear syst.

35 Given: M = a i b i When can we recover the factors a i and b i uniquely? This is only possible if {a i } and {b i } are orthonormal, or rank(m)=1

36 Given: M = a i b i When can we recover the factors a i and b i uniquely? This is only possible if {a i } and {b i } are orthonormal, or rank(m)=1 Given: T = a i b i c i When can we recover the factors a i, b i and c i uniquely?

37 Given: M = a i b i When can we recover the factors a i and b i uniquely? This is only possible if {a i } and {b i } are orthonormal, or rank(m)=1 Given: T = a i b i c i When can we recover the factors a i, b i and c i uniquely? Jennrich: If {a i } and {b i } are full rank and no pair in {c i } are scalar muliples of each other

38 OUTLINE The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiions Part I: Algorithms The RotaIon Problem Jennrich s Algorithm Part II: Applica=ons PhylogeneIc ReconstrucIon Pure Topic Models Part III: Smoothed Analysis Overcomplete Problems Kruskal Rank and the Khatri- Rao Product

39 PHYLOGENETIC RECONSTRUCTION x a y b z c d = exinct = extant Tree of Life

40 PHYLOGENETIC RECONSTRUCTION x a y b z c d = exinct = extant

41 PHYLOGENETIC RECONSTRUCTION a x root: π : Σ R + iniial distribuion y b z c d = exinct = extant Σ = alphabet

42 PHYLOGENETIC RECONSTRUCTION a x root: π : Σ R + iniial distribuion y condiional distribuion R z,b b z c d = exinct = extant Σ = alphabet

43 PHYLOGENETIC RECONSTRUCTION a x root: π : Σ R + iniial distribuion y condiional distribuion R z,b b z c d = exinct = extant Σ = alphabet In each sample, we observe a symbol (Σ) at each extant ( ) node where we sample from π for the root, and propagate it using R x,y, etc

44 HIDDEN MARKOV MODELS = hidden = observed x y z a b c

45 HIDDEN MARKOV MODELS π : Σ s R + iniial distribuion = hidden = observed x y z a b c

46 HIDDEN MARKOV MODELS π : Σ s R + iniial distribuion obs. matrices O x,a x a R x,y = hidden = observed transiion matrices y z b c

47 HIDDEN MARKOV MODELS π : Σ s R + iniial distribuion obs. matrices O x,a x a R x,y = hidden = observed transiion matrices y z b c In each sample, we observe a symbol (Σ o ) at each obs. ( ) node where we sample from π for the start, and propagate it using R x,y, etc (Σ s )

48 Ques=on: Can we reconstruct just the topology from random samples?

49 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily

50 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily [Steel, 1994]: The following is a distance funcion on the edges d x,y = - ln det(p x,y ) + ½ ln π x,σ - ½ ln π y,σ σ in Σ σ in Σ where P x,y is the joint distribuion

51 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily [Steel, 1994]: The following is a distance funcion on the edges d x,y = - ln det(p x,y ) + ½ ln π x,σ - ½ ln π y,σ σ in Σ σ in Σ where P x,y is the joint distribuion, and the distance between leaves is the sum of distances on the path in the tree

52 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily [Steel, 1994]: The following is a distance funcion on the edges d x,y = - ln det(p x,y ) + ½ ln π x,σ - ½ ln π y,σ σ in Σ σ in Σ where P x,y is the joint distribuion, and the distance between leaves is the sum of distances on the path in the tree (It s not even obvious it s nonnega=ve!)

53 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily

54 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily [Erdos, Steel, Szekely, Warnow, 1997]: Used Steel s distance funcion and quartet tests a d b c OR a b c d OR to reconstrucion the topology

55 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily [Erdos, Steel, Szekely, Warnow, 1997]: Used Steel s distance funcion and quartet tests a d b c OR to reconstrucion the topology, from polynomially many samples a b c d OR

56 Ques=on: Can we reconstruct just the topology from random samples? Usually, we assume T x,y, etc are full rank so that we can re- root the tree arbitrarily [Erdos, Steel, Szekely, Warnow, 1997]: Used Steel s distance funcion and quartet tests a d b c OR to reconstrucion the topology, from polynomially many samples a b c d OR For many problems (e.g. HMMs) finding the transiion matrices is the main issue

57 [Chang, 1996]: The model is idenifiable (if R s are full rank)

58 [Chang, 1996]: The model is idenifiable (if R s are full rank) x a y R z,b z d b c

59 [Chang, 1996]: The model is idenifiable (if R s are full rank) x a y R z,b z d b c

60 [Chang, 1996]: The model is idenifiable (if R s are full rank) x a a R z,b b z y c d b z c

61 [Chang, 1996]: The model is idenifiable (if R s are full rank) x a a R z,b b z y c d b z c Joint distribu=on over (a, b, c): σ Pr[z = σ] Pr[a z = σ] Pr[b z = σ] Pr[c z = σ]

62 [Chang, 1996]: The model is idenifiable (if R s are full rank) x a a R z,b b z y c d b z c Joint distribu=on over (a, b, c): σ Pr[z = σ] Pr[a z = σ] Pr[b z = σ] Pr[c z = σ] columns of R z,b

63 [Mossel, Roch, 2006]: There is an algorithm to PAC learn a phylogeneic tree or an HMM (if its transiion/output matrices are full rank) from polynomially many samples

64 [Mossel, Roch, 2006]: There is an algorithm to PAC learn a phylogeneic tree or an HMM (if its transiion/output matrices are full rank) from polynomially many samples Ques=on: Is the full- rank assumpion necessary?

65 [Mossel, Roch, 2006]: There is an algorithm to PAC learn a phylogeneic tree or an HMM (if its transiion/output matrices are full rank) from polynomially many samples Ques=on: Is the full- rank assumpion necessary? [Mossel, Roch, 2006]: It is as hard as noisy- parity to learn the parameters of a general HMM

66 [Mossel, Roch, 2006]: There is an algorithm to PAC learn a phylogeneic tree or an HMM (if its transiion/output matrices are full rank) from polynomially many samples Ques=on: Is the full- rank assumpion necessary? [Mossel, Roch, 2006]: It is as hard as noisy- parity to learn the parameters of a general HMM Noisy- parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2 n/log(n) Due to [Blum, Kalai, Wasserman, 2003]

67 [Mossel, Roch, 2006]: There is an algorithm to PAC learn a phylogeneic tree or an HMM (if its transiion/output matrices are full rank) from polynomially many samples Ques=on: Is the full- rank assumpion necessary? [Mossel, Roch, 2006]: It is as hard as noisy- parity to learn the parameters of a general HMM Noisy- parity is an infamous problem in learning, where O(n) samples suffice but the best algorithms run in Ime 2 n/log(n) Due to [Blum, Kalai, Wasserman, 2003] (It s now used as a hard problem to build cryptosystems!)

68 THE POWER OF CONDITIONAL INDEPENDENCE [Phylogene=c Trees/HMMS]: (joint distribuion on leaves a, b, c) σ Pr[z = σ] Pr[a z = σ] Pr[b z = σ] Pr[c z = σ]

69 PURE TOPIC MODELS topics (r) Each topic is a distribuion on words words (m)

70 PURE TOPIC MODELS topics (r) Each topic is a distribuion on words words (m) Each document is about only one topic (stochasically generated)

71 PURE TOPIC MODELS topics (r) Each topic is a distribuion on words words (m) Each document is about only one topic (stochasically generated) Each document, we sample L words from its distribuion

72 PURE TOPIC MODELS A W M =

73 PURE TOPIC MODELS A W M =

74 PURE TOPIC MODELS A W M =

75 PURE TOPIC MODELS A W M =

76 PURE TOPIC MODELS A W M =

77 PURE TOPIC MODELS A W M =

78 PURE TOPIC MODELS A W M

79 PURE TOPIC MODELS A W M [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)

80 PURE TOPIC MODELS A W M [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank) Ques=on: Where can we find three condiionally independent random variables?

81 PURE TOPIC MODELS A W M [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank)

82 PURE TOPIC MODELS A W M [Anandkumar, Hsu, Kakade, 2012]: Algorithm for learning pure topic models from polynomially many samples (A is full rank) The first, second and third words are independent condiioned on the topic t (and are random samples from A t )

83 THE POWER OF CONDITIONAL INDEPENDENCE [Phylogene=c Trees/HMMS]: (joint distribuion on leaves a, b, c) σ Pr[z = σ] Pr[a z = σ] Pr[b z = σ] Pr[c z = σ]

84 THE POWER OF CONDITIONAL INDEPENDENCE [Phylogene=c Trees/HMMS]: (joint distribuion on leaves a, b, c) σ Pr[z = σ] Pr[a z = σ] Pr[b z = σ] Pr[c z = σ] [Pure Topic Models/LDA]: (joint distribuion on first three words) j Pr[topic = j] A j A j A j

85 THE POWER OF CONDITIONAL INDEPENDENCE [Phylogene=c Trees/HMMS]: (joint distribuion on leaves a, b, c) σ Pr[z = σ] Pr[a z = σ] Pr[b z = σ] Pr[c z = σ] [Pure Topic Models/LDA]: (joint distribuion on first three words) j Pr[topic = j] A j A j A j [Community Detec=on]: (couning stars) j Pr[C x = j] (C A Π) j (C B Π) j (C C Π) j

86 OUTLINE The focus of this tutorial is on Algorithms/ApplicaIons/Models for tensor decomposiions Part I: Algorithms The RotaIon Problem Jennrich s Algorithm Part II: Applica=ons PhylogeneIc ReconstrucIon Pure Topic Models Part III: Smoothed Analysis Overcomplete Problems Kruskal Rank and the Khatri- Rao Product

87 So far, Jennrich s algorithm has been the key but it has a crucial limitaion.

88 So far, Jennrich s algorithm has been the key but it has a crucial limitaion. Let R T = a i a i a i i = 1 where {a i } are n- dimensional vectors

89 So far, Jennrich s algorithm has been the key but it has a crucial limitaion. Let T = a i a i a i i = 1 where {a i } are n- dimensional vectors R Ques=on: What if R is much larger than n?

90 So far, Jennrich s algorithm has been the key but it has a crucial limitaion. Let R T = a i a i a i i = 1 where {a i } are n- dimensional vectors Ques=on: What if R is much larger than n? This is called the overcomplete case e.g. the number of factors is much larger than the number of observaions

91 So far, Jennrich s algorithm has been the key but it has a crucial limitaion. Let R T = a i a i a i i = 1 where {a i } are n- dimensional vectors Ques=on: What if R is much larger than n? This is called the overcomplete case e.g. the number of factors is much larger than the number of observaions In such cases, why stop at third- order tensors?

92 Consider a sixth- order tensor T: R T = a i a i a i i = 1 a i a i a i

93 Consider a sixth- order tensor T: R T = a i a i a i i = 1 a i a i a i Ques=on: Can we find its factors, even if R is much larger than n?

94 Consider a sixth- order tensor T: R T = a i a i a i i = 1 a i a i a i Ques=on: Can we find its factors, even if R is much larger than n? Let s flaven it: R flat(t) = b i b i b i i = 1 (where b i = a i a KR i ) n 2 - dimensional vector whose (j,k) th entry is the product of the j th and k th entries of a i Khatri- Rao product

95 Consider a sixth- order tensor T: R T = a i a i a i i = 1 a i a i a i Ques=on: Can we find its factors, even if R is much larger than n? Let s flaven it by rearranging its entries into a third- order tensor: R flat(t) = b i b i b i i = 1 (where b i = a i a KR i ) n 2 - dimensional vector whose (j,k) th entry is the product of the j th and k th entries of a i Khatri- Rao product

96 Ques=on: Can we apply Jennrich s Algorithm to flat(t)?

97 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i a KR i linearly independent?

98 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #1: a KR i linearly independent? Let {a i } be all ( n ) 2 vectors with exactly two ones

99 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #1: a KR i linearly independent? Let {a i } be all ( n ) 2 vectors with exactly two ones Then {b i } are vectorizaions of: =

100 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #1: a KR i linearly independent? Let {a i } be all ( n ) 2 Then {b i } are vectorizaions of: vectors with exactly two ones Non- zero only in b i =

101 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #1: a KR i linearly independent? Let {a i } be all ( n ) 2 Then {b i } are vectorizaions of: vectors with exactly two ones Non- zero only in b i = and are linearly independent

102 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i a KR i linearly independent?

103 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #2: a KR i linearly independent? Let {a 1 n } and {a n+1..2n } be two random orthonormal bases

104 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #2: a KR i linearly independent? Let {a 1 n } and {a n+1..2n } be two random orthonormal bases Then there is a linear dependence with 2n terms:

105 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #2: a KR i linearly independent? Let {a 1 n } and {a n+1..2n } be two random orthonormal bases Then there is a linear dependence with 2n terms: n i = 1 a i a i KR 2n - a i KR a i i = n+1 = 0

106 Ques=on: Can we apply Jennrich s Algorithm to flat(t)? When are the new factors b i = a i Example #2: a KR i linearly independent? Let {a 1 n } and {a n+1..2n } be two random orthonormal bases Then there is a linear dependence with 2n terms: n i = 1 a i a i KR 2n - a i KR a i i = n+1 = 0 (as matrices, both sum to the idenity)

107 THE KRUSKAL RANK

108 THE KRUSKAL RANK Defini=on: The Kruskal rank (k- rank) of {b i } is the largest k s.t. every set of k vectors is linearly independent

109 THE KRUSKAL RANK Defini=on: The Kruskal rank (k- rank) of {b i } is the largest k s.t. every set of k vectors is linearly independent b i = a i a KR i k- rank({a i }) = n

110 THE KRUSKAL RANK Defini=on: The Kruskal rank (k- rank) of {b i } is the largest k s.t. every set of k vectors is linearly independent b i = a i a i KR k- rank({a i }) = n Example #1: k- rank({b i }) = R = ( n ) 2

111 THE KRUSKAL RANK Defini=on: The Kruskal rank (k- rank) of {b i } is the largest k s.t. every set of k vectors is linearly independent b i = a i a i KR k- rank({a i }) = n Example #1: k- rank({b i }) = R = ( n ) 2 Example #2: k- rank({b i }) = 2n- 1

112 THE KRUSKAL RANK Defini=on: The Kruskal rank (k- rank) of {b i } is the largest k s.t. every set of k vectors is linearly independent b i = a i a i KR k- rank({a i }) = n Example #1: k- rank({b i }) = R = ( n ) 2 Example #2: k- rank({b i }) = 2n- 1 The Kruskal rank always adds under the Khatri- Rao product, but someimes it mul=plies and that can allow us to handle R >> n

113 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product

114 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Proof: The set of {a i } where b i = a i is measure zero a KR i and det({b i }) = 0

115 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Proof: The set of {a i } where b i = a i is measure zero a KR i and det({b i }) = 0 But this yields a very weak bound on the condi=on number of {b i }

116 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Proof: The set of {a i } where b i = a i is measure zero a KR i and det({b i }) = 0 But this yields a very weak bound on the condi=on number of {b i } which is what we need to apply it to learning/staisics, where we have an esimate to T

117 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product

118 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Defini=on: The robust Kruskal rank (k- rank γ ) of {b i } is the largest k s.t. every set of k vector has condiion number at most O(γ)

119 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Defini=on: The robust Kruskal rank (k- rank γ ) of {b i } is the largest k s.t. every set of k vector has condiion number at most O(γ) [Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri- Rao product

120 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Defini=on: The robust Kruskal rank (k- rank γ ) of {b i } is the largest k s.t. every set of k vector has condiion number at most O(γ) [Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri- Rao product [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed

121 [Allman, Ma=as, Rhodes, 2009]: Almost surely, the Kruskal rank muliplies under the Khatri- Rao product Defini=on: The robust Kruskal rank (k- rank γ ) of {b i } is the largest k s.t. every set of k vector has condiion number at most O(γ) [Bhaskara, Charikar, Vijayaraghavan, 2013]: The robust Kruskal rank always under the Khatri- Rao product [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)

122 [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ)

123 [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ) Hence we can apply Jennrich s Algorithm to flat(t) with R >> n

124 [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ) Hence we can apply Jennrich s Algorithm to flat(t) with R >> n Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ

125 [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ) Hence we can apply Jennrich s Algorithm to flat(t) with R >> n Note: These bounds are easy to prove with inverse polynomial failure probability, but then γ depends δ This can be extended to any constant order Khatri- Rao product

126 [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ) Hence we can apply Jennrich s Algorithm to flat(t) with R >> n

127 [Bhaskara, Charikar, Moitra, Vijayaraghavan, 2014]: Suppose the vectors {a i } are ε- perturbed. Then k- rank γ ({b i }) = R for R = n 2 /2 and γ = poly(1/n, ε) with exponen=ally small failure probability (δ) Hence we can apply Jennrich s Algorithm to flat(t) with R >> n Sample applica=on: Algorithm for learning mixtures of n O(1) spherical Gaussians in R n, if their means are ε- perturbed This was also obtained independently by [Anderson, Belkin, Goyal, Rademacher, Voss, 2014]

128 Any QuesIons? Summary: Tensor decomposiions are unique under much more general condiions, compared to matrix decomposiions Jennrich s Algorithm (rediscovered many Imes!), and its many applicaions in learning/staisics Introduced new models to study overcomplete problems (R >> n) Are there algorithms for order- k tensors that work with R = n 0.51 k?

ECE 598: Representation Learning: Algorithms and Models Fall 2017

ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors