A Practical Randomized CP Tensor Decomposition

Size: px

Start display at page:

Download "A Practical Randomized CP Tensor Decomposition"

Ami Glenn
5 years ago
Views:

1 A Practical Randomized CP Tensor Decomposition Casey Battaglino, Grey Ballard 2, and Tamara G. Kolda 3 SIAM CSE 27 Atlanta, GA Georgia Tech Computational Sci. and Engr. Wake Forest University Sandia National Labs cbattaglino3@gatech.edu 2 ballard@wfu.edu 3 tgkolda@sandia.gov

2 Tensors modes I3 I I4 I4 etc I2 I5 order order 2 order 3 order 4 order 5 order N 2

3 Problem Matrix Decomposition ` ` `... (SVD) 3

4 Problem Matrix Decomposition ` ` `... (SVD) Tensor Decomposition ` ` `... (CP) 4

5 CP: Multi-Way Data Analysis PCA (2D) A neurons data matrix principal components analysis (PCA) = reconstruction trial phase B trial # trial #5 trial # neuron factors phase factors trial factors CP (3D) functional cell types trial start Canonical Polyadic (CP) tensor decomposition within trial dynamics trial end first trial learning dynamics reconstruction last trial neurons phase trials + + = (source: Alex Williams, Stanford/Sandia) 5

6 Standard Method Alternating Least Squares (CP-ALS) ` ` ` ` ` ` mode mode 2 mode 3 Solve a sequence of least squares problems, one for each mode, repeat A () A (2) A (3) 6

7 Our Method Randomized Least Squares or sketching 7

8 Contributions We show how randomized sketching techniques can extend from matrices to general dense tensors We demonstrate a novel randomized least squares algorithm for the CP decomposition, with a novel stopping condition This enables us to scale to much larger data sets In addition to speed, there is evidence that this randomization increases robustness! 8

9 Example: Linear Regression b A ` " y 6 px 3,y 3 q m y 3 x 3 2 n `" n x Find line 2x ` satisfying: min }A b} 9

10 Example: Linear Regression b A ` " y 6 m y 3 x 3 2 n `" y 2 ` x 2 n Method of Normal Equations Solves for : x min }A b} pa T Aq A T A pa T Aq A T b A : b

11 Example: Linear Regression b A ` " y 6 m y 3 x 3 2 n `" y 2 ` x 2 n In MATLAB: x min }A b} Azb

12 Sampling Linear Regression What if we sample points? (Uniform, Random) 6 row-sampling operator 5 4 Sb SA 3 2 s 2 n

13 Sampling Linear Regression Sample A and b (MATLAB): >> srows = randi(m,s,); %sample s rows w. replacement >> As = A(srows, :); >> bs = b(srows); >> x = A \ b; >> xs = As \ bs; 6 row-sampling operator 5 4 y 2 ` x s Sb SA n

14 Sampling Linear Regression Chol: Opmn 2 ` n 3 q flops QR: Opmn 2 q flops b A 6 m " n m y 3 x y 2 ` x Chol: Opsn 2 ` n 3 q flops QR: Opsn 2 q flops m " s n s Sb SA (Fast Solution, Approximate) y ˆ2 ` ˆx n

15 Danger: Undetermined Sampling A SA SA may be rank-deficient > underdetermined! 5

16 Underdetermined Sampling m A x b n c n m Given full-rank A: Consider a row containing the only nonzero in its column 6

17 Underdetermined Sampling m A x b n c n m s SA n ˆx ˆb Any sampling must contain that row or be rank-deficient. s Given full-rank A: Consider a row containing the only nonzero in its column Uniform sampling w/ replacement: s = O(mlogm) samples needed (so s > m; sampling is useless here!) (Avron, Maymounkov & Toledo 2) 7

18 Mixing A x b c n?? n Idea: Transform A,b before sampling m (Ailon & Chazelle, 26 / Drineas, Mahoney, Muthukrishnan, Sarlós 27 / Rokhlin & Tygert 28) 8

19 Mixing Trick: mix up the rows (good)

20 Mixing Trick: mix up the rows F problem with vectors that are sparse in the frequency domain (bad) 8 2

21 random sign flipping helps spread out smooth vectors D Mixing Trick: mix up the rows F (good)

22 Mixing Trick: mix up the rows Fast Johnson-Lindenstrauss Transform (Ailon & Chazelle, 26) Randomly sign-flip rows, then perform one of: FFT WHT DHT F F D A x F - - D - - b Orthogonal transformations; Change of basis 22

23 Mixing Trick: mix up the rows Fast Johnson-Lindenstrauss Transform (Ailon & Chazelle, 26) Randomly sign-flip rows, then perform one of: FFT WHT DHT F F D A x F - - D - - b Opmn log mq Â Opmnq Opm log mq ˆb Opmq 23

24 Mixing Fast Johnson-Lindenstrauss Transform (Ailon & Chazelle, 26) Randomly sign-flip rows, then perform one of: Trick: mix up the rows - - FFT WHT DHT (other options include DCT, random Givens rotations, random orthogonal matrix) F D A x F F D b Opmn log mq Â Opmnq Opm log mq ˆb Opmq 24

25 Mixing Fast Johnson-Lindenstrauss Transform (Ailon & Chazelle, 26) Randomly sign-flip rows, then perform one of: Trick: mix up the rows S Â x FFT WHT DHT (other options include DCT, random Givens rotations, random orthogonal matrix) S ˆb F We should then be able to sample a small number of rows 25

26 Coherence Given a matrix Q whose columns are an orthonormal basis for A: e.g. rq, Rs QRpAq Coherence is the maximum leverage score: µpaq max i }Q i, } 2 2 (measuring correlation with the standard basis) e.g. Well-mixed matrix n m µpaq e.g. Identity matrix incoherent coherent 26

27 n #Samples (to recover a full-rank preconditioner) n m µpaq incoherent coherent m µpaq n m Ñ s Opn log nq (tiny) µpaq Ñ s Opm log mq (larger than input) (Avron, Maymounkov & Toledo 2) 27

28 Faster Approximate Least Squares mix sample min x }Ax b} 2 min x }FDAx FDb} 2 use as approximate solution for original problem solve exactly min x }SFDAx SFDb} 2 ˆx SFDAzSFDb (Drineas, Mahoney, Muthukrishnan, Sarlós 27/) 28

29 Sketching min x }Ax b} 2 min x }SFDAx SFDb} 2 Has found practical uses, e.g. Blendenpik : Precondition using Randomized Least Squares Then apply a standard iterative method (LSQR) ~4X overall speedup over LAPACK (Avron, Maymounkov & Toledo 2) 29

30 Sketching Can we apply this technique to tensor applications? I3 I I2 3

31 (our algorithms): CPRAND CPRAND-FFT Yes Fit CP-ALS(R) (standard algorithms) CP-ALS(H) x3x3 Tensor, R=5, Noise = % Time(s) 3

32 CP Decomposition X «X Rÿ r a r b r c r SVD: ` ` `... 3D CP: ` ` `... R number of components 32

33 Deriving CP-ALS I K X «X Rÿ r c b a r b r c r ` ` `... J a R Store factors in matrix form: R R I A J B K R C a b c 33

34 K Tensor Fibers K K I I I J mode- fibers J mode-2 fibers J mode-3 fibers 34

35 K Matricization K aka unfolding K I I I J mode- fibers J mode-2 fibers J mode-3 fibers JK IK IJ I J K X pq X p2q X p3q 35

36 Kronecker Product A b B» fi a B a n B..... fl, a m B a mn B B, B,2 B, B,2 A, A,2 b B, B,2 A, A,2 B2, B2,2 B2, B2,2 A2, A2,2 B2, B2,2 B, B,2 B, B,2 A2, A2,2 B2, B2,2 B2, B2,2 every pair of scalar entries multiplied, in a block structure 36

37 CP Decomposition & Khatri-Rao Product The CP Representation: Has matricized form: X «X Rÿ r a r b r c r X pq «ApC d Bq T A d B ra b b, a 2 b b 2,...,a K b b K s c b c b b C B C d B b d columnwise kronecker 37

38 CP Decomposition c b a ` ` (order 3, extends to order N) R R I A J B K R C R Matricized form: X pq «ApC d Bq T Allows us to express the least squares problem: min A }X pq ApC d Bq T } F unknown coefficient matrix 38

39 Alternating Least Squares min A }X pq ApC d Bq T } F unknown coefficient matrix Solve for each factor A, B, C in alternating fashion: A X pq { pc d Bq T (Transpose least squares, so forward-slash) 39

40 Alternating Least Squares order 3: A X p2q { pc d Bq T X pq repeat B X p2q { pc d Aq T C X p3q { pb d Aq T R ˆ R A X p2q pc d Bq { pc T C B T Bq X pq B X p2q pc d Aq { pc T C A T Aq C X p3q pb d Aq { pb T B A T Aq special structure 4

41 CP-ALS Algorithm CP-ALS : procedure CP-ALS(X,R) ô X P R IˆJˆK 2: Initialize factor matrices B, C 3: repeat 4: A X pq rpc d Bq T s : X pq pc d BqpC T C B T Bq : 5: B X p2q rpc d Aq T s : X p2q pc d AqpC T C A T Aq : 6: C X p3q rpb d Aq T s : X p3q pb d AqpB T B A T Aq : 7: until termination criteria met 8: return factor matrices A, B, C 9: end procedure init: random or nvec terminate when fit stops changing: }X X} }X} 4

42 CPRAND A X pq { pc d Bq T A X pq S T { SpC d Bq T JK JK I { R X pq pc d Bq T CP-RAND: Just sample X pq and pc d Bq T 42

43 CPRAND A X pq { pc d Bq T A X pq S T { SpC d Bq T JK JK I { R X pq pc d Bq T CP-RAND: Just sample X pq and pc d Bq T... and hope that pc d Bq T is incoherent... 43

44 Algorithm CPRAND CPRAND : procedure CPRAND(X,R,S) ô X P R IˆJˆK 2: Initialize factor matrices B, C 3: repeat 4: Define uniform random sampling operators S A, S B, S C 5: A X pq S T A ps ApC d Bq T q : 6: B X p2q S T B ps BpC d Aq T q : 7: C X p3q S T C ps CpB d Cq T q : 8: until termination criteria met 9: return A, B, C : end procedure (Order 3 Example, extends to arbitrary order) 44

45 Algorithm CPRAND CPRAND : procedure CPRAND(X,R,S) ô X P R IˆJˆK 2: Initialize factor matrices B, C 3: repeat 4: Define uniform random sampling operators S A, S B, S C 5: A X pq S T A ps ApC d Bq T q : 6: B X p2q S T B ps BpC d Aq T q : 7: C X p3q S T C ps CpB d Cq T q : 8: until termination criteria met 9: return A, B, C : end procedure (Order 3 Example, extends to arbitrary order) Trick : Instead of matricizing X and then sampling, sample fibers from X and then matricize the result. 45

46 Algorithm CPRAND CPRAND : procedure CPRAND(X,R,S) ô X P R IˆJˆK 2: Initialize factor matrices B, C 3: repeat 4: Define uniform random sampling operators S A, S B, S C 5: A X pq S T A pskrps A, C, Bq T q : X pq S T A ps ApC d Bq T q : 6: B X p2q S T B pskrps B, C, Aq T q : X p2q S T B ps BpC d Aq T q : 7: C X p3q S T C pskrps C, B, Aq T q : X p3q S T C ps CpB d Cq T q : 8: until termination criteria met 9: return A, B, C : end procedure (Order 3 Example, extends to arbitrary order) SKR: Sample KRP without forming Trick 2: We don t need to form full KR-Product j i I B i R C j R B c i b * j S I C S(C B) 46

47 Algorithm CPRAND CPRAND : procedure CPRAND(X,R,S) ô X P R IˆJˆK 2: Initialize factor matrices B, C 3: repeat 4: Define uniform random sampling operators S A, S B, S C 5: A X pq S T A pskrps A, C, Bq T q : X pq S T A ps ApC d Bq T q : 6: B X p2q S T B pskrps B, C, Aq T q : X p2q S T B ps BpC d Aq T q : 7: C X p3q S T C pskrps C, B, Aq T q : X p3q S T C ps CpB d Cq T q : 8: until termination criteria met 9: return A, B, C : end procedure (Order 3 Example, extends to arbitrary order) No mixing performed here, yet converges in many cases. Our result: µpc d Bq µpcqµpbq 47

48 Coherence of Kronecker Product identity: AB b CD pabcqpb b Dq Lemma : µpa b Bq µpaqµpbq Proof: A b B Q A R A b Q B R B pq A b Q B qpr A b R B q Q AbB R AbB Each row of Q AbB has form Q A pi, :qbq B pj, :q Proof follows from }a b b} }a}}b} 48

49 Coherence of KR Product identity: AB d CD pa b CqpB d Dq Lemma 2: µpa d Bq µpaqµpbq Proof: pq A b Q B qpr A d R B q pq A b Q B qq R R R Q AdB R AdB. Q AdB R AdB ˆq T i...row i of Q A b Q B ˆli ˆq T i Q R Q T Rˆq i Q T R ˆq i 49

50 Mixing Tensors For better convergence & guarantees, we need to mix Goals: Space Efficient (A single mixed tensor, avoid forming full KRP) Time Efficient (Minimize mixing costs, sample sizes) 5

51 Mixing Tensors CPRAND without mixing was: A X pq S T { SpC d Bq T Naive approach: mix X pq, C d B directly, then sample. I JK X pq JK R C d B (Expensive) 5

52 Mixing Factors Can we avoid forming the full KRP? R R I A J B K C JK R R C d B 52

53 Mixing Factors CPRAND without mixing was: A X pq S T { SpC d Bq T R R R R I A J B K C s Trick 3: Transform individual factors pf C D C C d F B D B Bq Sample Khatri Rao product without fully forming 53

54 Mixing Factors Fact: AB d CD pa b CqpB d Dq So the mixed factors expand: pf C D C C d F B D B Bq pf C D C b F B D B qpc d Bq apply to X pq (the LHS) X pq pf C D C b F B D B q 54

55 Mode-n Multiplication I K X ˆn U UX pnq K K IJ ˆ3 U U ˆ K J mode-3 fibers I A X pq pf C D C b F B D B q X ˆ F C D C ˆ2 F B D B ˆ3 I A 55

56 Mixing Factors Trick 4: Mix original tensor only once: ˆX X ˆ3 F C D C ˆ2 F B D B ˆ F A D A matricize in one mode ˆX pq D A F A X pq pf C D C b F B D B q sample rows D A F A X pq pf C D C b F B D B qs T solve least squares Â D A F A X pq pf C D C b F B D B qs T psz pqt q : inverse mix the final, small, solution A D A F A... 56

57 Mixing Factors Trick 4: Mix original tensor only once: ˆX X ˆ3 F C D C ˆ2 F B D B ˆ F A D A matricize in one mode ˆX pq D A F A X pq pf C D C b F B D B q sample rows D A F A X pq pf C D C b F B D B qs T solve least squares Â D A F A X pq pf C D C b F B D B qs T psz pqt q : inverse mix the final, small, solution A D A F A... 57

58 Mixing Factors Trick 4: Mix original tensor only once: ˆX X ˆ3 F C D C ˆ2 F B D B ˆ F A D A matricize in one mode ˆX pq D A F A X pq pf C D C b F B D B q sample rows D A F A X pq pf C D C b F B D B qs T solve least squares Â D A F A X pq pf C D C b F B D B qs T psz pqt q : inverse mix the final, small, solution A D A F A... 58

59 Mixing Factors Trick 4: Mix original tensor only once: ˆX X ˆ3 F C D C ˆ2 F B D B ˆ F A D A matricize in one mode ˆX pq D A F A X pq pf C D C b F B D B q sample rows D A F A X pq pf C D C b F B D B qs T solve least squares Â D A F A X pq pf C D C b F B D B qs T psz pqt q : inverse mix the final, small, solution A D A F A... 59

60 Mixing Factors Trick 4: Mix original tensor only once: ˆX X ˆ3 F C D C ˆ2 F B D B ˆ F A D A matricize in one mode ˆX pq D A F A X pq pf C D C b F B D B q sample rows D A F A X pq pf C D C b F B D B qs T solve least squares Â D A F A X pq pf C D C b F B D B qs T psz pqt q : inverse mix the final, small, solution A D A F A ÂÂ 6

61 Mixing Factors Trick 4: Mix original tensor only once: ˆX X ˆ3 F C D C ˆ2 F B D B ˆ F A D A Summary: inverse mix small result at end solve small system of size S * R A pnq D n F n p ˆX pnq S T qrpsẑ pnq q s : efficiently sample S fibers from mode n of ˆX before reshaping (Trick ) efficiently sample S rows of KRP (Trick 2) mix individual factors (Trick 3) 6

62 Algorithm 3: CPRAND-MIX Algorithm CPRAND (Order 3 Example, extends to arbitrary order) : procedure CPRAND(X,R,S) ô X P R IˆJˆK 2: Mix: ˆX Xˆ3F C D C ˆ2 F B D B ˆ F A D A 3: Initialize factor matrices B, C 4: repeat 5: Define uniform random sampling operators S A, S B, S C ı 6: A D A F A ˆX pq S T A pskrps A, F C D C C, F B D B Bq T q : ı 7: B D B F B ˆX p2q S T B pskrps B, F C D C C, F A D A Aq T q : ı 8: C D C F C ˆX p3q S T C pskrps C, F B D B B, F C D C Cq T q : 9: until termination criteria met : return A, B, C : end procedure arg min kax bk 2 = arg min x2r n x2r n apple <(A) =(A) x apple <(b) =(b) 2 (Trick 5) 62

63 Algorithm 4: CPRAND-PREMIX If we use a real-valued orthogonal transform such as the DCT, We can simply mix, call CPRAND, then unmix the output. Algorithm CPRAND-PREMIX : function CPRAND-PREMIX(X,R,S). X 2 R I I N 2: Define random sign-flip operators D m and orthogonal matrices F m,m2{,...,n} 3: 4: Mix: ˆX X F D N F N D N [, {Â (n) }]= CPRAND(ˆX,R,S) 5: for n =,...,N do 6: Unmix: A (n) = D n Fn TÂ(n) 7: end for 8: return factor matrices { A (n) } 9: end function 63

64 Trick 5: Stopping Criterion I K J CP-ALS: }X X} }X} expensive inner product b }X} 2 `} X} 2 2 X, X }X} 64

65 Trick 5: Stopping Criterion I K J CP-ALS: }X X} }X} expensive inner product b }X} 2 `} X} 2 2 X, X }X} I CPRAND: K Uniformly sample : X I Ä I risbrjsbrks Choose some subset of tensor entries J 65

66 Trick 5: Stopping Criterion I K J CP-ALS: }X X} }X} expensive inner product b }X} 2 `} X} 2 2 X, X }X} I J CPRAND: K Uniformly sample : X }X X} }X} I Ä I risbrjsbrks a I mean t e 2 i i P I u }X} «approximate fit a I mean t e 2 i i P I u }X} 66

67 Trick 5: Stopping Criterion I J CPRAND: K Uniformly sample : X }X X} }X} I Ä I risbrjsbrks a I mean t e 2 i i P I u }X} «approximate fit a I mean t e 2 i i P I u }X} How many samples are needed? Since we are approximating a mean, we can apply the Chernoff-Hoeffding bound 67

68 Trick 5: Stopping Criterion K approximate fit I }X X} }X} a I mean t e 2 i i P I u }X} «a I mean t e 2 i i P I u }X} J Very conservative bound: (assuming e i is i.i.d) (µ max = max(e 2 i )) #samples ˆP 372µ 2 max Yields 95% confidence that the sampled fit is within 5% of true fit In practice: ˆP =2 4 yields error < 3 on synthetic data 68

69 Trick 5: Stopping Criterion K approximate fit I }X X} }X} a I mean t e 2 i i P I u }X} «a I mean t e 2 i i P I u }X} J Terminate after fit ceases to decrease after a certain number of iterations 69

70 Experimental Setup Environment MATLAB R26a & Tensor Toolbox v2.6. Intel Xeon E5-265 Ivy Bridge 2. GHz machine with 32 GB of memory Data Synthetic random tensor Factors of rank Rtrue Factors are generated with collinearity Observation is combined with Gaussian noise Score is a measure of how well the original factors are recovered, from to 7

71 Example Run True Rank = 5, Target Rank = 5, Noise = %, Collinearity.9 7

72 Per-Iteration Times R 5 S R log R 72

73 Stopping Criterion Speedup (stopping condition with ˆP =2 4 ) 73

74 Preprocessing Times nvec on each mode-n unfolding vs. fft on fibers of each mode 74

75 Experiment : order Time (s) Score Fit Iterations until Termination CPRAND MIX CPRAND CP ALS(R) CP ALS(H) Size: 9x9x9x9 Rtrue: 5 R: {5,6} Noise: % 6 Runs Collinearity: {.5,.9} 75

76 Experiment : order Time (s) Score Fit Iterations until Termination CPRAND MIX CPRAND CP ALS(R) CP ALS(H) Size: 9x9x9x9 Rtrue: 5 R: {5,6} Noise: % 6 Runs Collinearity: {.5,.9} 76

77 Conclusion (Experiment ) CPRAND can surprisingly outperform CP-ALS in both time and factor recovery (given the same termination criteria). CPRAND has very low per-iteration time, but may take more iterations until termination. CPRAND-FFT takes fewer iterations and samples at the cost of pre-processing time. 77

78 Experiment 2: Hazardous Gas Classification We replicate an experiment from Vervliet & Lauthauwer, 26: 72 sensors 25,9 time steps 3 experiments per gas (3 gases: CO, Acetaldehyde, Ammonia) 259 x 72 x 9 Tensor 3.4 GB, dense sensor time experiment Goal: Classify which gas is released in each experiment Here we use CPRAND without mixing! 78

79 Experiment 2: Hazardous Gas Classification (over trials) Method Median Time (s) Median Fit Median Classification Error CPRAND % CP-ALS(H) % CP-ALS(R) % S =, ˆP =2 4 Factors Recovered by CPRAND: 79

80 Experiment 3: COIL- Image Set We replicate an experiment from Zhou, Cichocki, and Xie (23): COIL-: objects, images of each from 72 different poses. Each image is RGB with size 28x28 pixels. 28 x 28 x 3 x 72 Tensor (2.8 GB) image size color channel image id 8

81 Experiment 3: COIL- Image Set Here we use CPRAND-FFT (mixing): # Samples (S) Median Speedup Median Fit trials, R = 2, ˆP =2 4 8

82 Future Work Randomization in the Tucker model Randomization in sparse CP (for scalability) 82

83 Contact Casey Battaglino 83

A Practical Randomized CP Tensor Decomposition

A Practical Randomized CP Tensor Decomposition Casey Battaglino, Grey Ballard 2, and Tamara G. Kolda 3 SIAM AN 207, Pittsburgh, PA Georgia Tech Computational Sci. and Engr. 2 Wake Forest University 3 Sandia