Machine Learning with Quantum-Inspired Tensor Networks

Size: px

Start display at page:

Download "Machine Learning with Quantum-Inspired Tensor Networks"

Julianna Matthews
6 years ago
Views:

1 Machine Learning with Quantum-Inspired Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv: RIKEN AICS - Mar 2017

2 Collaboration with David J. Schwab, Northwestern and CUNY Graduate Center Quantum Machine Learning, Perimeter Institute, Aug 2016

3 Exciting time for machine learning Language Processing Self-driving cars Medicine Materials Science / Chemistry

4 Progress in neural networks and deep learning neural network diagram

5 Convolutional neural network "MERA" tensor network

6 Are tensor networks useful for machine learning? This Talk Tensor networks fit naturally into kernel learning (Also very strong connections to graphical models) Many benefits for learning Linear scaling Adaptive Feature sharing

7 Machine Learning Physics Topological Phases Phase Transitions Neural Nets Boltzmann Machines Quantum Monte Carlo Sign Problem Kernel Learning Materials Science & Chemistry Supervised Learning Unsupervised Learning Tensor Networks

8 Machine Learning Physics Topological Phases Phase Transitions Neural Nets Boltzmann Machines Quantum Monte Carlo Sign Problem Kernel Learning Materials Science & Chemistry Supervised Learning Unsupervised Learning (this talk) Tensor Networks

9 What are Tensor Networks?

10 How do tensor networks arise in physics? Quantum systems governed by Schrödinger equation: Ĥ ~ = E ~ It is just an eigenvalue problem.

11 The problem is that Ĥ is a 2 N x 2 N matrix ~ =) wavefunction has 2 N components = E Ĥ ~ ~

12 Natural to view wavefunction as order-n tensor i = X {s} s 1 s 2 s 3 s N s 1 s 2 s 3 s N i

13 Natural to view wavefunction as order-n tensor s 1 s 2 s 3 s 4 s N s 1 s 2 s 3 s N =

14 Tensor components related to probabilities of e.g. Ising model spin configurations ""#""#" = # # " " " " "

15 Tensor components related to probabilities of e.g. Ising model spin configurations # # # # # " " #""##"# =

16 Must find an approximation to this exponential problem s 1 s 2 s 3 s 4 s N s 1 s 2 s 3 s N =

17 Simplest approximation (mean field / rank-1) Let spins "do their own thing" s 1 s 2 s 3 s 4 s 5 s 6 ' s 1 s 2 s 3 s 4 s 5 s 6 s 1 s 2 s 3 s 4 s 5 s 6 Expected values of individual spins ok No correlations

18 Restore correlations locally s 1 s 2 s 3 s 4 s 5 s 6 ' s 1 s 2 s 3 s 4 s 5 s 6 s 1 s 2 s 3 s 4 s 5 s 6

19 Restore correlations locally s 1 s 2 s 3 s 4 s 5 s 6 ' s 1 s 2 s 3 s 4 s 5 s 6 i 1 i 1 s 1 s 2 s 3 s 4 s 5 s 6

20 Restore correlations locally s 1 s 2 s 3 s 4 s 5 s 6 ' s 1 s 2 s 3 s 4 s 5 s 6 i 1 i 1 i 2 i 2 i 3 i 3 i 4 i 4 i 5 i 5 s 1 s 2 s 3 s 4 s 5 s 6 matrix product state (MPS) Local expected values accurate Correlations decay with spatial distance

21 " # # " " # "Matrix product state" because retrieving an element = product of matrices

22 = "##""# "Matrix product state" because retrieving an element = product of matrices

23 Tensor diagrams have rigorous meaning j v j i j M ij i j k T ijk

24 Joining lines implies contraction, can omit names X i j j M ij v j A ij B jk = AB A ij B ji =Tr[AB]

25 MPS = matrix product state MPS approximation controlled by bond dimension "m" (like SVD rank) Compress 2 N N 2 m 2 parameters parameters into m 2 N 2 can represent any tensor

26 Friendly neighborhood of "quantum state space" m=8 m=4 m=2 m=1

27 MPS = matrix product state MPS lead to powerful optimization techniques (DMRG algorithm) White, PRL 69, 2863 (1992) Stoudenmire, White, PRB 87, (2013)

28 Besides MPS, other successful tensor are PEPS and MERA PEPS (2D systems) MERA (critical systems) Evenbly, Vidal, PRB 79, (2009) Verstraete, Cirac, cond-mat/ (2004) Orus, Ann. Phys. 349, 117 (2014)

29 Supervised Kernel Learning

30 Supervised Learning Very common task: Labeled training data (= supervised) Find decision function f(x) f(x) > 0 x 2 A f(x) < 0 x 2 B Input vector x e.g. image pixels

31 ML Overview Use training data to build model x 16 x 13 x 1 x 3 x 4 x 7 x 8 x 9 x 10 x 11 x 12 x 14 x 15 x 6 x 2 x 5

32 ML Overview Use training data to build model x 16 x 13 x 1 x 3 x 4 x 7 x 8 x 9 x 10 x 11 x 12 x 14 x 15 x 6 x 2 x 5

33 ML Overview Use training data to build model Generalize to unseen test data

34 ML Overview Popular approaches Neural Networks f(x) = 2 M 2 1 M 1 x Non-Linear Kernel Learning f(x) =W (x)

35 Non-linear kernel learning Want f(x) to separate classes Linear classifier often insufficient f(x) =W x??

36 Non-linear kernel learning Apply non-linear "feature map" x! (x)

37 Non-linear kernel learning Apply non-linear "feature map" x! (x) Decision function f(x) =W (x)

38 Non-linear kernel learning Decision function f(x) =W (x) Linear classifier in feature space

39 Non-linear kernel learning Example of feature map x =(x 1,x 2,x 3 ) (x) =(1,x 1,x 2,x 3,x 1 x 2,x 1 x 3,x 2 x 3 ) x is "lifted" to feature space

40 Proposal for Learning

41 Grayscale image data

42 Map pixels to "spins"

43 Map pixels to "spins"

44 Map pixels to "spins"

45 x = input Local feature map, dimension d=2 (x j )= h cos 2 x j i, sin 2 x j x j 2 [0, 1] Crucially, grayscale values not orthogonal

46 x = input = local feature map Total feature map (x) s 1 s 2 s N (x) = s 1 (x 1 ) s 2 (x 2 ) s N (x N ) Tensor product of local feature maps / vectors Just like product state wavefunction of spins Vector in dimensional space 2 N

47 x = input = local feature map Total feature map (x) More detailed notation x =[x 1, x 2, x 3,..., x N ] raw inputs (x) = [ 1( x 1 ) 2( x 1 ) [ [ 1( x 2 ) 2( x 2 ) [ [ 1( x 3 ) 2( x 3 ) [ [ 1( x N )... feature 2( x N ) [ vector

48 x = input = local feature map Total feature map (x) Tensor diagram notation x =[x 1, x 2, x 3,..., x N ] raw inputs (x) = s 1 s 2 s 3 s 4 s 5 s 6 s 1 s 2 s 3 s 4 s 5 s 6 s N s N feature vector

49 Construct decision function f(x) =W (x) (x)

50 Construct decision function f(x) =W (x) W (x)

51 Construct decision function f(x) =W (x) f(x) = W (x)

52 Construct decision function f(x) =W (x) f(x) = W (x) W =

53 Main approximation W = order-n tensor matrix product state (MPS)

54 MPS form of decision function f(x) = W (x)

55 Linear scaling Can use algorithm similar to DMRG to optimize Scaling is N N T m 3 N = size of input N T = size of training set m = MPS bond dimension f(x) = W (x)

56 Linear scaling Can use algorithm similar to DMRG to optimize Scaling is N N T m 3 N = size of input N T = size of training set m = MPS bond dimension f(x) = W (x)

57 Linear scaling Can use algorithm similar to DMRG to optimize Scaling is N N T m 3 N = size of input N T = size of training set m = MPS bond dimension f(x) = W (x)

58 Linear scaling Can use algorithm similar to DMRG to optimize Scaling is N N T m 3 N = size of input N T = size of training set m = MPS bond dimension f(x) = W (x)

59 Linear scaling Can use algorithm similar to DMRG to optimize Scaling is N N T m 3 N = size of input N T = size of training set m = MPS bond dimension f(x) = W (x) Could improve with stochastic gradient

60 Multi-class extension of model Decision function f `(x) =W ` (x) Index ` runs over possible labels ` f `(x) = W ` (x) ` = W ` (x) Predicted label is argmax` f `(x)

61 MNIST Experiment MNIST is a benchmark data set of grayscale handwritten digits (labels ` = 0,1,2,...,9) 60,000 labeled training images 10,000 labeled test images

62 MNIST Experiment One-dimensional mapping

63 MNIST Experiment Results Bond dimension m = 10 m = 20 m = 120 Test Set Error ~5% (500/10,000 incorrect) ~2% (200/10,000 incorrect) 0.97% (97/10,000 incorrect) State of the art is < 1% test set error

64 MNIST Experiment Demo Link:

65 Understanding Tensor Network Models f(x) = W (x)

66 Again assume W is an MPS f(x) = W (x) Many interesting benefits Two are: 1. Adaptive 2. Feature sharing

67 { 1. Tensor networks are adaptive boundary pixels not useful for learning grayscale training data

68 2. Feature sharing ` f `(x) = W ` (x) ` = Different central tensors "Wings" shared between models Regularizes models

69 2. Feature sharing ` f `(x) = Progressively learn shared features

70 2. Feature sharing ` f `(x) = Progressively learn shared features

71 2. Feature sharing ` f `(x) = Progressively learn shared features

72 2. Feature sharing ` f `(x) = Progressively learn shared features Deliver to central tensor `

73 Nature of Weight Tensor Representer theorem says exact W = X j j (x j ) Density plots of trained W ` for each label ` =0, 1,...,9

74 Nature of Weight Tensor Representer theorem says exact W = X j j (x j ) Tensor network approx. can violate this condition W MPS 6= X j j (x j ) for any { j } Tensor network learning not interpolation Interesting consequences for generalization?

75 Some Future Directions Apply to 1D data sets (audio, time series) Other tensor networks: TTN, PEPS, MERA Useful to interpret W (x) 2 as probability? Could import even more physics insights. Features extracted by elements of tensor network?

76 W What functions realized for arbitrary? Instead of "spin" local feature map, use* (x) =(1,x) Recall total feature map is [ (x) = 1 ( x 1 ) 2( x 1 ) [ [ 1( x 2 ) 2( x 2 ) [ [ 1( x 3 ) 2( x 3 ) [... [ 1( x N ) 2( x N ) [ *Novikov, et al., arxiv:

77 [ N=2 case (x) =(1,x) (x) =[ 1 [ x 1 [ 1 x 2 =(1,x 1,x 2,x 1 x 2 ) (W 11,W 21,W 12,W 22 ) f(x) =W (x) = (1, x 1, x 2, x 1 x 2 ) = W 11 + W 21 x 1 + W 12 x 2 + W 22 x 1 x 2

78 [ [ N=3 case (x) =(1,x) (x) =[ 1 [ x 1 [ 1 x 2 [ 1 x 3 =(1,x 1,x 2,x 3,x 1 x 2,x 1 x 3,x 2 x 3,x 1 x 2 x 3 ) f(x) =W (x) = W W 211 x 1 + W 121 x 2 + W 112 x 3 + W 221 x 1 x 2 + W 212 x 1 x 3 + W 122 x 1 x 3 + W 222 x 1 x 2 x 3

79 General N case x 2 R N f(x) =W (x) = W W x 1 + W x 2 + W x W x 1 x 2 + W x 1 x W x 1 x 2 x constant singles doubles triples W x 1 x 2 x 3 x N N-tuple Model has exponentially many formal parameters Novikov, Trofimov, Oseledets, arxiv: (2016)

80 Related Work Novikov, Trofimov, Oseledets ( ) matrix product states + kernel learning stochastic gradient descent Cohen, Sharir, Shashua ( , , , ) tree tensor networks expressivity of tensor network models correlations of data (analogue of entanglement entropy) generative proposal

81 Other MPS related work ( = "tensor trains") Markov random field models Novikov et al., Proceedings of 31st ICML (2014) Large scale PCA Lee, Cichocki, arxiv: (2014) Feature extraction of tensor data Bengua et al., IEEE Congress on Big Data (2015) Compressing weights of neural nets Novikov et al., Advances in Neural Information Processing (2015)

Machine Learning with Tensor Networks

Machine Learning with Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv:1605.05775 Beijing Jun 2017 Machine learning has physics in its DNA # " # #