Applications of the tensor RG method to machine learning

Size: px

Start display at page:

Download "Applications of the tensor RG method to machine learning"

Philomena Hood
6 years ago
Views:

1 Applications of the tensor RG method to machine learning Yannick Meurice The University of Iowa With Sam Foreman, Joel Giedt, and Judah Unmuth-Yockey Happy birthday Thom! TGASEPC, August 19, 2017

2 Content of the talk 1 The Tensor renormalization group (TRG) method in a nutshell 2 What is Machine Learning? 3 RG ideas applied to digit recognition using the MNIST data The MNIST data The multiple layer perceptron Principal Components Analysis (PCA): relevant features Coarse graining procedures 4 The 2D Ising model near T c (Sam Foreman) 5 Work in progress 6 Conclusions

3 Tensor Renormalization Group (TRG) Blocking in configuration space is hard! (2D Ising example) Tensor RG: blocking is simple and exact! TRG: first implementation of Wilson program for lattice models without incontrollable approximations In this talk, we focus on the 2D Ising model example Not discussed here: O(N) and principal chiral models and gauge models (Ising, U(1) and SU(2)) and their quantum simulators For details see: PRB , PRD 88, and PRD , PRA , PRD , PRA , (PRD in press)

4 The Renormalization Group (RG) for lattice models Main ingredients: 1 Replace microscopic degrees of freedom by coarser ones (block spins, decimation, coarse graining...). 2 Rescale the new variables in such a way that the new interactions (encoded in an Hamiltonian or tensor...) can be compared to the original ones. 3 Find the fixed points of the RG transformation defined by 1. and 2. 4 Linearize the RG transformation at the fixed point; separate relevant (expanding) and irrelevant (contracting) directions. The relevant directions characterize the universal large distance behavior.

5 The 2D Ising model System of spins on a square lattice: σ x,y = ±1 Hamiltonian: H = (x,y) σ x,y(σ x+1,y + σ x,y+1 ) partition function: Z = {σ} e βh Second order phase transition (order-disorder) at self dual β c : tanh(β c ) = exp( 2β c ) which implies β c = ln(1 + 2)/2 = Transfer matrix exactly diagonalizable at finite volume (Kaufman) Critical exponents are known exactly First model to try when developing a new renormalization group (RG) method

9 Block Spining in Configuration Space: Step 1 Square blocks in A B checkerboard; B blocks are fixed backgrounds. We can block spin in the A blocks. Example: Pr(φ A = 4 φ background i ) = exp(β(4 + 8 i=1 φbackground i )) B B A B B

10 Block Spining in Configuration Space: Step 2 etc. The next step is to try to block spin in the B blocks. We can use step 1 and block spin in a given B block but there are 20 background spins. Finding the effective energy function is nontrivial. B'' B' A B' B'' A B A B'' B' A B' B''

11 TRG blocking: it s simple and exact! For each link: exp(βσ 1 σ 2 ) = cosh(β)(1 + tanh(β)σ 1 tanh(β)σ2 ) = cosh(β) ( tanh(β)σ 1 tanh(β)σ2 ) n 12. n 12 =0,1 Regroup the four terms involving a given spin σ i and sum over its two values ±1. The results can be expressed in terms of a tensor: T (i) xx yy which can be visualized as a cross attached to the site i with the four legs covering half of the four links attached to i. The horizontal indices x, x and vertical indices y, y take the values 0 and 1 as the index n 12. T (i) xx yy = f x f x f y f y δ ( mod[x + x + y + y, 2] ), where f 0 = 1 and f 1 = tanh(β). The delta symbol is 1 if x + x + y + y is zero modulo 2 and zero otherwise.

12 TRG blocking (graphically) Exact form of the partition function: Z = (cosh(β)) 2V Tr i T (i) xx yy. Tr mean contractions (sums over 0 and 1) over the links. Reproduces the closed paths of the HT expansion. TRG blocking separates the degrees of freedom inside the block which are integrated over, from those kept to communicate with the neighboring blocks. Graphically : Y y 1 y 2 X x 1 y L x U y R x 1 ' X' x 2 x D x 2 ' y 1 ' y 2 ' Y'

13 TRG Blocking (formulas) Blocking defines a new rank-4 tensor T XX YY where each index now takes four values. T X(x 1,x 2 )X (x 1,x 2 )Y (y 1,y 2 )Y (y 1,y 2 ) = x U,x D,x R,x L T x1 x U y 1 y L T xu x 1 y 2y R T xd x 2 y Ry 2 T x 2 x D y L y 1, where X(x 2, x 2 ) is a notation for the product states e. g., X(0, 0) = 1, X(1, 1) = 2, X(1, 0) = 3, X(0, 1) = 4. The partition function can be written exactly as Z = (cosh(β)) 2V Tr 2i T (2i) XX YY, where 2i denotes the sites of the coarser lattice with twice the lattice spacing of the original lattice.

14 Accurate exponents from approximate tensor renormalizations (YM, Phys. Rev. B 87, ) For the Ising model on square and cubic lattices, truncation method (HOSVD) sharply singles out a surprisingly small subspace of dimension two. In the two states limit, the transformation can be handled analytically yielding a value for the critical exponent ν much closer to the exact value 1 than obtained in Migdal-Kadanoff approximations. Alternative blocking procedures that preserve the isotropy can improve the accuracy to ν = and respectively. Applications to other classical lattice models are possible, including models with fermions. TRG could become a competitor for the Monte Carlo method suitable to calculate accurately critical exponents, take continuum limits and study near-conformal systems in large volumes.

15 More than two states When a few more states are added, the quality of the approximation does not immediately improve. One first observes oscillations, false bifurcations, approximate degeneracies. Same observations were made by Leo Kadanoff ( ) and collaborators. Can a systematic understanding of this phenomenon be reached using the exact solution at finite volume (B. Kaufman, 1949) or approximate functional evolution (Curtright and Zachos, 2011)? 1.05 x T χ Figure: Thermal x-value 1/ν versus the truncation size χ in various calculations summarized in by Efi Efrati, Zhe Wang, Amy Kolan, Leo P. Kadanoff, Rev. Mod. Phys. 86, 647 May 2014

16 RG aspects of Machine Learning 1 What is Machine Learning? 2 RG ideas applied to digit recognition using the MNIST data The MNIST data The multiple layer perceptron Principal Components Analysis (PCA): relevant features Coarse graining procedures 3 The 2D Ising model near T c (Sam Foreman)

you can recognize 91 percent of the digits of the MNIST

17 Rosenblatt perceptron (1958) and its digital version Using outputs functions of the form y l = σ( j W ljv j ) you can recognize 91 percent of the digits of the MNIST data (v j : pixels, W lj tunable, σ(x) the sigmoid function).

18 Machine Learning Course (Andrew Ng, Stanford) About this course: Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). [...]

19 MNIST data 60,000 handwritten digits; the correct" answer is known. 0 1-th. MNIST digit 0 2-th. MNIST digit 0 3-th. MNIST digit th. MNIST digit 0 5-th. MNIST digit 0 6-th. MNIST digit th. MNIST digit 0 8-th. MNIST digit 0 9-th. MNIST digit FIG. 1. First 9 MNIST data.

20 MNIST data: grayscale pixels There is a UV cutoff (pixels are uniform), an IR cutoff close to the overall size and a typical size (width of lines).

21 A perceptron for the MNIST data (one hidden layer) We consider a perceptron model where the visible variables v i are the = 784 pixels of the MNIST digits (in grayscale values between 0 and 1). We decided to take 196=784/4 hidden variables h k. Later (see hierarchical approximations below), we will attach" them rigidly to 2 2 blocks of pixels. The hidden variables are: 784 h k = σ( W (1) kj v j ), k = 1, j=1 We choose the sigmoid function σ(x) = 1/(1 + exp( x)). The output variables are y l = σ( k W (2) lk h k)). with l = 0, 1,... 9 which we want to associate with the MNIST characters by having y l 1 for l =digit while y l 0 for the 9 others.

22 The sigmoid function The sigmoid function σ(x) = 1/(1 + exp( x)) is a popular choice of activation function (0 at large negative input, 1 at large positive input). We have σ(x) = σ(x) σ(x) 2 which allows simple algebraic manipulations for the gradients. 1.0 sigmoid function

23 Gradient search Given a learning set {v (n) i } with n = 1, 2,... N 60, 000 with the corresponding target vectors {t (n) l }, we minimize the loss function: with E(W (1), W (2) ) = (1/2) y (n) l σ( k N W (2) lk σ( j 9 (y (n) l n=1 l=0 t (n) l ) 2 W (1) kj v (n) j )). The weights matrices W (1) and W (2) are initialized with random numbers following a normal distribution. They are then optimized using a gradient method with gradients: G (2) lk E W (2) lk G (1) kj E W (1) = l = (y l t l )(y l y 2 l )h k (y l t l )(y l yl 2 )W (2) lk (h k hk 2 )v j kj

24 Performance After one learning cycle" (going through the entire MNIST training data), we get a performance of about 0.95 (number of correct identifications/number of attempts), after 10 learning cycles, the performance saturates near 0.98 (with 196 hidden variables and learning parameters 0.1 and 0.5). It is straightforward to introduce more hidden layers y (n) l σ(w (m)...σ(w (2) (1) lk σ(w kj v (n) j )). With two hidden layers, the performance improves (but only very slightly). On the other hand, if we remove the hidden layer (as in Rosenblatt experiment), the performance goes down to about 0.91.

25 Failures (2 percent) It is instructive to look at the outputs for the 2 percent of cases where the algorithm fails to identify the correct digits. Here, the outputs suggest 1, 7 and 9 with a slightly larger value for 1 than for 9 which is the correct answer according to the MNIST data. In the following, I will focus more on getting comparable performance with less learning parameters rather than reducing the number of failures testing sample 320 (a nine) y for testing sample

26 Recent attempts to use RG ideas It has been suggested by Methap and Schwab ( ) that the hidden variables can be related to the RG block variables" and that an hierarchical organization inspired by physics modeling could drastically reduce the number of learning parameters. This may be called cheap" learning (Lin and Tegmark, ). The notion of criticality or fixed point has not been identified precisely on the ML side. It is not clear that the technical meaning of relevant", as used in the RG context to describe unstable directions of the RG transformation linearized near a nontrivial fixed point, can be used generically in ML context. We are exploring this question for the MNIST data (which is not critical") and the more tunable Ising model.

27 Principal Component Analysis (PCA, 19th cent. tech.) Identifying relevant directions" may allow a drastic reduction of the information necessary to calculate observables. Relevant directions: directions with largest variance v (n) i : greyscale value of the i-th pixel in the n-th sample v i : average greyscale value of the i-th pixel covariance matrix: C ij = (1/N) N (n) n=1 (v i v i )(v (n) j v j ) We can project the original data in a small dimensional subspace corresponding to the largest eigenvalues of C ij Can we keep the crucial information in a small subspace? For the MNIST data, the 784 pixels digits are recognizable using projections in dimension subspace

28 MNIST averages Figure: Empirical probability for 60,000 digits (includes all digits, left) and 5,923 zeros (right).

29 PCA eigenvectors ( eigenfaces")

30 PCA projections of a MNIST digit 10-dim. projection of MNIST data 20-dim. projection of MNIST data dim. projection of MNIST data dim. projection of MNIST data 0 50-dim. projection of MNIST data 0 60-dim. projection of MNIST data dim. projection of MNIST data 0 80-dim. projection of MNIST data 0 Original MNIST data Figure 1: PCA projections in subspaces of dimensions 10, 20, compared to the original 784-dimensional data. PCA projections in subspaces of dimensions 10, 20, compared to the original (lower right) 784-dimensional MNIST digit.

31 Success rate of (drastic) PCA Projections PCA Success rate PCAP Images PCAP Vectors PCA dimension Figure: Success rate for PCA projections of the images (blue, using the PCA eigenvectors=images) or using the PCA coordinates only (red, no images) as a function of the dimension of the subspace. The green line represents the asymptotic value (98 percent) for the original version (dimension 784).

32 Blocking We have used the one layer perceptron with blocked images. First we replaced squares of 4 four pixels by a single pixel carrying the average value of the four blocked pixels. Using the blocked pictures with 49 hidden variables, the success rate goes down slightly (97 percent). Repeating once, we obtain 7 7 images. With 25 hidden variables, we get a success rate of 92 percent. 0 blocked once 0 blocked twice

33 Ising projection We can replace the grayscale pixels by black and white pixels. This barely affects the performance (97.6 percent) but diminishes the configuration space from to and allows a Restricted Boltzmann Machine treatment (not discussed here). 0 Original 0 Isinglike (graycut=0.5)

34 Cheap learning (hierarchical approximations) We have considered the hierarchical approximation where each hidden variable is only connected to a single 2 2 block of visible variables (pixels): W (1) lj v j W (1) l,α v l,α with α = 1, 2, 3, 4 are the position in the 2 2 block and l = 1,...,196 the labeling of the blocks. Even though the number of parameters that we need to determine with the gradient method is significantly smaller (by a factor 196), the performance remains A generalization with 4x4 blocks leads to a 0.90 performance with 1/4 as many weights. This simplified version can be used as a starting point for a full gradient search (pretraining), but the hierarchical structure (sparcity of W ij ) is robust and remains visible during the training. This pretraining breaks the huge permutation symmetry of the hidden variables.

Transition to the 2D Ising model The MNIST data has a typical size built-in and UV details can be erased until the relevant information contained at that size is reached.

35 Transition to the 2D Ising model The MNIST data has a typical size built-in and UV details can be erased until the relevant information contained at that size is reached. One can think that the various digits are phases", but there is nothing like a critical point. We now consider worm configurations for the 2D Ising model at different β = 1/T, some close to the critical value. The graphs for the Ising model calculations have been made by Sam Foreman.

36 Worm Algorithm 1 Randomly select starting point on lattice, set head = tail = j. Choose nearest neighbor site at random for possible update, j. 2 If random() R: assign head to new site, j, and update the bond number, i.e. the number of currently active bonds n b = n b ± 1, where R is the acceptance ratio, defined by R = { βj n b +1, when n b = n b + 1 n b βj, when n b = n b 1 (1) 3 While head tail, repeat 2. If head = tail, update partition function Z = Z + 1, and go to 1. 4 Note that when this update is accepted, the global number of active bonds N b is updated by one, i.e. N b = N b For the Ising model, one can use resummed versions with links occupied 0 or 1 time with a weight th(β).

37 Main points legal worms (high temperature graphs) =combinations of tensors Specific heat =C v = A(β) + B(β) N b + C(β)(N b N b ) 2 with N b the number of links in a worm (N b N b ) 2 (2/π) ln( T T c ) Conjecture: λ max PCA (3/2V )(N b N b ) 2 Blocking results: similar shapes as unblocked. Quantitative features and RG interpretations are in progress (not discussed here).

38 Sampling of Ising HT contributions ( worms")

39 Blocked worms (8x8)

40 Worms (16x16)

41 Blocked worms (16x16)

42 Blocking We implement a coarse-graining renormalization approach, where the lattice is divided into blocks of 2 2 squares. Each 2 2 square is then blocked into a single site, where the new external bonds in a given direction are determined by the number of active bonds exiting a given square. If a given block has one external bond in a given direction, the blocked site retains this bond in the blocked configuration, otherwise it is ignored. Results for the leading PCA eigenvalue and specific heat for blocked configurations are qualitatively similar to the unblocked results. These results and their RG interpretation are still preliminary and will not be presented here. The blocking procedure is approximate. Better approximations can be constructed with the TRG method (YM arxiv: , Phys. Rev. B 87, ). Worm implementation of the critical theory are in progress.

43 Specific Heat (Sam Foreman) The worm algorithm allows statistically exact calculations of the specific heat. In order to observe finite scaling effects, we performed this analysis for lattice sizes L = 4, 8, 16, 32.

44 Specific Heat (Sam Foreman)

45 Leading singularity fit

46 Comparison with exact results

47 PCA leading eigenvalue (Sam Foreman)

48 PCA leading eigenvalue (Sam Foreman)

49 PCA leading eigenvalue and specific heat

50 Conclusions ML has a clear RG flavor: how to extract relevant features from noisy pictures (configurations). RG ideas allow to reduce the complexity of the perceptron algorithms for the MNIST data. Direct use of Tensor RG methods are being used for the ML treatment of worm configurations of the 2D Ising model near criticality. Quantitative tests in progress. Restricted Boltzmann Machines=Ising models ML Lattice Field Theory: compress configurations, find features or updates (Lei Wang ) Private funding for Lattice Field Theory?

51 Thanks for listening and waiting (the image is cooling!) = 0.1

52 Thanks for listening and waiting (the image is cooling!) = 0.2

53 Thanks for listening and waiting (the image is cooling!) = 0.3

54 Thanks for listening and waiting (the image is cooling!) = 0.4

55 Thanks for listening and waiting (the image is cooling!) = 0.5

56 Thanks for listening and waiting (the image is cooling!) = 0.6

57 Happy birthday t Hom!

58 Acknowledgements: This research was supported in part by the Dept. of Energy under Award Numbers DOE grants DE-SC , DE-SC , and DE-SC

59 Truncation procedure M <ij> X(x 1,x 2 )X (x 1 x 2 )yy = y T (i) x 1,x 1,y,y T (j) x 2 x 2 y y, which can be put in a canonical form by using a Higher Order Singular Value Decomposition defined by a unitary transformation on each of the four indices (see Tao Xiang et al. PRB 86). The unitary transformation for each index is the one that diagonalizes the symmetric tensor obtained by summing over all the other indices in the following way: G XX = X yy M XX yy M X X yy.

60 Eigenvalues of G t The two new states have the form 0 = cos φ 00 + sin φ 11 and 1 = ( )/ 2. The angle φ is obtained by diagonalizing the even-even block. The symmetric form of 1 is a consequence of G 33 = G 44, itself due to reflection symmetry.

61 A simple 3 parameter map With the two states projection, we obtain a new rank 4 tensor with indices taking again two values and the same parity selection rule. We divide the new tensor by a constant in such a way that, dropping all the primes herafter, T 0000 = 1. In the isotropic case, the reflection symmetry imposes that T 1010 = T 0110 = T 1001 = T 0101 t 1, T 1100 = T 0011 t 2, T 1111 t 3. For the initial tensor, we have what we call later the Ising condition": t 1 = t 2 and t 3 = t 2 1, but this property is not preserved by the blocking procedure which can be expressed as a mapping of the three dimensional parameter space (t 1, t 2, t 3 ) into itself. This can be expressed as rational expressions involving the values of the tensor and trigonometric functions of φ. They are analytical expressions and derivatives can be taken to calculate the linearized map.

62 Numerical Results A fixed point is found for β = not far from the exact value The fixed point is approximately t1 = , t 2 = , t 3 = and shows significant departure from the Ising condition. The eigenvalues of the linearized RG transformation are λ 1 =2.0177, λ 2 = and λ 3 = The scaling factor b is 2, and this implies the values for the critical exponents ν = log b/log λ and ω = log λ 2 /log b 2.31 which can be compared with the exact values 1 and 2 respectively. For comparison, the also isotropic Migdal recursion for the D = 2 Ising model and b = 2 can be written as t = 2t 2 /(1 + t 4 ) The fixed point is at t = and corresponds to β c = and ν =

63 Restricted Boltzmann Machines (say for MNIST data) v (visible) and h (hidden) vectors of binary variables (Ising spins). v are the 784 black and white pixels of a MNIST picture. p(v, h) = (1/Z ) exp(h k W kj v j + b k h k + a j v j ) p(v) = {h k =0,1} (1/Z ) exp(h kw kj v j + b k h k + a j v j ) Z = {h k =0,1,v j =0,1} exp(h kw kj v j + b k h k + a j v j ) Given an empirical image v, you can maximize the model probability (determine the learning parameters W kj etc.) by using gradients: ln p(v)/ W kj = (1/Z (v)) Z (v)/ W kj (1/Z ) Z / W kj Z (v) = {h k =0,1} exp(h kw kj v j + b k h k + a j v j ) For Z (v), v correspond to the black and white pixels of a given image, the summation over the binary hidden variables factorizes and can performed analytically: (1 + exp(b k + W kj v j )). On the other hand, Z requires the sum over the visible binary variables (v j = (1 + σ j )/2 are essentially Ising spins).

Finding the phase from worm pictures (Sam Foreman) The one layer perceptron with fully connected W can be used to try to determine the phase from a worm

64 Finding the phase from worm pictures (Sam Foreman) The one layer perceptron with fully connected W can be used to try to determine the phase from a worm picture. The target values are determined using the known infinite volume transition. Obviously, this information becomes more accurate as the size increases.

65 Output functions for fully connected W (Sam Foreman) Figure: Softmax outputs vs. temperature. These results can be improved using ConvNet algorithms which have a clear RG flavor (Sam Foreman, work in progress).

Approaching conformality in lattice models

Approaching conformality in lattice models Yannick Meurice The University of Iowa yannick-meurice@uiowa.edu Work done in part with Alexei Bazavov, Zech Gelzer, Shan-Wen Tsai, Judah Unmuth-Yockey, Li-Ping