Mathematics for Artificial Intelligence

Size: px

Start display at page:

Download "Mathematics for Artificial Intelligence"

Simon Merritt
5 years ago
Views:

1 Mathematics for Artificial Intelligence Reading Course Elena Agliari Dipartimento di Matematica Sapienza Università di Roma

2 (TENTATIVE) PLAN OF THE COURSE Introduction Chapter : Basics of statistical mechanics The Curie-Weiss model Chapter 2: Neural networks for associative memory and pattern recognition Chapter 3: The Hopfield model Hopfield model with low-load and solution via log-constrained entropy Self-average, spurious states, phase diagram Hopfield model with high-load and solution via stochastic stability Chapter 4: Beyond the Hebbian paradigma Chapter 5: A gentle introduction to machine learning Maximum likelihood and Bayesian learning Concepts and preliminaries in machine learning Chapter 6: Neural networks for statistical learning and feature discovery. Rosenblatt and Minsky&Papert perceptrons. Restricted Boltzmann machines. Chapter 7: A few remarks on deep learning, complex patterns, and outlooks Multilayered Boltzmann machines and deep learning. Mapping Restricted Bolzmann machines and Hopfield networks Seminars: Numerical tools for machine learning; Non-mean-field neural networks; (Bio-)Logic gates; Maximum entropy approach, Hamilton-Jacobi techniques for mean-field models, 66/220

The units in the BM are divided into visible units and hidden units. The visible units are those that receive information from the environment.

3 (Restricted) Boltzmann machines A Boltzmann machine is a type of stochastic recurrent neural network. Assuming symmetric coupling among units, the detailed balance holds and the invariant measure is of the Boltzmann-Gibbs type, which they are named after. The units in the BM are divided into visible units and hidden units. The visible units are those that receive information from the environment. Learning is impractical in general Boltzmann machines, yet it can be made quite efficient in a restricted architecture called the Restricted Boltzmann Machine, which does not allow intralayer connections. RBMs were initially invented under the name Harmonium by Paul Smolensky in 986, and rose to prominence after Geoffrey Hinton and collaborators invented fast learning algorithms for them in the mid-2000s. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common deep learning strategies. As each new layer is added the overall generative model gets better. 67 /220

4 Restricted Boltzmann machines Three layers composed of N+K+M binary neurons 68/220 Interaction matrix (symmetric, zero eye)

5 Hidden layer Each neuron σj corresponds to a feature Set on/off according to whether the sampled individual matches feature j Ex.: i= -> Thriller, i=2 -> SF,, i=4 -> Romance Output layer Each neuron yk provides likelihood of interest towards an item j from the sampled individual Ex.: i= -> Asimov i=2 -> Brönte,, i=4 -> King Input layer Each neuron xi corresponds to a movie Set on/off according to whether the sampled individual like movie i Ex.: i= -> Harry Potter, i=2 -> Avatar,, i=4 ->/220 Titanic

6 Stochastic dynamic for neurons The parameters Jij give the magnitude of the pairwise coupling between spins, the parameter θi determines the firing thresholds of the neurons and the parameter β controls the degree of randomness in the dynamics. For β =0 the dynamics just assigns random values to the states of the updated neurons. For β the candidate neurons align their states strictly to the sign of the local fields. 70 /220

7 Stochastic dynamic for neurons Remark: focus on the probability pt(s) to find the system in a certain state s at a certain time t Remark: We have the freedom to update only a subset S Prop: the unique stationary probability distribution of the neuronal dynamics formulated above is p (s) = Z e H(s), H(s) = 2 7 /220 X ij J ij s i s j X s i # i i

8 x is clamped x&y are clamped /220

9 Task: learn a prescribed target joint input-output probability distribution q(x,y) The system has accomplished this when p (x,y) (i.e., the equilibrium input-output probability distribution of the network) equals the target distribution q(x,y). Performance measure: distance between q(x,y) and p (x,y) Kullback-Leibler distance. Positive definite 2. D(q p ) =0 iff q(x,y)=p (x,y), 3. Additive for independent distribution 4. Convex in the pair q,p if your data comes from probability distribution Q, but you use a compression scheme optimised for P, D(Q P) is the number of extra bits you'll require to store a record of each sample from Q. 73 /220

10 Gradient descent learning rule This ensures D(q p ) decreases monotonically until a stationary state is reached which could, but need not correspond to D(q p )=0 Thus, we need D(q p )= where λ is any parameter in the system log 2 X x,y log p (x, y) Details of calculation depend on the network operation mode (A, B, C) 74 /220

11 A: Operation with all neurons freely evolving ( ) Each individual modification step, to be carried out repeatedly, thus involves: (i) Operate the neuronal dynamics with input and output neuron states (x,y) fixed until equilibrium is reached, and measure the averages sisj + and si +; repeat this for many combinations of (x,y) generated according to the desired joint distribution q(x,y) (ii) Operate the neuronal dynamics with all neurons evolving freely until equilibrium is reached, and measure the averages sisj - and si - (iii) Insert the results of (i,ii) into ( ) and execute the rule ( ) 75 /220

12 B: Operation with hidden&output neurons freely evolving ( ) Each individual modification step, to be carried out repeatedly, now involves: (i) Operate the neuronal dynamics with input and output neuron states (x,y) fixed until equilibrium is reached, and measure the averages sisj + and si +; repeat this for many combinations of (x,y) generated according to the desired joint distribution q(x,y) (ii) Operate the neuronal dynamics with all hidden and output neurons evolving freely until equilibrium is reached, and measure the averages sisj - and si -; repeat this for many input configurations x generated according to the desired distribution q(x)= y q(x,y) (iii) Insert the results of (i,ii) into ( ) and execute the rule ( ) 76 /220

Case A Learning is accomplished when first and second moments of the free system correspond to those of the system with clamped input&output Case B Learning is

13 Case A Learning is accomplished when first and second moments of the free system correspond to those of the system with clamped input&output Case B Learning is accomplished when first and second moments of the system with clamped input correspond to those of the system with clamped input&output Evaluateing averages is complex! 77/220

A telegraphic introduction to Markov Chain Monte Carlo (MCMC) sampling Evaluate thermodynamic average by summing over all possible configurations beyond reach if N is large Evaluate thermodynamic

on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution.

14 A telegraphic introduction to Markov Chain Monte Carlo (MCMC) sampling Evaluate thermodynamic average by summing over all possible configurations beyond reach if N is large Evaluate thermodynamic average by summing over a subset of drawn configurations selections are unlikely Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. (0)! ()!...! (t)! (t + )! (t + M) Random initial configuration Average over M samples hf( )i M MX f( (t + k)) k= Equilibrium configuration 78/220

15 (0)! ()!...! (t)! (t + )! (t + M) hf( )i M MX f( (t + k)) k= Random initial configuration Equilibrium configuration Average over M samples Let p be the probability distribution (over a finite set X) you want to sample with. Suppose you are in configuration i X and there are r- possible new configurations you can reach in one move, that is, overall r possible outcomes (including retaining the current configuration) Pick neighbouring configuration j X with probability /r If you pick configuration j and p(j) p(i), then deterministically go to j Otherwise, p(j) < p(i), and you to go j with probability p(j)/p(i) Probability weight pij for the transition i j is indeed a probability distribution for each configuration i and turns out to be p ij = r min(,p(j)/p(i)) X p ii = Also, p(j) is the stationary distribution for this chain as one can prove checking that the det. bal. holds. We could take a large sample S X via the solution to the sampling problem, and then compute the average value of f on that sample. i j p ij 79/220

16 (0)! ()!...! (t)! (t + )! (t + M) Random initial configuration Average over M samples hf( )i = M MX f( (t + k)) Equilibrium configuration k= p ij = r min(,p(j)/p(i)) X p ii = i j p ij if exp(-ej/t)/exp(-ei/t) = exp(-δe/t)> (energetically convenient) exp(-δe/t) if exp(-δe/t)< (energetically not convenient) Metropolis algorithm Initialize the system at an arbitrary point σ0 and choose a probability that suggests a candidate configuration for the next step Iterate the following steps:. Extract a candidate configuration σ 2. Being σ the current configuration, calculate the energy variation following move ΔE = E(σ ) - E(σ) 3. If ΔE<0 accept the move; if ΔE>0 generate a uniform random number u [0,] and accept the move if u exp(-δe/t), otherwise reject 80/220

17 Contrastive divergence Consider a probability distribution over a vector x (assumed discrete w.l.o.g.) p(x; )= Z( ) e E(x; ) where λ is a vector of model parameters and the partition function Z(λ) is defined as Z( )= X x e E(x; ) We derive model parameters, λ, by maximizing the likelihood, namely the probability of a training set of iid data Χ = {x, x2,, xn} as p( ; )= NY k= Z( ) e E(x k; ) or, equivalently, by minimizing the negative of the log-likelihood L( ; ) = log Z( )+ N NX E(x k ; ) k= 8/220

18 The gradient equation is obtained by first writing down the partial derivative of L(x,λ) ; i log Z( log Z( i + N NX k ; i where x is the expectation of given the distribution x, i.e., p 0 (x) = N NX n= (x x n ) = Z( ) X log Z( ) i Z( i e E(x; ) ) i p(x; ) which can be numerically approximated by drawing samples from the proposed distribution. However, samples cannot be drawn directly from p(x;λ) as we do not know the value of the partition function, but we can use as many cycles of MCMC sampling to transform our training data (drawn from the target distribution) into data drawn from the proposed distribution. 82/220

19 Denoting with x n the training data transformed using n cycles of MCMC, such that x 0 E(x), we ; i ) i p(x; i ) i x 0 Update parameters ) i(t + ) = i i x i x 83/220

20 i(t + ) = i i x i x p(x; )= Z( ) e E(x; ) hoosing a pairwise energy function we recover the previous results 2 3 E(x; )! E(s; J; #) = 4 X ij J ij s i s j + X i s i # i ij = s i s i = s i 84/220

21 Denoting with x n the training data transformed using n cycles of MCMC, such that x 0 E(x), we ; i i i x 0 How many MCMC cycles are required to compute an accurate gradient? Hinton 2000: only a few MCMC cycles (empirically even just one) are needed to calculate an approximate gradient. The intuition is that after a few iterations the data will have moved from the target distribution (i.e., that of the training data) towards the proposed distribution, and so give an idea in which direction the proposed distribution should move to better model the training data. i(t + ) = i i x i x 85/220

22 Another numerical method: Simulated annealing Approximate global optimization in a large search space It has to be preferred to gradient descent for problems where finding an approximate global optimum is more important that finding a precise local optimum in a fixed amount of time. Some thermal noise is useful to escape from local minima Too much thermal noise would make even global minima unstable S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi (983) Optimization by Simulated Annealing, Science 86/220

23 Another numerical method: Simulated annealing Approximate global optimization in a large search space It has to be preferred to gradient descent for problems where finding an approximate global optimin is more important that finding a precise local optimum in a fixed amount of time. Annealing in metallurgy: technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects At each time step, the algorithm - randomly selects a solution close to the current one (e.g., a spin configuration under single spin-flip) - measures its quality (e.g., the energy change ΔE) - decides probabilistically to move to it or to stay (e.g., p exp[-δe/t]). During the search, the temperature is progressively decreased from an initial positive value to zero. This affects the move probability: at each step, the probability of moving to a worse new solution is progressively changed towards zero. S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi (983) Optimization by Simulated Annealing, Science 87/220

24 configuration of the system random generator of rearrangements of the elements in a configuration cost function contaning the trade-offs that have to be made annealing schedule of the temperatures and length of times for which the system is to be evolved For a given problem may be developed by trial-and-error, or may consist of just warming the system until it is obviously melted, then cooling in slow stages until diffusion of the components ceases. Inventing the most effective sets of moves and deciding which factors to incorporate into the objective function require insight into the problem being solved and may not be obvious -E By Kingpin3 - Own work, CC0, 88/220

25 Another method: Mean-field formulation Learning consists of adjusting the weights and thresholds in such a way that the Boltzmann distribution on the visible visible units p approximate a target distribution q, as closely as possible. This can be recast in the problem of minimizing the KL distance using gradient descent and the learning rules are given by The parameter ε is the learning rate. The brakets - and + denote the free and clamped expectation values, respectively. hf(x,, y)i + = X x,,y f(x,, y)p ( x, y)q(x, y) hf(x,, y)i = X x,,y f(x,, y)p (x,, y) Naive mean-field approximation: the free and the clamped expectations are approximated by their classic, infinite-size, mean-field values hs i i m i, 0 hs i s j i m i m j m i = X J ij m j + # i A j6=i 89/220

26 Naive mean-field approximation: the free and the campled expectations are approximated by their classic, infinite-size, mean-field values hs i i m i, 0 hs i s j i m i m j m i = X J ij m j + # i A j6=i In each step of the gradient descent procedure, one must solve these self-consistent equations for the current parameter values, evaluate the distance with respect to the clamped averages and accordingly update parameters. This method is O(0) times faster than the MC method. However, this method may not converge! In fact, this method leads to a converging gradient descent algorithm only when the data are such that hs i s j i + = hs i i + hs j i +, i 6= j This is a property of the data It is equivalent to state that the target probability distribution q is factorized in all its variables: q(s) = i qi (si). The quality of this naive method depends on to what extent the previous equation is violated. There exist possible improvements, e.g., the linear response correction. 90/220

27 A few examples MNIST database Features Two 9/220

28 Natural images database Pines 92/220

Olivetti database The dataset consists of 400 images with greyscale 64 64 pixels. There are 0 images for each person, so there is 40 persons (target).

29 Olivetti database The dataset consists of 400 images with greyscale pixels. There are 0 images for each person, so there is 40 persons (target). The first 6 faces in the olivetti dataset look like this: First convert the image array to binary (black and white). This is needed because RBM can extract features more efficiently on binary data, than on the grayscale data. 93/220

30 After running RBM, we see the extracted features on the first 6 faces. This looks similar like eigen face. 94/220

31 A simple example to see frustration at work σ3 h i (t + )i = 2 tanh(2 ) [J ijh j (t)i + J ik h k (t)i], i,j,k =, 2, 3 u i (t) =h i (t)i, i =, 2, 3 Jij = Jji and Jij =, i,j J3 J23 d dt u (t) = d dt u 2(t) = d dt u 3(t) = u (t)+ 2 tanh(2 )[J 2u 2 (t)+j 3 u 3 (t)] u 2(t)+ 2 tanh(2 )[J 2u (t)+j 23 u 3 (t)] u 3(t)+ 2 tanh(2 )[J 3u (t)+j 23 u 2 (t)] σ J2 σ2 In matricial notation u (t) = Au(t), where and Θ=tanh(2β). A= / J 2/2 J 3 /2 J 2 /2 / J 23 /2 A J 3 /2 J 23 /2 / A I =0) apple apple J 2 J 23 J 3 =0. Characteristic polynomial = = 2 ( J 2 J 23 J 3 ), = J 2 J 23 J 3. 95/220

32 = = 2 ( J 2 J 23 J 3 ), = J 2 J 23 J 3. C,2,3 depend on initial configuration h (t)i = C J 3 J 23 e t/ 2C 3 J 3 J 23 e t/ 3, h 2 (t)i = (C C 2 J 2 J 3 ) e t/ C 3 e t/ 3, h 3 (t)i = (C J 2 J 3 + C 2 )e t/ + C 3 J 2 J 3 e t/ 3. σi(t) relax to zero* as t with characteristic time * The system is finite! J2J3J23= simple system τ/τ τ 3 τ = τ β Simple system eigenvalues are peaked at short time scales Complex system eigenvalues are peaked at long time scales τ/τ 96/ J2J3J23=- complex system τ = τ 2 τ β

33 Pairwise correlation functions u ij (t) =h i (t) j (t)i, i =, 2, 3 d dt u 2(t) = d dt u 3(t) = d dt u 23(t) = 2 u 2(t)+ 2 (2J 2 + J 23 u 3 (t)+j 3 u 23 (t)), 2 u 3(t)+ 2 (2J 3 + J 23 u 2 (t)+j 2 u 23 (t)), 2 u 23(t)+ 2 (2J 23 + J 3 u 2 (t)+j 2 u 3 (t)). Roots of the characteristic polynomial As t σi(t)σj(t) relax to non-zero value independent of the initial configuration = (2 J 2 J 23 J 3 ), = = J 2 J 23 J 3. h (t) 2 (t)i = C J 2 J 3 e t/ 2C 2 e t/ 2 + 2J 2 J 3 J 23, h (t) 3 (t)i = C e t/ +(C 2 J 3 J 2 C 3 J 3 J 23 )e t/ 2 + h 2 (t) 3 (t)i = C J 3 J 23 e t/ +(C 2 J 2 J 23 + C 3 )e t/ 3 + 2J 3 J 2 J 23, 2J 23 J 2 J 3. C,2,3 depend on initial configuration 97/220

34 = (2 J 2 J 23 J 3 ), = = J 2 J 23 J 3. h (t) 2 (t)i = C J 2 J 3 e t/ 2C 2 e t/ 2 + 2J 2 J 3 J 23, h (t) 3 (t)i = C e t/ +(C 2 J 3 J 2 C 3 J 3 J 23 )e t/ 2 + h 2 (t) 3 (t)i = C J 3 J 23 e t/ +(C 2 J 2 J 23 + C 3 )e t/ 3 + 2J 3 J 2 J 23, 2J 23 J 2 J 3. C,2,3 depend on initial configuration J2J3J23= simple system 3 3 J2J3J23=- complex system τ τ/τ τ 2 = τ 3 τ 0 2 β τ/τ β 98/220 τ 2 = τ 3

Conditioned reflex h h ii = h J ij i = i j i = hj ij i i = h i i 0 +

2 h i ji + < : h ii tanh 4 X jk k i + h j 5 + h j i tanh 4 k6=jhj X = ik

j6=ihj 0 h i jih i i tanh( ) σi Jij σj Ivan Pavlov (849-936) E Agliari,

35 Conditioned reflex h h ii = h J ij i = i j i = hj ij i i = h i i 0 + X ij i j i + h i A j6=ihj 0 hj iji + 0 h i ji tanh( ) h i ji + < : h ii tanh 4 X jk k i + h j 5 + h j i tanh 4 k6=jhj X = ik k i + h i 5 ; k6=ihj hj ij ii + hj iji tanh 4 X ij j i + h i 5 + j6=ihj 0 h i jih i i tanh( ) σi Jij σj Ivan Pavlov ( ) E Agliari, A Barra, K Gervasi-Vidal, F Guerra, Journal of biological dynamics (202) 99/220

Some fun Google DeepDream Initially it was invented to help scientists and engineers to see what a deep neural network is seeing when it is looking in a given image.

In doing this, the software builds up a idea of what it thinks an object looked like.

36 Some fun Google DeepDream Initially it was invented to help scientists and engineers to see what a deep neural network is seeing when it is looking in a given image. Later the algorithm has become a new form of psychedelic and abstract art. The network's 'answer' comes from this final output layer. In doing this, the software builds up a idea of what it thinks an object looked like. Other images were created by feeding a picture into the network and then asking the software to recognise a feature of it, and modify the picture to emphasise the feature it recognises - such as animals and eyes. That modified picture is then fed back into the network, which is again tasked to recognise features and emphasise them, and so on. Eventually, the feedback loop modifies the picture beyond all recognition. 200/220

37 Neural art Neural networks are trained to be able to work out what makes each artists style unique 20/220

Hopfield Networks and Boltzmann Machines. Christian Borgelt Artificial Neural Networks and Deep Learning 296

Hopfield Networks and Boltzmann Machines Christian Borgelt Artificial Neural Networks and Deep Learning 296 Hopfield Networks A Hopfield network is a neural network with a graph G = (U,C) that satisfies