4 3hrs. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

Size: px

Start display at page:

Download "4 3hrs. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109"

Derick Higgins
5 years ago
Views:

1 4 3hrs Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

2 Back to optimization Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

3 Stochastic optimization Optimize a cost that is a random variable Types of randomness: - Measurement plus noise: R + ν - Multiple effects mixed together (we might use a mixture model) - Unknown statistical properties Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

4 Monte Carlo integration Expectation of a random variable X: E {X} = ξ p x (ξ) dξ (over the whole data space E)... But only a sample {x 1,..., x n } is given (training set) Empirical distribution P x (ξ) = 1 n l=1 n δ (ξ x l) Approximate (empirical) expectation of X: E {X} = This is a Monte Carlo integral E E ξ P x (ξ) dξ = 1 n n l=1 x l Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

5 Suppose that R is classification performance (risk). We want to optimize the true risk, the one computed on all possible, infinite data: R(w) = R (y(x), w) p(x)dx. This is a function of w (the weights identify one specific neural net) It is also a function of the data distribution p(x) (the performance is estimated on the data) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

6 When training a neural network we don t have p(x), but only the training set {x 1,..., x n } From the training set we have the empirical distribution P x (ξ) = 1 n n δ (ξ x l ) l=1 so we can compute a Monte Carlo estimate of the risk ˆR(w, X) = 1 n p this is the empirical risk. n p R (y(x l ), w) l=1 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

7 Training by epoch Optimize using the whole training set to estimate the cost It means computing ˆR (and the W ) on the basis of a Monte Carlo estimate of risk Finds the optimal value of an approximate (empirical) cost function Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

8 Stochastic approximation A special kind of stochastic optimization R is estimated at each input pattern using that pattern alone Extremely unreliable estimation but it converges in probability! Robbins and Monro, 1951; Kiefer and Wolfowitz, 1952 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

9 Convergence in probability: lim Pr ( ˆR n R ε ) = 0 n ˆRn is the estimate of R on a training set of size n Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

10 Stochastic approximation Given: - A function R whose gradient R we want to set to zero, or minimize (but we can t compute analytically) - A sequence G 1, G 2,..., G l,... of random samples of R, affected by random noise - A decreasing sequence η 1, η 2,..., η l,... of step size coefficients Basic iteration: w(l + 1) = w(l) η l G l Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

11 Stochastic approximation: The intuition Each sample gives a noisy (stochastic) estimate of the gradient R + noise By averaging over time, noise cancels out Random variations also make it possible to escape local minima Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

12 Results on convergence of stochastic approximation If R is twice differentiable and convex, ( ) 1 then stochastic approximation converges with a rate of O l A condition of convergence (not optimal rate of convergence): 0 < l η 2 l = A < Usually the hypotheses are not met (complex cost landscape) and we don t have guarantees. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

13 Training by pattern is computing ˆR (and the W ) on the basis of an estimate of risk on a single point An extreme Monte Carlo estimate on a training set of one observation only Finds the approximate optimal value of an approximate cost function Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

14 Implementation of training By epoch: estimation loop, then update By pattern: estimation + update loop By pattern on a training set: l = random Learning rate η By pattern: keep it low By epoch: make it adaptive Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

15 Multi-layer neural networks Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

16 Connectionism and Parallel Distributed Processes David Rumelhart James McClelland Geoffrey Hinton Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

17 What is connectionism? Connectionism is an approach to cognitive science that characterizes learning and memory through the discrete interactions between nodes of neural networks Representation of concepts and rules not concentrated in symbols with a lot of meaning, but in sub-symbolic neural encodings (neuron activations) which have a meaning only if taken collectively as patterns Neural networks are distributed and massively parallel They rely on spontaneously-generated internal representations Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

18 Network topologies Most general: feedback. * Units may be visible or hidden (*) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

19 Network topologies A special type of feedback is lateral connections Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

20 Network topologies Less general: a topology where cycles are forbidden: feedforward. Visible units may be input or output. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

21 Network topologies Least general: multi-layer Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

22 Why multi-layer? Linear separability Feature discovery Hierarchies of abstractions Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

23 Example: Parity Problem: Given any input string of d bits, tell whether the number of bits set (= 1) is even. Generalizes XOR: it is not linearly separable Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

24 Example: Parity The solution requires d hidden units Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

25 Universal approximation theorem G. Cybenko 1989 A feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R d Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

26 How do we train a multi-layer neural network? 1 With a suitable algorithm 2 With a sequence of independent trainings Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

27 As we have seen, learning (e.g., learning to recognize) can be cast as the problem of optimizing a suitable cost function (risk) But most optimization methods rely on the necessary minimum condition E = 0 or on the direction of the gradient E requirement: E must be at least differentiable (even better if also convex, but that s not always possible) Even if E is differentiable, for hidden units we cannot compute an error term like (t a) 2 (mse) requirement: we need a way to do this Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

28 A differentiable activation function Let s write the discriminant function for a problem with two Gaussian, spherical, equal-variance classes. Translation of the origin, rotation of axes... 1-dimensional symmetrical problem in x with only two parameters p(x ω 1 ) = 1 [ ] (x µ) 2 exp 2πσ 2σ 2 p(x ω 2 ) = 1 [ ] (x + µ) 2 exp 2πσ 2σ 2 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

29 For the Bayes theorem: P (ω 1 ) = P (ω 2 ) = p(x ω 1 )P (ω 1 ) p(x ω 1 )P (ω 1 ) + p(x ω 2 )P (ω 2 ) p(x ω 2 )P (ω 2 ) p(x ω 1 )P (ω 1 ) + p(x ω 2 )P (ω 2 ) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

30 2-class discriminant function: g(x) = P (ω 1 ) P (ω 2 ) = [ (x µ) 2 exp [ ] exp (x µ) 2 + exp 2σ 2 removing the factors 1/ 2πσ 2σ 2 ] [ (x+µ) 2 2σ 2 ] [ exp (x+µ) 2 [ ] exp (x µ) 2 + exp 2σ 2 2σ 2 ] [ ] (x+µ) 2 2σ 2 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

31 [ exp g(x) = exp ] x2 +µ 2 2σ 2 exp [ xµ ] [ ] σ exp x2 +µ 2 exp [ xµ ] 2 2σ 2 σ 2 [ ] x2 +µ 2 exp [ xµ ] [ ] 2σ 2 σ + exp x2 +µ 2 exp [ xµ ] 2 2σ 2 σ 2 [ ] The common positive factor exp x2 +µ 2 cancels out: 2σ 2 g(x) = e xµ σ 2 e xµ σ 2 e xµ σ 2 + e xµ σ 2 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

32 We replace x with the score r = x w We can absorb the factor µ/σ 2 into the norm of w : We obtain w = µ σ 2 w g(r) = er e r e r + e r, r = x w g(r) = hyperbolic tangent activation, tanh(r) logistic or sigmoid activation: σ(r) = 1 tanh(r) + 1 = 1 + e r 2 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

33 1 0.5 a 0 SIGMOID TANH r Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

34 1 0.5 a 0 HEAVISIDE SIGN r Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

35 The sigmoid is the solution of the logistic equation Therefore, by definition, y = y(1 y) σ(r) r = σ(r) ( 1 σ(r) ) Also, tanh(r) r = 1 tanh 2 (r) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

36 The error back-propagation algorithm Discovered by Amari/Werbos/Parker/Rumelhart/Hinton/Williams from 1974 to 1986 The name appears in Rosenblatt s book Principles of Neurodynamics in 1962 A clever application of the chain rule of differential calculus We can perform gradient descent in a distributed way and without actually computing derivatives The responsibility for errors is back-propagated from the outputs back inside the network, and distributed among the hidden layers. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

37 The chain rule Where is the chain? df(g(x)) dx df(g(h(x))) dx = df(y) dy = df(g) dg which, for instance, can be used to prove that dg(x) y=g(x) dx dg(h) dh dh(x) dx σ(r) w i = dσ(r) dr r = σ (r)x i = σ(r) ( 1 σ(r) ) x i (1) w i Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

38 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

39 np ni nh no nw i j k x i r j r k sh j so k tg k whi ji woh kj number of patterns in the training set number of input units number of hidden units number of output units total number of weights, nw = (ni + 1)nh + (nh + 1)no index for input components index for hidden units index for output units i-th component of input pattern net stimulus of the j-th hidden unit net stimulus of the k-th output unit j-th hidden unit activation value k-th component of output k-th component of target weight to j-th hidden unit from i-th input unit [(ni + 1) nh] weight to k-th output unit from j-th hidden unit [(nh + 1) no] Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

40 Loss function λ(so k, tg k ) = (tg k so k ) 2 1 in general there may be several output units; 2 the overall cost function is not quadratic (a paraboloid) because the network is non-linear Non convex cost function Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

41 Expected cost 1 1 no E = (so k (x) tg 2 no k (x)) 2 p(x)dx (2) k=1 E is known only through its estimate on the training set (here by epoch): Ê = 1 np np l=1 1 1 no (so k (x l ) tg 2 no k (x l )) 2 (3) k=1 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

42 Summation and differentiations are both linear and therefore can be exchanged freely. Ê = 1 1 no (so k tg 2 no k ) 2 k=1 (4) We only consider one pattern For training online = by pattern, we will apply immediately the w as we did with perceptron and Adaline For training by epoch, we will sum several w and apply them only at the end of each pass (a training epoch). For training by batch, we will sum several w and apply them only after some % of a complete pass. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

43 The operation of the multilayer perceptron is divided in two steps: activation forward-propagation error back-propagation. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

44 Forward propagation Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

45 Forward propagation ni j r j = whi ji x i sh j = σ(r j ) (5) k r k = i=0 nh j=0 woh kj sh j so k = σ(r k ) (6) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

46 Error back-propagation Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

47 Error back-propagation and update we start from computation of partial derivatives, i.e., the gradient of the error. w is generically any of the weights of the network. We need all the components of the gradient Ê These are Ê w for all possible w Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

48 Ê w = no no k=1 (so k tg k ) 2 w = 1 no no k=1 (so k tg k ) so k w (7) Depending on whether w is a woh or a whi we will have different expansions of the above expression. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

49 Hidden-to-output weights woh kj Ê woh kj = 1 no no k =1 (so k tg k ) so k r k r k sh j (8) We can drop all terms not depending on k, those with k k: Ê = 1 woh kj no (so k tg k ) so k r k (9) r k sh j We plug in quantities known from the forward pass: Ê woh kj = 1 no (so k tg k ) σ (r k )sh j (10) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

50 If we define δ k = (so k tg k ) σ (r k ) (11) we have a generalization of the delta term which we have seen in the delta rule by Widrow and Hoff. Generalized delta rule for the hidden-to-output weights: woh kj = ηδ k sh j, (12) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

51 Problem with the input-to-hidden weights: not all terms are readily available. We use again the chain rule to find another formulation for Ê/ whi ji Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

52 Ê = 1 1 whi ji 2 no = 1 no no k=1 no k=1 (so k tg k ) 2 whi ji = (13) (so k tg k ) so k r k r k sh j sh j whi ji (14) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

53 Now the quantities appearing in the last equation are available, again from either the forward pass or theory: (so k tg k ) so k = δ k r k r k = woh kj sh j sh j = sh j r j = σ (r j )x i whi ji r j whi ji Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

54 Ê whi ji = 1 no no k=1 (so k tg k ) so k r k r k sh j sh j whi ji (15) = 1 no no k=1 [ δk σ (r k )woh kj ] [ σ (r j )x i ] (16) Note that the summation here does not disappear Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

55 We can further manipulate the expression, by first isolating the terms which do not contribute to the summation: [ ] 1 no [σ = δ k σ (r k )woh ] kj (r j )x i (17) no k=1 and then identifying the generalized delta for the input-to-hidden weights: [ ] 1 no δ j = δ k σ (r k )woh kj (18) no k=1 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

56 Generalized delta rule for the input-to-hidden weights: whi ji = ηδ j x i, (19) amazingly similar in form to that for the hidden-to-output weights Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

57 Important property of multi-layer networks The layered network is the simplest possible connectivity that has the universal approximation property. Should be large enough or deep enough Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

58 Generalization and overfitting The number of weights needs to be high. We must take care of controlling overfitting. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

59 Overfitting Is the situation where ˆR is low but ˆR R is high Symptom: While training we are happy, but then tests fail! No generalization due to too much specialization (learning the training set, not the classificatin rule) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

60 Multi-layer perceptrons not a good model for the brain? Some evidence that the brain uses sparse (localized) rather than dense (distributed) representations. Probably both Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

61 Deep neural networks Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

62 David Hubel and Torsten Wiesel Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

63 Hubel and Wiesel placed electrodes in animals brains (visual cortex) They discovered the columnar organization of neurons Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

64 Each layer in a cortical colum extracts features from the input it receives from the previous layer These features are more and more abstract Edges Simple shapes Composite shapes Eyes, mouths, noses... Grandmother (The Grandmother Cell hypothesis) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

65 Learning features in neural networks Internal representation in hidden layers Hierarchy requires many layers (deep networks). Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

66 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

67 Learning: Limits of multi-layer networks Error back-propagation does not work well with very deep structures Vanishing gradient phenomenon: At each layer, the backpropagated components of the gradient become exponentially smaller. To avoid the problem: use shallow networks (theoretically sufficient). Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

68 Example of a shallow architecture Support vector machines Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

69 Representational advantage of depth In the 80 s and early 90 s some works proved that: some logical functions, that can be implemented with a depth of k layers, require exponentially more units if reduced to k 1 layers In the 2010 s: dependent inputs (variables) need very deep networks Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

70 How to avoid training the whole network altogether? Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

71 Multi-level hierarchies of networks Cascaded networks of unsupervised layers trained one after the other + Final classification layer The whole structure is finally trained with error back-propagation Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

72 The idea is not new: Neocognitron K. Fukushima, 1987 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

73 Unsupervised learning principles Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

74 Information Bottleneck Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

75 Information Bottleneck Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

76 Techniques using the "information bottleneck" principle Using statistics and entropy Coding theory Stochastic complexity and minimum description length Using errors Autoencoders PCA Rate-distortion theory Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

77 Autoencoders An autoencoder is a special case of a multi-layer perceptron charcterized by two aspects: 1 Structure: number of units in the input layer = number of units in the output layer > number of hidden units 2 Learned task: an autoencoder is trained to approximate the identity function (= replicate its input at the output) An autoencoder is not a classifier Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

78 Autoencoders Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

79 Autoencoders What is interesting is not the output value (is an approximation to the input) but the pattern present on the hidden layer Since we don t use any target (the target coincides with the input), the autoencoder task is unsupervised Sometimes termed "self-supervised" Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

80 Learned features from a set of images Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

81 Recognizing handwritten digits Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

82 Features for recognizing 0 from 8 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

83 Features for recognizing 1 from 8 Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

84 An example of an autoencoder for learning features from symbolic data Task: diagnose Lyme disease from patient records Problem: many features (observed signs and symptoms) are binary and very sparse Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

85 An example of an autoencoder for learning features from symbolic data Learning the features Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

86 An example of an autoencoder for learning features from symbolic data Using the learned features Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

87 Principal component analysis Is an instance of factor analysis: Discover the few unobservable factors that give rise to observable (measurable) variables Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

88 Example of factor analysis problem: Discover the abilities underlying performance in school tests Observed variables Hidden factors Marks in algebra test Marks in geometry test Marks in literature test Marks in foreign language test Marks in music test Marks in essay Linguistic ability Spatial ability Symbolic processing ability Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

89 Principal Component Analysis or PCA is a linear solution to the factor analysis problem Linear: factors are linear combinations of patterns v = λ 1 x 1 + λ 2 x λ d x d Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

90 PCA works on the Covariance matrix of data Covariance between input x i and input x j : σ i,j = σ j,i = E {(x i x i )(x j x j )} E{} expectation (or mean over te training set), x i mean of i-th input Σ = σ 1,1 σ 1,2... σ 1,d σ 2,1 σ 2,2... σ 2,d..... σ d,1 σ d,2... σ d,d Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

91 Note: If X is the training set as a matrix and all inputs have zero mean, i.e., X X, then Σ = X T X X = X-repmat(mean(X),size(X,1),1) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

92 Principal components The "factors" in PCA are called principal components and are given by the eigenvectors of Σ: v 1,..., v d If we project pattern x = [x 1, x 2,..., x d ] onto the component v i = [v 1, v 2,..., v d ] we obtain the value of the i-th factor, or component, or feature, for pattern x: a i = x v i = i x i v i OK, components; but why "principal"? Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

93 Property 1 Eigenvectors of Σ can be ordered by the corresponding eigenvalues, from largest to smallest 2 Eigenvectors are thus ordered by variance or energy or level of activity from largest to smallest 3 Projection of the training set X onto the first r (principal) components gives the best rank r approximation to X itself, when measured by mean square error PCA is a form of lossy compression The principal components are features useful to represent the data in a synthetic way (information bottleneck) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

94 It has been proved that an autoencoder with linear activations learns the principal components This is because the objective is the mean squared reconstruction error of a lower-rank representation, the same as PCA Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

95 Oja s neuron A single-unit model with linear (identity) activation a = x w Learning rule: w w + ηx(a aw) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

96 It can be proven that, for small η, Oja s learning rule is a first-order Taylor approximation of the Rayleigh quotient iteration method of finding the principal eigenvector. At convergence, w is the principal component of Σ. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

97 Oja s neuron is a neural principal component analyzer Advantages over using explicit eigensolvers (e.g., LAPACK eigensolver, or Matlab s eig function): 1 Distributed 2 Online (big data!) Disadvantages: 1 Stochastic (convergence in probability) 2 Slower because of the requirement of small η Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

98 Restricted Boltzmann Machines A generative model Invented by G. Hinton Started in the Eighties (Boltzmann machines) then developed in the following decades Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

99 Boltzmann Machines: binary-valued units bi-directional connections symmetric weight (equal in the two directions) general topology (feedback possible) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

100 The restricted version has the limitation that its topology must be a bipartite graph This makes it more tractable Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

101 Energy v = [v i ] and h = [h i ] visible and hidden unit activation values, respectively w i,j weight between v i and h j a i and b i biases of visible and hidden units, respectively then we can define an "energy" E(v, h) = i a i v i j b j h j i v i w i,j h j j Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

102 Probability of states The probability of any possible network state is P (v, h) = 1 Z e E(v,h) with Z partition function (normalizer) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

103 Probability of states Since intra-layer connections are not present, probability of activation of one unit does not depend on that of other units in the same layer only in the other layer P (v i = 1 h) = e (a i+ j w i,jh j ) P (h j = 1 v) = e (b j+ i w i,jv i ) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

104 Training a RBM Algorithm called contrastive divergence Uses random sampling from the probabilities (computed as above): Apply one input Compute probability P (h v) - Sample from it to generate hidden configuration Compute a positive update step w + = vh T (outer product) Generate one possible input v from the hidden configuration Compute probability P (h v ) Compute a negative update step w = v h T Apply update: w w + η( w + w ) This does not optimize any explicit objective function!! Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

105 Training RBMs of large size is not simple There are tricks to make the task easier Example: weight sharing and convolutional neural networks These help with data having correlated inputs, as in image, video, speech, general time series. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

106 Deep Belief Networks A DBN is a sequence of RBMs Each RBM can be trained independently of the following ones greedy strategy The last layer can be a classifier Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

107 Deep networks can be built out of RBMs, but also out of autoencoders Autoencoders are less insensitive to random noise Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

108 Neural networks: Why bothering? Deep learning achieved success in very complex tasks and won many competitions Example: extracting words from audio and transforming them in automatic subtitles (cfr. Youtube) Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

109 T H E E N D Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

Introduction to Neural Networks

Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning