Deep neural networks

Size: px

Start display at page:

Download "Deep neural networks"

Joel Roberts
5 years ago
Views:

1 Deep neural networks

2 On-line Resources Online book by Michael Nielsen - of convolutions demo/mnist.html - demo of CNN - 3D visualization Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. MIT Press book in prep from Bengio

3 A history of neural networks 1940s-60 s: McCulloch & Pitts; Hebb: modeling real neurons Rosenblatt, Widrow-Hoff: : perceptrons 1969: Minskey & Papert, Perceptrons book showed formal limitations of one-layer linear network 1970 s-mid-1980 s: mid-1980 s mid-1990 s: backprop and multi-layer networks Rumelhart and McClelland PDP book set Sejnowski s NETTalk, BP-based text-to-speech Neural Info Processing Systems (NIPS) conference starts Mid 1990 s-early 2000 s: Mid-2000 s to current: More and more interest and experimental success

6 ANNs in the 90 s Mostly 2-layer networks or else carefully constructed deep networks Worked well but training was slow and einicky Nov 1998 Yann LeCunn, Bo2ou, Bengio, Haffner

training typically took weeks when guided by an

7 ANNs in the 90 s Mostly 2-layer networks or else carefully constructed deep networks Worked well but training typically took weeks when guided by an expert SVM: % accurate CNNs: % accurate

8 A history of neural networks 1940s-60 s: McCulloch & Pitts; Hebb: modeling real neurons Rosenblatt, Widrow-Hoff: : perceptrons 1969: Minskey & Papert, Perceptrons book showed formal limitations of one-layer linear network 1970 s-mid-1980 s: mid-1980 s mid-1990 s: backprop and multi-layer networks Rumelhart and McClelland PDP book set Sejnowski s NETTalk, BP-based text-to-speech Neural Info Processing Systems (NIPS) conference starts Mid 1990 s-early 2000 s: Mid-2000 s to current: The next two lectures

9 Outline of lectures What s new in ANNs in the last 5-10 years? What types of ANNs are most successful and why? What are the hot research topics for deep learning?

10 Outline What s new in ANNs in the last 5-10 years? Deeper networks, more data, and faster training Scalability and use of GPUs Symbolic differentiation Some subtle changes to cost function, architectures, optimization methods What types of ANNs are most successful and why? Convolutional networks (CNNs) Long term/short term memory networks (LSTM) Word2vec and embeddings What are the hot research topics for deep learning?

11 PARALLEL TRAINING FOR ANNS

12 Recap: logistic regression with SGD P(Y =1 X = x) = p = 1 1+ e x w This part computes inner product <x,w> This part logisjc of <x,w> 12

13 Recap: logistic regression with SGD P(Y =1 X = x) = p = 1 1+ e x w a z On one example: computes inner product <x,w> There s some chance to compute this in parallel can we do more? 13

14 In ANNs we have many many logistic regression nodes

15 Recap: logistic regression with SGD Let x be an example Let w i be the input weights for the i-th hidden unit Then output a i = x. w i a i z i 15

16 Recap: logistic regression with SGD Let x be an example Let w i be the input weights for the i-th hidden unit Then a = x W is output for all m units w 1 w 2 w 3 w m W = a i z i 16

Recap: logistic regression with SGD Let X be a matrix with k examples Let w i be the input weights for the i-th hidden unit Then A = X W is output for all m units for

17 Recap: logistic regression with SGD Let X be a matrix with k examples Let w i be the input weights for the i-th hidden unit Then A = X W is output for all m units for all k examples w 1 w 2 w 3 w m x x 2 x k There s a lot of chances to do this in parallel XW = x 1.w 1 x 1.w 2 x 1.w m x k.w 1 x k.w m 17

18 ANNs and multicore CPUs Modern libraries (Matlab, numpy, ) do matrix operations fast, in parallel Many ANN implementations exploit this parallelism automatically Key implementation issue is working with matrices comfortably

19 ANNs and GPUs GPUs do matrix operations very fast, in parallel For dense matrixes, not sparse ones! Training ANNs on GPUs is common SGD and minibatch sizes of 128 Modern ANN implementations can exploit this GPUs are not super-expensive $500 for high-end one large models with O(10 7 ) parameters can eit in a large-memory GPU (12Gb) Speedups of 20x-50x have been reported

20 ANNs and multi-gpu systems There are ways to set up ANN computations so that they are spread across multiple GPUs Sometimes needed for very large networks Not especially easy to implement and do with most current tools

21 Outline What s new in ANNs in the last 5-10 years? Deeper networks, more data, and faster training Scalability and use of GPUs Symbolic differentiation Some subtle changes to cost function, architectures, optimization methods But eirst why? What types of ANNs are most successful and why? Convolutional networks (CNNs) Long term/short term memory networks (LSTM) Word2vec and embeddings What are the hot research topics for deep learning?

22 WHY ARE DEEPER NETWORKS USEFUL?

23 Recap: mul6layer networks Classifier is a muljlayer network of logis/c units Each unit takes some inputs and produces one output using a logisjc classifier Output of one unit can be the input of another Input layer Hidden layer Output layer 1 w 0,2 w 0,1 w 1,1 v 1 =S(w T X) w 1 x 1 w 1,2 w 2,1 w 2 z 1 =S(w T V) x 2 w 2,2 v 2 =S(w T X)

24 Recap: Learning a mul6layer network Define a loss which is squared error But over a network of logisjc units Minimize loss with gradient descent J X,y (w) = i ( y i y ˆ i ) 2 But output is network output

Recap: weight updates for mul6layer ANN For nodes k in

errors backward BACKPROP Can carry this recursion out

25 Recap: weight updates for mul6layer ANN For nodes k in output layer: δ k ( t k a ) k a ( k 1 a ) k For nodes j in hidden layer: δ j ( δ k w ) kj a j 1 a j k For all weights: ( ) w kj = w kj ε δ k a j Propagate errors backward BACKPROP Can carry this recursion out further if you have multiple hidden layers w ji = w ji ε δ j a i

ANNs are expressive One logisjc unit can implement and AND or an OR of a subset of inputs e.g., (x 3 AND x 5 AND AND x 19 ) Every boolean funcjon can be expressed as an OR of ANDs e.

26 ANNs are expressive One logisjc unit can implement and AND or an OR of a subset of inputs e.g., (x 3 AND x 5 AND AND x 19 ) Every boolean funcjon can be expressed as an OR of ANDs e.g., (x 3 AND x 5 ) OR (x 7 AND x 19 ) OR So one hidden layer can express any BF (But it might need lots and lots of hidden units)

27 ANNs are expressive One logisjc unit can implement and AND or an OR of a subset of inputs e.g., (x 3 AND x 5 AND AND x 19 ) Every boolean funcjon can be expressed as an OR of ANDs e.g., (x 3 AND x 5 ) OR (x 7 AND x 19 ) OR So one hidden layer can express any BF Example: parity(x1,,xn)=1 iff off number of xi s are set to one Parity(a,b,c,d) = (a & -b & -c & -d) OR (-a & b & -c & -d) OR #list all the 1s (a & b & c & -d) OR (a & b & -c & d) OR #list all the 3s Size in general is O(2 N )

Deeper ANNs are more expressive A two-layer

1 s) With O(logN) units in a binary tree Deep

28 Deeper ANNs are more expressive A two-layer network needs O(2 N ) units A two-layer network can express binary XOR A 2*logN layer network can express the parity of N inputs (even/odd number of 1 s) With O(logN) units in a binary tree Deep network + parameter tying ~= subroujnes x1 x2 x3 x4 x5 x6 x7 x8

29 Hypothe6cal code for face recogni6on..

30 WHY ARE DEEP NETWORKS HARD TO TRAIN?

31 Recap: weight updates for mul6layer ANN For nodes k in output layer L: ( ) a k ( 1 a k ) δ L k t k a k For nodes j in hidden layer h: ( ) δ h δ h+1 w j j kj a j 1 a j k ( ) What happens as the layers get further and further from the output layer? E.g., what s gradient for the bias term with several layers after it?

32 Gradients are unstable Max at 1/4 If weights are usually < 1 then we are multiplying by many numbers < 1 so the gradients get very small. The vanishing gradient problem What happens as the layers get further and further from the output layer? E.g., what s gradient for the bias term with several layers after it in a trivial net?

The exploding gradient problem What happens (less common as the layers but possible) get

33 Gradients are unstable Max at 1/4 If weights are usually > 1 then we are multiplying by many numbers > 1 so the gradients get very big. The exploding gradient problem What happens (less common as the layers but possible) get further and further from the output layer? E.g., what s gradient for the bias term with several layers after it in a trivial net?

34 AI Stats 2010 input output Histogram of gradients in a 5-layer network for an artificial image recognition task

35 AI Stats 2010 We will get to these tricks eventually.

36 It s easy for sigmoid units to saturate Learning rate approaches zero and unit is stuck

37 It s easy for sigmoid units to saturate For a big network there are lots of weighted inputs to each neuron. If any of them are too large then the neuron will saturate. So neurons get stuck with a few large inputs OR many small ones.

38 It s easy for sigmoid units to saturate If there are 500 non-zero inputs inijalized with a Gaussian ~N(0,1) then the SD is

39 It s easy for sigmoid units to saturate SaturaJon visualizajon from Glorot & Bengio using a smarter inijalizajon scheme Bottom layer still stuck for first 100 epochs

40 Outline What s new in ANNs in the last 5-10 years? Deeper networks, more data, and faster training Scalability and use of GPUs Symbolic differenjajon Some subtle changes to cost func6on, architectures, op6miza6on methods What types of ANNs are most successful and why? ConvoluJonal networks (CNNs) Long term/short term memory networks (LSTM) Word2vec and embeddings What are the hot research topics for deep learning?

41 WHAT S DIFFERENT ABOUT MODERN ANNS?

42 Some key differences Use of soomax and entropic loss instead of quadrajc loss. Use of alternate non-linearijes relu and hyperbolic tangent Be2er understanding of weight inijalizajon Data augmentajon Especially for image data

43 Cross-entropy loss Compare to gradient for square loss when a~=1 y=0 and x=1 C w = σ (z) y a z

44 Cross-entropy loss

45 Cross-entropy loss

$So\max output layer Softmax$ distribution!

46 So\max output layer Softmax Network outputs a probability distribution! Cross-entropy loss after a softmax layer gives a very simple, numerically stable gradient Δw ij = (y i -z i )y j

47 Some key differences Use of soomax and entropic loss instead of quadrajc loss. Ooen learning is faster and more stable as well as gepng be2er accuracies in the limit Use of alternate non-linearijes Be2er understanding of weight inijalizajon Data augmentajon Especially for image data

48 Some key differences Use of soomax and entropic loss instead of quadrajc loss. Ooen learning is faster and more stable as well as gepng be2er accuracies in the limit Use of alternate non-linearijes relu and hyperbolic tangent Be2er understanding of weight inijalizajon Data augmentajon Especially for image data

49 Alterna6ve non-lineari6es Changes so far Changed the loss from square error to crossentropy (no effect at test Jme) Proposed adding another output layer (soomax) A new change: modifying the nonlinearity The logisjc is not widely used in modern ANNs

50 Alterna6ve non-lineari6es A new change: modifying the nonlinearity The logisjc is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1]

51 AI Stats 2010 depth 4? We will get to these tricks eventually.

52 Alterna6ve non-lineari6es A new change: modifying the nonlinearity relu ooen used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implement: clip the gradient when you pass zero)

53 Alterna6ve non-lineari6es A new change: modifying the nonlinearity relu ooen used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn t saturate (at one end) Sparsifies outputs Helps with vanishing gradient

54 Some key differences Use of soomax and entropic loss instead of quadrajc loss. Ooen learning is faster and more stable as well as gepng be2er accuracies in the limit Use of alternate non-linearijes relu and hyperbolic tangent Be2er understanding of weight inijalizajon Data augmentajon Especially for image data

55 It s easy for sigmoid units to saturate For a big network there are lots of weighted inputs to each neuron. If any of them are too large then the neuron will saturate. So neurons get stuck with a few large inputs OR many small ones.

56 It s easy for sigmoid units to saturate If there are 500 non-zero inputs inijalized with a Gaussian ~N(0,1) then the SD is Common heurisjcs for inijalizing weights:! N # 0, " 1 $ & #inputs % " U $ # 1 #inputs, 1 #inputs % ' &

57 It s easy for sigmoid units to saturate SaturaJon visualizajon from Glorot & Bengio 2010 using " U $ # 1 #inputs, Bottom layer still stuck for first 100 epochs 1 #inputs % ' &

Ini6alizing to avoid satura6on In Glorot and Bengio they suggest weights if level j First (with breakthrough n j inputs) deep from learning results were based on clever pre-training

58 Ini6alizing to avoid satura6on In Glorot and Bengio they suggest weights if level j First (with breakthrough n j inputs) deep from learning results were based on clever pre-training initialization schemes, where deep networks were seeded with weights learned from unsupervised strategies This is not always the solution but good initialization is very important for deep nets!

59 Outline What s new in ANNs in the last 5-10 years? Deeper networks, more data, and faster training Scalability and use of GPUs Symbolic differenjajon Some subtle changes to cost funcjon, architectures, opjmizajon methods What types of ANNs are most successful and why? Convolu6onal networks (CNNs) Long term/short term memory networks (LSTM) Word2vec and embeddings What are the hot research topics for deep learning?

60 WHAT S A CONVOLUTIONAL NEURAL NETWORK?

61 Model of vision in animals

62 Vision with ANNs

63 What s a convolu6on? 1-D

64 What s a convolu6on? 1-D

65 What s a convolu6on?

66 What s a convolu6on?

67 What s a convolu6on?

68 What s a convolu6on?

69 What s a convolu6on?

70 What s a convolu6on?

71 What s a convolu6on? Basic idea: Pick a 3-3 matrix F of weights Slide this over an image and compute the inner product (similarity) of F and the corresponding field of the image, and replace the pixel in the center of the field with the output of the inner product operajon Key point: Different convolujons extract different types of low-level features from an image All that we need to vary to generate these different features is the weights of F

72 How do we convolve an image with an ANN? Note that the parameters in the matrix defining the convolution are tied across all places that it is used

73 How do we do many convolu6ons of an image with an ANN?

74 Example: 6 convolu6ons of a digit

75 CNNs typically alternate convolu6ons, non-linearity, and then downsampling Downsampling is usually averaging or (more common in recent CNNs) max-pooling

76 Why do max-pooling? Saves space Reduces overfipng? Because I m going to add more convolujons aoer it! Allows the short-range convolujons to extend over larger subfields of the images So we can spot larger objects Eg, a long horizontal line, or a corner, or

77 Another CNN visualiza6on

80 Why do max-pooling? Saves space Reduces overfipng? Because I m going to add more convolujons aoer it! Allows the short-range convolujons to extend over larger subfields of the images So we can spot larger objects Eg, a long horizontal line, or a corner, or At some point the feature maps start to get very sparse and blobby they are indicators of some semanjc property, not a recognizable transformajon of the image Then just use them as features in a normal ANN

81 Why do max-pooling? Saves space Reduces overfipng? Because I m going to add more convolujons aoer it! Allows the short-range convolujons to extend over larger subfields of the images So we can spot larger objects Eg, a long horizontal line, or a corner, or

82 Alterna6ng convolu6on and downsampling 5 layers up The subfield in a large dataset that gives the strongest output for a neuron

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo