Introduction to (Convolutional) Neural Networks

Size: px
Start display at page:

Download "Introduction to (Convolutional) Neural Networks"

Transcription

1 Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018

2 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient Descent 5 The Basic Recipe 6 Going Deep 7 Convolutional Neural Networks 8 What I didn t tell you

3 1 Motivation and Definition

4 Which Method to Choose? We have seen linear Regression, kernel regression, regularization, K-PCA, K-SVM,... and there exist a zillion other methods.

5 Which Method to Choose? We have seen linear Regression, kernel regression, regularization, K-PCA, K-SVM,... and there exist a zillion other methods. Is there a universally best method?

6 No Free Lunch Theorem

7 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!!

8 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1...

9 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1... Every algorithm will have a specific preference, for example as specified through the hypothesis class H all categories are artificial!

10 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1... Every algorithm will have a specific preference, for example as specified through the hypothesis class H all categories are artificial! We want our algorithm to reproduce the artificial categories produced by our brain so let s build a hypothesis class that mimicks our thinking!

11 Neuroscience

12 Neuroscience The Brain as Biological Neural Network In neuroscience, a biological neural network is a series of interconnected neurons whose activation defines a recognizable linear pathway. The interface through which neurons interact with their neighbors usually consists of several axon terminals connected via synapses to dendrites on other neurons. If the sum of the input signals into one neuron surpasses a certain threshold, the neuron sends an action potential (AP) at the axon hillock and transmits this electrical signal along the axon. Source: Wikipedia

13 Neurons

14 Neurons recall: If the sum of the input signals into one neuron surpasses a certain threshold, [...] the neuron transmits this [...] signal [...].

15 Artificial Neurons x 1 w 1 Threshold b Inputs x 2 w 2 Σ { 1 0 Activation i x iw i b > 0 i x iw i b 0 Output y x 3 w 3 Weights

16 Artificial Neurons x 1 w 1 Threshold b Inputs x 2 w 2 Σ { 1 0 Activation i x iw i b > 0 i x iw i b 0 Output y x 3 w 3 Weights Artificial Neuron An artificial neuron with weights w 1,..., w s, bias b and activation function σ : R R is defined as the function ( s ) f(x 1,..., x s ) = σ x i w i b. i=1

17 Activation Functions Figure: Heaviside activation function (as in biological motivation)

18 Activation Functions Figure: Heaviside activation function (as in biological motivation) Figure: Sigmoid activation function σ(x) = 1 1+e x

19 Artificial Neural Networks

20 Artificial Neural Networks Artificial neural networks consist of a graph, connecting artificial neurons!

21 Artificial Neural Networks Artificial neural networks consist of a graph, connecting artificial neurons! Dynamics difficult to model, due to loops, etc...

22 Artificial Feedforward Neural Networks Use directed, acyclic graph!

23 Artificial Feedforward Neural Networks Use directed, acyclic graph!

24 Artificial Feedforward Neural Networks Definition Let L, d, N 1,..., N L N. A map Φ : R d R N L given by Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, is called a neural network. It is composed of affine linear maps A l : R N l 1 R N l, 1 l L (where N 0 = d), and non-linear functions often referred to as activation function σ acting component-wise. Here, d is the dimension of the input layer, L denotes the number of layers, N 1,..., N L 1 stands for the dimensions of the L 1 hidden layers, and N L is the dimension of the output layer.

25 Artificial Feedforward Neural Networks Definition Let L, d, N 1,..., N L N. A map Φ : R d R N L given by Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, is called a neural network. It is composed of affine linear maps A l : R N l 1 R N l, 1 l L (where N 0 = d), and non-linear functions often referred to as activation function σ acting component-wise. Here, d is the dimension of the input layer, L denotes the number of layers, N 1,..., N L 1 stands for the dimensions of the L 1 hidden layers, and N L is the dimension of the output layer. An affine map A : R N l 1 R N l is given by x W x + b with weight matrix W R N l 1 N l and bias vector b R N l.

26 Artificial Feedforward Neural Networks

27 On the Biological Motivation

28 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain:

29 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations

30 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward

31 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity

32 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity...

33 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity... Artificial feedforward neural networks constitute a mathematically and computationally convenient but very simplistic mathematical construct which is inspired by our understanding of how the brain works.

34 Terminology

35 Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks.

36 Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks. This requires optimizing the weights and bias parameters.

37 Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks. This requires optimizing the weights and bias parameters. Deep Learning : Neural network learning with neural networks consisting of many (e.g., 3) layers.

38 2 Universal Approximation

39 Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough?

40 Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough? Surely not!

41 Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough? Surely not! Suppose that σ is a polynomial of degree r. Then σ(ax) is a polynomial of degree r for all affine maps A and therefore any neural network with activation function σ will be a polynomial of degree r.

42 Approximation Question Main Approximation Problem Under which conditions on the activation function σ can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough?

43 Universal Approximation Theorem Theorem Suppose that σ : R R continuous is not a polynomial and fix d 1, L 2, N L 1 N and a compact subset K R d. Then for any continuous f : R d R N l and any ε > 0 there exist N 1,..., N L 1 N and affine linear maps A l : R N l 1 R N l, 1 l L such that the neural network Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, approximates f to within accuracy ε, i.e., sup Φ(x) Φ(x) ε. x K

44 Universal Approximation Theorem Theorem Suppose that σ : R R continuous is not a polynomial and fix d 1, L 2, N L 1 N and a compact subset K R d. Then for any continuous f : R d R N l and any ε > 0 there exist N 1,..., N L 1 N and affine linear maps A l : R N l 1 R N l, 1 l L such that the neural network Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, approximates f to within accuracy ε, i.e., sup Φ(x) Φ(x) ε. x K Neural networks are universal approximators and one hidden layer is enough if the number of nodes is sufficient! Skip Proof

45 Proof of the Universal Approximation Theorem For simplicity we only the case of one hidden layer, e.g., L = 2 and one output neuron, e.g., N L = 1:

46 Proof of the Universal Approximation Theorem For simplicity we only the case of one hidden layer, e.g., L = 2 and one output neuron, e.g., N L = 1: N 1 Φ(x) = c i σ(w i x b i ), w i R d, c i, b i R. i=1

47 Proof of the Universal Approximation Theorem We will show the following. Theorem For d N and σ : R R continuous consider { } R(σ, d) := span σ(w x b) : w R d, b R. Then R(σ, d) is dense in C(R d ) if and only if σ is not a polynomial.

48 Proof for d = 1 and σ smooth

49 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N.

50 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0.

51 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0.

52 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0. same argument polynomials in x can be approximated.

53 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0. same argument polynomials in x can be approximated. Stone-Weierstrass Theorem yields the result.

54 General d

55 General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case).

56 General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case). First approximate f C(R d ) by N d i g i (v i x e i ), i=1 v i R d, d i, e i R, g i C(R).

57 General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case). First approximate f C(R d ) by N d i g i (v i x e i ), i=1 v i R d, d i, e i R, g i C(R). Then apply our univariate result to approximate the univariate functions t g i (t e i ) using neural networks.

58 The case that σ is nonsmooth

59 The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta.

60 The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta. Apply previous result to the smooth function σ g ε and let ε 0:

61 The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta. Apply previous result to the smooth function σ g ε and let ε 0:

62 3 Backpropagation

63 Regression/Classification with Neural Networks Neural Network Hypothesis Class Given d, L, N 1,..., N L and σ define the associated hypothesis class H [d,n1,...,n L ],σ := { AL σ (A L 1 σ (... σ (A 1 (x)))) : A l : R N l 1 R N l affine linear }.

64 Regression/Classification with Neural Networks Neural Network Hypothesis Class Given d, L, N 1,..., N L and σ define the associated hypothesis class H [d,n1,...,n L ],σ := { AL σ (A L 1 σ (... σ (A 1 (x)))) : A l : R N l 1 R N l affine linear }. Typical Regression/Classification Task Given data z = ((x i, y i )) m i=1 Rd R N L, find the empirical regression function f z argmin f H[d,N1,...,N L ],σ m L(f, x i, y i ), i=1 where L : C(R d ) R d R N L R + is the loss function (in least squares problems we have L(f, x, y) = f(x) y 2 ).

65 Example: Handwritten Digits MNIST Database for handwritten digit recognition exdb/mnist/

66 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/

67 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/

68 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/ Every label is given as a 10-dim vector y R 10 describing the probability of each digit

69 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/ Every label is given as a 10-dim vector y R 10 describing the probability of each digit

70 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit

71 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Given labeled training data (x i, y i ) m i=1 R784 R 10. Every label is given as a 10-dim vector y R 10 describing the probability of each digit

72 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20).

73 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.

74 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.???how???

75 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.???how??? Non-linear, non-convex

76 Gradient Descent: The Simplest Optimization Method Gradient Descent

77 Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N

78 Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N Gradient descent with stepsize η > 0 is defined by u n+1 u n η F (u n ).

79 Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N Gradient descent with stepsize η > 0 is defined by u n+1 u n η F (u n ). Converges (slowly) to stationary point of F.

80 Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1.

81 Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1. Since ((Wl,b l )) L F = m l=1 i=1 ((W l,b l )) L L(f, x i, y i ), we need to l=1 determine (for x, y R d R N L fixed) L(f, x, y) (W l ) i,j, L(f, x, y) (b l ) i, l = 1,..., L. Skip Derivation

82 Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1. Since ((Wl,b l )) L F = m l=1 i=1 ((W l,b l )) L L(f, x i, y i ), we need to l=1 determine (for x, y R d R N L fixed) L(f, x, y) (W l ) i,j, L(f, x, y) (b l ) i, l = 1,..., L. Skip Derivation For simplicity suppose that L(f, x, y) = (f(x) y) 2, so that L(f, x, y) = 2 (f(x) y) T f(x), (W l ) i,j (W l ) i,j L(f, x, y) (b l ) i = 2 (f(x) y) T f(x) (b l ) i.

83 x = ( ) (x) 1 (x) 2

84 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) = σ(w 1 x + b 1 ) (W 1 ) 1,1 (W 1 ) 1,2 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 (W 1 ) 3,1 (W 1 ) 3,2 (b 1 ) 1 b 1 = (b 1 ) 2 (b 1 ) 3

85 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3

86 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

87 (z 3 ) 1 (W 3 ) 1,2 = ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

88 (z 3 ) 1 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 1,1 (a 2 ) 1 + (W 3 ) 1,2 (a 2 ) 2 + (W 3 ) 1,3 (a 2 ) 3 ) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

89 (z 3 ) 1 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 1,1 (a 2 ) 1 + (W 3 ) 1,2 (a 2 ) 2 + (W 3 ) 1,3 (a 2 ) 3 ) = (a 2 ) 2 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

90 (z 3 ) 2 (W 3 ) 1,2 = ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

91 (z 3 ) 2 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 2,1 (a 2 ) 1 + (W 3 ) 2,2 (a 2 ) 2 + (W 3 ) 2,3 (a 2 ) 3 ) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

92 (z 3 ) 2 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 2,1 (a 2 ) 1 + (W 3 ) 2,2 (a 2 ) 2 + (W 3 ) 2,3 (a 2 ) 3 ) = 0 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

93 (z 3 ) k (W 3 ) i,j = { (a2 ) j i = k 0 i k ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

94 (z 3 ) k (W 3 ) i,j = (z 3 ) k (b 3 ) i = { (a2 ) j i = k { 1 0 i k i = k 0 i k ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

95 ( (a2 ) 1 Φ(x) = 0 W 3 )( (a2 ) 2 0 )( (a2 ) 3 0 ) ( )( )( ) (a 2 ) 1 (a 2 ) 2 (a 2 ) 3 ) 1 Φ(x) = ( ( 0 b 3 0 1) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3

96 Backprop: Last Layer

97 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T L(f,x,y) (b L ) i f(x) (W L ) i,j, = 2 (f(x) y) T f(x) (b L ) i.

98 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i

99 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i f(x) 2(f(x) y) T (W L ) i,j = 2(f(x) y) i σ(w L 1 (... ) + b L 1 ) j, 2(f(x) y) T f(x) (b L ) i = 2(f(x) y) i

100 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i f(x) 2(f(x) y) T (W L ) i,j = 2(f(x) y) i σ(w L 1 (... ) + b L 1 ) j, 2(f(x) y) T f(x) (b L ) i = 2(f(x) y) i In matrix notation: {}}{ W L = 2(f(x) y) (σ( W L 1 (... ) + b L 1 )) T, }{{}}{{} δ L a L 1 L(f,x,y) L(f,x,y) b L = 2(f(x) y). z L 1

101 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L.

102 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L, L(F,x,y) b L

103 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1

104 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1

105 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 same as before!

106 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 = diag(σ (z L 1 )) WL T 2(f(x) y) }{{} δ L 1 a T L 2.

107 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 = diag(σ (z L 1 )) WL T 2(f(x) y) }{{} δ L 1 Similar arguments yield L(f,x,y) b L 1 = δ L 1. a T L 2.

108 The Backprop Algorithm

109 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass).

110 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y)

111 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1.

112 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do:

113 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) W T l+1 δ l+1

114 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) W T l+1 δ l+1 Then L(f,x,y) b l = δ l and L(f,x,y) W l = δ l a T l 1.

115 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) Wl+1 T δ l+1 Then L(f,x,y) b l = δ l and L(f,x,y) W l = δ l a T l 1. 5 return L(f,x,y) b l, L(f,x,y) W l, l = 1,..., L.

116 Computational Graphs

117 Automatic Differentiation

118 4 Stochastic Gradient Descent

119 The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run.

120 The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run. Thus, the total complexity of one gradient descent step is equal to m complexity(backprop).

121 The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run. Thus, the total complexity of one gradient descent step is equal to m complexity(backprop). The complexity of backprop is asymptotically equal to the number of DOFs of the network: complexity(backprop) L N l 1 N l + N l. l=1

122 An Example

123 An Example ImageNet database consists of 1.2m images and 1000 categories.

124 An Example ImageNet database consists of 1.2m images and 1000 categories. AlexNet, neural network with 160m DOFs is one of the most successful annotation methods

125 An Example ImageNet database consists of 1.2m images and 1000 categories. AlexNet, neural network with 160m DOFs is one of the most successful annotation methods One step of gradient descent requires flops (and memory units)!!

126 Stochastic Gradient Descent (SGD) Approximate m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1 by ((Wl,b l )) L l=1 L(f, x i, y i ) for some i chosen uniformly at random from {1,..., m}.

127 Stochastic Gradient Descent (SGD) Approximate m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1 by ((Wl,b l )) L l=1 L(f, x i, y i ) for some i chosen uniformly at random from {1,..., m}. In expectation we have E ((Wl,b l )) L l=1 L(f, x i, y i ) = 1 m m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1

128 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R.

129 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0

130 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do:

131 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random

132 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i

133 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i n = n + 1

134 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i n = n return u n

135 Typical Behavior Figure: Comparison btw. GD and SGD. m steps of SGD are counted as one iteration.

136 Typical Behavior Figure: Comparison btw. GD and SGD. m steps of SGD are counted as one iteration. Initially very fast convergence, followed by stagnation!

137 Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient.

138 Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient. K = 1 SGD

139 Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient. K = 1 SGD K > 1 Minibatch SGD with batchsize K.

140 Some Heuristics

141 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1

142 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1

143 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1 Increasing the batch size by a factor 100 yields an improvement of the variance by a factor 10 while the complexity increases by a factor 100!

144 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1 Increasing the batch size by a factor 100 yields an improvement of the variance by a factor 10 while the complexity increases by a factor 100! Common batchsize for large models: K = 16, 32.

145 5 The Basic Recipe

146 The basic Neural Network Recipe for Learning

147 The basic Neural Network Recipe for Learning 1 Neuro-inspired model

148 The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop

149 The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop 3 Minibatch SGD

150 The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop 3 Minibatch SGD Now let s try classifying handwritten digits!

151 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size =

152 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10].

153 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%.

154 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10].

155 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%.

156 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10].

157 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10]. Classification accuracy 95.07%.

158 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10]. Classification accuracy 95.07%. Deep learning might not help after all...

159 6 Going Deep (?)

160 Problems with Deep Networks

161 Problems with Deep Networks Overfitting (as usual...)

162 Problems with Deep Networks Overfitting (as usual...) Vanishing/Exploding Gradient Problem

163 Dealing with Overfitting: Regularization Rather than minimizing m L(f, x i, y i ), i=1 minimize for example m L(f, x i, y i ) + λω((w l ) L l=1 ), i=1 Ω((W l ) L l=1 ) = (W l ) i,j p. l,i,j

164 Dealing with Overfitting: Regularization Rather than minimizing m L(f, x i, y i ), i=1 minimize for example m L(f, x i, y i ) + λω((w l ) L l=1 ), i=1 Ω((W l ) L l=1 ) = (W l ) i,j p. l,i,j Gradient update has to be augmented by λ (W l ) i,j Ω((W l ) L l=1 ) = λ p (W l) i,j p 1 sgn((w l ) i,j )

165 Sparsity-Promoting Regularization Since lim p 0 (W l ) i,j p = #nonzero weights, l,i,j regularization with p 1 promotes sparse connectivity (and hence small memory requirements)!

166 Sparsity-Promoting Regularization Since lim p 0 (W l ) i,j p = #nonzero weights, l,i,j regularization with p 1 promotes sparse connectivity (and hence small memory requirements)!

167 Dropout During each feedforward/backprop step drop nodes with probability p.

168 Dropout During each feedforward/backprop step drop nodes with probability p.

169 Dropout During each feedforward/backprop step drop nodes with probability p. After training, multiply all weights with p.

170 Dropout During each feedforward/backprop step drop nodes with probability p. After training, multiply all weights with p. Final output is average over many sparse network models.

171 Dataset Augmentation Use invariances in dataset to generate more data!

172 Dataset Augmentation Use invariances in dataset to generate more data!

173 Dataset Augmentation Use invariances in dataset to generate more data!

174 Dataset Augmentation Use invariances in dataset to generate more data!

175 Dataset Augmentation Use invariances in dataset to generate more data! Sometimes also noise is added to the weights to favour robust stationary points.

176 The Vanishing Gradient Problem

177 The Vanishing Gradient Problem Figure: Extremely Deep Network

178 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5

179 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1

180 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l

181 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l bottom layers will learn much slower than top layers and not contribute to learning.

182 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l bottom layers will learn much slower than top layers and not contribute to learning. Is depth a nuisance!?

183 Dealing with the Vanishing Gradient Problem

184 Dealing with the Vanishing Gradient Problem Use activation function with large gradient.

185 Dealing with the Vanishing Gradient Problem Use activation function with large gradient. ReLU The Rectified Linear Unit is defined as { x x > 0 ReLU(x) := 0 else

186 7 Convolutional Neural Networks

187

188 Is there a cat in this image?

189

190 Suppose we have a cat-filter W

191 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33

192 C-S Inequality Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33

193 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 For any X = have C-S Inequality x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat.

194 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 (Cat-Selection) C-S Inequality For any X = have x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat.

195 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 (Cat-Selection) C-S Inequality For any X = have x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat. perform cat-test on all 3 3 image subpatches!

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212 Convolution Definition Suppose that X, Y R n n. Then Z = X Y R n n is defined as Z[i, j] = n 1 k,l=0 X[i k, j l]y [k, l], where periodization or zero-padding of X, Y is used if i k or j l is not in {0,..., n 1}.

213 Convolution Definition Suppose that X, Y R n n. Then Z = X Y R n n is defined as Z[i, j] = n 1 k,l=0 X[i k, j l]y [k, l], where periodization or zero-padding of X, Y is used if i k or j l is not in {0,..., n 1}. Efficient computation possible via FFT (or directly if X or Y are sparse)!

214 Example: Detecting Vertical Edges Given vertical-edge-detection-filter W =

215 Example: Detecting Vertical Edges Given vertical-edge-detection-filter W =

216 Example: Detecting Vertical Edges

217 Example: Detecting Vertical Edges

218 Example: Detecting Vertical Edges =

219 Example: Detecting Vertical Edges =

220 Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S.

221 Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S.

222 Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S. Given a filter W RF F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X i=1 X[:, :, i] W [:, :, i] + b.

223 Introducing Convolutional Layers A convolutional node accepts as input a stack of images, e.g. X R n 1 n 2 S. Given a filter W R F F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X[:, :, i] W [:, :, i] + b. i=1 A convolutional layer consists of K convolutional nodes ((W i, b i )) K i=1 RF F S R and produces as output a stack Z R n 1 n 2 K via Z[:, :, i] = W i 12 X + b i.

224 Introducing Convolutional Layers A convolutional node accepts as input a stack of images, e.g. X R n 1 n 2 S. Given a filter W R F F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X[:, :, i] W [:, :, i] + b. i=1 A convolutional layer consists of K convolutional nodes ((W i, b i )) K i=1 RF F S R and produces as output a stack Z R n 1 n 2 K via Z[:, :, i] = W i 12 X + b i. A convolutional layer can be written as a conventional neural network layer!

225 Activation Layers The activation layer is defined in the same way as before, e.g., Z R n 1 n 2 K is mapped to A = ReLU(Z) where ReLU is applied component-wise.

226 Pooling Layers Reduce dimensionality after filtering.

227 Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2.

228 Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2. Downsampling

229 Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2. Downsampling Max-pooling

230 Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer.

231 Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer. Typical architectures consist of a CNN (as a feature extractor), followed by a fully connected NN (as a classifier)

232 Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer. Typical architectures consist of a CNN (as a feature extractor), followed by a fully connected NN (as a classifier) Figure: LeNet (1998, LeCun etal): the first successful CNN architecture, used for reading handwritten digits

233 Feature Extractor vs. Classifier

234 Feature Extractor vs. Classifier

235 Feature Extractor vs. Classifier D. Trump B. Sanders A. Merkel B. Johnson

236

237 Python Library Tensor Flow, developed by Google Brain, based on symbolic computational graphs

238 Python Library Tensor Flow, developed by Google Brain, based on symbolic computational graphs

239 8 What I didn t tell you

240

241 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...)

242 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification

243 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...)

244 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...) 4 More sophisticated training procedures for feature extractor layers (Autoencoder, Restricted Boltzmann Machines,...)

245 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...) 4 More sophisticated training procedures for feature extractor layers (Autoencoder, Restricted Boltzmann Machines,...) 5 Recurrent Neural Networks

246 Questions?

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Introduction to Deep Learning

Introduction to Deep Learning Introduction to Deep Learning Some slides and images are taken from: David Wolfe Corne Wikipedia Geoffrey A. Hinton https://www.macs.hw.ac.uk/~dwcorne/teaching/introdl.ppt Feedforward networks for function

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

NN V: The generalized delta learning rule

NN V: The generalized delta learning rule NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Deep Learning: a gentle introduction

Deep Learning: a gentle introduction Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep

More information

Neural Networks: Introduction

Neural Networks: Introduction Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

Machine Learning

Machine Learning Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Introduction to Deep Learning CMPT 733. Steven Bergner

Introduction to Deep Learning CMPT 733. Steven Bergner Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

RegML 2018 Class 8 Deep learning

RegML 2018 Class 8 Deep learning RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Artificial Neural Networks

Artificial Neural Networks Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Deep Feedforward Networks. Sargur N. Srihari

Deep Feedforward Networks. Sargur N. Srihari Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Feedforward Neural Networks

Feedforward Neural Networks Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on

More information

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Introduction to Deep Learning

Introduction to Deep Learning Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2015 A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2015 1 / 39 Outline 1 Universality of Neural Networks

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

CS:4420 Artificial Intelligence

CS:4420 Artificial Intelligence CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

More information

Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

More information