Introduction to (Convolutional) Neural Networks
|
|
- Blaze Bryce Jefferson
- 5 years ago
- Views:
Transcription
1 Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018
2 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient Descent 5 The Basic Recipe 6 Going Deep 7 Convolutional Neural Networks 8 What I didn t tell you
3 1 Motivation and Definition
4 Which Method to Choose? We have seen linear Regression, kernel regression, regularization, K-PCA, K-SVM,... and there exist a zillion other methods.
5 Which Method to Choose? We have seen linear Regression, kernel regression, regularization, K-PCA, K-SVM,... and there exist a zillion other methods. Is there a universally best method?
6 No Free Lunch Theorem
7 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!!
8 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1...
9 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1... Every algorithm will have a specific preference, for example as specified through the hypothesis class H all categories are artificial!
10 No Free Lunch Theorem No Free Lunch Theorem [Wolpert(1996)], Informal Version Of course not!! Proof of the Theorem If ρ is completely arbitrary and nothing is known, we cannot possibly infer anything about ρ from samples ((x i, y i )) m i=1... Every algorithm will have a specific preference, for example as specified through the hypothesis class H all categories are artificial! We want our algorithm to reproduce the artificial categories produced by our brain so let s build a hypothesis class that mimicks our thinking!
11 Neuroscience
12 Neuroscience The Brain as Biological Neural Network In neuroscience, a biological neural network is a series of interconnected neurons whose activation defines a recognizable linear pathway. The interface through which neurons interact with their neighbors usually consists of several axon terminals connected via synapses to dendrites on other neurons. If the sum of the input signals into one neuron surpasses a certain threshold, the neuron sends an action potential (AP) at the axon hillock and transmits this electrical signal along the axon. Source: Wikipedia
13 Neurons
14 Neurons recall: If the sum of the input signals into one neuron surpasses a certain threshold, [...] the neuron transmits this [...] signal [...].
15 Artificial Neurons x 1 w 1 Threshold b Inputs x 2 w 2 Σ { 1 0 Activation i x iw i b > 0 i x iw i b 0 Output y x 3 w 3 Weights
16 Artificial Neurons x 1 w 1 Threshold b Inputs x 2 w 2 Σ { 1 0 Activation i x iw i b > 0 i x iw i b 0 Output y x 3 w 3 Weights Artificial Neuron An artificial neuron with weights w 1,..., w s, bias b and activation function σ : R R is defined as the function ( s ) f(x 1,..., x s ) = σ x i w i b. i=1
17 Activation Functions Figure: Heaviside activation function (as in biological motivation)
18 Activation Functions Figure: Heaviside activation function (as in biological motivation) Figure: Sigmoid activation function σ(x) = 1 1+e x
19 Artificial Neural Networks
20 Artificial Neural Networks Artificial neural networks consist of a graph, connecting artificial neurons!
21 Artificial Neural Networks Artificial neural networks consist of a graph, connecting artificial neurons! Dynamics difficult to model, due to loops, etc...
22 Artificial Feedforward Neural Networks Use directed, acyclic graph!
23 Artificial Feedforward Neural Networks Use directed, acyclic graph!
24 Artificial Feedforward Neural Networks Definition Let L, d, N 1,..., N L N. A map Φ : R d R N L given by Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, is called a neural network. It is composed of affine linear maps A l : R N l 1 R N l, 1 l L (where N 0 = d), and non-linear functions often referred to as activation function σ acting component-wise. Here, d is the dimension of the input layer, L denotes the number of layers, N 1,..., N L 1 stands for the dimensions of the L 1 hidden layers, and N L is the dimension of the output layer.
25 Artificial Feedforward Neural Networks Definition Let L, d, N 1,..., N L N. A map Φ : R d R N L given by Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, is called a neural network. It is composed of affine linear maps A l : R N l 1 R N l, 1 l L (where N 0 = d), and non-linear functions often referred to as activation function σ acting component-wise. Here, d is the dimension of the input layer, L denotes the number of layers, N 1,..., N L 1 stands for the dimensions of the L 1 hidden layers, and N L is the dimension of the output layer. An affine map A : R N l 1 R N l is given by x W x + b with weight matrix W R N l 1 N l and bias vector b R N l.
26 Artificial Feedforward Neural Networks
27 On the Biological Motivation
28 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain:
29 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations
30 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward
31 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity
32 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity...
33 On the Biological Motivation Artificial (feedforward) neural networks should not be confused with a model for our brain: Neurons are more complicated than simply weighted linear combinations Our brain is not feedforward Biological neural networks evolve with time neuronal plasticity... Artificial feedforward neural networks constitute a mathematically and computationally convenient but very simplistic mathematical construct which is inspired by our understanding of how the brain works.
34 Terminology
35 Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks.
36 Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks. This requires optimizing the weights and bias parameters.
37 Terminology Neural Network Learning : Use neural networks of a fixed topology as hypothesis class for regression or classification tasks. This requires optimizing the weights and bias parameters. Deep Learning : Neural network learning with neural networks consisting of many (e.g., 3) layers.
38 2 Universal Approximation
39 Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough?
40 Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough? Surely not!
41 Approximation Question Main Approximation Problem can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough? Surely not! Suppose that σ is a polynomial of degree r. Then σ(ax) is a polynomial of degree r for all affine maps A and therefore any neural network with activation function σ will be a polynomial of degree r.
42 Approximation Question Main Approximation Problem Under which conditions on the activation function σ can every (continuous, or measurable) function f : R d R N L be arbitrarily well approximated by a neural network, provided that we choose N 1,..., N L 1, L large enough?
43 Universal Approximation Theorem Theorem Suppose that σ : R R continuous is not a polynomial and fix d 1, L 2, N L 1 N and a compact subset K R d. Then for any continuous f : R d R N l and any ε > 0 there exist N 1,..., N L 1 N and affine linear maps A l : R N l 1 R N l, 1 l L such that the neural network Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, approximates f to within accuracy ε, i.e., sup Φ(x) Φ(x) ε. x K
44 Universal Approximation Theorem Theorem Suppose that σ : R R continuous is not a polynomial and fix d 1, L 2, N L 1 N and a compact subset K R d. Then for any continuous f : R d R N l and any ε > 0 there exist N 1,..., N L 1 N and affine linear maps A l : R N l 1 R N l, 1 l L such that the neural network Φ(x) = A L σ (A L 1 σ (... σ (A 1 (x)))), x R d, approximates f to within accuracy ε, i.e., sup Φ(x) Φ(x) ε. x K Neural networks are universal approximators and one hidden layer is enough if the number of nodes is sufficient! Skip Proof
45 Proof of the Universal Approximation Theorem For simplicity we only the case of one hidden layer, e.g., L = 2 and one output neuron, e.g., N L = 1:
46 Proof of the Universal Approximation Theorem For simplicity we only the case of one hidden layer, e.g., L = 2 and one output neuron, e.g., N L = 1: N 1 Φ(x) = c i σ(w i x b i ), w i R d, c i, b i R. i=1
47 Proof of the Universal Approximation Theorem We will show the following. Theorem For d N and σ : R R continuous consider { } R(σ, d) := span σ(w x b) : w R d, b R. Then R(σ, d) is dense in C(R d ) if and only if σ is not a polynomial.
48 Proof for d = 1 and σ smooth
49 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N.
50 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0.
51 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0.
52 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0. same argument polynomials in x can be approximated.
53 Proof for d = 1 and σ smooth if σ is not a polynomial, there exists x 0 R with σ (k) ( x 0 ) 0 for all k N. constant functions can be approximated because σ(hx x 0 ) σ( x 0 ) 0, h 0. linear functions can be approximated because 1 h (σ((λ + h)x x 0) σ(λx x 0 )) xσ ( x 0 ), h, λ 0. same argument polynomials in x can be approximated. Stone-Weierstrass Theorem yields the result.
54 General d
55 General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case).
56 General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case). First approximate f C(R d ) by N d i g i (v i x e i ), i=1 v i R d, d i, e i R, g i C(R).
57 General d Note that the functions span{g(w x b) : w R d, b R, g C(R) arbitrary}, are dense in C(R d ) (just take g as sin(w x), cos(w x) just as in the Fourier series case). First approximate f C(R d ) by N d i g i (v i x e i ), i=1 v i R d, d i, e i R, g i C(R). Then apply our univariate result to approximate the univariate functions t g i (t e i ) using neural networks.
58 The case that σ is nonsmooth
59 The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta.
60 The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta. Apply previous result to the smooth function σ g ε and let ε 0:
61 The case that σ is nonsmooth pick family (g ε ) ε>0 of mollifiers, i.e. lim σ g ε σ ε 0 uniformly on compacta. Apply previous result to the smooth function σ g ε and let ε 0:
62 3 Backpropagation
63 Regression/Classification with Neural Networks Neural Network Hypothesis Class Given d, L, N 1,..., N L and σ define the associated hypothesis class H [d,n1,...,n L ],σ := { AL σ (A L 1 σ (... σ (A 1 (x)))) : A l : R N l 1 R N l affine linear }.
64 Regression/Classification with Neural Networks Neural Network Hypothesis Class Given d, L, N 1,..., N L and σ define the associated hypothesis class H [d,n1,...,n L ],σ := { AL σ (A L 1 σ (... σ (A 1 (x)))) : A l : R N l 1 R N l affine linear }. Typical Regression/Classification Task Given data z = ((x i, y i )) m i=1 Rd R N L, find the empirical regression function f z argmin f H[d,N1,...,N L ],σ m L(f, x i, y i ), i=1 where L : C(R d ) R d R N L R + is the loss function (in least squares problems we have L(f, x, y) = f(x) y 2 ).
65 Example: Handwritten Digits MNIST Database for handwritten digit recognition exdb/mnist/
66 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/
67 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/
68 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/ Every label is given as a 10-dim vector y R 10 describing the probability of each digit
69 Example: Handwritten Digits Every image is given as a matrix x R R 784 : MNIST Database for handwritten digit recognition exdb/mnist/ Every label is given as a 10-dim vector y R 10 describing the probability of each digit
70 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit
71 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Given labeled training data (x i, y i ) m i=1 R784 R 10. Every label is given as a 10-dim vector y R 10 describing the probability of each digit
72 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20).
73 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.
74 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.???how???
75 Example: Handwritten Digits Every image is given as a matrix x R R 784 : Every label is given as a 10-dim vector y R 10 describing the probability of each digit Given labeled training data (x i, y i ) m i=1 R784 R 10. Fix network topology, e.g., number of layers (for example L = 3) and numbers of neurons (N 1 = 20, N 2 = 20). The learning goal is to find the empirical regression function f z H [784,20,20,10],σ.???how??? Non-linear, non-convex
76 Gradient Descent: The Simplest Optimization Method Gradient Descent
77 Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N
78 Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N Gradient descent with stepsize η > 0 is defined by u n+1 u n η F (u n ).
79 Gradient Descent: The Simplest Optimization Method Gradient Descent Gradient of F : R N R is defined by F (u) = ( ) F (u) F (u) T,...,. (u) 1 (u) N Gradient descent with stepsize η > 0 is defined by u n+1 u n η F (u n ). Converges (slowly) to stationary point of F.
80 Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1.
81 Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1. Since ((Wl,b l )) L F = m l=1 i=1 ((W l,b l )) L L(f, x i, y i ), we need to l=1 determine (for x, y R d R N L fixed) L(f, x, y) (W l ) i,j, L(f, x, y) (b l ) i, l = 1,..., L. Skip Derivation
82 Backprop In our problem: F = m i=1 L(f, x i, y i ) and u = ((W l, b l )) L l=1. Since ((Wl,b l )) L F = m l=1 i=1 ((W l,b l )) L L(f, x i, y i ), we need to l=1 determine (for x, y R d R N L fixed) L(f, x, y) (W l ) i,j, L(f, x, y) (b l ) i, l = 1,..., L. Skip Derivation For simplicity suppose that L(f, x, y) = (f(x) y) 2, so that L(f, x, y) = 2 (f(x) y) T f(x), (W l ) i,j (W l ) i,j L(f, x, y) (b l ) i = 2 (f(x) y) T f(x) (b l ) i.
83 x = ( ) (x) 1 (x) 2
84 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) = σ(w 1 x + b 1 ) (W 1 ) 1,1 (W 1 ) 1,2 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 (W 1 ) 3,1 (W 1 ) 3,2 (b 1 ) 1 b 1 = (b 1 ) 2 (b 1 ) 3
85 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3
86 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
87 (z 3 ) 1 (W 3 ) 1,2 = ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
88 (z 3 ) 1 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 1,1 (a 2 ) 1 + (W 3 ) 1,2 (a 2 ) 2 + (W 3 ) 1,3 (a 2 ) 3 ) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
89 (z 3 ) 1 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 1,1 (a 2 ) 1 + (W 3 ) 1,2 (a 2 ) 2 + (W 3 ) 1,3 (a 2 ) 3 ) = (a 2 ) 2 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
90 (z 3 ) 2 (W 3 ) 1,2 = ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
91 (z 3 ) 2 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 2,1 (a 2 ) 1 + (W 3 ) 2,2 (a 2 ) 2 + (W 3 ) 2,3 (a 2 ) 3 ) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
92 (z 3 ) 2 (W 3 ) 1,2 = (W 3 ) 1,2 ((W 3 ) 2,1 (a 2 ) 1 + (W 3 ) 2,2 (a 2 ) 2 + (W 3 ) 2,3 (a 2 ) 3 ) = 0 ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
93 (z 3 ) k (W 3 ) i,j = { (a2 ) j i = k 0 i k ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
94 (z 3 ) k (W 3 ) i,j = (z 3 ) k (b 3 ) i = { (a2 ) j i = k { 1 0 i k i = k 0 i k ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
95 ( (a2 ) 1 Φ(x) = 0 W 3 )( (a2 ) 2 0 )( (a2 ) 3 0 ) ( )( )( ) (a 2 ) 1 (a 2 ) 2 (a 2 ) 3 ) 1 Φ(x) = ( ( 0 b 3 0 1) ( ) (W W 3 = 3 ) 1,1 (W 3 ) 1,2 (W 3 ) 1,3 (W 3 ) 2,1 (W 3 ) 2,2 (W 3 ) 2,3 ( ) (b b 3 = 3 ) 1 (b 3 ) 2 x = ( ) (x) 1 (x) 2 a 1 = σ(z 1 ) a 2 = σ(z 2 ) = σ(w 1 x + b 1 ) = σ(w 2 a 1 + b 2 ) (W 1 ) 1,1 (W 1 ) 1,2 (W 2 ) 1,1 (W 2 ) 1,2 (W 2 ) 1,3 W 1 = (W 1 ) 2,1 (W 1 ) 2,2 W 2 = (W 2 ) 2,1 (W 2 ) 2,2 (W 2 ) 2,3 (W 1 ) 3,1 (W 1 ) 3,2 (W 2 ) 3,1 (W 2 ) 3,2 (W 2 ) 3,3 (b 1 ) 1 (b 2 ) 1 b 1 = (b 1 ) 2 b 2 = (b 2 ) 2 (b 1 ) 3 (b 2 ) 3 Φ(x) = z 3 = W 3 a 2 + b 3
96 Backprop: Last Layer
97 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T L(f,x,y) (b L ) i f(x) (W L ) i,j, = 2 (f(x) y) T f(x) (b L ) i.
98 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i
99 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i f(x) 2(f(x) y) T (W L ) i,j = 2(f(x) y) i σ(w L 1 (... ) + b L 1 ) j, 2(f(x) y) T f(x) (b L ) i = 2(f(x) y) i
100 Backprop: Last Layer L(f,x,y) (W L ) i,j = 2 (f(x) y) T f(x) (W L ) i,j, L(f,x,y) (b L ) i = 2 (f(x) y) T f(x) (b L ) i. Let f(x) = W L σ(w L 1 (... ) + b L 1 ) + b L. It follows that f(x) (W L ) i,j = (0,..., σ(w L 1 (... ) + b L 1 ) j,..., 0) T }{{} i,..., 0) T f(x) (b L ) i = (0,..., 1 }{{} i f(x) 2(f(x) y) T (W L ) i,j = 2(f(x) y) i σ(w L 1 (... ) + b L 1 ) j, 2(f(x) y) T f(x) (b L ) i = 2(f(x) y) i In matrix notation: {}}{ W L = 2(f(x) y) (σ( W L 1 (... ) + b L 1 )) T, }{{}}{{} δ L a L 1 L(f,x,y) L(f,x,y) b L = 2(f(x) y). z L 1
101 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L.
102 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L, L(F,x,y) b L
103 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1
104 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1
105 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 same as before!
106 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 = diag(σ (z L 1 )) WL T 2(f(x) y) }{{} δ L 1 a T L 2.
107 Backprop: Second-to-last Layer Define a l+1 = σ(z l+1 ) where z l+1 = W l+1 a l + b l+1, a 0 = x, f(x) = z L. We have computed L(f,x,y) W L Then, use chain rule: L(f, x, y) W L 1 =, L(F,x,y) b L L(f, x, y) a L 1 = 2(f(x) y) T W a L 1 L. a L 1 W L 1 W L 1 = 2(f(x) y) T W L diag(σ (z L 1 )) z L 1 W L 1 = diag(σ (z L 1 )) WL T 2(f(x) y) }{{} δ L 1 Similar arguments yield L(f,x,y) b L 1 = δ L 1. a T L 2.
108 The Backprop Algorithm
109 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass).
110 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y)
111 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1.
112 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do:
113 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) W T l+1 δ l+1
114 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) W T l+1 δ l+1 Then L(f,x,y) b l = δ l and L(f,x,y) W l = δ l a T l 1.
115 The Backprop Algorithm 1 Calculate a l = σ(z l ), z l = A l (a l 1 ) for l = 1,..., L, a 0 = x (forward pass). 2 Set δ L = 2(f(x) y) 3 Then L(f,x,y) b L = δ L and L(f,x,y) W L = δ L a T L 1. 4 for l from L 1 to 1 do: δ l = diag(σ (z l )) Wl+1 T δ l+1 Then L(f,x,y) b l = δ l and L(f,x,y) W l = δ l a T l 1. 5 return L(f,x,y) b l, L(f,x,y) W l, l = 1,..., L.
116 Computational Graphs
117 Automatic Differentiation
118 4 Stochastic Gradient Descent
119 The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run.
120 The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run. Thus, the total complexity of one gradient descent step is equal to m complexity(backprop).
121 The Complexity of Gradient Descent Recall that one gradient descent step requires the calculation of m ((Wl,b l )) L L(f, x i, y i ). l=1 i=1 and each of the summands requires one backpropagation run. Thus, the total complexity of one gradient descent step is equal to m complexity(backprop). The complexity of backprop is asymptotically equal to the number of DOFs of the network: complexity(backprop) L N l 1 N l + N l. l=1
122 An Example
123 An Example ImageNet database consists of 1.2m images and 1000 categories.
124 An Example ImageNet database consists of 1.2m images and 1000 categories. AlexNet, neural network with 160m DOFs is one of the most successful annotation methods
125 An Example ImageNet database consists of 1.2m images and 1000 categories. AlexNet, neural network with 160m DOFs is one of the most successful annotation methods One step of gradient descent requires flops (and memory units)!!
126 Stochastic Gradient Descent (SGD) Approximate m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1 by ((Wl,b l )) L l=1 L(f, x i, y i ) for some i chosen uniformly at random from {1,..., m}.
127 Stochastic Gradient Descent (SGD) Approximate m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1 by ((Wl,b l )) L l=1 L(f, x i, y i ) for some i chosen uniformly at random from {1,..., m}. In expectation we have E ((Wl,b l )) L l=1 L(f, x i, y i ) = 1 m m ((Wl,b l )) L L(f, x i, y i ) l=1 i=1
128 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R.
129 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0
130 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do:
131 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random
132 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i
133 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i n = n + 1
134 The SGD Algorithm Goal: Find stationary point of function F = m i=1 F i : R N R. 1 Set starting value u 0 and n = 0 2 while (error is large) do: Pick i {1,..., m} uniformly at random update u n+1 = u n η F i n = n return u n
135 Typical Behavior Figure: Comparison btw. GD and SGD. m steps of SGD are counted as one iteration.
136 Typical Behavior Figure: Comparison btw. GD and SGD. m steps of SGD are counted as one iteration. Initially very fast convergence, followed by stagnation!
137 Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient.
138 Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient. K = 1 SGD
139 Minibatch SGD For every {i 1,..., i K } {1,..., m} chosen uniformly at random, it holds that E 1 K k ((Wl,b l )) L L(f, x i, y l=1 l i ) = 1 l m l=1 m ((Wl,b l )) L L(f, x i, y i ), l=1 i=1 e.g., we have an unbiased estimator for the gradient. K = 1 SGD K > 1 Minibatch SGD with batchsize K.
140 Some Heuristics
141 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1
142 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1
143 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1 Increasing the batch size by a factor 100 yields an improvement of the variance by a factor 10 while the complexity increases by a factor 100!
144 Some Heuristics The sample mean 1 k K l=1 ((W l,b l )) L L(f, x i, y l=1 l i ) is itself a l random variable that has expected value 1 m m i=1 ((W l,b l )) L L(f, x i, y i ). l=1 In order to assess the deviation of the sample mean from its expected value we may compute its standard deviation σ/ n where σ is the standard deviation of i ((Wl,b l )) L L(f, x i, y i ). l=1 Increasing the batch size by a factor 100 yields an improvement of the variance by a factor 10 while the complexity increases by a factor 100! Common batchsize for large models: K = 16, 32.
145 5 The Basic Recipe
146 The basic Neural Network Recipe for Learning
147 The basic Neural Network Recipe for Learning 1 Neuro-inspired model
148 The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop
149 The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop 3 Minibatch SGD
150 The basic Neural Network Recipe for Learning 1 Neuro-inspired model 2 Backprop 3 Minibatch SGD Now let s try classifying handwritten digits!
151 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size =
152 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10].
153 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%.
154 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10].
155 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%.
156 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10].
157 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10]. Classification accuracy 95.07%.
158 Results MNIST dataset, 30 epochs, learning rate η = 3.0, minibatch size K = 10, training set size m = 50000, test set size = network size [784, 30, 10]. Classification accuracy 94.84%. network size [784, 30, 30, 10]. Classification accuracy 95.81%. network size [784, 30, 30, 30, 10]. Classification accuracy 95.07%. Deep learning might not help after all...
159 6 Going Deep (?)
160 Problems with Deep Networks
161 Problems with Deep Networks Overfitting (as usual...)
162 Problems with Deep Networks Overfitting (as usual...) Vanishing/Exploding Gradient Problem
163 Dealing with Overfitting: Regularization Rather than minimizing m L(f, x i, y i ), i=1 minimize for example m L(f, x i, y i ) + λω((w l ) L l=1 ), i=1 Ω((W l ) L l=1 ) = (W l ) i,j p. l,i,j
164 Dealing with Overfitting: Regularization Rather than minimizing m L(f, x i, y i ), i=1 minimize for example m L(f, x i, y i ) + λω((w l ) L l=1 ), i=1 Ω((W l ) L l=1 ) = (W l ) i,j p. l,i,j Gradient update has to be augmented by λ (W l ) i,j Ω((W l ) L l=1 ) = λ p (W l) i,j p 1 sgn((w l ) i,j )
165 Sparsity-Promoting Regularization Since lim p 0 (W l ) i,j p = #nonzero weights, l,i,j regularization with p 1 promotes sparse connectivity (and hence small memory requirements)!
166 Sparsity-Promoting Regularization Since lim p 0 (W l ) i,j p = #nonzero weights, l,i,j regularization with p 1 promotes sparse connectivity (and hence small memory requirements)!
167 Dropout During each feedforward/backprop step drop nodes with probability p.
168 Dropout During each feedforward/backprop step drop nodes with probability p.
169 Dropout During each feedforward/backprop step drop nodes with probability p. After training, multiply all weights with p.
170 Dropout During each feedforward/backprop step drop nodes with probability p. After training, multiply all weights with p. Final output is average over many sparse network models.
171 Dataset Augmentation Use invariances in dataset to generate more data!
172 Dataset Augmentation Use invariances in dataset to generate more data!
173 Dataset Augmentation Use invariances in dataset to generate more data!
174 Dataset Augmentation Use invariances in dataset to generate more data!
175 Dataset Augmentation Use invariances in dataset to generate more data! Sometimes also noise is added to the weights to favour robust stationary points.
176 The Vanishing Gradient Problem
177 The Vanishing Gradient Problem Figure: Extremely Deep Network
178 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5
179 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1
180 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l
181 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l bottom layers will learn much slower than top layers and not contribute to learning.
182 The Vanishing Gradient Problem Figure: Extremely Deep Network Φ(x) = w 5 σ(w 4 σ(w 3 σ(w 2 σ(w 1 x + b 1 ) + b 2 ) + b 3 ) + b 4 ) + b 5 Φ(x) b 1 = L l=2 L 1 w l σ (z l ) l=1 If σ(x) = 1 1+e x, it holds that σ (x) 2 e x, and thus Φ(x) b 1 2 L l=2 w l e L 1 l=1 z l bottom layers will learn much slower than top layers and not contribute to learning. Is depth a nuisance!?
183 Dealing with the Vanishing Gradient Problem
184 Dealing with the Vanishing Gradient Problem Use activation function with large gradient.
185 Dealing with the Vanishing Gradient Problem Use activation function with large gradient. ReLU The Rectified Linear Unit is defined as { x x > 0 ReLU(x) := 0 else
186 7 Convolutional Neural Networks
187
188 Is there a cat in this image?
189
190 Suppose we have a cat-filter W
191 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33
192 C-S Inequality Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33
193 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 For any X = have C-S Inequality x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat.
194 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 (Cat-Selection) C-S Inequality For any X = have x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat.
195 Suppose we have a cat-filter W w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 (Cat-Selection) C-S Inequality For any X = have x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 we X W (X X) 1/2 (W W ) 1/2 with equality if and only if X is parallel to a cat. perform cat-test on all 3 3 image subpatches!
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212 Convolution Definition Suppose that X, Y R n n. Then Z = X Y R n n is defined as Z[i, j] = n 1 k,l=0 X[i k, j l]y [k, l], where periodization or zero-padding of X, Y is used if i k or j l is not in {0,..., n 1}.
213 Convolution Definition Suppose that X, Y R n n. Then Z = X Y R n n is defined as Z[i, j] = n 1 k,l=0 X[i k, j l]y [k, l], where periodization or zero-padding of X, Y is used if i k or j l is not in {0,..., n 1}. Efficient computation possible via FFT (or directly if X or Y are sparse)!
214 Example: Detecting Vertical Edges Given vertical-edge-detection-filter W =
215 Example: Detecting Vertical Edges Given vertical-edge-detection-filter W =
216 Example: Detecting Vertical Edges
217 Example: Detecting Vertical Edges
218 Example: Detecting Vertical Edges =
219 Example: Detecting Vertical Edges =
220 Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S.
221 Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S.
222 Introducing Convolutional Nodes A convolutional node accepts as input a stack of images, e.g. X Rn1 n2 S. Given a filter W RF F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X i=1 X[:, :, i] W [:, :, i] + b.
223 Introducing Convolutional Layers A convolutional node accepts as input a stack of images, e.g. X R n 1 n 2 S. Given a filter W R F F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X[:, :, i] W [:, :, i] + b. i=1 A convolutional layer consists of K convolutional nodes ((W i, b i )) K i=1 RF F S R and produces as output a stack Z R n 1 n 2 K via Z[:, :, i] = W i 12 X + b i.
224 Introducing Convolutional Layers A convolutional node accepts as input a stack of images, e.g. X R n 1 n 2 S. Given a filter W R F F S, where F is the spatial extent and a bias b R, it computes a matrix Z = W 12 X := S X[:, :, i] W [:, :, i] + b. i=1 A convolutional layer consists of K convolutional nodes ((W i, b i )) K i=1 RF F S R and produces as output a stack Z R n 1 n 2 K via Z[:, :, i] = W i 12 X + b i. A convolutional layer can be written as a conventional neural network layer!
225 Activation Layers The activation layer is defined in the same way as before, e.g., Z R n 1 n 2 K is mapped to A = ReLU(Z) where ReLU is applied component-wise.
226 Pooling Layers Reduce dimensionality after filtering.
227 Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2.
228 Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2. Downsampling
229 Pooling Layers Reduce dimensionality after filtering. Definition A pooling operator R acts layer-wise on a tensor X R n 1 n 2 S to result in a tensor R(X) R m 1 m 2 S, where m 1 < n 1 and m 2 < n 2. Downsampling Max-pooling
230 Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer.
231 Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer. Typical architectures consist of a CNN (as a feature extractor), followed by a fully connected NN (as a classifier)
232 Convolutional Neural Networks (CNNs) Definition A CNN with L layers consists of L iterative applications of a convolutional layer, followed by an activation layer, (possibly) followed by a pooling layer. Typical architectures consist of a CNN (as a feature extractor), followed by a fully connected NN (as a classifier) Figure: LeNet (1998, LeCun etal): the first successful CNN architecture, used for reading handwritten digits
233 Feature Extractor vs. Classifier
234 Feature Extractor vs. Classifier
235 Feature Extractor vs. Classifier D. Trump B. Sanders A. Merkel B. Johnson
236
237 Python Library Tensor Flow, developed by Google Brain, based on symbolic computational graphs
238 Python Library Tensor Flow, developed by Google Brain, based on symbolic computational graphs
239 8 What I didn t tell you
240
241 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...)
242 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification
243 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...)
244 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...) 4 More sophisticated training procedures for feature extractor layers (Autoencoder, Restricted Boltzmann Machines,...)
245 1 Data structures & algorithms for efficient deep learning (computational graphs, automatic differentiation, adaptive learning rate, hardware,...) 2 Things to do besides regression or classification 3 Finetuning (choice of activation function, choice of loss function,...) 4 More sophisticated training procedures for feature extractor layers (Autoencoder, Restricted Boltzmann Machines,...) 5 Recurrent Neural Networks
246 Questions?
Neural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationIntroduction to Machine Learning Spring 2018 Note Neural Networks
CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,
More informationCSC 411 Lecture 10: Neural Networks
CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationConvolutional Neural Network Architecture
Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationIntroduction to Deep Learning
Introduction to Deep Learning Some slides and images are taken from: David Wolfe Corne Wikipedia Geoffrey A. Hinton https://www.macs.hw.ac.uk/~dwcorne/teaching/introdl.ppt Feedforward networks for function
More informationArtificial Neural Networks
Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationFeedforward Neural Nets and Backpropagation
Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationNeural Networks 2. 2 Receptive fields and dealing with image inputs
CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationConvolutional Neural Networks. Srikumar Ramalingam
Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationTutorial on Machine Learning for Advanced Electronics
Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationNN V: The generalized delta learning rule
NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationDeep Learning: a gentle introduction
Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationLecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep
More informationNeural Networks: Introduction
Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationLecture 4: Perceptrons and Multilayer Perceptrons
Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons
More informationMachine Learning
Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationIntroduction to Deep Learning CMPT 733. Steven Bergner
Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization
More informationCSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!
CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationARTIFICIAL INTELLIGENCE. Artificial Neural Networks
INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationRegML 2018 Class 8 Deep learning
RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationDeep Learning (CNNs)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Feedforward Networks
Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationNeural networks. Chapter 19, Sections 1 5 1
Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationArtificial Neural Networks
Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationCourse 395: Machine Learning - Lectures
Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationMachine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler
+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationDeep Feedforward Networks. Sargur N. Srihari
Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationNeural networks. Chapter 20. Chapter 20 1
Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationTopic 3: Neural Networks
CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationFeedforward Neural Networks
Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron
More informationPattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore
Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal
More informationArtificial Neural Networks
Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on
More informationEVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)
EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationIntroduction to Deep Learning
Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2015 A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2015 1 / 39 Outline 1 Universality of Neural Networks
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationIntroduction to Convolutional Neural Networks 2018 / 02 / 23
Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationCS:4420 Artificial Intelligence
CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationCSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning
CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.
More informationLecture 2: Learning with neural networks
Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for
More information