Machine Learning (CSE 446): Backpropagation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 8, 2017 1 / 32
Neuron-Inspired Classifiers correct output y n L n loss hidden units input weights x n W tanh activation! ŷ b v classifier output, f 2 / 32
Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). Theoretical result: any continuous function on a bounded region in R d can be approximated arbitrarily well, with a finite number of hidden units. The number of hidden units affects how complicated your decision boundary can be and how easily you will overfit. 3 / 32
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? It finds a local minimum. To calculate gradients, we need to use the chain rule from calculus. Special name for (S)GD with chain rule invocations: backpropagation. 4 / 32
Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a 5 / 32
Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a Base case: L n = L n L n = 1 6 / 32
Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a Base case: L n = L n L n = 1 From here on, we overload notation and let a and b refer to nodes and to their values. 7 / 32
Backpropagation For every node in the computation graph, we wish to calculate the first derivative of L n with respect to that node. For any node a, let: ā = L n a After working forwards through the computation graph to obtain the loss L n, we work backwards through the computation graph, using the chain rule to calculate ā for every node a, making use of the work already done for nodes that depend on a. L n a = b:a b ā = b:a b = b:a b L n b b a b b a b 1 if b = a + c for some c c if b = a c for some c 1 b 2 if b = tanh(a) 8 / 32
Backpropagation with Vectors Pointwise ( Hadamard ) product for vectors in R n : a[1] b[1] a[2] b[2] a b =. a[n] b[n] ā = b b:a b i=1 = b:a b b[i] b[i] a b b c b (1 b b) if b = a + c for some c if b = a c for some c if b = tanh(a) 9 / 32
Backpropagation, Illustrated y n L n x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v Intermediate nodes are de-anonymized, to make notation easier. 10 / 32
Backpropagation, Illustrated y n L n 1 x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v L n L n = 1 11 / 32
Backpropagation, Illustrated y n L n 1 x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ The form of ḡ will be loss-function specific (e.g., 2(y n g) for squared loss). 12 / 32
Backpropagation, Illustrated y n L n 1 x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ 1 ḡ Sum. 13 / 32
Backpropagation, Illustrated y n L n 1 ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ 1 ḡ ḡ e Product. 14 / 32
Backpropagation, Illustrated y n L n 1 ā=ḡ v (1 e e) ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b v ḡ 1 ḡ ḡ e Hyperbolic tangent. 15 / 32
Backpropagation, Illustrated y n L n 1 ā ā=ḡ v (1 e e) ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b ā v ḡ e ḡ 1 ḡ Sum. 16 / 32
Backpropagation, Illustrated y n L n 1 āx n ā ā=ḡ v (1 e e) ḡ v x n W d=wx n a=b+d e=tanh a f=v e g= h f[h] b ā v ḡ e ḡ 1 ḡ Product. 17 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. 18 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. 19 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? 20 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations 21 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD 22 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) 23 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) 24 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization 25 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization regularization 26 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization regularization Interpretability? 27 / 32
Practical Notes Don t initalize all parameters to zero; add some random noise. Random restarts: train K networks with different initializers, and you ll get K different classifiers of varying quality. Hyperparameters? number of training iterations learning rate for SGD number of hidden units (H) number of layers (and number of hidden units in each layer) amount of randomness in initialization regularization Interpretability? 28 / 32
Challenge of Deeper Networks Backpropagation aims to assign credit (or blame ) to each parameter. In a deep network, credit/blame is shared across all layers. So parameters at early layers tend to have very small gradients. One solution is to train a shallow network, then use it to initialize a deeper network, perhaps gradually increasing network depth. This is called layer-wise training. 29 / 32
Radial Basis Function Networks correct output y n L n loss input weights hidden units x n w 1 sqd exp γ 1 v 1 activation! ŷ w 2 sqd exp classifier output, f γ 2 v 2 In the diagram, sqd(x, w) = x w 2 2. 30 / 32
Radial Basis Function Networks Generalizing to H hidden units: ( H f(x) = sign v[h] exp ( γ h x w h 2 ) ) 2 h=1 Each hidden unit is like a little bump in data space. w h is the position of the bump, and γ h is inversely proportional to its width. 31 / 32
A Gentle Reading on Backpropagation http://colah.github.io/posts/2015-08-backprop/ 32 / 32