Neural networks COMS 4771

Similar documents
<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Rapid Introduction to Machine Learning/ Deep Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Lecture 5: Logistic Regression. Neural Networks

CSCI567 Machine Learning (Fall 2018)

Ch.6 Deep Feedforward Networks (2/3)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Feedforward Neural Nets and Backpropagation

Deep Feedforward Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Introduction to Machine Learning (67577)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

A summary of Deep Learning without Poor Local Minima

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning Basics III

Artificial Neural Networks. MGS Lecture 2

Lecture 3 Feedforward Networks and Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Lecture 3 Feedforward Networks and Backpropagation

Feedforward Neural Networks

Neural Networks and Deep Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 17: Neural Networks and Deep Learning

CSC 411 Lecture 10: Neural Networks

Deep Feedforward Networks

Advanced Machine Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning Lecture 10

Deep Learning (CNNs)

Convolutional Neural Network Architecture

Deep Feedforward Networks. Sargur N. Srihari

Machine Learning (CSE 446): Neural Networks

Logistic Regression & Neural Networks

Machine Learning (CSE 446): Backpropagation

Neural Networks: Backpropagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Reading Group on Deep Learning Session 1

Machine Learning

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Classification objectives COMS 4771

Machine Learning Lecture 12

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Machine Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

RegML 2018 Class 8 Deep learning

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Artificial Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CS489/698: Intro to ML

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Machine Learning

Feedforward Neural Networks. Michael Collins, Columbia University

Statistical NLP for the Web

Neural Networks: Backpropagation

Introduction to (Convolutional) Neural Networks

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Deep Feedforward Networks

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Understanding Neural Networks : Part I

Logistic regression and linear classifiers COMS 4771

Solutions. Part I Logistic regression backpropagation with a single training example

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Introduction to Machine Learning Spring 2018 Note Neural Networks

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

SGD and Deep Learning

CSE446: Neural Networks Winter 2015

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks

Neural Networks: Introduction

CSC321 Lecture 6: Backpropagation

An overview of deep learning methods for genomics

Neural networks. Chapter 20, Section 5 1

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Machine Learning. Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Intro to Neural Networks and Deep Learning

4. Multilayer Perceptrons

Multilayer Perceptrons (MLPs)

Lecture 2: Learning with neural networks

Bayesian Networks (Part I)

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

10-701/ Machine Learning, Fall

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Gradient Descent Training Rule: The Details

Restricted Boltzmann Machines

Transcription:

Neural networks COMS 4771

1. Logistic regression

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(logistic(x T w)), x R d, with logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 Parameters: w = (w 1,..., w d ) R d. Conditional probability function: η w(x) = logistic(x T w). 1 / 22

Logistic regression Network diagram for η w: x 1 x 2 w 1 w 2 v. w d x d d v := g(z), z := w jx j, j=1 (g = logistic). Here, g is called the link function. 2 / 22

Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. 3 / 22

Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 3 / 22

Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 Observe that for any (X, Y ) P (not necessarily logistic regression), [ ( ) ] 2 ( 2 E g(x T w) Y X = x = g(x T w) η(x)) + var(y X = x) where η(x) = P(Y = 1 X = x). 3 / 22

Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 Observe that for any (X, Y ) P (not necessarily logistic regression), [ ( ) ] 2 ( 2 E g(x T w) Y X = x = g(x T w) η(x)) + var(y X = x) where η(x) = P(Y = 1 X = x). Algorithm for Squared loss ERM with link function g? 3 / 22

Stochastic gradient method w {(g(x T w) y) 2} = 2(g(x T w) y) g (x T w) x. 4 / 22

Stochastic gradient method w {(g(x T w) y) 2} = 2(g(x T w) y) g (x T w) x. Stochastic gradient method for squared loss ERM with link function g 1: Start with some initial w (1) R d. 2: for t = 1, 2,... until some stopping condition is satisfied do 3: Pick (X (t), Y (t) ) uniformly at random from (x 1, y 1 ),..., (x n, y n ). 4: Update: w (t+1) := w (t) 2η t (g( X (t), w (t) ) Y (t) ) g ( X (t), w (t) ) X (t). 5: end for 4 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). 5 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? 5 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? Somtimes, but not always. 5 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? Somtimes, but not always. Nevertheless, stochastic gradient method is still often effective at finding approximate local minima. 5 / 22

2. Multilayer neural networks

Two-output network x 1 v 1 x 2. v 2 x d d v j := g(z j), z j := W i,jx i, j {1, 2}. i=1 6 / 22

k-output network x 1 v 1 x 2 v 2. x d v k d v j := g(z j), z j := W i,jx i, i=1 j {1,..., k}. 7 / 22

A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. 8 / 22

A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 8 / 22

A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 Suppose labels y i,1,..., y i,k are not independent. E.g., if y i,1 = 1, then also more likely that y i,2 = 1. 8 / 22

A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 Suppose labels y i,1,..., y i,k are not independent. E.g., if y i,1 = 1, then also more likely that y i,2 = 1. Option 2: Do Option 1, but also learn to combine predictions of y i,1,..., y i,k to get better predictions for each y i,j. 8 / 22

Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. 9 / 22

Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. Each node is called a unit. 9 / 22

Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. Each node is called a unit. Non-input and non-output units are called hidden. 9 / 22

Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. 10 / 22

Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. Composition: f W 1,W 2 := f W 2 f W 1 is defined by f W 1,W 2 (x) := f W 2 ( fw 1 (x) ). 10 / 22

Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. Composition: f W 1,W 2 := f W 2 f W 1 is defined by This is a two-layer neural network. f W 1,W 2 (x) := f W 2 ( fw 1 (x) ). x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) l 10 / 22

Necessity of multiple layers One-layer neural network with a monotonic link function is a linear (or affine) classifier. Cannot represent XOR function (Minsky and Papert, 1969). x 1 1 x 1 x 1 1 1? 0 0 0 x 2 0 1 x 2 0 1 0 1 x 2 (a) x 1 and x 2 (b) x1 or x2 (c) x1 xor x2 (Figure from Stuart Russell.) 11 / 22

Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. 12 / 22

Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. Theorem (Telgarsky, 2015; Eldan and Shamir, 2015). Some functions can be approximated with exponentially fewer hidden units by using more than two layers. 12 / 22

Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. Theorem (Telgarsky, 2015; Eldan and Shamir, 2015). Some functions can be approximated with exponentially fewer hidden units by using more than two layers. Note: none of this speaks directly to learning neural networks from data. 12 / 22

3. Computation and learning with neural networks

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. ŷ w u,v u v x 1 x 2 x d 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. w u,v u ŷ v x 1 x 2 x d 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. w u,v u ŷ v x 1 x 2 x d 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. Each edge (u, v) E has a weight w u,v R. x 1 w u,v u x 2 ŷ v x d 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. w u,v u ŷ v Each edge (u, v) E has a weight w u,v R. x 1 x 2 x d Value of vertex v given values of parents π G(v) := {u V : (u, v) E} is v := g(z v), z v := w u,v u. (g is link function, e.g., logistic function.) u π G (v) 13 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 2. Compute values of all vertices in V 2, given values of vertices in V 0 V 1. (All parents of v V 2 are in V 0 V 1.) 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 2. Compute values of all vertices in V 2, given values of vertices in V 0 V 1. (All parents of v V 2 are in V 0 V 1.) 3. Etc., until ŷ = f(x) is computed. 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 2. Compute values of all vertices in V 2, given values of vertices in V 0 V 1. (All parents of v V 2 are in V 0 V 1.) 3. Etc., until ŷ = f(x) is computed. This is called forward propagataion. 14 / 22

Training a neural network How to fit a neural network to data? 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) Let l denote loss of prediction ŷ = f(x) (e.g., l := (ŷ y) 2 ). 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) Let l denote loss of prediction ŷ = f(x) (e.g., l := (ŷ y) 2 ). Goal: Compute l w u,v, (u, v) E. 15 / 22

Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v 16 / 22

Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v For squared loss l = (ŷ y) 2, l = 2(ŷ y). ŷ Easy to compute with other losses as well. (ŷ is computed in forward propagation.) 16 / 22

Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v For squared loss l = (ŷ y) 2, l = 2(ŷ y). ŷ Easy to compute with other losses as well. (ŷ is computed in forward propagation.) Since v = g(z v) where z v = w u,v u + (terms not involving w u,v), v = v z v w u,v z v w u,v = g (z v) u. w u,v (z v and u are computed in forward propagation.) v u 16 / 22

Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. 17 / 22

Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v 17 / 22

Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v Since v i = g(z vi ) where z vi = w v,vi v + (terms not involving v), v i v = g (z vi ) w v,vi. (The z vi s are computed in forward propagation.) 17 / 22

Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v Since v i = g(z vi ) where z vi = w v,vi v + (terms not involving v), v i v = g (z vi ) w v,vi. (The z vi s are computed in forward propagation.) Since v i are in a higher layer than v, ŷ v i has already been computed! 17 / 22

Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. 18 / 22

Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. The first layer thus computes and i-th layer computes (Convention: f 0 (x) := x.) f 1 (x) := g 1 (z 1) where z 1 := W 1x, f i (x) := g i (z i) where z i := W if i 1 (x). 18 / 22

Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. The first layer thus computes and i-th layer computes (Convention: f 0 (x) := x.) f 1 (x) := g 1 (z 1) where z 1 := W 1x, f i (x) := g i (z i) where z i := W if i 1 (x). Previous backprop equations prove correctness of the following matrix derivative formula: [ ( l ( ) )] = g W i(z i) W T i+1 g L 1(z L 1) W T L g L(z L)l (f L(x)) f i 1(x) T i where denotes element-wise product. 18 / 22

Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) 19 / 22

Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) Standardize inputs (i.e., center and divide by standard deviation). Doing this at every layer during training: batch normalization. (Must also apply same/similar normalization at test time.) 19 / 22

Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) Standardize inputs (i.e., center and divide by standard deviation). Doing this at every layer during training: batch normalization. (Must also apply same/similar normalization at test time.) Initialization: Take care so initial weights not too large or small. E.g., for node with d inputs, draw weights iid from N(0, 1/d). 19 / 22

Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. 20 / 22

Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), 20 / 22

Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), Trend in architectures (in some domains, like vision and speech): Few convolutional layers then many dense layers (AlexNet) More convolutional layers (VGGNet) Nearly purely convolutional layers in many (100+) layers with varienty of identity connections throughout (ResNet, DenseNet). 20 / 22

Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), Trend in architectures (in some domains, like vision and speech): Few convolutional layers then many dense layers (AlexNet) More convolutional layers (VGGNet) Nearly purely convolutional layers in many (100+) layers with varienty of identity connections throughout (ResNet, DenseNet). Can use intermediate computation (e.g., f i (x)) as feature expansion! Indespensible in visual and audio tasks; application attempts are constant in all other disciplines. 20 / 22

Wrap-up Frontier of experimental machine learning research. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). Applications: visual detection and recognition, speech recognition, general function fitting (e.g., learning reward functions of different actions of video games), etc. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). Applications: visual detection and recognition, speech recognition, general function fitting (e.g., learning reward functions of different actions of video games), etc. In cutting-edge applications, training is still delicate. 21 / 22

Key takeaways 1. Structure of neural networks; concept of link functions. 2. Motivation for multilayer neural networks and arguments for depth. 3. Forward and backward propagation algorithms. 22 / 22