Neural networks COMS PDF Free Download

Neural networks COMS 4771

1. Logistic regression

Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a particular form: Y X = x Bern(logistic(x T w)), x R d, with logistic(z) := 1 ez = 1 + e z 1 + e, z R. z 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 Parameters: w = (w 1,..., w d ) R d. Conditional probability function: η w(x) = logistic(x T w). 1 / 22

Logistic regression Network diagram for η w: x 1 x 2 w 1 w 2 v. w d x d d v := g(z), z := w jx j, j=1 (g = logistic). Here, g is called the link function. 2 / 22

Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. 3 / 22

Learning w from data Training data ((x i, y i)) n i=1 from R d {0, 1}. Could use MLE to learn w from data. Another option: Squared loss ERM (with link function g) 1 min w R d n n (g(x T i w) y i) 2. i=1 Observe that for any (X, Y ) P (not necessarily logistic regression), [ ( ) ] 2 ( 2 E g(x T w) Y X = x = g(x T w) η(x)) + var(y X = x) where η(x) = P(Y = 1 X = x). 3 / 22

Stochastic gradient method w {(g(x T w) y) 2} = 2(g(x T w) y) g (x T w) x. 4 / 22

Stochastic gradient method w {(g(x T w) y) 2} = 2(g(x T w) y) g (x T w) x. Stochastic gradient method for squared loss ERM with link function g 1: Start with some initial w (1) R d. 2: for t = 1, 2,... until some stopping condition is satisfied do 3: Pick (X (t), Y (t) ) uniformly at random from (x 1, y 1 ),..., (x n, y n ). 4: Update: w (t+1) := w (t) 2η t (g( X (t), w (t) ) Y (t) ) g ( X (t), w (t) ) X (t). 5: end for 4 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). 5 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? 5 / 22

Extensions Other loss functions (e.g., y ln 1 1 + (1 y) ln ). p 1 p Other link functions (e.g., g(z) = some polynomial in z). Is the overall objective function convex? Somtimes, but not always. Nevertheless, stochastic gradient method is still often effective at finding approximate local minima. 5 / 22

2. Multilayer neural networks

Two-output network x 1 v 1 x 2. v 2 x d d v j := g(z j), z j := W i,jx i, j {1, 2}. i=1 6 / 22

k-output network x 1 v 1 x 2 v 2. x d v k d v j := g(z j), z j := W i,jx i, i=1 j {1,..., k}. 7 / 22

A motivating example: multitask learning k binary prediction tasks with a single feature vector (e.g., predicting tags for images). Labeled examples are of the form (x i, (y i,1,..., y i,k )) R d {0, 1} k. Option 1: k independent logistic regression models; learn w 1,..., w k by minimizing (e.g.) 1 n k (g(x T i w j) y i,j) 2. n i=1 j=1 Suppose labels y i,1,..., y i,k are not independent. E.g., if y i,1 = 1, then also more likely that y i,2 = 1. 8 / 22

Multilayer neural network x 1 W (1) W (2) v (1) 1 v (2) 1 x 2 v (1) 2 v (2) 2. x d v (1) k v (2) k Columns of W 1 R d k : params. of original logistic regression models. Columns of W 2 R k k : params. of new logistic regression models to combine predictions of original models. 9 / 22

Compositional structure Suppose we have two functions f W 1 : R d R k, (W 2 R d k ), f W 2 : R k R l, (W 2 R k l ), where f W (x) := g(w T x), and g applies the link function g coordinate-wise to a vector. Composition: f W 1,W 2 := f W 2 f W 1 is defined by f W 1,W 2 (x) := f W 2 ( fw 1 (x) ). 10 / 22

Necessity of multiple layers One-layer neural network with a monotonic link function is a linear (or affine) classifier. Cannot represent XOR function (Minsky and Papert, 1969). x 1 1 x 1 x 1 1 1? 0 0 0 x 2 0 1 x 2 0 1 0 1 x 2 (a) x 1 and x 2 (b) x1 or x2 (c) x1 xor x2 (Figure from Stuart Russell.) 11 / 22

Approximation power of multilayer neural networks Theorem (Cybenko, 1989; Hornik, 1991; Barron, 1993). Any continuous function f can be approximated arbitrarily well by a two-layer neural network f f W 2 }{{} R k R f W }{{} 1. R d R k However: may need a very large number of hidden units. Theorem (Telgarsky, 2015; Eldan and Shamir, 2015). Some functions can be approximated with exponentially fewer hidden units by using more than two layers. 12 / 22

3. Computation and learning with neural networks

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. ŷ w u,v u v x 1 x 2 x d 13 / 22

General structure of neural network Neural network for f : R d R. (Easy to generalize to f : R d R k.) Directed acyclic graph G = (V, E); vertices regarded as formal variables. d source vertices, one per input variable, called x 1,..., x d. Single sink vertex, called ŷ. w u,v u ŷ v Each edge (u, v) E has a weight w u,v R. x 1 x 2 x d Value of vertex v given values of parents π G(v) := {u V : (u, v) E} is v := g(z v), z v := w u,v u. (g is link function, e.g., logistic function.) u π G (v) 13 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. 14 / 22

Organizing and evaluating a neural network Vertices in V partitioned into layers V 0, V 1,.... V 0 := {x 1,..., x d }, just the input variables. Put v in V l if longest path in G from some x i to v has l edges. Final layer only consists of sink vertex, ŷ. Given input values x = (x 1,..., x d ) R d, how to compute f(x)? 1. Compute values of all vertices in V 1, given values of vertices in V 0 (i.e., input variables). v := g(z v), z v := w u,v u. (All parents of v V 1 are in V 0.) u π G (v) 14 / 22

Training a neural network How to fit a neural network to data? 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! 15 / 22

Training a neural network How to fit a neural network to data? Use stochastic gradient method! Basic computational problem: compute partial derivative of loss on a labeled example with respect to a parameter. Mid-to-late 20th century researchers discovered how to use chain rule to organize gradient computation: backpropagation algorithm. Given: labeled example (x, y) R d R; current weights w u,v R for all (u, v) E; values v and z v for all non-source v V on input x. (Can first run forward propagation to get v s and z v s.) 15 / 22

Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v 16 / 22

Backpropagation: exploiting the chain rule Strategy: use chain rule. l = l w u,v ŷ ŷ v v. w u,v For squared loss l = (ŷ y) 2, l = 2(ŷ y). ŷ Easy to compute with other losses as well. (ŷ is computed in forward propagation.) Since v = g(z v) where z v = w u,v u + (terms not involving w u,v), v = v z v w u,v z v w u,v = g (z v) u. w u,v (z v and u are computed in forward propagation.) v u 16 / 22

Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. 17 / 22

Backpropagation: the recursive part Key trick: compute ŷ v for all v V l, in decreasing order of layer l. Strategy: for v ŷ, use multivariate chain rule. Let k = out-deg(v), (v, v 1),..., (v, v k ) E: ŷ. ŷ v = k i=1 ŷ v i vi v. v 1 v k w v,v1 w v,vk v Since v i = g(z vi ) where z vi = w v,vi v + (terms not involving v), v i v = g (z vi ) w v,vi. (The z vi s are computed in forward propagation.) 17 / 22

Matrix view of forward/backward propagation Recall general neural network function f : R d R can be written as f(x) = g L (W L g 2 (W 2g 1 (W 1x)) ), where W i R d i d i 1 is weight matrix for layer i, and g i : R d i R d i collects the non-linear link functions for layer i. The first layer thus computes and i-th layer computes (Convention: f 0 (x) := x.) f 1 (x) := g 1 (z 1) where z 1 := W 1x, f i (x) := g i (z i) where z i := W if i 1 (x). Previous backprop equations prove correctness of the following matrix derivative formula: [ ( l ( ) )] = g W i(z i) W T i+1 g L 1(z L 1) W T L g L(z L)l (f L(x)) f i 1(x) T i where denotes element-wise product. 18 / 22

Practical tips Apply stochastic gradient method to examples in random order. Can use several examples to form gradient estimate: mini-batch. λ (t) := 1 b tb i=(t 1)b+1 where l i is loss on i-th training example. l i W W =W (t) Standardize inputs (i.e., center and divide by standard deviation). Doing this at every layer during training: batch normalization. (Must also apply same/similar normalization at test time.) 19 / 22

Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. 20 / 22

Modern networks Two kinds of linear layers: dense / fully-connected layers W if i 1 (x) as before, convolutional layers, which have a special sparse representation. Common link functions: ReLU z i max{0, z i} (applied element-wise) SoftMax z i ez i j ez j (applied element-wise) max-pooling z max i z i (after convolution layers; R d R), Trend in architectures (in some domains, like vision and speech): Few convolutional layers then many dense layers (AlexNet) More convolutional layers (VGGNet) Nearly purely convolutional layers in many (100+) layers with varienty of identity connections throughout (ResNet, DenseNet). 20 / 22

Wrap-up Frontier of experimental machine learning research. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. 21 / 22

Wrap-up Frontier of experimental machine learning research. Basic ideas same as from the 1980s. What s new? Fast hardware. Tons of data. Many-layered networks (made possible through many adjustments). Applications: visual detection and recognition, speech recognition, general function fitting (e.g., learning reward functions of different actions of video games), etc. 21 / 22

Key takeaways 1. Structure of neural networks; concept of link functions. 2. Motivation for multilayer neural networks and arguments for depth. 3. Forward and backward propagation algorithms. 22 / 22