CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why Machine Learning is making a big splash. Similar to logistic regression and random forests, their main purpose is prediction/classification. Classical applications of neural networks include: 1. Image classification. For example, determining the digit contained in an image, as in Figure 3.1. 2. Natural language processing. For example, interpreting voice commands, like Siri and Alexa do. 3. Stock market prediction. Each of these problems can be rephrased mathematically as finding an unknown (and highly complex) function f, such that given certain data x, f (x) gives the desired classification/prediction. In our examples: 1. Image classification. In our example, x is a vectorized image of a digit, and f (x) {0,..., 9} is the digit depicted in such figure. For the images x 1, x 2,..., x 5 in Figure 3.1, f (x 1 ) = 5, f (x 2 ) = 0,..., f (x 5 ) = 9. 2. Natural language processing. In our example, x is a vector containing the values in a voice signal, as in Figure 3.2, and f (x) is the interpretation, in this case turn off the t.v.. 3. Stock market prediction. Here x could be a sequence of stock market prices at times t = 1,..., T, and f (x) could be the prediction of such stock market price at time T + 1. All of these tasks can be tremendously challenging. For example, whether an image contains a 0, or a 1, or any other digit, depends not only on the values of isolated pixels, but on the way that pixels interact with one an other in complex manners. The main idea behind ANNs is to use a sequence of simpler functions that interact with one another in a networked way, so that combined, they approximate f with arbitrary precision. Figure 3.1: Digit images from the MNIST dataset. The goal is to automatically determine the digit contained in each image, in this case {5, 0, 4, 1, 9}. 3-1
Topic 3: Neural Networks 3-2 Figure 3.2: Voice signal corresponding to the command turn off the t.v.. 3.2 Intuition To build some intuition, suppose we want to estimate the following function f : Figure 3.3: Function f (x) that we want to estimate. One option is to use a Riemann-type approximation ˆf = L l=1 g l where g l are step functions of the form: { wl if β(l 1) x < βl g l (x) = 0 otherwise. Here w l indicates the height of f in the l th interval [β(l 1), βl). This type of approximation would look something like:
Topic 3: Neural Networks 3-3 Notice that the parameters w 1,..., w l, together with the interval width β determine ˆf. Unfortunately, estimating this type of function would require to observe at least one sample x on each interval [β(l 1), βl). For the figure above, this would require at least 21 samples. This may not sound like much. However, as we move to higher dimensions, this number scales exponentially, and quickly becomes prohibitively large: if f is a multivariate function of x R d (instead of a single variable function of x R), then a similar approximation would require 21 d samples. With d as little as 10, this would mean more than 1 trillion samples. As we will see, in most of the big data applications that we are interest, d is often in the order of hundreds, and often thousands or millions, whence this approach is completely infeasible. Another option is to allow varying size intervals, such that g l (x) = { wl if β l 1 x < β l 0 otherwise. In this case, the parameters are the weights w 1,..., w L and the interval boundaries β 0,..., β L. This type of approximation would look more like: and will potentially require far fewer samples. The downside is that now ˆf may be quite inaccurate (on top of being discontinuous). To address this, we can make each g l a simple nonlinear yet continuous function. For example, a sigmoid: which looks as follows: σ(x) = 1, (3.1) 1 + e x Notice that σ(wx β) shifts the function by β, and squeezes the function by w. So if we let g(x) = σ(wx β), and we choose the right parameters w and β, and add a bunch of these functions, then we can approximate
Topic 3: Neural Networks 3-4 f much more accurately. Moreover, if f is a function of more than one variable, say of a vector x R d, then we can generalize g in a very natural way: g(x) = σ(w T x β), (3.2) where now w R d is a parameter vector. Just as the step function is the building block of the Riemann-type approximation above, this type of function g is precisely the building block (often called neuron) of a neural network, usually depicted as follows: The main insight of a neural network is that by adding a bunch of these building blocks, one can approximate f with arbitrary accuracy. For example, with two neurons we would obtain the following: which we can depict as follows: ˆf(x) = g 1 (x) + g 2 (x), There is no reason why we cannot include weights to each g, nor a final shift, to obtain: [ ] [ ] w1 w ˆf(x) = w 1 g 1 (x) + w 2 g 2 (x) β = 2 g1 (x) β, =: w T g(x) β, g 2 (x) where w 1, w 2, and β are new parameters. This function would be represented like: There is no reason to stop there. We can add more neurons, and more layers to obtain more powerful networks, capable of approximating more complex functions. Let L be the number of layers, and let n l denote the number of neurons in the l th layer. For l = 2, 3,..., L, let W l R n l n l 1 be the matrix formed by transposing and stacking the weight vectors of the n l neurons in the l th layer, and let β l R n l be the vector containing the shifting coefficients (often called bias coefficients) of the neurons in the l th layer, such that the output at the second layer is given by: g 2 (x) = σ(w 2 x β 2 ),
Topic 3: Neural Networks 3-5 Figure 3.4: Components of a neural network. and for l = 3,..., L, the output at the l th layer is given by: g l (x) = σ(w l g l 1 (x) β l ), where g L (x) is the final output of the network, also denoted as: ˆf(x) := σ(w L σ(w L 1 σ(w 3 σ(w 2 x β 2 ) β 3 ) β L 1 ) β L ). (3.3) Notice that ˆf(x) R nl may be a vector (if n L > 1), as opposed to a scalar (if n L = 1), and so neural networks allow to infer vector functions. See Figure 3.4 to build some intuition. Going back to estimating the function in Figure 3.3, we can construct the following network: Depending on the parameters {W l, β l } L l=2 that we choose, may produce functions like the following:
Topic 3: Neural Networks 3-6 An approximation of the function in Figure 3.3 with a neural network like this would look like: However, this will require to identify the right parameters {W l, β l } L l=2 that produce this approximation.
Topic 3: Neural Networks 3-7 3.3 Identifying the Right Parameters As discussed before, with the right parameters {W l, β l } L l=2, the function ˆf(x) in (3.3) can approximate any function f (x) with arbitrary accuracy. The challenge is to find the right parameters {W l, β l } L l=2. Hence, our goal can be summarized as follows: find the parameters {W l, β l } L l=2 of the function ˆf in (3.3), such that ˆf(x) f (x). Recall that we do not know f. However, we do have training data. More precisely, we have N samples x 1, x 2,..., x N of N, as well as their response y 1, y 2,..., y N. For example, if we were studying diabetes, x i could contain demographic information about the i th person in our study, and y i could be a binary variable indicating whether this person is diabetic or not. In other words, we know that f (x i ) = y i for every i = 1, 2,..., N. Intuitively, this means that we get to see N snapshots of f at points x i. Our goal is to exploit this information to find the parameters {W l, β l } L l=2 such that ˆf f, such that the network can reproduce the response y whenever a new vector x is fed to the network. This can be done by minimizing the error (over all training data) between the network s prediction ˆf(x i ) and its corresponding observation y i. Mathematically, we can achieve this by solving the following optimization min {W l,β l } L l=2 N y i ˆf(x i ) 2. (3.4) i=1 The most widely used technique to solve (3.4) is through stochastic gradient descent and back-propagation. The key insight is to observe that the gradient of each weight matrix can be computed backwards in terms of the gradients of the weight matrices of subsequent layers. To see this, first define z 1 i = y 1 i = x i, and then for l = 2, 3,..., L, recursively define z l i := W l y l 1 i β l, where y l i := g l (x i ) = σ(z l i ) denotes the output at the l th layer, so that ˆf(x i ) = y L i. Define the cost of the ith sample as c i := y i ˆf(x i ), and δ L i := c i σ (z L i ), δ l i := [ (W l+1 ) T δ l+1 i ] σ (z l i ), 2 l L 1, where represents the Hadamard product, and σ represents the derivative of σ. Then with a simple chain rule we obtain the following gradients: i W l := c i 2 W l = 2 δ l i (y l 1 i ) T, i β l := c i 2 β l = 2 δ l i, 2 l L. (3.5) Training the neural network is equivalent to identifying the parameters {W l, β l } L l=2 such that ˆf f. This can be done using stochastic gradient descent, which iteratively updates the parameters of interest according to (3.5). More precisely, at each training time t we select a random subsample Ω t {1, 2,..., N} of the training data (hence the term stochastic), and we update our parameters as follows W l t = W l t 1 η i Ω t i W l t 1, β l t = β l t 1 η i Ω t i β l t 1, where η is the gradient step parameter to be tuned. This iteration is repeated until convergence to obtain the final parameters {Ŵl, β l } L l=2.
Topic 3: Neural Networks 3-8 3.4 Neural Networks Flavors At this point you may be wondering whether g l (x) = σ(w l g l 1 (x) β) is the only option for g, where σ is the sigmoid function in (3.1). The answer is: no. For instance, if you use then you obtain a linear layer/network. If you use σ(x) = x, σ(x) = max(0, x), then you obtain the so-called rectified linear unit (ReLU) layer/network. function. Similarly, if instead of g l (x) = σ(w l g l 1 (x) β) you use σ is usually called activation g l (x) = σ(w l g l 1 (x) β), where denotes the convolution operator, then you obtain a convolutional layer/network. The topology (shape) of the network is defined by the number of layers (L) and the number of neurons in each layer (n 1, n 2,..., n L ). Networks with large L are often called deep, as opposed to shallow networks that have small L. Depending on the problem at hand, one choice of L, n 1,..., n L, σ, and g (or combinations) may be better than another. For example, for image classification, people usually like to mix cascades of one linear layer, followed by a ReLU layer, followed by a convolutional layer, with large n l in each case, and sometimes with fixed (predetermined) W s in the convolutional layers. The possibilities are endless, and deciding on the best choice of topology remains more of an art than science. 3.5 A Word of Warning We have mentioned before that using neural networks we can approximate any function f with arbitrary accuracy. Put another way, for every f, and every ɛ > 0 there exists a function ˆf θ with parameters θ := {W l, β l } L l=2 such that, f (x) ˆf θ (x) 2 < ɛ for every x in the domain of f. (3.6) The challenge is to find the parameter θ for which (3.6) is true, which we aim to do using gradient descent to minimize the following cost function: c(θ) := N f (x i ) ˆf θ (x i ) 2. i=1 The wrinkle is that the cost function that we are trying to minimize may be non-convex, which implies that we may never find the right parameter θ. In other words, for every f there will always exist a neural network (parametrized by θ) that approximates f arbitrarily well. However, we may be unable to find such network.