Understanding Neural Networks : Part I

Size: px

Start display at page:

Download "Understanding Neural Networks : Part I"

Simon McLaughlin
5 years ago
Views:

1 TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018

2 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

3 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

4 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is

5 Artificial Neural Networks Neural networks are a class of simple, yet effective, computing systems with a diverse range of applications. In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is leveraged to carry out complex calculations. Deep Learning by Goodfellow, Bengio, and Courville: Convolutional Neural Networks for Visual Recognition at Stanford:

6 Artificial Neurons Diagram modified from Stack Exchange post answered by Gonzalo Medina. Weights are first used to scale inputs; the results are summed with a bias term and passed through an activation function.

7 Formula and Vector Representation The diagram from the previous slide can be interpreted as: y = f ( x 1 w 1 + x 2 w 2 + x 3 w 3 + b ) which can be conveniently represented in vector form via: y = f ( w T x + b ) by interpreting the neuron inputs and weights as column vectors.

8 Artificial Neurons: Multiple Outputs

9 Matrix Representation This corresponds to a pair of equations, one for each ouput: y 1 = f ( w T 1 x + b 1 ) y 2 = f ( w T 2 x + b 2 ) which can be represented in matrix form by the system: y = f ( W x + b ) where we assume the activation function has been vectorized.

10 Fully-Connected Neural Layers The resulting layers, referred to as fully-connected or dense, can be visualized as a collection of nodes connected by edges corresponding to weights (bias/activations are typically omitted)

11 Floating Point Operation Count Mult: MN Matrix-Vector Multiplication w w 1N x w M1... w MN x N w 11 x 1... w 1N x N..... w M1 x 1... w MN x N Add: M(N 1) w 11 x w 1N x N... w M1 x w MN x N

12 Floating Point Operation Count So we see that when bias terms are omitted, the FLOPs required for a neural connection between N inputs and M outputs is: 2 MN M = MN multiplies + M(N 1) adds When bias terms are included, an additional M addition operations are required, resulting in a total of 2 MN FLOPs. Note: This omits the computation required for applying the activation function to M values resulting from the linear operations. Depending on the activation function selected, this may or may not have a significant impact on the overall computational complexity.

13 Activation Functions Activation functions are a fundamental component of neural network architectures; these functions are responsible for: Providing all of the network s non-linear modeling capacity Controlling the gradient flows that guide the training process While activation functions play a fundamental role in all neural networks, it is still desirable to limit their computational demands (e.g. avoid defining them in terms of a Krylov subspace method...). In practice, activations such as rectified linear units (ReLUs) with the most trivial function and derivative definitions often suffice.

14 Activation Functions Rectified Linear Unit (ReLU) { x x 0 f(x) = 0 x < 0 SoftPlus Activation f(x) = ln ( 1 + exp( x) )

15 Activation Functions Sigmoidal Unit Hyperbolic Tangent Unit f(x) = exp( x) f(x) = tanh(x)

16 Activation Functions (Parameterized) Exponential Linear Unit (ELU) Leaky Rectified Linear Unit { { f α (x) = x x 0 x x 0 α (e x f α (x) = 1) x < 0 α x x < 0

17 Activation Functions (Learnable Parameters) Parameterized ReLU { β x x 0 f β (x) = x x < 0 f β (x) = Swish Units x 1 + exp( β x)

18 Hidden Layers Intermediate, or hidden, layers can be added between the input and ouput nodes to allow for additional non-linear processing. For example, we can first define a layer such as: h = f 1 ( W 1 x + b 1 ) and construct a subsequent layer to produce the final output: y = f 2 ( W 2 h + b 2 )

19 Hidden Layers

20 Multiple Hidden Layers

21 Multiple Hidden Layers Multiple hidden layers can easily be defined in the same way: h 1 = f 1 ( W 1 x + b 1 ) h 2 = f 2 ( W 2 h 1 + b 2 ) y = f 3 ( W 3 h 2 + b 3 ) One of the challenges of working with additional layers is the need to determine the impact that earlier layers have on the final ouput. This will be necessary for tuning/optimizing network parameters (i.e. weights and biases) to produce accurate predictions.

22 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

23 Universal Approximators: Cybenko (1989) Cybenko, G., Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), pp Basic Idea of Result: Let I n denote the unit hypercube in R n ; the collection of functions which can be expressed in the form: N i=1 α i σ ( w T i x + b i ) x In is dense in the space of continuous functions C(I n ) defined on I n : i.e. f C(I n ), ε > 0 there exist constants N, α i, w i, b i such that f(x) N α i σ(w T i=1 i x + b i ) < ε x I n

24 Universal Approximators: Hornik et al. / Funahashi Hornik, K., Stinchcombe, M. and White, H., Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp Funahashi, K.I., On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), pp Summary of Results: For any compact set K R n, multi-layer feedforward neural networks are dense in the space of continuous funtions C(K) on K, with respect to the supremum norm, provided that the activation function used for the network layers is: Continuous and increasing Non-constant and bounded

25 Universal Approximators: Leshno et al. (1992) Leshno, M., Lin, V.Y., Pinkus, A. and Schocken, S., Multilayer feedforward networks with a non-polynomial activation function can approximate any function. A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can approximate any continuous function to any degree of accuracy if and only if the network s activation function is not a polynomial. (Leshno et al.) Here the notion of approximation is also defined in terms of the supremum norm, and the domains are assumed to be compact The result does not hold without thresholds (i.e. bias terms)

26 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

27 Overfitting In some cases the network is capable learning too much from the specific training data used; this phenomenon is referred to as overfitting and occurs when the model performs well on the training dataset, but does not generalize to accurate predictions on data which has not been seen during training. Consider, for example:

28 L1 and L2 Weight Regularization One simple technique to help avoid overfitting is to add a penalty for network parameters with large L1 or L2 norms. This is similar to the underlying idea behind LASSO regression and can be loosely interpreted as a form of applying the principle of Ockham s Razor: i.e. the simplest solution often turns out to be the correct solution. L2 regularization is a fairly general regularization technique which places an emphasis on reducing the largest weights L1 regularization helps to encourage sparsity in the network and improves performance when the problem has a sparse solution

29 Applying Dropout Applying dropout to hidden network layers also helps to avoid overfitting. This technique consists of removing, or dropping, units/nodes randomly at each step of the training process. A fixed drop rate p (0, 1) is specified prior to training, and nodes in the layer are dropped according to a collection of i.i.d. random Bernoulli samples drawn at each training step Since all nodes will be used after training, the outputs of the remaining nodes are rescaled by a factor of 1/(1 p) to ensure that the expected values during training and testing coincide Loosely speaking, this can be thought of as a way to ensure that no individual node plays too large of a role in the final prediction.

30 Example: Dropout with Rate = 0.25 [ Training ]

31 Example: Dropout with Rate = 0.25 [ Training ]

32 Example: Dropout with Rate = 0.25 [ Training ]

33 Example: Dropout with Rate = 0.25 [ Testing ]

34 Motivation for Batch Normalization Ioffe, S. and Szegedy, C., Batch normalization: Accelerating deep network training by reducing internal covariate shift. arxiv preprint arxiv: Internal Covariate Shift: As network parameters change during training, the distributions of the input values to each layer change. Training could be more efficient if the layers were receiving inputs with a fixed distribution throughout the entire process Achieving this using normalization requires a technique which is compatabile with gradient-based optimization

35 Batch Normalization The proposed batch normalization technique corresponds to first performing a normalization with respect to the batch statistics: x = x µ B σ 2 B + ε with µ B = 1 m x, σb 2 = 1 m x B (x µ B ) 2 x B where m is a fixed batch size, and ε > 0 for numerical stability. A linear map with learnable parameters γ and β is then applied: y i = γ x i + β and the normalized values {y i } are passed to the activation function to apply the non-linear transformation for the layer.

36 Batch Normalization after Training After training, we need a way to freeze the model in place for making predictions. This is accomplished by specifying a fixed normalization rule for each layer; rather than use sample statistics from a specific batch, it is natural to incorporate the entire dataset: x = x µ σ 2 + ε where µ is the empirical mean E D [x] and σ 2 is the variance Var D [x] taken with respect to the complete training dataset D. These values can be tracked using moving averages during training to avoid direct computation and provide accurate estimates when parameter changes are small near the end of the training process.

37 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

38 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

39 Evaluating a Network Up until now, we have still not discussed how to quantify the performance of a neural network. The most common strategy for quantifying performance is to define loss functions for the network with low loss corresponding to high performance. In supervised training, where the true labels/solutions are known, the network loss function is typically composed of: A primary loss term corresponding to a measure of how close the network predictions are to the true solutions Auxiliary loss components, such as weight regularization penalties, designed to help guide the training process Once the network performance is quantified, we can specify an optimization algorithm designed to minimize the network loss.

40 Loss Functions Two of the most common applications of neural networks are regression (for predicting continuous properties/values) and classification (for predicting discrete properties or labels). For regression, a standard loss is given by the mean squared error: Loss = 1 I (ŷ i y i ) 2 i I where I are the indices of the output data (e.g. pixels of an image). For classification, softmax cross entropy can be used when labels are mutually exclusive (e.g. classifying a digit as 0, 1, 2 etc.); sigmoid cross entropy can be used when labels are not mutually exclusive (e.g. determining which objects are in an image).

41 One Hot Encoding It is also fundamentally important to consider how the data will be represented within the network. When classifying digits, for example, networks will typically perform extremely poorly if the labels are represented as a single number: 0.0, 1.0, 2.0, etc. To better distinguish the differences between e.g. 0, 1, and 2, it is useful to instead store the values using a one hot encoding: 0 = [ ] 1 = [ ] 2 = [ ] One hot encodings are also typically used for word prediction (by specifying a dictionary of possibilities) and character level predictions (by specifying the admissable character set).

42 Data Preparation In general, it also a good practice to first process/prepare the input values of a dataset before training. For example, if the input values are centered around 100 and all lie within the interval [99.9, 100.1], it is typically better to center and rescale these values beforehand: where µ = 1 D x D x = x µ σ 2 ε x and σ 2 = 1 D (x µ) 2 x D The values of µ and σ 2 can then be saved, and predictions can be made on arbitrary inputs during testing by applying the above normalization before passing the test inputs to the network.

43 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

44 Gradient Descent Gradient descent provides a simple, iterative algorithm for finding local minima of a real-valued function F numerically. The main idea behind gradient descent is relatively straightforward: compute the gradient of the function that we want to minimize and take a step in the direction of steepest descent ( i.e. F (x) ). The iteration step of the algorithm is defined in terms of a step size parameter α ( or by a decreasing sequence {α i } ) by setting: x i+1 = x i α F (x i ) Note: Convergence is only guaranteed under certain assumptions on the function F (e.g. convexity, Lipschitz continuity, etc.).

45 Loss Functions for Large Datasets In the context of neural networks, gradient descent appears to provide a reasonable approach for tuning network parameters. The initial weights and biases can be interpreted as a single vector θ 0, and the iteration steps from the previous slide could, in theory, be used to identify the optimal parameters θ for the model. The issue with this approach is that the function we are actually trying to minimize is defined in terms of the entire dataset D: F (θ) = 1 D x D f(x θ) where f(x θ) denotes the loss for a single example x when using the model parameters θ. So the standard algorithm would require computing the average loss at each step of the iterative scheme...

46 Stochastic Gradient Descent Since computing the true gradient F (θ) at every step is impractical for large datasets, we can instead try to approximate this gradient using a smaller, more managable mini-batch of data: F (θ) F i (θ) = 1 B i x B i f(x θ) where the batches {B i } partition the dataset into smaller subsets (typically of equal size). The iteration step is then taken to be: θ i+1 = θ i α F i (θ i )

47 Potential Obstacles Fixed learning rates typically lead to suboptimal performance Defining a learning rate schedule manually does not allow the algorithm to adapt to the particular problem in consideration Different parameters often require different learning rates Since the directions/magnitudes of previous updates are not taken into consideration, defining optimization policies on small batches of data may lead to a noisy, inefficient training process

48 Importance of Selecting the Correct Learning Rate High Learning Rate Low Learning Rate

49 Nesterov Momentum Nesterov, Y.E., A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. Akad. Nauk SSSR (Vol. 269, pp ). One method for learning from previous steps is to incorporate momentum into the update policy. This can be done by setting: v i = γ v i 1 + α F (θ) θ i+1 = θ i v i An accelerated form was introduced by Nesterov in 1983 which leverages the value of looking ahead before making updates: v i = γ v i 1 + α F (θ γ v i 1 )

50 AdaGrad and RMSProp Duchi, J., Hazan, E. and Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), pp Hinton, G., Srivastava, N. and Swersky, K., Neural Networks for Machine Learning Lecture 6a Overview of mini batch gradient descent. AdaGrad defines parameter specific updates which are normalized by the sum of the squares of previous gradients; this leads to a natural learning rate decay (often too much) RMSProp keeps moving averages of the squared gradients for each parameter which are used to rescale updates AdaDelta provides another method for rescaling updates, and many other variants (e.g. including momentum) exist...

51 Exponential Moving Averages One common method for estimating an average incrementally is to keep an exponential moving average of the values. This method applies an exponential decay to terms in the average which places an emphasis on the most recent values; this allows the average to move, or correct itself, as the distribution of the values changes. To track the gradient g t = F (θ t 1 ) of the loss with respect to the parameters θ, we can define an average recursively by setting: { m 0 = 0 m t = β m t 1 + (1 β) g t t m t = (1 β) β t τ g τ τ=1 where the parameter β is used to specify the exponential decay rate and is typically taken to be close to, but smaller than, 1.

52 The Adam Optimization Algorithm Kingma, D.P. and Ba, J., Adam: A method for stochastic optimization. arxiv preprint arxiv: The Adam optimizer, derived from adaptive moment estimation, proposes keeping exponential moving averages of both the first moment g t and the (uncentered) second moment g 2 t of the gradient. In addition, a bias correction is introduced to address the issue of arbitrarily initializing the exponential moving averages with zero. The first moment average m t and second moment average v t are defined with decay rates β 1 and β 2, respectively, and the bias correction procedure is defined by the rescaling: m t = m t 1 β t 1 v t = v t 1 β t 2

53 The Adam Optimization Algorithm

54 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

55 Backpropagation Rumelhart, D.E., Hinton, G.E. and Williams, R.J., Learning representations by back-propagating errors. Nature, 323(6088), p.533. While this theoretical framework for neural network optimization may seem complete, one fundamental question still remains: How are the gradients of network parameters actually computed? One approach, referred to as backpropagation, was proposed in 1986 which dealt with sigmoidal activations "σ" and defined the loss "E" in terms of predictions "y j " and true values "d j " via: E = 1 2 (yj d j ) 2

56 Backpropagation: Rumelhart, Hinton, and Williams

57 Backpropagation: Rumelhart, Hinton, and Williams

58 Backpropagation: Rumelhart, Hinton, and Williams

59 Backpropagation: Rumelhart, Hinton, and Williams

60 Backpropagation Now that the error contribution associated with y i is known: i.e. E y i = j E x j w ji contributions from network parameters of the previous layer can be computed using the same methodology that was applied to y j : e.g. E x i = E y i dy i dx i = E d σ(x i ) y i dx i In this way, gradient calculations for all network parameters can be computed by propagating back the error contributions from the parameters in subsequent layers which depend on their values.

61 Symbolic and Numeric Differentiation Two commonly used methods for automating the process of computing derivatives are symbolic differentiation and numeric differentiation; however, both of these methods have severe practical limitations in the context of training neural networks. Symbolic differentiation produces exact derivatives through direct manipulation of the mathematical expressions used to define functions; the resulting expressions can be lengthy and contain unnecessary computations, however, and are inefficient unless additional expression simplification steps are included Numeric differentiation techniques are widely applicable and efficient; however, the resulting inexact gradient estimates can entirely undermine the training process for large networks

62 Automatic Differentiation Automatic differentiation (AD) in reverse mode provides a generalization to backpropagation and gives us a way to carry out the required gradient calculations exactly and efficiently. Computes derivatives using the underlying computational graph Very efficient with respect to evaluation time A trace of all elementary operations is stored on an evaluation tape, or Wengert list ; potentially large storage requirements Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., Automatic differentiation in machine learning: a survey. University of Washington CSE599W: Spring 2018 Slides from Lecture 4: Backpropagation and Automatic Differentiation

Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3