Ch.6 Deep Feedforward Networks (2/3)

Size: px

Start display at page:

Download "Ch.6 Deep Feedforward Networks (2/3)"

Eugene Gerard McCoy
5 years ago
Views:

1 Ch.6 Deep Feedforward Networks (2/3) (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1

2 Contents 6.3. Hidden Units Rectified Linear Units and Their Generalizations Logistic Sigmoid and Hyperbolic Tangent Other Hidden Units 6.4. Architecture Design Universal Approximation Properties and Depth Other Architectural Considerations 6.5. Back-Propagation and Other Differentiation Algorithms Computational Graphs Chain Rule of Calculus Recursively Applying the Chain Rule to Obtain Backprop 2

3 Hidden Units The design of hidden units In this chapter(6.3), we discuss some of the basic intuitions motivating each type of hidden units. It s an extremely active area of research, and does not yet have many definitive guiding theoretical principles. Rectified linear units (ReLUs) are used in most cases of the model. It s usually impossible to predict which will work best. Most hidden units can be described as accepting a vector of inputs x, computing an affine transformation z = W T x + b, and applying an element-wise nonlinear function g(z). 3

4 ReLUs Generalization Rectified Linear Units (ReLUs) Activation function: g z = max 0, z Easy to optimize because they are so similar to linear units. Typically used on top of an affine transformation: h = g(w T x + b) Set all elements of b to small, positive value, such as 0.1 -> it makes the rectified linear units active for most inputs in the training set. Three generalizations of ReLUs : h i = g z, α i = max 0, z i + α i min(0, z i ) Absolute value rectification: α i = 1 Leaky ReLU: α i = 0.01 Parametric ReLU (PReLU): fixes α i as a learnable parameter. 4

ReLUs Generalization Maxout units Instead of applying an element-wise function g(z), maxout units divide z into groups of k values: g z i = max j G i z j where G i is the set of input groups.

5 ReLUs Generalization Maxout units Instead of applying an element-wise function g(z), maxout units divide z into groups of k values: g z i = max j G i z j where G i is the set of input groups. It can learn a piecewise linear, convex function with up to k pieces. It can be seen as learning the activation function itself. One can gain some statistical and computational advantages by requiring fewer parameters. 5

Logistic Sigmoid / Hyperbolic Tangent Logistic Sigmoid: g z = σ z = 1 1+e z Hyperbolic tangent: g z = tanh z Sigmoidal units saturate across most of their domain.

6 Logistic Sigmoid / Hyperbolic Tangent Logistic Sigmoid: g z = σ z = 1 1+e z Hyperbolic tangent: g z = tanh z Sigmoidal units saturate across most of their domain. gradient-based learning can be very difficult. It is used when an appropriate cost function can undo the saturation of the sigmoid in the output layer. Sigmoidal activation functions are more common in settings other tan feedforward networks. 6

7 Other Hidden Units In general, a wide variety of differentiable functions can be used. Usually new hidden unit types are published only if they are clearly demonstrated to provide a significant improvement. Especially useful and distinctive ones (1/3): To not have an activation g(z) at all: using the identity function as the activation function. A linear unit can be useful as a hidden unit, as well as the output unit. It offers an effective way of reducing the number of parameters in a networks. Softmax unit: may sometimes be used as a hidden unit. 7

8 Other Hidden Units Especially useful and distinctive ones (2/3): Radial basis function: h i = exp 1 σ i 2 W :,i x 2 Softplus: g a = ξ a = log(1 + e a ) px-Rectifier_and_softplus_functions.svg.png 8

9 Other Hidden Units Especially useful and distinctive ones (3/3): Hard tanh: g a = max 1, min 1, a 9

10 Architecture Design Determining the architecture The word architecture refers to the overall structure of the network. Most neural networks are organized into groups of units called layers. 1 st layer: h 1 = g 1 W 1 T x + b 1 2 nd layer: h 2 = g 2 W 2 T h 1 + b 2 The main architectural considerations are to choose the depth of the network and the width of each layer. 10

11 Universal Approximation Properties and Depth A linear model can by definition represent only linear functions, but we often want to nonlinear function. Universal approximation theorem A feedforward network with a linear output layer and at least one hidden layer with any squashing activation function (such as the logistic sigmoid) can approximate any Borel measurable function nearly a continuous function. It means that regardless of what function we are trying to learn, we know that a large MLP will be able to represent the function. 11

12 Universal Approximation Properties and Depth Even if the MLP is able to represent the function, learning can fail for: The optimization algorithm used for training may not be able to find the value of the parameters. The training algorithm might choose the wrong function due to overfitting. The universal approximation theorem says that there exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be. In the worse case, an exponential number of hidden units may be required. A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly. 12

13 Universal Approximation Properties and Depth There exist families of functions which can be approximately efficiently by an architecture with depth greater than some value d. But they require a much larger model if depth is restricted to be less than or equal to d. In many cases, the number of hidden units required by the shallow model is exponential in n. Such results were proved for models that do not resemble the continuous, differentiable neural networks used for machine learning, but have since been extended to these models (see p.199) 13

14 Universal Approximation Properties and Depth Leshno et al. (1993) demonstrated that shallow networks with a broad family of non-polynomial activation functions (including rectified linear units) have universal approximation properties. But these result do not address the questions of depth or efficiency. Montufar et al. (2014) shows that functions representable with a deep rectifier network can require an exponential number of hidden units with a shallow network. They also showed that piecewise linear networks can represent function with a number of regions that is exponential in the depth of the network. 14

15 Universal Approximation Properties and Depth 15

16 Universal Approximation Properties and Depth Choosing a deep model means that the function we want to learn should involve composition of several simpler functions. Empirically, greater depth does seem to result in better generalization for a wide variety of tasks. 16

17 17

18 18

19 Other Architectural Considerations In practice, neural networks show considerably more diversity. Many architectures build a main chain but then add extra architectural features to it, such as skip connections going from layer i to layer i + 2 or higher. In the default neural network layer, every input unit is connected to every output unit. Reducing the number of connections reduce the number of parameters and the amount of computation required to evaluate the network. 19

20 Back-Propagation and Other Differentiation Algorithms The back-propagation algorithm The input x provide the initial information that then propagates up to the hidden units at each layer and finally produces y. The back-propagation algorithm (simply called backprop) allows the information from the cost to then flow backwards through the network, in order to compute the gradient. The back-propagation algorithm does using a simple and inexpensive procedure, because of the expensive computation of numerical evaluation. 20

21 Computational Graphs To describe the back-propagation algorithm more precisely, it is helpful to have a computational graph language. Here, each node indicates a variable (scalar, vector, matrix, tensor, ) Operation: a sample function of one or more variables. In the example below, they define an operation to return only a single output. If a variable y is computed by applying an operation to a variable x, then a directed edge from x to y is drawn. 21

22 Computational Graphs 22

23 23

24 Chain Rule of Calculus The chain rule of calculus (not of probability) Let x be a real number, and let f and g both be functions mapping from a real number to a real number. Suppose that y = g(x) and z = f(g(x)) = f(y). Then the chain rule states that: dz dx = dz dy dy dx It can be generalized beyond the scalar case. Suppose that x R m, y R n, g maps from R m to R n, and f maps from R n to R. if y = g(x) and z = f(y) then z x i = j z y j vector notation y j x x z = y i x T y z 24

25 Recursively Applying the Chain Rule to Obtain Backprop Actually, evaluating an expression in a computer introduces some extra considerations. Any procedure that computes gradients need to choose: to store subexpressions or to recompute them several times. In some cases, computing the same subexpression twice would simply be wasteful. (see fig. 6.9) First consider a computational graph describing how to compute a single scalar u n (Algorithm 6.1). Then we consider the back-propagation algorithm designed to reduce the number of common subexpressions (Algorithm 6.2). 25

26 26

27 u i = f A i where A i is the set of all nodes that are parents of u i. 27

28 28

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping