Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24
Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3 Hidden Units 4 Architecture Design 5 Back-Propagation and other differentiation algorithms 6 Regularization in deep learning Liu Yang Short title March 30, 2017 2 / 24
Background A general introduction Deep forward network Deep forward network is also called feedforward neural networks or MLP (Multilayer perceptron) The goal of a deep forward network is to approximate (learn) a function y = f Why we call it feedforward: no feedback connections in outputs A network properties for deep forward network is composing may different functions: f (3) (f (2) (f (1) (x))) Liu Yang Short title March 30, 2017 3 / 24
Background A general introduction Figure: An Example Liu Yang Short title March 30, 2017 4 / 24
Background Example Example: learning XOR The input is operation on two binary values, if one value equals 1 it will return to 1, otherwise it will return to zero Input X = {(0, 0), (0, 1), (1, 1), (1, 0)} Suppose we would like to fit a model y = f (x, θ) to learn the target function, then the loss function will be: J(θ) = 1 4 (f (x) f (x; θ)) 2 (1) Liu Yang Short title March 30, 2017 5 / 24
Background Example Example: learning XOR Linear approach can be used as the first try f (x; w, b) = x T w + b or f (x; w, b) = x 1 w 1 + x 2 w 2 + b Solution: w = 0, b = 0.5 with outputs 0.5 everywhere Why linear function fail? Liu Yang Short title March 30, 2017 6 / 24
Background Example Linear approach for XOR Major challenge for single-layer perceptron network: the two classes must be linearly separable, however, in XOR example, one linear function cannot separate these two classes, two lines may separate them Thus multiple-layer perceptron network can be used to provide a solution Liu Yang Short title March 30, 2017 7 / 24
Background Example Basic Components There are several basic components for deep forward network: Cost function Output Units Hidden Units Architecture Design Back-Propagation Algorithms Liu Yang Short title March 30, 2017 8 / 24
Gradient based learning Cost functions Cost functions Learning conditional distribution with maximum likelihood J(θ) = E x,y ˆpdata logp model (y x) (2) Learning conditional statistics f = argmine x,y ˆpdata y f (x) 2 (3) Liu Yang Short title March 30, 2017 9 / 24
Gradient based learning Output Units Output Units Linear Units for Gaussian Output Distributions : ŷ = W T h + b Sigmoid Units for Bernoulli Output Distributions: ŷ = σ(w T h + b) Softmax Units for Multinoulli Output Distributions: softmax(z) i = exp(z i) Σ j exp(z j ) (4) where z i = logp(y = i x) Other Output Types: Mixture units Liu Yang Short title March 30, 2017 10 / 24
Hidden Units Hidden Units Activation functions is used to compute the hidden layer values How to choose the type of hidden unit to use in the hidden layers? Rectified Linear Units (ReLU) ReLU use the activation function g(z) = max{0, z} ReLU are used on top of an affine transformation: h = g(w T x + b) Noisy ReLU: g(z) = max(0, z + Y ), Y N(0, σ(z)) Absolute value rectification g(z) = z Leaky ReLU: g(z, α) = max(0, z) + αmin(0, z), α = 0.01 parametric ReLU: treat α as a learnable parameter Logistic Sigmoid and Hyperbolic Tangent: use activation function: g(z) = σ(z) or g(z) = tanh(z) Other Hidden Units: RBF, softplus and Hard tanh Liu Yang Short title March 30, 2017 11 / 24
Architecture Design Architecture Design Architecture refers to the overall structure of the network: How many units it should have and how these units should be connected to each other. Universal Approximation Properties and Depth Other Architectural Considerations Liu Yang Short title March 30, 2017 12 / 24
Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-Propagation allows information from the cost to then flow backward through the network in order to compute the gradient Back-Propagation use the chain rule to iteratively compute gradients for each layer Back-Propagation requires activation function to be differentiable Liu Yang Short title March 30, 2017 13 / 24
Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Suppose we have a loss function E and a three layer network y = f (h(x)). Our goal is to minimize the loss function and obtain a solution for the weights (w (1) ) from input region to hidden layer and the weights (w (2) ) from hidden layer to output unit. E = 1 2 o t, where o is output unit and t is the target value. Liu Yang Short title March 30, 2017 14 / 24
Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-propagated error for output unit: o (2) j output, t j is the target value, w (2) ij is the value for j th is the weight from i th hidden layer to j th output unit. The right part of each circle is the target function and the left part is the gradient for the target function. Liu Yang Short title March 30, 2017 15 / 24
Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-propagated error for hidden layer: o (1) j hidden layer, t j is the target value, w (2) jq layer to q th output unit. δ (2) q is the value for j th is the weight from j th hidden is the BP error for output unit. Liu Yang Short title March 30, 2017 16 / 24
Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-Propagation algorithm can be divided into two phases: Phase 1: Propagation Forward propagation of a training pattern s input through the neural network in order to generate the network s output value(s). Backward propagation of the propagation s output activations through the neural network using the training pattern target in order to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. Phase 2: Weight update The weight s output delta and input activation are multiplied to find the gradient of the weight A ratio (percentage) of the weight s gradient is subtracted from the weight Liu Yang Short title March 30, 2017 17 / 24
Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms How Back-Propagation works in a three layer network Figure: Pseudocode for a stochastic gradient algorithm Liu Yang Short title March 30, 2017 18 / 24
Back-Propagation and other differentiation algorithms Example: learning XOR A linear approach fails, we can consider changing the input space: Left: Original x space Right: Learned h space and with this h space, we can approach by a linear model, using one line to separate two classes Figure: A linear approach Liu Yang Short title March 30, 2017 19 / 24
Back-Propagation and other differentiation algorithms Example: learning XOR How can we do a nonlinear transformation to get a h space? Use neural network: f (1) (x) = W T x and f (2) (h) = h T w Use a Hidden layers function defined as: h = g(w T x + c) The activation function g can be defined as the rectified linear unit (ReLU): g(z) = max{0, z} Liu Yang Short title March 30, 2017 20 / 24
Back-Propagation and other differentiation algorithms Example: learning XOR Now the complete network is: f (x; W, c, w, b) = f (2) (f 1 (x)) = w T max{0, W T x + c} + b (5) Now walk through how model processes a batch of inputs Design matrix X for four points First step: XW Adding c Comput h Multiply by w Liu Yang Short title March 30, 2017 21 / 24
Regularization in deep learning Regularization Regularization is widely used in machine learning method Goal: reduce the generalization error but not its training error Liu Yang Short title March 30, 2017 22 / 24
Regularization in deep learning Regularization Parameter Norm Penalties L 2 Parameter Regularization L 1 Regularization Norm Penalties as Constrained Optimization Regularization and Under-Constrained Problems Dataset Augmentation Noise Robustness Injecting Noise at the Output Targets Semi-Supervised Learning Multitask Learning Liu Yang Short title March 30, 2017 23 / 24
Regularization in deep learning References Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2017) Deep Learning R. Rojas (1996) Neural Networks, Springer-Verlag Liu Yang Short title March 30, 2017 24 / 24
Regularization and Optimization in Deep Learning Libo Wang Department of Statistics Florida State University Mar 3rd, 2017
Early Stopping Motivation: Avoiding over-fitting and over-optimization Reducing complexity and computational cost (too many layers & nodes in Neural Network) Having relatively large training and validation datasets. Libo Wang Regularization and Optimization in Deep Learning
Early Stopping Most commonly used form of regularization in deep learning: Effectiveness and Simplicity Computational cost: (1) running validation set periodically, (2) maintain best parameters Libo Wang Regularization and Optimization in Deep Learning
Early Stopping Problem of early stopping: not include all of the data Strategy 1: Train for the same number of stops Strategy 2: Keep the parameters obtained Libo Wang Regularization and Optimization in Deep Learning
Early Stopping How early stopping acts as a L2 regularization? Theoretically? Gradient : Q T θ (τ) = [I (I ɛσ) τ ]Q T θ L2 : Q T θ = [Σ + αi] 1 ΣQ T θ = [I (Σ + αi) 1 α]q T θ Libo Wang Regularization and Optimization in Deep Learning
Ensemble method When/Why model averaging works? How Bagging works? Example: Netflix Grand Prize (Koren, 2009) Libo Wang Regularization and Optimization in Deep Learning
Dropout Motivation: provides an approximation to evaluate bagged ensemble networks. Each hidden unit is set to 0 with probability p h (k) (x) = g(a (k) (x)) m (k) Drop-out models share parameters that inherited from the parent neural network. Libo Wang Regularization and Optimization in Deep Learning
Dropout The sharing of the weights in drop-out models indicates that every model is very strongly regularized (Wager, 2013; Hinton, 2012). Dropout is a better regularizer than L2 or L1 penalties because it pulls weights towards what other models want instead of 0. For single hidden layer, it is equivalent to taking the geometric average of all neural networks, with all possible binary masks. p(y x) = 2 d θ p(y x, θ) Libo Wang Regularization and Optimization in Deep Learning
Adversarial Training Causes of adversarial examples: excessive linearity Advatages? Libo Wang Regularization and Optimization in Deep Learning
Optimization in Deep Learning Challenges in neural network optimization Second-order method Basic algorithm: SGD Improvements: Momentum, Polyak averaging Popular variants: AdaGrad, RMSProp, Adam.. Libo Wang Regularization and Optimization in Deep Learning
Challenges in neural network optimization Ill-conditioning Local minima Saddle Points and Flat Regions Libo Wang Regularization and Optimization in Deep Learning
Second-order method Newton s method: update for non-convex or saddlepoitns θ = θ 0 [H(f (θ 0 )) + αi] 1 θ f (θ 0 ) Conjugate gradients: avoid calculating H 1 d t = θ J(θ) + β t d t 1 β t = ( θj(θ t ) θ J(θ t 1 )) T θ J(θ t ) θ J(θ t 1 ) T θ J(θ t 1 ) Libo Wang Regularization and Optimization in Deep Learning
Basic algorithm: Stochastic Gradient Descent Motivation: Redudant to compute the summation of gradient for large datasets. Advantages: performs a parameter update for each training example (faster updating). Learning rate decay: ɛ k : ɛ k = (1 α)ɛ 0 + αɛ τ with α = k/τ Libo Wang Regularization and Optimization in Deep Learning
Improvements: Momentum, Polyak averaging Reasons to use momentum: SGD has trouble navigating ravines which are common around local optima. Helps accelerate SGD in the relevant direction and dampens oscillations. Libo Wang Regularization and Optimization in Deep Learning
Improvements: Momentum, Polyak averaging Algorithm of Momentum Libo Wang Regularization and Optimization in Deep Learning
Improvements: Momentum, Polyak averaging Nesterov Momentum Benefit: Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction. It is better to make correction after making mistakes. Libo Wang Regularization and Optimization in Deep Learning
Improvements: Momentum, Polyak averaging Polyak-Ruppert Averaging: To reduce the variance of estimation, we can average the estimates using k θ k = t=1 Then it can be implemented recursively as θ t θ (t) = θ (t 1) 1 k ( θ (t 1) θ t ) The θ k estimates quickly converge to near the optimum and then wander around it, while θ k averages out these fluctuations. We should not start the averaging process until after a burn-in phase. Libo Wang Regularization and Optimization in Deep Learning
Popular variants: AdaGrad, RMSProp, Adam... AdaGrad: Infrequent but predictive, text mining.... Benifit: Eliminates the need to manually tune the learning rate Weakness: The accumulated sum of gradient keeps growing during training. So the learning rate is shrink and eventually become infinitesimally small. Libo Wang Regularization and Optimization in Deep Learning
Popular variants: AdaGrad, RMSProp, Adam... RMSProp: an extension of AdaGrad that deals with radically diminishing learning rates. Libo Wang Regularization and Optimization in Deep Learning
Popular variants: AdaGrad, RMSProp, Adam... Adam: adds bias-correction and momentum to RMSprop.. Libo Wang Regularization and Optimization in Deep Learning
Popular variants: AdaGrad, RMSProp, Adam... Summary: Which optimization to use? For fast convergence to train a deep or complex neural network, we should choose one of the adaptive learning rate methods. We don t need to tune the learning rate but likely achieve the best results with the default value by using adaptive learning rate methods. RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [15] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Libo Wang Regularization and Optimization in Deep Learning