Deep Feedforward Networks

Similar documents
Deep Feedforward Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks: Optimization & Regularization

Lecture 3 Feedforward Networks and Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Day 3 Lecture 3. Optimizing deep networks

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

1 What a Neural Network Computes

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 6 Optimization for Deep Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Ch.6 Deep Feedforward Networks (2/3)

Introduction to Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning Basics III

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Optimization for neural networks

Deep Learning & Artificial Intelligence WS 2018/2019

Feed-forward Network Functions

Rapid Introduction to Machine Learning/ Deep Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Gradient-Based Learning. Sargur N. Srihari

Understanding Neural Networks : Part I

OPTIMIZATION METHODS IN DEEP LEARNING

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Computational statistics

More Tips for Training Neural Network. Hung-yi Lee

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Regularization and Optimization of Backpropagation

4. Multilayer Perceptrons

CSC 578 Neural Networks and Deep Learning

Deep Learning II: Momentum & Adaptive Step Size

CS60010: Deep Learning

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Stochastic Gradient Descent

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 8: Optimization

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

DeepLearning on FPGAs

Reading Group on Deep Learning Session 1

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Deep Learning CMPT 733. Steven Bergner

Optimization for Training I. First-Order Methods Training algorithm

Lecture 5: Logistic Regression. Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

A summary of Deep Learning without Poor Local Minima

Artificial Neuron (Perceptron)

Neural Network Training

Neural Networks and Deep Learning

Normalization Techniques in Training of Deep Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks. MGS Lecture 2

Neural Networks and the Back-propagation Algorithm

Deep Learning Lab Course 2017 (Deep Learning Practical)

Course 395: Machine Learning - Lectures

CSC321 Lecture 9: Generalization

Machine Learning

Tips for Deep Learning

Machine Learning: Logistic Regression. Lecture 04

Lecture 17: Neural Networks and Deep Learning

Statistical Machine Learning from Data

CSCI567 Machine Learning (Fall 2018)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Advanced Machine Learning

Machine Learning Lecture 14

Neural networks and support vector machines

Artificial Neural Networks 2

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Adam: A Method for Stochastic Optimization

Lecture 5 Neural models for NLP

CS260: Machine Learning Algorithms

From perceptrons to word embeddings. Simon Šuster University of Groningen

Bagging and Other Ensemble Methods

Feedforward Neural Networks. Michael Collins, Columbia University

Logistic Regression & Neural Networks

CSC321 Lecture 9: Generalization

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Tips for Deep Learning

Multilayer Perceptron

Neural Networks. Lecture 2. Rob Fergus

Introduction to Deep Neural Networks

Notes on Back Propagation in 4 Lines

Introduction to Machine Learning Spring 2018 Note Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Introduction to Neural Networks

Intro to Neural Networks and Deep Learning

Unit III. A Survey of Neural Network Model

Transcription:

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24

Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3 Hidden Units 4 Architecture Design 5 Back-Propagation and other differentiation algorithms 6 Regularization in deep learning Liu Yang Short title March 30, 2017 2 / 24

Background A general introduction Deep forward network Deep forward network is also called feedforward neural networks or MLP (Multilayer perceptron) The goal of a deep forward network is to approximate (learn) a function y = f Why we call it feedforward: no feedback connections in outputs A network properties for deep forward network is composing may different functions: f (3) (f (2) (f (1) (x))) Liu Yang Short title March 30, 2017 3 / 24

Background A general introduction Figure: An Example Liu Yang Short title March 30, 2017 4 / 24

Background Example Example: learning XOR The input is operation on two binary values, if one value equals 1 it will return to 1, otherwise it will return to zero Input X = {(0, 0), (0, 1), (1, 1), (1, 0)} Suppose we would like to fit a model y = f (x, θ) to learn the target function, then the loss function will be: J(θ) = 1 4 (f (x) f (x; θ)) 2 (1) Liu Yang Short title March 30, 2017 5 / 24

Background Example Example: learning XOR Linear approach can be used as the first try f (x; w, b) = x T w + b or f (x; w, b) = x 1 w 1 + x 2 w 2 + b Solution: w = 0, b = 0.5 with outputs 0.5 everywhere Why linear function fail? Liu Yang Short title March 30, 2017 6 / 24

Background Example Linear approach for XOR Major challenge for single-layer perceptron network: the two classes must be linearly separable, however, in XOR example, one linear function cannot separate these two classes, two lines may separate them Thus multiple-layer perceptron network can be used to provide a solution Liu Yang Short title March 30, 2017 7 / 24

Background Example Basic Components There are several basic components for deep forward network: Cost function Output Units Hidden Units Architecture Design Back-Propagation Algorithms Liu Yang Short title March 30, 2017 8 / 24

Gradient based learning Cost functions Cost functions Learning conditional distribution with maximum likelihood J(θ) = E x,y ˆpdata logp model (y x) (2) Learning conditional statistics f = argmine x,y ˆpdata y f (x) 2 (3) Liu Yang Short title March 30, 2017 9 / 24

Gradient based learning Output Units Output Units Linear Units for Gaussian Output Distributions : ŷ = W T h + b Sigmoid Units for Bernoulli Output Distributions: ŷ = σ(w T h + b) Softmax Units for Multinoulli Output Distributions: softmax(z) i = exp(z i) Σ j exp(z j ) (4) where z i = logp(y = i x) Other Output Types: Mixture units Liu Yang Short title March 30, 2017 10 / 24

Hidden Units Hidden Units Activation functions is used to compute the hidden layer values How to choose the type of hidden unit to use in the hidden layers? Rectified Linear Units (ReLU) ReLU use the activation function g(z) = max{0, z} ReLU are used on top of an affine transformation: h = g(w T x + b) Noisy ReLU: g(z) = max(0, z + Y ), Y N(0, σ(z)) Absolute value rectification g(z) = z Leaky ReLU: g(z, α) = max(0, z) + αmin(0, z), α = 0.01 parametric ReLU: treat α as a learnable parameter Logistic Sigmoid and Hyperbolic Tangent: use activation function: g(z) = σ(z) or g(z) = tanh(z) Other Hidden Units: RBF, softplus and Hard tanh Liu Yang Short title March 30, 2017 11 / 24

Architecture Design Architecture Design Architecture refers to the overall structure of the network: How many units it should have and how these units should be connected to each other. Universal Approximation Properties and Depth Other Architectural Considerations Liu Yang Short title March 30, 2017 12 / 24

Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-Propagation allows information from the cost to then flow backward through the network in order to compute the gradient Back-Propagation use the chain rule to iteratively compute gradients for each layer Back-Propagation requires activation function to be differentiable Liu Yang Short title March 30, 2017 13 / 24

Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Suppose we have a loss function E and a three layer network y = f (h(x)). Our goal is to minimize the loss function and obtain a solution for the weights (w (1) ) from input region to hidden layer and the weights (w (2) ) from hidden layer to output unit. E = 1 2 o t, where o is output unit and t is the target value. Liu Yang Short title March 30, 2017 14 / 24

Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-propagated error for output unit: o (2) j output, t j is the target value, w (2) ij is the value for j th is the weight from i th hidden layer to j th output unit. The right part of each circle is the target function and the left part is the gradient for the target function. Liu Yang Short title March 30, 2017 15 / 24

Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-propagated error for hidden layer: o (1) j hidden layer, t j is the target value, w (2) jq layer to q th output unit. δ (2) q is the value for j th is the weight from j th hidden is the BP error for output unit. Liu Yang Short title March 30, 2017 16 / 24

Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-Propagation algorithm can be divided into two phases: Phase 1: Propagation Forward propagation of a training pattern s input through the neural network in order to generate the network s output value(s). Backward propagation of the propagation s output activations through the neural network using the training pattern target in order to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. Phase 2: Weight update The weight s output delta and input activation are multiplied to find the gradient of the weight A ratio (percentage) of the weight s gradient is subtracted from the weight Liu Yang Short title March 30, 2017 17 / 24

Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms How Back-Propagation works in a three layer network Figure: Pseudocode for a stochastic gradient algorithm Liu Yang Short title March 30, 2017 18 / 24

Back-Propagation and other differentiation algorithms Example: learning XOR A linear approach fails, we can consider changing the input space: Left: Original x space Right: Learned h space and with this h space, we can approach by a linear model, using one line to separate two classes Figure: A linear approach Liu Yang Short title March 30, 2017 19 / 24

Back-Propagation and other differentiation algorithms Example: learning XOR How can we do a nonlinear transformation to get a h space? Use neural network: f (1) (x) = W T x and f (2) (h) = h T w Use a Hidden layers function defined as: h = g(w T x + c) The activation function g can be defined as the rectified linear unit (ReLU): g(z) = max{0, z} Liu Yang Short title March 30, 2017 20 / 24

Back-Propagation and other differentiation algorithms Example: learning XOR Now the complete network is: f (x; W, c, w, b) = f (2) (f 1 (x)) = w T max{0, W T x + c} + b (5) Now walk through how model processes a batch of inputs Design matrix X for four points First step: XW Adding c Comput h Multiply by w Liu Yang Short title March 30, 2017 21 / 24

Regularization in deep learning Regularization Regularization is widely used in machine learning method Goal: reduce the generalization error but not its training error Liu Yang Short title March 30, 2017 22 / 24

Regularization in deep learning Regularization Parameter Norm Penalties L 2 Parameter Regularization L 1 Regularization Norm Penalties as Constrained Optimization Regularization and Under-Constrained Problems Dataset Augmentation Noise Robustness Injecting Noise at the Output Targets Semi-Supervised Learning Multitask Learning Liu Yang Short title March 30, 2017 23 / 24

Regularization in deep learning References Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2017) Deep Learning R. Rojas (1996) Neural Networks, Springer-Verlag Liu Yang Short title March 30, 2017 24 / 24

Regularization and Optimization in Deep Learning Libo Wang Department of Statistics Florida State University Mar 3rd, 2017

Early Stopping Motivation: Avoiding over-fitting and over-optimization Reducing complexity and computational cost (too many layers & nodes in Neural Network) Having relatively large training and validation datasets. Libo Wang Regularization and Optimization in Deep Learning

Early Stopping Most commonly used form of regularization in deep learning: Effectiveness and Simplicity Computational cost: (1) running validation set periodically, (2) maintain best parameters Libo Wang Regularization and Optimization in Deep Learning

Early Stopping Problem of early stopping: not include all of the data Strategy 1: Train for the same number of stops Strategy 2: Keep the parameters obtained Libo Wang Regularization and Optimization in Deep Learning

Early Stopping How early stopping acts as a L2 regularization? Theoretically? Gradient : Q T θ (τ) = [I (I ɛσ) τ ]Q T θ L2 : Q T θ = [Σ + αi] 1 ΣQ T θ = [I (Σ + αi) 1 α]q T θ Libo Wang Regularization and Optimization in Deep Learning

Ensemble method When/Why model averaging works? How Bagging works? Example: Netflix Grand Prize (Koren, 2009) Libo Wang Regularization and Optimization in Deep Learning

Dropout Motivation: provides an approximation to evaluate bagged ensemble networks. Each hidden unit is set to 0 with probability p h (k) (x) = g(a (k) (x)) m (k) Drop-out models share parameters that inherited from the parent neural network. Libo Wang Regularization and Optimization in Deep Learning

Dropout The sharing of the weights in drop-out models indicates that every model is very strongly regularized (Wager, 2013; Hinton, 2012). Dropout is a better regularizer than L2 or L1 penalties because it pulls weights towards what other models want instead of 0. For single hidden layer, it is equivalent to taking the geometric average of all neural networks, with all possible binary masks. p(y x) = 2 d θ p(y x, θ) Libo Wang Regularization and Optimization in Deep Learning

Adversarial Training Causes of adversarial examples: excessive linearity Advatages? Libo Wang Regularization and Optimization in Deep Learning

Optimization in Deep Learning Challenges in neural network optimization Second-order method Basic algorithm: SGD Improvements: Momentum, Polyak averaging Popular variants: AdaGrad, RMSProp, Adam.. Libo Wang Regularization and Optimization in Deep Learning

Challenges in neural network optimization Ill-conditioning Local minima Saddle Points and Flat Regions Libo Wang Regularization and Optimization in Deep Learning

Second-order method Newton s method: update for non-convex or saddlepoitns θ = θ 0 [H(f (θ 0 )) + αi] 1 θ f (θ 0 ) Conjugate gradients: avoid calculating H 1 d t = θ J(θ) + β t d t 1 β t = ( θj(θ t ) θ J(θ t 1 )) T θ J(θ t ) θ J(θ t 1 ) T θ J(θ t 1 ) Libo Wang Regularization and Optimization in Deep Learning

Basic algorithm: Stochastic Gradient Descent Motivation: Redudant to compute the summation of gradient for large datasets. Advantages: performs a parameter update for each training example (faster updating). Learning rate decay: ɛ k : ɛ k = (1 α)ɛ 0 + αɛ τ with α = k/τ Libo Wang Regularization and Optimization in Deep Learning

Improvements: Momentum, Polyak averaging Reasons to use momentum: SGD has trouble navigating ravines which are common around local optima. Helps accelerate SGD in the relevant direction and dampens oscillations. Libo Wang Regularization and Optimization in Deep Learning

Improvements: Momentum, Polyak averaging Algorithm of Momentum Libo Wang Regularization and Optimization in Deep Learning

Improvements: Momentum, Polyak averaging Nesterov Momentum Benefit: Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction. It is better to make correction after making mistakes. Libo Wang Regularization and Optimization in Deep Learning

Improvements: Momentum, Polyak averaging Polyak-Ruppert Averaging: To reduce the variance of estimation, we can average the estimates using k θ k = t=1 Then it can be implemented recursively as θ t θ (t) = θ (t 1) 1 k ( θ (t 1) θ t ) The θ k estimates quickly converge to near the optimum and then wander around it, while θ k averages out these fluctuations. We should not start the averaging process until after a burn-in phase. Libo Wang Regularization and Optimization in Deep Learning

Popular variants: AdaGrad, RMSProp, Adam... AdaGrad: Infrequent but predictive, text mining.... Benifit: Eliminates the need to manually tune the learning rate Weakness: The accumulated sum of gradient keeps growing during training. So the learning rate is shrink and eventually become infinitesimally small. Libo Wang Regularization and Optimization in Deep Learning

Popular variants: AdaGrad, RMSProp, Adam... RMSProp: an extension of AdaGrad that deals with radically diminishing learning rates. Libo Wang Regularization and Optimization in Deep Learning

Popular variants: AdaGrad, RMSProp, Adam... Adam: adds bias-correction and momentum to RMSprop.. Libo Wang Regularization and Optimization in Deep Learning

Popular variants: AdaGrad, RMSProp, Adam... Summary: Which optimization to use? For fast convergence to train a deep or complex neural network, we should choose one of the adaptive learning rate methods. We don t need to tune the learning rate but likely achieve the best results with the default value by using adaptive learning rate methods. RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [15] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Libo Wang Regularization and Optimization in Deep Learning