Deep Feedforward Networks

Size: px
Start display at page:

Download "Deep Feedforward Networks"


1 Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, / 24

2 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3 Hidden Units 4 Architecture Design 5 Back-Propagation and other differentiation algorithms 6 Regularization in deep learning Liu Yang Short title March 30, / 24

3 Background A general introduction Deep forward network Deep forward network is also called feedforward neural networks or MLP (Multilayer perceptron) The goal of a deep forward network is to approximate (learn) a function y = f Why we call it feedforward: no feedback connections in outputs A network properties for deep forward network is composing may different functions: f (3) (f (2) (f (1) (x))) Liu Yang Short title March 30, / 24

4 Background A general introduction Figure: An Example Liu Yang Short title March 30, / 24

5 Background Example Example: learning XOR The input is operation on two binary values, if one value equals 1 it will return to 1, otherwise it will return to zero Input X = {(0, 0), (0, 1), (1, 1), (1, 0)} Suppose we would like to fit a model y = f (x, θ) to learn the target function, then the loss function will be: J(θ) = 1 4 (f (x) f (x; θ)) 2 (1) Liu Yang Short title March 30, / 24

6 Background Example Example: learning XOR Linear approach can be used as the first try f (x; w, b) = x T w + b or f (x; w, b) = x 1 w 1 + x 2 w 2 + b Solution: w = 0, b = 0.5 with outputs 0.5 everywhere Why linear function fail? Liu Yang Short title March 30, / 24

7 Background Example Linear approach for XOR Major challenge for single-layer perceptron network: the two classes must be linearly separable, however, in XOR example, one linear function cannot separate these two classes, two lines may separate them Thus multiple-layer perceptron network can be used to provide a solution Liu Yang Short title March 30, / 24

8 Background Example Basic Components There are several basic components for deep forward network: Cost function Output Units Hidden Units Architecture Design Back-Propagation Algorithms Liu Yang Short title March 30, / 24

9 Gradient based learning Cost functions Cost functions Learning conditional distribution with maximum likelihood J(θ) = E x,y ˆpdata logp model (y x) (2) Learning conditional statistics f = argmine x,y ˆpdata y f (x) 2 (3) Liu Yang Short title March 30, / 24

10 Gradient based learning Output Units Output Units Linear Units for Gaussian Output Distributions : ŷ = W T h + b Sigmoid Units for Bernoulli Output Distributions: ŷ = σ(w T h + b) Softmax Units for Multinoulli Output Distributions: softmax(z) i = exp(z i) Σ j exp(z j ) (4) where z i = logp(y = i x) Other Output Types: Mixture units Liu Yang Short title March 30, / 24

11 Hidden Units Hidden Units Activation functions is used to compute the hidden layer values How to choose the type of hidden unit to use in the hidden layers? Rectified Linear Units (ReLU) ReLU use the activation function g(z) = max{0, z} ReLU are used on top of an affine transformation: h = g(w T x + b) Noisy ReLU: g(z) = max(0, z + Y ), Y N(0, σ(z)) Absolute value rectification g(z) = z Leaky ReLU: g(z, α) = max(0, z) + αmin(0, z), α = 0.01 parametric ReLU: treat α as a learnable parameter Logistic Sigmoid and Hyperbolic Tangent: use activation function: g(z) = σ(z) or g(z) = tanh(z) Other Hidden Units: RBF, softplus and Hard tanh Liu Yang Short title March 30, / 24

12 Architecture Design Architecture Design Architecture refers to the overall structure of the network: How many units it should have and how these units should be connected to each other. Universal Approximation Properties and Depth Other Architectural Considerations Liu Yang Short title March 30, / 24

13 Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-Propagation allows information from the cost to then flow backward through the network in order to compute the gradient Back-Propagation use the chain rule to iteratively compute gradients for each layer Back-Propagation requires activation function to be differentiable Liu Yang Short title March 30, / 24

14 Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Suppose we have a loss function E and a three layer network y = f (h(x)). Our goal is to minimize the loss function and obtain a solution for the weights (w (1) ) from input region to hidden layer and the weights (w (2) ) from hidden layer to output unit. E = 1 2 o t, where o is output unit and t is the target value. Liu Yang Short title March 30, / 24

15 Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-propagated error for output unit: o (2) j output, t j is the target value, w (2) ij is the value for j th is the weight from i th hidden layer to j th output unit. The right part of each circle is the target function and the left part is the gradient for the target function. Liu Yang Short title March 30, / 24

16 Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-propagated error for hidden layer: o (1) j hidden layer, t j is the target value, w (2) jq layer to q th output unit. δ (2) q is the value for j th is the weight from j th hidden is the BP error for output unit. Liu Yang Short title March 30, / 24

17 Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms Back-Propagation algorithm can be divided into two phases: Phase 1: Propagation Forward propagation of a training pattern s input through the neural network in order to generate the network s output value(s). Backward propagation of the propagation s output activations through the neural network using the training pattern target in order to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. Phase 2: Weight update The weight s output delta and input activation are multiplied to find the gradient of the weight A ratio (percentage) of the weight s gradient is subtracted from the weight Liu Yang Short title March 30, / 24

18 Back-Propagation and other differentiation algorithms Back-Propagation and other differentiation algorithms How Back-Propagation works in a three layer network Figure: Pseudocode for a stochastic gradient algorithm Liu Yang Short title March 30, / 24

19 Back-Propagation and other differentiation algorithms Example: learning XOR A linear approach fails, we can consider changing the input space: Left: Original x space Right: Learned h space and with this h space, we can approach by a linear model, using one line to separate two classes Figure: A linear approach Liu Yang Short title March 30, / 24

20 Back-Propagation and other differentiation algorithms Example: learning XOR How can we do a nonlinear transformation to get a h space? Use neural network: f (1) (x) = W T x and f (2) (h) = h T w Use a Hidden layers function defined as: h = g(w T x + c) The activation function g can be defined as the rectified linear unit (ReLU): g(z) = max{0, z} Liu Yang Short title March 30, / 24

21 Back-Propagation and other differentiation algorithms Example: learning XOR Now the complete network is: f (x; W, c, w, b) = f (2) (f 1 (x)) = w T max{0, W T x + c} + b (5) Now walk through how model processes a batch of inputs Design matrix X for four points First step: XW Adding c Comput h Multiply by w Liu Yang Short title March 30, / 24

22 Regularization in deep learning Regularization Regularization is widely used in machine learning method Goal: reduce the generalization error but not its training error Liu Yang Short title March 30, / 24

23 Regularization in deep learning Regularization Parameter Norm Penalties L 2 Parameter Regularization L 1 Regularization Norm Penalties as Constrained Optimization Regularization and Under-Constrained Problems Dataset Augmentation Noise Robustness Injecting Noise at the Output Targets Semi-Supervised Learning Multitask Learning Liu Yang Short title March 30, / 24

24 Regularization in deep learning References Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2017) Deep Learning R. Rojas (1996) Neural Networks, Springer-Verlag Liu Yang Short title March 30, / 24

25 Regularization and Optimization in Deep Learning Libo Wang Department of Statistics Florida State University Mar 3rd, 2017

26 Early Stopping Motivation: Avoiding over-fitting and over-optimization Reducing complexity and computational cost (too many layers & nodes in Neural Network) Having relatively large training and validation datasets. Libo Wang Regularization and Optimization in Deep Learning

27 Early Stopping Most commonly used form of regularization in deep learning: Effectiveness and Simplicity Computational cost: (1) running validation set periodically, (2) maintain best parameters Libo Wang Regularization and Optimization in Deep Learning

28 Early Stopping Problem of early stopping: not include all of the data Strategy 1: Train for the same number of stops Strategy 2: Keep the parameters obtained Libo Wang Regularization and Optimization in Deep Learning

29 Early Stopping How early stopping acts as a L2 regularization? Theoretically? Gradient : Q T θ (τ) = [I (I ɛσ) τ ]Q T θ L2 : Q T θ = [Σ + αi] 1 ΣQ T θ = [I (Σ + αi) 1 α]q T θ Libo Wang Regularization and Optimization in Deep Learning

30 Ensemble method When/Why model averaging works? How Bagging works? Example: Netflix Grand Prize (Koren, 2009) Libo Wang Regularization and Optimization in Deep Learning

31 Dropout Motivation: provides an approximation to evaluate bagged ensemble networks. Each hidden unit is set to 0 with probability p h (k) (x) = g(a (k) (x)) m (k) Drop-out models share parameters that inherited from the parent neural network. Libo Wang Regularization and Optimization in Deep Learning

32 Dropout The sharing of the weights in drop-out models indicates that every model is very strongly regularized (Wager, 2013; Hinton, 2012). Dropout is a better regularizer than L2 or L1 penalties because it pulls weights towards what other models want instead of 0. For single hidden layer, it is equivalent to taking the geometric average of all neural networks, with all possible binary masks. p(y x) = 2 d θ p(y x, θ) Libo Wang Regularization and Optimization in Deep Learning

33 Adversarial Training Causes of adversarial examples: excessive linearity Advatages? Libo Wang Regularization and Optimization in Deep Learning

34 Optimization in Deep Learning Challenges in neural network optimization Second-order method Basic algorithm: SGD Improvements: Momentum, Polyak averaging Popular variants: AdaGrad, RMSProp, Adam.. Libo Wang Regularization and Optimization in Deep Learning

35 Challenges in neural network optimization Ill-conditioning Local minima Saddle Points and Flat Regions Libo Wang Regularization and Optimization in Deep Learning

36 Second-order method Newton s method: update for non-convex or saddlepoitns θ = θ 0 [H(f (θ 0 )) + αi] 1 θ f (θ 0 ) Conjugate gradients: avoid calculating H 1 d t = θ J(θ) + β t d t 1 β t = ( θj(θ t ) θ J(θ t 1 )) T θ J(θ t ) θ J(θ t 1 ) T θ J(θ t 1 ) Libo Wang Regularization and Optimization in Deep Learning

37 Basic algorithm: Stochastic Gradient Descent Motivation: Redudant to compute the summation of gradient for large datasets. Advantages: performs a parameter update for each training example (faster updating). Learning rate decay: ɛ k : ɛ k = (1 α)ɛ 0 + αɛ τ with α = k/τ Libo Wang Regularization and Optimization in Deep Learning

38 Improvements: Momentum, Polyak averaging Reasons to use momentum: SGD has trouble navigating ravines which are common around local optima. Helps accelerate SGD in the relevant direction and dampens oscillations. Libo Wang Regularization and Optimization in Deep Learning

39 Improvements: Momentum, Polyak averaging Algorithm of Momentum Libo Wang Regularization and Optimization in Deep Learning

40 Improvements: Momentum, Polyak averaging Nesterov Momentum Benefit: Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction. It is better to make correction after making mistakes. Libo Wang Regularization and Optimization in Deep Learning

41 Improvements: Momentum, Polyak averaging Polyak-Ruppert Averaging: To reduce the variance of estimation, we can average the estimates using k θ k = t=1 Then it can be implemented recursively as θ t θ (t) = θ (t 1) 1 k ( θ (t 1) θ t ) The θ k estimates quickly converge to near the optimum and then wander around it, while θ k averages out these fluctuations. We should not start the averaging process until after a burn-in phase. Libo Wang Regularization and Optimization in Deep Learning

42 Popular variants: AdaGrad, RMSProp, Adam... AdaGrad: Infrequent but predictive, text mining.... Benifit: Eliminates the need to manually tune the learning rate Weakness: The accumulated sum of gradient keeps growing during training. So the learning rate is shrink and eventually become infinitesimally small. Libo Wang Regularization and Optimization in Deep Learning

43 Popular variants: AdaGrad, RMSProp, Adam... RMSProp: an extension of AdaGrad that deals with radically diminishing learning rates. Libo Wang Regularization and Optimization in Deep Learning

44 Popular variants: AdaGrad, RMSProp, Adam... Adam: adds bias-correction and momentum to RMSprop.. Libo Wang Regularization and Optimization in Deep Learning

45 Popular variants: AdaGrad, RMSProp, Adam... Summary: Which optimization to use? For fast convergence to train a deep or complex neural network, we should choose one of the adaptive learning rate methods. We don t need to tune the learning rate but likely achieve the best results with the default value by using adaptive learning rate methods. RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [15] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Libo Wang Regularization and Optimization in Deep Learning

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on Problem Statement Machine Learning Optimization Problem Training samples:

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Ch.6 Deep Feedforward Networks (2/3)

Ch.6 Deep Feedforward Networks (2/3) Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information


OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning  Ian Goodfellow Last updated Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway October 17, 2017 Regularization Definition of Regularization

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page 1 Goals for the Lecture You should understand the following

More information

CS60010: Deep Learning

CS60010: Deep Learning CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018 FFN Goal: Approximate some unknown ideal function f : X! Y Ideal classifier: y = f*(x) with x and category y Feedforward Network: Define parametric

More information



More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information



More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Introduction to Deep Learning CMPT 733. Steven Bergner

Introduction to Deep Learning CMPT 733. Steven Bergner Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University August 17 th,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Machine Learning

Machine Learning Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression. Lecture 04 Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science Supervised Learning Task = learn an (unkon function t : X T that maps input

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen Course Outline Fundamentals Bayes Decision Theory Probability

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) Lecture 5 Neural models for NLP Julia Hockenmaier 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Bagging and Other Ensemble Methods Sargur N. Srihari 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Neural Networks. Lecture 2. Rob Fergus

Neural Networks. Lecture 2. Rob Fergus Neural Networks Lecture 2 Rob Fergus Overview Individual neuron Non-linearities (RELU, tanh, sigmoid) Single layer model Multiple layer models Theoretical discussion: representational power Examples shown

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Notes on Back Propagation in 4 Lines

Notes on Back Propagation in 4 Lines Notes on Back Propagation in 4 Lines Lili Mou March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information


CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models

More information

Intro to Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs

More information

Unit III. A Survey of Neural Network Model

Unit III. A Survey of Neural Network Model Unit III A Survey of Neural Network Model 1 Single Layer Perceptron Perceptron the first adaptive network architecture was invented by Frank Rosenblatt in 1957. It can be used for the classification of

More information