CSC 578 Neural Networks and Deep Learning

Similar documents
Multilayer Perceptrons (MLPs)

Neural Network Training

Reading Group on Deep Learning Session 1

Lecture 6. Notes on Linear Algebra. Perceptron

Day 3 Lecture 3. Optimizing deep networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Feedforward Networks

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks (Part 1) Goals for the lecture

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning II: Momentum & Adaptive Step Size

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Computational statistics

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Based on the original slides of Hung-yi Lee

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

10-701/ Machine Learning, Fall

CS60010: Deep Learning

Neural Networks and Deep Learning

Artificial Neuron (Perceptron)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSC321 Lecture 5: Multilayer Perceptrons

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

4. Multilayer Perceptrons

Introduction to Convolutional Neural Networks (CNNs)

Artificial Neural Networks. MGS Lecture 2

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

More Tips for Training Neural Network. Hung-yi Lee

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 5: Logistic Regression. Neural Networks

epochs epochs

Tips for Deep Learning

1 What a Neural Network Computes

Stochastic Gradient Descent

Logistic Regression & Neural Networks

Feed-forward Network Functions

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Radial Basis Function (RBF) Networks

Machine Learning Basics III

Neural Networks: Optimization & Regularization

CS249: ADVANCED DATA MINING

Machine Learning Lecture 14

Statistical Machine Learning from Data

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Regularization in Neural Networks

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Dan Roth 461C, 3401 Walnut

Rapid Introduction to Machine Learning/ Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Understanding Neural Networks : Part I

Tips for Deep Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Introduction to Machine Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

DeepLearning on FPGAs

Multiclass Logistic Regression

Course 395: Machine Learning - Lectures

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

ECE521 Lecture 7/8. Logistic Regression

Gradient-Based Learning. Sargur N. Srihari

Multilayer Perceptron

From perceptrons to word embeddings. Simon Šuster University of Groningen

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CSCI567 Machine Learning (Fall 2018)

CSC 411 Lecture 10: Neural Networks

Deep Feedforward Networks

Neural Nets Supervised learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Classification. Sandro Cumani. Politecnico di Torino

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

An overview of deep learning methods for genomics

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Learning & Neural Networks Lecture 4

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Lecture 6 Optimization for Deep Neural Networks

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Optimization for Training I. First-Order Methods Training algorithm

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks and Deep Learning.

Transcription:

CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1

Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross Entropy Log likelihood 2. Regularization L2 regularization L1 regularization Dropout Data expansion 3. Initial weights 4. Hyper-parameters Hessian technique Momentum 5. Other approaches to minimize error/cost Conjugate gradient L-BFGS 6. Other Activation/transfer functions Tanh Rectified Linear Unit (ReLU) Softmax 2

For classification: Quadratic 1 Various Cost Functions Cross Entropy CC = 1 nn xx y(x) ln aa (1 yy(xx)) ln(1 aa) Likelihood CC = 1 nn xx ln aa yy LL usually used with Softmax activation function (for multiple output nodes) For regression: Mean Squared Error (MSE) CC = 1 nn (yy xx aa) 2 xx Or Root Mean Squared Error (RMSE) 3

Differentials of Cost Functions General form (in gradient): For a cost function C and an activation function a (and z is the weighted sum, zz = ww ii xx ii ), Quadratic: CC = CC ww ii aa CC = CC ww ii ww ii aa zz = 1 2nn (yy xx aa) 2 xx = 1 2nn 2 yy xx aa xx = 1 nn (yy xx aa) xx zz ww ii ww ii ww ii ww ii 4

Cross Entropy: CC = CC ww ii ww ii = 1 nn xx yy xx ln aa (1 yy xx ) ln(1 aa) ww ii = 1 nn xx yy xx ln aa + (1 yy xx ) ln(1 aa) ww ii = 1 nn xx yy xx aa + 1 yy xx ln 1 aa ww ii = 1 nn xx yy xx aa 1 yy xx + 1 aa ( 1) ww ii = 1 nn xx aa yy xx aa (1 aa) ww ii 5

Log Likelihood, for one data instance <x, y>: CC = CC ww ii ww ii = aa kk ln aa yy LL ww ii = 1 aa yy LL ww ii where aa yy LL is the y th activation value where y is the index of the target category. The derivative values for all other components are 0. A good reference on cross-entropy cost function is at https://www.ics.uci.edu/~pjsadows/notes.pdf. 6

1.1 Cross Entropy Since sigmoid curve is flat on both ends, cost functions whose gradient include sigmoid is slow to converge (to minimize the cost), especially when the output is very far from the target. Using sigmoid neurons but with a different cost function, in particular Cross Entropy, helps speed up convergence. With Cross Entropy, C is always > 0. where y is the target output, and a = network output 7

Cross-entropy cost (or loss ) can be divided into two separate cost functions: one for y=1 and one for y=0. The two cases are combined into one expression (since the cases are mutually exclusive). If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform. ML Cheatsheet, https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#cost-function 8

Differentials of the cost functions when aa = σσ(zz): Quadratic = ww jj aa aa zz zz ww jj = 1 nn xx σσ(zz) yy σσ zz xx jj Cross-Entropy = ww jj aa aa zz zz ww jj = 1 nn xx σσ zz yy σσ zz 1 σσ zz σσ zz xx jj = 1 nn xx (σσ zz yy) xx jj o Since the gradient of Cross-Entropy s formula removes the term σσ (zz), the weight convergence could become faster. o This also affects the BP algorithm, since the errors become different: o Step 3: Output error: δδ LL = aa LL yy o Step 4: Backpropagate the error: δδ ll = ww ll+1 TT δδ ll+1 o Step 5: Weight update is the same: ww jjjj = ηη δδ jj xx ii 9

10

You can specify the cost function in the constructor call (by cost= ). 11

1.2 Log Likelihood Log likelihood (or Maximum Likelihood Estimate) is essentially equivalent to cross entropy, but for multiple outcomes. CC = 1 nn xx ln aa yy LL where aa LL yy is the y th activation value where y is the index of the target category. Log likelihood cost function is used when the activation function is Softmax (applicable only when there are multiple output nodes). Softmax converts the activation values in the output layer to a probability distribution. where zz jj LL = mm ww jjjj LL aa mm LL 1 + bb jj LL 12

13

2 Regularization Methods Overfitting is a big problem in Machine Learning. Regularization is applied in Neural Networks to avoid overfitting. There are several regularization methods, such as: Weight decay / L2, L1 normalization Dropout Artificially enhancing the training data 14

Weight decay. 2.1 L2 Regularization Add an extra term, a quadratic values of the weights, in the cost function: For Cross-Entropy: For Quadratic: The extra term causes the gradient descent search to seek weight vectors with small magnitudes, thereby reducing the risk of overfitting (by biasing against complex model). 15

Differentials of the cost function with L2 regularization. An extra term for weights: No change for bias: Weight update rule in the BP algorithm: o A negative value ηηλλ nn ww in weights => weight decay: o No change for bias: 16

17

18

2.2 L1 Regularlization Similar to L1, but adds an extra term that is the absolute values of the weights in the cost function: Differentials of the cost function with L1 regularlization. An extra term for weights: where sgn ww = +1 if ww > 0 0 if ww == 0 1 if ww < 0 19

Weight update rule in the BP algorithm: o A negative value ηηλλ nn ssssss(ww) in weights => weight decay o No change for bias 20

L2 vs. L1 In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w. So when a particular weight has a large magnitude, w, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when w is small, L1 regularization shrinks the weight much more than L2 regularization. The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero. 21

2.3 Dropout Dropout modifies the network by periodically disconnecting some nodes in the network. We start by randomly (and temporarily) deleting half the hidden neurons in the network, while leaving the input and output neurons untouched. We forward-propagate the input x through the modified network, and then backpropagate the result, also through the modified network. We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete,.. 22

Robustness (and generalization power) of dropout: Heuristically, when we dropout different sets of neurons, it's rather like we're training different neural networks. And so the dropout procedure is like averaging the effects of a very large number of different networks. The different networks will overfit in different ways, and so, hopefully, the net effect of dropout will be to reduce overfitting. we can think of dropout as a way of making sure that the model is robust to the loss of any individual piece of evidence. Dropout has been shown to be effective especially in training large, deep networks, where the problem of overfitting is often acute. 23

2.4 Artificial Expansion of Data In general, neural networks learn better with larger training data. We can expand the data by inserting artificially generated data instances obtained by distorting the original data. The general principle is to expand the training data by applying operations that reflect real-world variation. 24

3 Initial Weights Setting good values for the initial weights could speed up the learning (convergence). Common values: Uniform random (in a fixed range) Gaussian (µ = 0, σ = 1) Gaussian (µ = 0, σ = 1 for bias, for weights on the input layer, nn iiii where nn iiii is the number of nodes) 1 25

4 Hyper-parameters How do we set good values for hyper parameters such as η (learning rate), λ (regularization parameter) to speed up the experimentation? A broad strategy is to try to obtain a signal during the early stages of experiment that gives a promising result. From there, you tweak hyper-parameters. Heuristics for specific hyper-parameters. 1. Learning rate: Try values of different magnitude, e.g. 0.025, 0.25, 2.5 2. Number of epochs: Try early stopping and monitor overfitting. 3. Regularization parameter: Start with 0 and fix η. Then try 1.0, then * or / by 10. 26

5 Other Cost-minimization Techniques In addition to Stochastic Gradient Descent, there are other techniques to minimize cost function: 1. Hessian technique 2. Momentum 3. Other learning techniques 27

5.1 Hessian Technique The Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function (of many variables), ff: R nn R. Hessian is used to approximate the minimization of cost function. First approximate C(w) by CC ww = CC ww + ww at a point w (based on Taylor Series). That gives By dropping higher order terms above quadratic, we have Finally we set ww = HH 1 CC to minimize CC ww + ww. 28

Then in the algorithm, the weight update becomes ww + ww = ww HH 1 CC The Hessian technique incorporates information about second-order changes in the cost function, and has been shown to converge to a minimum faster. However, computing H at every step, for every weight, is computationally costly (with regard to memory as well computation time). 29

5.2 Momentum The momentum method is similar to Hessian in that it incorporates the second-order changes in the cost function as an analogical notion of velocity. Stochastic gradient descent with momentum remembers the update Δw at each iteration, and determines the next update as a linear combination of the gradient (ηη CC) and the previous update (α ww), that is, ww = α ww ηη CC and the base weight update rule: ww = ww + ww gives the final weight update formula ww + ww = ww + (α ww ηη CC) [Wikipedia] Momentum adds extra term for the same direction, thereby preventing oscillation. 30

5.3 Other Learning Techniques Resilient Propagation: Use only the magnitude of the gradient and allow each neuron to learn at its own rate. No need for learning rate/momentum; however, only works in full batch mode. Nesterov accelerated gradient: Helps mitigate the risk of choosing a bad mini-batch. Adagrad: Allows an automatically decaying per-weight learning rate and momentum concept. Adadelta: Extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Non-Gradient Methods: Non-gradient methods can sometimes be useful, though rarely outperform gradient-based backpropagation methods. These include: simulated annealing, genetic algorithms, particle swarm optimization, Nelder Mead, and many more. https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class06_backpropagation.ipynb 31

6 Other Neuron Types In addition to sigmoid, there are other types of neurons(with different activation functions), for example: 1. Tanh 2. Rectified Linear Unit (ReLU) 3. Softmax A good reference on various activation functions is found at https://mlcheatsheet.readthedocs.io/en/latest/activation_functions.ht ml?highlight=softmax 32

6.1 Tanh (Hyperbolic Tangent) Definition: tanh zz = eezz ee zz ee zz + ee zz The range of tanh is between -1 and 1 (rather than 0 and 1). Tanh is also a transformation of sigmoid: σσ zz = 1+tanh h 2 2 Derivative: Let y = ff zz = eezz ee zz ee zz + ee zz yy = ff xx = 1 yy 2, with ff 0 = 0 33

6.2 ReLU (Rectified Linear Unit) Definition:ff zz = zz + = max(0, zz) This function is also called a Ramp function. Advantage -- Neurons do not saturate when z > 0. Disadvantage -- Gradient vanishes when z < 0. Derivative: ff zz = 1, zz > 0 0, zz 0 But the function is technically NOT differentiable at 0. => 0 is arbitrarily substituted above. 34

6.3 Softmax Softmax function can be applied to neurons on the output layer only. where zz jj LL = mm ww jjjj LL aa mm LL 1 + bb jj LL First the function ensures all values are positive by taking the exponent. Then it normalizes the resulting values and makes them a probability distribution (which sum up to 1). And if we use the log likelihood as the cost function CC = ln aa yy LL, we have the error δδ jj LL = aa jj LL yy ii and derivatives: 35

Derivative of Softmax The derivative of Softmax (for a layer of node activations a 1... a n ) is a 2D matrix, NOT a vector because the activation of a j depends on all other activations ( kk aa kk ). Derivative: Let ss ii = eezz ii kk ee zz kk ss ii = ss ii 1 ss jj when ii = jj aa jj ss ii ss jj when ii jj References: [1] Eli Bendersky [2] Aerin Kim 36

Softmax activation + LogLikelihood cost function: 37

7 Two Problems to review 1. Vanishing Gradient problem With gradient descent, sometimes the gradient becomes vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. 2. Neuron Saturation When a neuron s activation value becomes close to the maximum or minimum of the range (e.g. 0/1 for sigmoid, -1/1 for tanh), the neuron is said to be saturated. In those cases, the gradient becomes small (vanishing gradient), and the network stops learning. 38