Computational statistics

Similar documents
Neural networks (NN) 1

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Network Training

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Artifical Neural Networks

Neural Networks and Deep Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Artificial Neural Networks. MGS Lecture 2

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Deep Feedforward Networks

Multilayer Perceptron

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Deep Feedforward Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Logistic Regression & Neural Networks

Multi-layer Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks: Backpropagation

ECS171: Machine Learning

Machine Learning

Artificial Intelligence

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CSC 578 Neural Networks and Deep Learning

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Neural Networks: Backpropagation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Unit III. A Survey of Neural Network Model

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Stochastic gradient descent; Classification

Reading Group on Deep Learning Session 1

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Network Models in Statistical Learning

CSC242: Intro to AI. Lecture 21

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks. Edward Gatt

Multilayer Perceptron

Course 395: Machine Learning - Lectures

4. Multilayer Perceptrons

y(x n, w) t n 2. (1)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural networks and support vector machines

epochs epochs

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Statistical Machine Learning from Data

Learning from Data: Multi-layer Perceptrons

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Feedforward Neural Nets and Backpropagation

Machine Learning. 7. Logistic and Linear Regression

Regularization in Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Intelligent Systems Discriminative Learning, Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

Data Mining Part 5. Prediction

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Computational Intelligence Winter Term 2017/18

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Artificial Neural Networks

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Lecture 6. Regression

Learning Neural Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks: Optimization & Regularization

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Computational Intelligence

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks and the Back-propagation Algorithm

Artificial Neural Networks

Neural networks. Chapter 20. Chapter 20 1

18.6 Regression and Classification with Linear Models

Classification with Perceptrons. Reading:

Artificial Neural Network

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning Lecture 5

CS60010: Deep Learning

Gradient-Based Learning. Sargur N. Srihari

Artificial Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Introduction to Neural Networks

Machine Learning Basics III

Transcription:

Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016

Neural networks A class of learning methods that was developed separately in different fields statistics and artificial intelligence based on essentially identical models The central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features The result is a powerful learning method, with widespread applications in many fields There exist many neural network models. Here we describe the most widely used vanilla neural net, sometimes called the single hidden layer back-propagation network, or single layer perceptron

Overview Neural network model Learning Practical issues Examples Simulated data Example in R

Multi-layer architecture A neural network is a two-stage regression or classification model, typically represented by a network diagram as in the next slide It is composed of three layers: input, hidden and output This network applies both to regression or classification For regression, typically K = 1 and there is only one output unit Y 1 at the top. However, these networks can handle multiple quantitative responses in a seamless fashion, so we will deal with the general case For K-class classification, there are K units at the top, with the kth unit modeling the probability of class k. There are K target values Y k, k = 1,..., K, each being coded as a 0 1 variable for the kth class

Graphical representation

Propagation equations Derived features Z m are created from linear combinations of the inputs, and then the target Y k is modeled as a function of linear combinations of the Z m : Z m = σ(α 0m + α T mx ), m = 1,..., M, T k = β 0k + βk T Z, k = 1,..., K, f k (X ) = g k (T ), k = 1,..., K where Z = (Z 1, Z 2,..., Z M ), and T = (T 1, T 2,..., T K ) Terminology α mj : connection weight between input unit j and hidden unit m β km : connection weight between hidden unit m and output unit k α 0m, β 0k : bias terms σ: hidden unit activation function g k : output activation functions

Activation function The activation function σ(v) is usually chosen to be the sigmoid (see next slide) σ(v) = 1 1 + e v Variant: sometimes the output of hidden unit m is defined using a Gaussian radial basis function as Z m = exp( γ m X α m 2 ) We then have a radial basis function network

Sigmoid activation function Plot of σ(sv) as a function of v for s = 0.5 (blue curve), s = 1 (red curve) and s = 10 (purple curve). The scale parameter s controls the activation rate

Output function The output function g k (T ) allows a final transformation of the vector of outputs T For regression we typically choose the identity function g k (T ) = T k Early work in K-class classification also used the identity function, but this was later abandoned in favor of the softmax function g k (T ) = exp(t k ) K l=1 exp(t l) This is exactly the transformation used in the multi-class logistic regression model; it produces positive estimates that sum to one

Why neural networks? The name neural networks derives from the fact that they were first developed as models for the human brain Each unit represents a neuron, and the connections represent synapses In early models, the neurons fired when the total signal passed to that unit exceeded a certain threshold In the model above, this corresponds to use of a step function for σ(z) and g k (T ) Later the neural network was recognized as a useful tool for nonlinear statistical modeling, and for this purpose the step function is not smooth enough for optimization. Hence, the step function was replaced by the smoother sigmoid function

Overview Neural network model Learning Practical issues Examples Simulated data Example in R

Learning problem The neural network model has unknown parameters, often called weights, and we seek values for them that make the model fit the training data well We denote the complete set of weights by θ, which consists of M(p + 1) hidden layer weights {α 0m, α m ; m = 1, 2,..., M} K(M + 1) output layer weights {β 0k, β k ; k = 1, 2,..., K} To train a neural network, we need a measure of fit, or error function

Error functions For regression, we use sum-of-squared errors R(θ) = n K (y ik f k (x i )) 2 i=1 k=1 For classification we use either squared error or cross-entropy (deviance) n K R(θ) = y ik log f k (x i ) i=1 k=1 and the corresponding classifier is G(x) = arg max k f k (x) With the softmax activation function and the cross-entropy error function, the neural network model is exactly a linear logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood

Back-propagation algorithm Principle The generic approach to minimizing R(θ) is by gradient descent, called back-propagation in this setting Because of the compositional form of the model, the gradient can be easily derived using the chain rule for differentiation This can be computed by a forward and backward sweep over the network, keeping track only of quantities local to each unit

Back-propagation algorithm Gradient calculation Let z mi = σ(α 0m + αmx T i ) and let z i = (z 1i, z 2i,..., z Mi ) Then we have R(θ) = n i=1 R i with R i = K k=1 (y ik f k (x i )) 2 Derivatives R i β km = R i f k (x i ) f k (x i ) β km = 2(y ik f k (x i ))g k (βt k z i)z mi R i K R i f k (x i ) z mi = α mj f k (x i ) z mi α mj = k=1 K 2(y ik f k (x i ))g k (βt k z i)β km σ (αmx T i )x ij k=1

Back-propagation algorithm Weight updates Given these derivatives, a gradient descent update at the t + 1st iteration has the form β (t+1) km α (t+1) mj = β (t) km η t = α (t) mj η t n i=1 n i=1 where η t is the learning rate, discussed below How to compute the gradient efficiently? R i β (t) km R i α (t) mj

Back-propagation algorithm Back-propagation equations We can write R i β km = δ ki z mi, R i α mj = s mi x ij The quantities δ ki and s mi are errors from the current model at the output and hidden layer units, respectively. They satisfy K s mi = σ (αmx T i ) β km δ ki k=1 Using this, the weight updates can be implemented with a two-pass algorithm: 1 In the forward pass, the current weights are fixed and the predicted values f k (x i ) are computed 2 In the backward pass, the errors δ ki are computed, and then back-propagated to give the errors s mi Both sets of errors are then used to compute the gradients for the weight updates

Advantages of back-propagation The advantages of back-propagation are its simple, local nature In the back propagation algorithm, each hidden unit passes and receives information only to and from units that share a connection Hence it can be implemented efficiently on a parallel architecture computer

Batch vs. online training The previous update equations are a kind of batch learning, with the parameter updates being a sum over all of the training cases Learning can also be carried out online processing each observation one at a time, updating the gradient after each training case, and cycling through the training cases many times In this case, the update equations become β (t+1) km α (t+1) mj = β (t) km η R t t β (t) km = α (t) mj η R t t α (t) mj A training epoch refers to one sweep through the entire training set Online training allows the network to handle very large training sets, and also to update the weights as new observations come in

Learning rate tuning and optimization algorithms The learning rate η t for batch learning was originally taken to be a constant; it can also be optimized by a line search that minimizes the error function at each update With online learning η t should decrease to zero as the iteration t to ensure convergence (e.g., η t = 1/t) Back-propagation can be very slow, and for that reason is usually not the method of choice Second-order techniques such as Newton s method are not attractive, because the second derivative matrix of R (the Hessian) can be very large Better approaches to fitting include conjugate gradients and variable metric methods, which avoid explicit computation of the second derivative matrix while still providing faster convergence

Overview Neural network model Learning Practical issues Examples Simulated data Example in R

The art of neural network training There is quite an art in training neural networks The model is generally overparametrized, and the optimization problem is nonconvex and unstable, unless certain guidelines are followed Here, we summarize some of the important issues

Starting Values If the weights are near zero, then the operative part of the sigmoid is roughly linear, and hence the neural network collapses into an approximately linear model Usually starting values for weights are chosen to be random values near zero. Hence the model starts out nearly linear, and becomes nonlinear as the weights increase Use of exact zero weights leads to zero derivatives and perfect symmetry, and the algorithm never moves Starting instead with large weights often leads to poor solutions

Overfitting Early stopping Often neural networks have too many weights and will overfit the data at the global minimum of R In early developments of neural networks, an early stopping rule was used to avoid overfitting: we train the model only for a while, and stop well before we approach the global minimum Since the weights start at a highly regularized (linear) solution, this has the effect of shrinking the final model toward a linear model A validation dataset is useful for determining when to stop, since we expect the validation error to start increasing

Overfitting Weight decay A more explicit method for regularization is weight decay, which is analogous to ridge regression used for linear models We add a penalty to the error function R(θ) + λj(θ), with J(θ) = k,m β 2 km + m,j α 2 mj The hyper-parameter λ is usually determined by cross-validation Other forms for the penalty have been proposed, for example, J(θ) = k,m β 2 km 1 + β 2 km + m,j α 2 mj 1 + α 2 mj known as the weight elimination penalty. This has the effect of shrinking smaller weights more than weight decay does

Example

Heat maps of estimated weights

Scaling of the Inputs Since the scaling of the inputs determines the effective scaling of the weights in the bottom layer, it can have a large effect on the quality of the final solution It is best to standardize all inputs to have mean zero and standard deviation one This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights With standardized inputs, it is typical to take random uniform weights over the range [ 0.7, +0.7]

Overview Neural network model Learning Practical issues Examples Simulated data Example in R

Simulated data Model Y = f (X ) + ε with f (X ) = σ(a T 1 X ) + σ(a T 2 X ), X = (X 1, X 2 ), a 1 = (3, 3), a 2 = (3, 3), Var(f (X ))/Var(ε) = 4 Training sample of size 100, a test sample of size 10,000 Neural networks with weight decay and various numbers of hidden units Average test error for each of 10 random starting weights

Results without and with weight decay

Influence of the weight decay hyper-parameter

Example with the nnet package library( MASS ) mcycle.data<-data.frame(mcycle,x=scale(mcycle$times)) test.data<-data.frame(x=seq(-2,3,0.01)) library( nnet ) nn1<- nnet(accel x, data=mcycle.data, size=2, linout = TRUE, decay=0) pred1<- predict(nn1,newdata=test.data) nn2<- nnet(accel x, data=mcycle.data, size=10, linout = TRUE, decay=0) pred2<- predict(nn2,newdata=test.data) nn3<- nnet(accel x, data=mcycle.data, size=10, linout = TRUE, decay=1) pred3<- predict(nn3,newdata=test.data)

Results accel 100 50 0 50 M=2 M=10, λ = 0 M=10, λ = 1 1 0 1 2 x

Selection of λ by 10-fold cross-validation CV 600 650 700 0 2 4 6 8 10 lambda

Conclusions Powerful and very general approach for regression and classification Training a neural network implies solving a non-convex optimization problem with a large number of parameter: computer-intensive approach Neural networks are especially effective in settings where prediction without interpretation is the goal They are less effective for problems where the goal is to describe the physical process that generated the data and the roles of individual inputs. The difficulty of interpreting these models has limited their use in fields like medicine, where interpretation of the model is very important