Neural Networks Haiming Zhou Division of Statistics Northern Illinois University zhouh@niu.edu
Neural Networks The term neural network has evolved to encompass a large class of models and learning methods. We focus the most widely used vanilla neural net, sometimes called the single hidden layer back-propagation network, or single layer perceptron. The central idea is to extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features. Neural network can be viewed as a multi-stage regression or classification model, typically represented by a network diagram. 2 / 11
Schematic of a Single Hidden Layer Network 3 / 11
Neural Networks For K-class classification, there are K units at the top, with the kth unit modeling the probability of class k. There are K target measurements Y k, k = 1,..., K, each being coded as a 0/1 variable for the kth class. Derived features Z m are created from linear combinations of the inputs, and then the target Y k is modeled as a function of linear combinations of the Z m Z m = σ(α 0m + α T mx), m = 1,..., M, T k = β 0k + β T k Z, k = 1,..., K, f k (X) = g k (T), k = 1,..., K, where Z = (Z 1,..., Z M ) T and T = (T 1,..., T K ) T. σ( ) is called the activation function; e.g., the logistic function takes the form σ(v) = 1/(1 + e v ). 4 / 11
Neural Networks The output function g k (T) allows a final transformation of the vector of outputs T. For regression (i.e. T = T 1 ) we typically choose the identity function g k (T 1 ) = T 1. For K class classification, we can use g k (T) = e T k K l=1 et l The units in the middle of the network, computing the derived features Z m, are called hidden units because the values Z m are not directly observed. In general there can be more than one hidden layers. We can think of the Z m as a basis expansion of the original inputs X; the neural network is then a standard linear (multilogit) model, using these transformations as inputs. Here, different with before, the parameters of the basis functions are learned from the data. 5 / 11
Neural Networks Notice that if σ is the identity function, then the entire model collapses to a linear model in the inputs. By introducing the nonlinear transformation σ, it greatly enlarges the class of linear models. The rate of activation depends on α m ; if α m is small, the unit will indeed be operating in the linear part of its activation function. 6 / 11
Fitting Neural Networks The neural network model has unknown parameters, often called weights. Denote the complete set of weights by θ, which consists of {α 0m, α m : m = 1,..., M } {β 0k, β k : k = 1,..., K} For regression, we can use sum-of-squared errors as our measure of fit (K = 1) n K R(θ) = (y ik f k (x i )) 2 i=1 k=1 For classification we use either squared error or cross-entropy n K R(θ) = y ik log f k (x i ) i=1 k=1 7 / 11
Back-propagation Here is back-propagation in detail for squared error loss. Let z mi = σ(α 0m + α T mx i ) and z i = (z 1i,..., z Mi ). Then with derivatives n K R(θ) = (y ik f k (x i )) 2 set n = R i i=1 k=1 i=1 R i = 2(y ik f k (x i ))g β k(β T set k z i )z mi = δ ki z mi, km K R i α ml = k=1 2(y ik f k (x i ))g k(β T k z i )β km σ (α T mx i )x il set = s mi x il. The quantities δ ki and s mi satisfy s mi = σ (α T mx i ) K β km δ ki, (1) k=1 known as the back-propagation equations. 8 / 11
Back-propagation A gradient descent update at the (r + 1)st iteration has the form β (r+1) km α (r+1) ml = β (r) km γ r = α (r) ml γ r where γ r is the learning rate. n i=1 n i=1 R i β (r) km R i α (r) ml Implement these updates with a two-pass algorithm: In the forward pass, the current weights are fixed and the predicted values ˆf k (x i ) are computed. In the backward pass, the errors δ ki are computed, and then back-propagated via (1) to give the errors s mi. This two-pass procedure is what is known as back-propagation. 9 / 11
R Resources Fitting a neural network in R; neuralnet package. R-Session 11 - Statistical Learning - Neural Networks. Link Link 10 / 11
References The Elements of Statistical Learning, 2nd Edition, by Hastie, T., Tibshirani, R. and Friedman, J. (2009). http://statweb.stanford.edu/ tibs/elemstatlearn/ Prof. Xiaogang Su s Data Mining and Statistical Learning I. https://sites.google.com/site/xgsu00/. Prof. Yaser Abu-Mostafa s Lecture. Link 11 / 11