Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

Neural Networks Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea

What is Neural Networks? One of supervised learning method using one or more hidden layer. Imitate brain structure. Pros power to predict Cons ability of interpretation

What is Neural Networks? In 1943, the psychologist W.S.McCulloch and the mathematical logician W.Pitts first presented a mathematical model-mp model(w.s.mcculloch and W.Pitts, 1943), which is viewed as the earliest neural network. In 1957, F.Rosenblatt developed artificial neural networks by proposing the perceptron model and its learning algorithm(f.rosenblatt, 1957). Perceptron can only solve linearly separable binary problems. Later on, a low-tide period began in research of neural networks, which were gradually paid broad attention until the 1980s when Hopfield neural networks(j.j.hopfield, 1982), Boltzmann machines(d.h.ackley et al., 1985) and multilayer perceptrons(d.e.rumelhart et al., 1986) were presented.

What is Neural Networks? Examples of neural networks Figure: Single-layer perceptron and multi-layer perceptron

Structure of Neural Networks Consider the neural network K-class classification model that has one hidden layer: z m (0) = b (0) m + w (0) T m x, m = 1,..., M h m = σ(z m (0) ), m = 1,..., M T h, z (1) k = b (1) k + w (1) k k = 1,..., K f k (x) = g k (z), k = 1,..., K where x = (x 1,.., x P ) and h = (h 1,.., h M ).

Structure of Neural Networks x p, i = 1,..., P : input unit(node) h m, m = 1,..., M : hidden unit(node) y k, k = 1,..., K : output unit(node) w m (0), w (1) k : weight vector b (0) m, b (1) k : bias σ : activation function g k : output function

What is Neural Networks? Figure: Schematic of a single hidden layer neural network

Structure of Neural Networks Activation function Sigmoid function tanh function sigmoid(x) = tanh(x) = exp(x) 1 + exp(x) exp(x) exp( x) exp(x) + exp( x)

Structure of Neural Networks Output function (Final transformation function) Softmax function (Classification case) g k (z) = exp(z k ) K l=1 exp(z l), k = 1,..., K Identity function (Regression case) g k (z) = z, k = 1,..., K

Fitting Neural Networks using Back-Propagation Obective loss function Consider 1-hidden layer neural network. We assume there are N samples. θ (0) := {b (0), W (0) } : M(P + 1) parameters. θ (1) := {b (1), W (1) } : K(M + 1) parameters. Parameter set to estimate : θ := (θ (0), θ (1) ).

Fitting Neural Networks using Back-Propagation Regression case We use sum-of-squared errors: l(θ) = N K (y ik f k (x i )) 2 i=1 k=1 Classification case We usually use cross-entropy: l(θ) = N K y ik log f k (x i ) i=1 k=1 and the corresponding classifier is G(x) = argmax k f k (x).

Fitting Neural Networks using Back-Propagation Being distinct from simple models, the loss function of neural network is so complex. no closed form of estimates. Find estimates using iterative algorithm. Figure: Iterative algorithm

Fitting Neural Networks using Back-Propagation Gradient descent algorithm An iteration algorithm being widely used. First-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. Also known as steepest descent algorithm.

Fitting Neural Networks using Back-Propagation Gradient descent algorithm 1. Input : a differentiable (loss) function f(θ). θ (0) : an initial parameters. 2. for t in 1 : T Calculate gradient vector: Update parameters: 3. Output : final estimates θ (T ). grad(θ (t 1) ) = θ f(θ) θ=θ (t 1). θ (t) θ (t 1) ɛ grad(θ (t 1) ).

Fitting Neural Networks using Back-Propagation Choosing learning rate ɛ It is important to choose learning rate ɛ. Large learning rate May not be converge. Small learning rate Converge very slowly and may not reach the minimum.

Fitting Neural Networks using Back-Propagation There are many algorithms to choose learning rate. I will introduce most popular algorithm called line search algorithm. Line search algorithm At t iteration, ust choose learning rate ɛ t which minimize and update θ (t) : φ(ɛ) = f(θ (t 1) + ɛ grad(θ (t 1) )), θ (t) θ (t 1) ɛ t grad(θ (t 1) ) Figure: Line search algorithm

Fitting Neural Networks using Back-Propagation Back-propagation algorithm It is ust an algorithm applying gradient descent algorithm to neural network. Let s check the origin of name back-propagation algorithm.

Fitting Neural Networks using Back-Propagation Some notations Let consider L-hidden layer neural network. For simpleness of formula, we don t use any biases. h (l), l = 0,..., L + 1 : l-th hidden layer. (h (0) = x, h (L+1) = f(x)) Each layer h (l) has n l nodes. z (l) = n l 1 i=1 w (l 1) i h (l 1) i, l = 1,..., L + 1 h (l) = σ(z (l) ), l = 1,..., L : hidden layer h (L+1) = g(z (L+1) ) : output l(θ) : loss function

Fitting Neural Networks using Back-Propagation Back-propagation algorithm l w (l) i = = h (l) i = h (l) i l z (l+1) l z (l+1) l h (l+1) z (l+1) w (l) i = h (l) i σ (z (l+1) ) h (l+1) z (l+1) l h (l+1)

Fitting Neural Networks using Back-Propagation If l = L, Else, (i.e. l < L) l h (L+1) = h (L+1) l(h (L+1) ) l h (l+1) = = n l+2 k=1 n l+2 k=1 l z (l+2) k σ (z (l+2) k ) z (l+2) k h (l+1) l = h (l+2) k n l+2 k=1 w k l z (l+2) k w k The gradient of l-th layer parameters are only depend on the values of upper layers.

Fitting Neural Networks using Back-Propagation Example of BP algorithm Consider the 1-hidden layer neural network for regression. z m (0) = b (0) m + w m (0) T x, m = 1,..., M h m = σ(z m (0) ), m = 1,..., M T h z (1) k = b (1) k + w (1) k k = 1,..., K f k (x) = z (1) k k = 1,..., K where x = (x 1,.., x P ) and h = (h 1,.., h M ). We use sum-of-squared errors. l(θ) = N K (y ik f k (x i )) 2 i=1 k=1

Fitting Neural Networks using Back-Propagation Calculate gradient: l w (1) mk l b (1) k l w (0) pm l b (0) m = 2 = 2 = 2 = 2 N (y ik f k (x i ))h im i=1 N (y ik f k (x i )) i=1 N i=1 k=1 N i=1 k=1 Repeat until convergence: K (y ik f k (x i ))w (1) K (y ik f k (x i ))w (1) θ θ ɛ grad(θ) mk σ (z m (0) )x ip mk σ (z m (0) )

Vanishing gradient problem Gradient-based algorithms work well in shallow neural networks. But, a difficulty found in training deep structure networks with gradient-based learning methods. The back-propagation signals become diminishing in lower layers, so the algorithm finishes early, the parameters of lower layers being hardly changed at all.

Vanishing gradient problem Figure: Vanishing gradient problem

Vanishing gradient problem In deep structure, there are many bad local minima. Because of vanishing gradient problem, the learning process of deep structured neural network may be very slow and often gets trapped in poor local minima. Therefore, prediction power of trained deep structured neural network is bad, even worse than power of shallow neural network.

Vanishing gradient problem Explaining the vanishing gradient problem Let s consider the simplest deep neural network: one with ust a single neuron in each layer. Here s a network with three hidden layers: where w 1,..., w 4 are the weights, b 1,..., b 4 are the biases and C is some loss function. z = w h 1 + b h = σ(z ) and σ is the sigmoid function.

Vanishing gradient problem Explaining the vanishing gradient problem We can calculate gradient vector easily. Example. C = σ (z 3 )w 4 σ (z 4 ) C b 3 h 4 C = σ (z 1 )w 2 σ (z 2 )w 3 σ (z 3 )w 4 σ (z 4 ) C b 1 h 4 Figure: max z σ(z) = 1 4 < 1

Theoretical foundation of Neural Networks There is a wealth of literature discussing approximation, estimation and complexity of artificial neural networks. (e.g. M.Anthony and P.Bartlett, 2009) Neural networks as a universal approximator A well-known result states that a neural networks with a single, huge, hidden layer is a universal approximator. (G.Cybenko, 1989, K.Hornik et al., 1989)

Theoretical foundation of Neural Networks G.Cybenko (1989) where sigmoidal(in this theorem) means : { 1 as t σ(t) 0 as t

References D.H.Ackley, G.E.Hinton and T.J.Senowski. A learning algorithm for boltzmann machines. Cognitive science. 9(1). pp.147-169. 1985. M.Anthony and P.Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press. 2009. G.Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems. 2(4). pp.303-314. 1989. J.J.Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences. 79(8). pp.2554-2558. 1982.

References K.Hornik, M.stinchcombe and H.White. Multilayer feedforward networks are universal approximators. Neural networks. 2(5). pp.359-366. 1989. W.S.McCulloch and W.Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 5(4). pp.115-133. 1943. F.Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review. 65(6). p.386. 1958. D.E.Rumelhart, G.E.Hinton and R.J.Williams. Learning internal representations by error propagation. DTIC Document. 1985.