Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

Similar documents
Multilayer Perceptron

Artificial Neural Networks. MGS Lecture 2

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

CS60010: Deep Learning

Artificial Intelligence

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Feedforward Networks

Introduction to Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Introduction to Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

AI Programming CS F-20 Neural Networks

Computational statistics

Unit III. A Survey of Neural Network Model

Deep Feedforward Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Reading Group on Deep Learning Session 1

Statistical Machine Learning from Data

Neural Networks and the Back-propagation Algorithm

Feed-forward Network Functions

Introduction to Neural Networks

Machine Learning Linear Models

CS 4700: Foundations of Artificial Intelligence

Introduction Biologically Motivated Crude Model Backpropagation

4. Multilayer Perceptrons

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSC321 Lecture 5: Multilayer Perceptrons

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning. Neural Networks

Logistic Regression & Neural Networks

Artifical Neural Networks

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep unsupervised learning

Introduction to Neural Networks

Neural networks. Chapter 20, Section 5 1

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Network : Training

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Introduction to Machine Learning Spring 2018 Note Neural Networks

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Neural networks and support vector machines

Data Mining Part 5. Prediction

Lecture 5 Neural models for NLP

Lecture 4: Perceptrons and Multilayer Perceptrons

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Neural Networks and Deep Learning

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Part 8: Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Input layer. Weight matrix [ ] Output layer

Based on the original slides of Hung-yi Lee

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Basics III

Feedforward Neural Networks

Linear discriminant functions

Rapid Introduction to Machine Learning/ Deep Learning

CMSC 421: Neural Computation. Applications of Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

EEE 241: Linear Systems

Introduction to Convolutional Neural Networks (CNNs)

Feedforward Neural Nets and Backpropagation

Artificial Neural Networks. Edward Gatt

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Artificial Neural Network

Lecture 6. Notes on Linear Algebra. Perceptron

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Course 395: Machine Learning - Lectures

Logistic Regression. COMP 527 Danushka Bollegala

CS 4700: Foundations of Artificial Intelligence

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Neural Networks. Historical description

Artificial Neural Networks

Neural Networks DWML, /25

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE521 Lecture 7/8. Logistic Regression

Simple Neural Nets For Pattern Classification

Week 5: Logistic Regression & Neural Networks

Revision: Neural Network

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

ECS171: Machine Learning

Gradient-Based Learning. Sargur N. Srihari

Introduction to Artificial Neural Networks

A summary of Deep Learning without Poor Local Minima

Neural Network Training

Transcription:

Neural Networks Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea

What is Neural Networks? One of supervised learning method using one or more hidden layer. Imitate brain structure. Pros power to predict Cons ability of interpretation

What is Neural Networks? In 1943, the psychologist W.S.McCulloch and the mathematical logician W.Pitts first presented a mathematical model-mp model(w.s.mcculloch and W.Pitts, 1943), which is viewed as the earliest neural network. In 1957, F.Rosenblatt developed artificial neural networks by proposing the perceptron model and its learning algorithm(f.rosenblatt, 1957). Perceptron can only solve linearly separable binary problems. Later on, a low-tide period began in research of neural networks, which were gradually paid broad attention until the 1980s when Hopfield neural networks(j.j.hopfield, 1982), Boltzmann machines(d.h.ackley et al., 1985) and multilayer perceptrons(d.e.rumelhart et al., 1986) were presented.

What is Neural Networks? Examples of neural networks Figure: Single-layer perceptron and multi-layer perceptron

Structure of Neural Networks Consider the neural network K-class classification model that has one hidden layer: z m (0) = b (0) m + w (0) T m x, m = 1,..., M h m = σ(z m (0) ), m = 1,..., M T h, z (1) k = b (1) k + w (1) k k = 1,..., K f k (x) = g k (z), k = 1,..., K where x = (x 1,.., x P ) and h = (h 1,.., h M ).

Structure of Neural Networks x p, i = 1,..., P : input unit(node) h m, m = 1,..., M : hidden unit(node) y k, k = 1,..., K : output unit(node) w m (0), w (1) k : weight vector b (0) m, b (1) k : bias σ : activation function g k : output function

What is Neural Networks? Figure: Schematic of a single hidden layer neural network

Structure of Neural Networks Activation function Sigmoid function tanh function sigmoid(x) = tanh(x) = exp(x) 1 + exp(x) exp(x) exp( x) exp(x) + exp( x)

Structure of Neural Networks Output function (Final transformation function) Softmax function (Classification case) g k (z) = exp(z k ) K l=1 exp(z l), k = 1,..., K Identity function (Regression case) g k (z) = z, k = 1,..., K

Fitting Neural Networks using Back-Propagation Obective loss function Consider 1-hidden layer neural network. We assume there are N samples. θ (0) := {b (0), W (0) } : M(P + 1) parameters. θ (1) := {b (1), W (1) } : K(M + 1) parameters. Parameter set to estimate : θ := (θ (0), θ (1) ).

Fitting Neural Networks using Back-Propagation Regression case We use sum-of-squared errors: l(θ) = N K (y ik f k (x i )) 2 i=1 k=1 Classification case We usually use cross-entropy: l(θ) = N K y ik log f k (x i ) i=1 k=1 and the corresponding classifier is G(x) = argmax k f k (x).

Fitting Neural Networks using Back-Propagation Being distinct from simple models, the loss function of neural network is so complex. no closed form of estimates. Find estimates using iterative algorithm. Figure: Iterative algorithm

Fitting Neural Networks using Back-Propagation Gradient descent algorithm An iteration algorithm being widely used. First-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. Also known as steepest descent algorithm.

Fitting Neural Networks using Back-Propagation Gradient descent algorithm 1. Input : a differentiable (loss) function f(θ). θ (0) : an initial parameters. 2. for t in 1 : T Calculate gradient vector: Update parameters: 3. Output : final estimates θ (T ). grad(θ (t 1) ) = θ f(θ) θ=θ (t 1). θ (t) θ (t 1) ɛ grad(θ (t 1) ).

Fitting Neural Networks using Back-Propagation Choosing learning rate ɛ It is important to choose learning rate ɛ. Large learning rate May not be converge. Small learning rate Converge very slowly and may not reach the minimum.

Fitting Neural Networks using Back-Propagation There are many algorithms to choose learning rate. I will introduce most popular algorithm called line search algorithm. Line search algorithm At t iteration, ust choose learning rate ɛ t which minimize and update θ (t) : φ(ɛ) = f(θ (t 1) + ɛ grad(θ (t 1) )), θ (t) θ (t 1) ɛ t grad(θ (t 1) ) Figure: Line search algorithm

Fitting Neural Networks using Back-Propagation Back-propagation algorithm It is ust an algorithm applying gradient descent algorithm to neural network. Let s check the origin of name back-propagation algorithm.

Fitting Neural Networks using Back-Propagation Some notations Let consider L-hidden layer neural network. For simpleness of formula, we don t use any biases. h (l), l = 0,..., L + 1 : l-th hidden layer. (h (0) = x, h (L+1) = f(x)) Each layer h (l) has n l nodes. z (l) = n l 1 i=1 w (l 1) i h (l 1) i, l = 1,..., L + 1 h (l) = σ(z (l) ), l = 1,..., L : hidden layer h (L+1) = g(z (L+1) ) : output l(θ) : loss function

Fitting Neural Networks using Back-Propagation Back-propagation algorithm l w (l) i = = h (l) i = h (l) i l z (l+1) l z (l+1) l h (l+1) z (l+1) w (l) i = h (l) i σ (z (l+1) ) h (l+1) z (l+1) l h (l+1)

Fitting Neural Networks using Back-Propagation If l = L, Else, (i.e. l < L) l h (L+1) = h (L+1) l(h (L+1) ) l h (l+1) = = n l+2 k=1 n l+2 k=1 l z (l+2) k σ (z (l+2) k ) z (l+2) k h (l+1) l = h (l+2) k n l+2 k=1 w k l z (l+2) k w k The gradient of l-th layer parameters are only depend on the values of upper layers.

Fitting Neural Networks using Back-Propagation Example of BP algorithm Consider the 1-hidden layer neural network for regression. z m (0) = b (0) m + w m (0) T x, m = 1,..., M h m = σ(z m (0) ), m = 1,..., M T h z (1) k = b (1) k + w (1) k k = 1,..., K f k (x) = z (1) k k = 1,..., K where x = (x 1,.., x P ) and h = (h 1,.., h M ). We use sum-of-squared errors. l(θ) = N K (y ik f k (x i )) 2 i=1 k=1

Fitting Neural Networks using Back-Propagation Calculate gradient: l w (1) mk l b (1) k l w (0) pm l b (0) m = 2 = 2 = 2 = 2 N (y ik f k (x i ))h im i=1 N (y ik f k (x i )) i=1 N i=1 k=1 N i=1 k=1 Repeat until convergence: K (y ik f k (x i ))w (1) K (y ik f k (x i ))w (1) θ θ ɛ grad(θ) mk σ (z m (0) )x ip mk σ (z m (0) )

Vanishing gradient problem Gradient-based algorithms work well in shallow neural networks. But, a difficulty found in training deep structure networks with gradient-based learning methods. The back-propagation signals become diminishing in lower layers, so the algorithm finishes early, the parameters of lower layers being hardly changed at all.

Vanishing gradient problem Figure: Vanishing gradient problem

Vanishing gradient problem In deep structure, there are many bad local minima. Because of vanishing gradient problem, the learning process of deep structured neural network may be very slow and often gets trapped in poor local minima. Therefore, prediction power of trained deep structured neural network is bad, even worse than power of shallow neural network.

Vanishing gradient problem Explaining the vanishing gradient problem Let s consider the simplest deep neural network: one with ust a single neuron in each layer. Here s a network with three hidden layers: where w 1,..., w 4 are the weights, b 1,..., b 4 are the biases and C is some loss function. z = w h 1 + b h = σ(z ) and σ is the sigmoid function.

Vanishing gradient problem Explaining the vanishing gradient problem We can calculate gradient vector easily. Example. C = σ (z 3 )w 4 σ (z 4 ) C b 3 h 4 C = σ (z 1 )w 2 σ (z 2 )w 3 σ (z 3 )w 4 σ (z 4 ) C b 1 h 4 Figure: max z σ(z) = 1 4 < 1

Theoretical foundation of Neural Networks There is a wealth of literature discussing approximation, estimation and complexity of artificial neural networks. (e.g. M.Anthony and P.Bartlett, 2009) Neural networks as a universal approximator A well-known result states that a neural networks with a single, huge, hidden layer is a universal approximator. (G.Cybenko, 1989, K.Hornik et al., 1989)

Theoretical foundation of Neural Networks G.Cybenko (1989) where sigmoidal(in this theorem) means : { 1 as t σ(t) 0 as t

References D.H.Ackley, G.E.Hinton and T.J.Senowski. A learning algorithm for boltzmann machines. Cognitive science. 9(1). pp.147-169. 1985. M.Anthony and P.Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press. 2009. G.Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems. 2(4). pp.303-314. 1989. J.J.Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences. 79(8). pp.2554-2558. 1982.

References K.Hornik, M.stinchcombe and H.White. Multilayer feedforward networks are universal approximators. Neural networks. 2(5). pp.359-366. 1989. W.S.McCulloch and W.Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 5(4). pp.115-133. 1943. F.Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review. 65(6). p.386. 1958. D.E.Rumelhart, G.E.Hinton and R.J.Williams. Learning internal representations by error propagation. DTIC Document. 1985.