Neural Networks (Part 1) Goals for the lecture

Similar documents
Neural Networks and Deep Learning.

Machine Learning

Machine Learning

Course 395: Machine Learning - Lectures

Artificial Neural Networks

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Artificial Neural Networks The Introduction

Reading Group on Deep Learning Session 1

Multilayer Perceptrons and Backpropagation

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Chapter 2 Single Layer Feedforward Networks

Chapter ML:VI (continued)

Single layer NN. Neuron Model

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

y(x n, w) t n 2. (1)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Multilayer Perceptron

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Lab 5: 16 th April Exercises on Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Artificial Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Multilayer Perceptron

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Simple Neural Nets For Pattern Classification

Artificial Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Chapter ML:VI (continued)

Multilayer Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks biological neuron artificial neuron 1

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Introduction to Machine Learning

Introduction to Neural Networks

Artifical Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artificial Neural Network

Introduction Biologically Motivated Crude Model Backpropagation

Neural Networks. Instructor: Jesse Davis. Slides from: Pedro Domingos, Ray Mooney, David Page, Jude Shavlik

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Artificial Neural Networks. MGS Lecture 2

AI Programming CS F-20 Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Unit III. A Survey of Neural Network Model

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Neural Networks: Basics. Darrell Whitley Colorado State University

Deep Learning. Ali Ghodsi. University of Waterloo

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks. Historical description

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 7 Artificial neural networks: Supervised learning

Logistic Regression & Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks. Edward Gatt

COMP-4360 Machine Learning Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Introduction to Neural Networks

Part 8: Neural Networks

Artificial Neuron (Perceptron)

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

The Perceptron Algorithm 1

CSC 578 Neural Networks and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

In the Name of God. Lecture 11: Single Layer Perceptrons

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Introduction to Convolutional Neural Networks (CNNs)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 What a Neural Network Computes

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks and the Back-propagation Algorithm

Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

A summary of Deep Learning without Poor Local Minima

Classification with Perceptrons. Reading:

Least Mean Squares Regression

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Learning: a gentle introduction

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Pattern Classification

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

CMSC 421: Neural Computation. Applications of Neural Networks

CS:4420 Artificial Intelligence

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Transcription:

Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Tom Dietterich, Pedro Domingos, Tom Mitchell, David Page, and Jude Shavlik Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden units multilayer neural networks gradient descent stochastic (online) gradient descent activation functions sigmoid, hyperbolic tangent, ReLU objective (error, loss) functions squared error, cross entropy

Neural networks a.k.a. artificial neural networks, connectionist models inspired by interconnected neurons in biological systems simple processing units each unit receives a number of real-valued inputs each unit produces a single real-valued output Perceptrons [McCulloch & Pitts, 943; Rosenblatt, 959; Widrow & Hoff, 960] x x2 xn w0 w w2 n " $ if w0 + wi xi > 0 o=# i= $ 0 otherwise wn input units: represent given x output unit: represents binary classification 2

5 Perceptron example x 2.6 0.2 x 3 0.3 " $ o = # $ if w 0 + w i x i > 0 n i= 0 otherwise x 4 2.3 features, class labels are represented numerically x =, 0, 0, n w 0 + w i x i = 0. o = 0 i= n x =, 0,, w 0 + w i x i = 0.2 o = i= Learning a perceptron: the perceptron training rule. randomly initialize weights 2. iterate through training instances until convergence 2a. calculate the output for the given instance 2b. update each weight " $ o = # $ if w 0 + w i x i > 0 n i= 0 otherwise Δw i = η( y o)x i η is learning rate; set to value << w i w i + Δw i 3

Representational power of perceptrons perceptrons can represent only linearly separable concepts " $ o = # $ if w 0 + w i x i > 0 n i= 0 otherwise decision boundary given by: if w 0 + w x + w 2 > 0 w x + w 2 = w 0 = w w 2 x w 0 w 2 + + + + + + - + - - + + + + - + + - - - + + - - + - - - - - x Representational power of perceptrons in previous example, feature space was 2D so decision boundary was a line in higher dimensions, decision boundary is a hyperplane 4

Some linearly separable functions AND a b c d x 0 0 0 0 y 0 0 0 x b d a c 0 OR a b c d x 0 0 0 0 y 0 x b d a c 0 XOR is not linearly separable x x y b d a 0 0 0 b 0 c 0 a c d 0 0 a multilayer perceptron can represent XOR x - - assume w 0 = 0 for all nodes 5

Example multilayer neural network output units hidden units input units figure from Huang & Lippmann, NIPS 988 input: two features from spectral analysis of a spoken sound output: vowel sound occurring in the context h d Decision regions of a multilayer neural network figure from Huang & Lippmann, NIPS 988 input: two features from spectral analysis of a spoken sound output: vowel sound occurring in the context h d 6

Learning in multilayer networks work on neural nets fizzled in the 960 s single layer networks had representational limitations (linear separability) no effective methods for training multilayer networks x how to determine error signal for hidden units? revived again with the invention of backpropagation method [Rumelhart & McClelland, 986; also Werbos, 975] key insight: require neural network to be differentiable; use gradient descent Gradient descent in weight space { } Given a training set D = (x (), y () ) (x (m), y (m) ) we can specify an error measure that is a function of our weight vector w E(w) = 2 d D ( y (d ) o (d ) ) 2 figure from Cho & Chow, Neurocomputing 999 w 2 w This objective function defines a surface over the model (i.e. weight) space 7

Gradient descent in weight space gradient descent is an iterative process aimed at finding a minimum in the error surface on each iteration current weights define a point in this space find direction in which error surface descends most steeply take a step (i.e. update weights) in that direction Error w w 2 Gradient descent in weight space calculate the gradient of E: # E(w) = E, E,!, E & $ w 0 w w ( n ' take a step in the opposite direction Δw = η E( w) Δw i = η E w i Error E w w w 2 E w 2 8

The sigmoid function to be able to differentiate E with respect to w i, our network must represent a continuous function to do this, we can use sigmoid functions instead of threshold functions in our hidden and output units f z = + e '( z The sigmoid function for the case of a single-layer network f net = + e ' +,- / + / / net = w 3 + 4 w 5 x 5 5 9

Batch neural network training given: network structure and a training set initialize all weights in w to small random numbers until stopping criteria met do initialize the error for each (x (d), y (d) ) in the training set input x (d) to the network and compute output o (d) increment the error calculate the gradient # E(w) = E, E,!, E & $ w 0 w w ( n ' update the weights E(w) = 0 Δw = η E( w) { } D = (x (), y () ) (x (m), y (m) ) ( ) 2 E(w) = E(w) + 2 y(d ) o (d ) Online vs. batch training Standard gradient descent (batch training): calculates error gradient for the entire training set, before taking a step in weight space Stochastic gradient descent (online training): calculates error gradient for a single instance (or a small set of instances, a mini batch ), then takes a step in weight space much faster convergence less susceptible to local minima 0

Online neural network training (stochastic gradient descent) given: network structure and a training set initialize all weights in w to small random numbers until stopping criteria met do for each (x (d), y (d) ) in the training set input x (d) to the network and compute output o (d) calculate the error calculate the gradient update the weights { } D = (x (), y () ) (x (m), y (m) ) ( ) 2 E(w) = 2 y(d ) o (d ) # E(w) = E, E,!, E & $ w 0 w w ( n ' ( ) Δw = η E w Other activation functions the sigmoid is just one choice for an activation function there are others we can use including hyperbolic tangent f x = tanh x = rectified linear (ReLU) f x = > 0 if x < 0 x if x 0 2 + e '<

Other objective functions squared error is just one choice for an objective function there are others we can use including cross entropy E w = 4 y H ln o (H) y H ln o (H) H N multiclass cross entropy # QRSTTUT E w = 4 4 (H) (H) y 5 ln o5 H N 5VW Convergence of gradient descent gradient descent will converge to a minimum in the error function for a multi-layer network, this may be a local minimum (i.e. there may be a better solution elsewhere in weight space) for a single-layer network, this will be a global minimum (i.e. gradient descent will find the best solution) Recent analysis suggests that local minima are probably rare in high dimensions; saddle points are more of a challenge [Dauphin et al., NIPS 204] 2