COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Similar documents
Lecture 5: Logistic Regression. Neural Networks

CSC242: Intro to AI. Lecture 21

Artificial Neuron (Perceptron)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Intelligence

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lab 5: 16 th April Exercises on Neural Networks

Course 395: Machine Learning - Lectures

Learning Deep Architectures for AI. Part I - Vijay Chakilam

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CS:4420 Artificial Intelligence

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks

Learning and Neural Networks

Midterm: CS 6375 Spring 2015 Solutions

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 17: Neural Networks and Deep Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Neural networks. Chapter 20. Chapter 20 1

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks. Intro to AI Bert Huang Virginia Tech

Introduction to Neural Networks

CSC 411 Lecture 10: Neural Networks

Machine Learning

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural Nets Supervised learning

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks: Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

18.6 Regression and Classification with Linear Models

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Perceptron

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural Networks (Part 1) Goals for the lecture

Intro to Neural Networks and Deep Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artifical Neural Networks

Machine Learning

Artificial Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Machine Learning (CSE 446): Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Multilayer Perceptrons and Backpropagation

Neural networks. Chapter 20, Section 5 1

Data Mining Part 5. Prediction

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

AI Programming CS F-20 Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks and the Back-propagation Algorithm

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 16: Introduction to Neural Networks

Artificial Neural Networks. MGS Lecture 2

CS489/698: Intro to ML

A summary of Deep Learning without Poor Local Minima

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

y(x n, w) t n 2. (1)

Learning Neural Networks

Supervised Learning. George Konidaris

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Machine Learning Basics III

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Artificial Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Artificial neural networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Advanced Machine Learning

Neural Networks in Structured Prediction. November 17, 2015

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Neural Networks biological neuron artificial neuron 1

Reading Group on Deep Learning Session 1

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks: Basics. Darrell Whitley Colorado State University

Artificial Neural Networks

Machine Learning and Data Mining. Linear classification. Kalev Kask

Multilayer Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Learning from Examples

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

EPL442: Computational

Lecture 6. Notes on Linear Algebra. Perceptron

CMSC 421: Neural Computation. Applications of Neural Networks

Transcription:

COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Annnouncements Assignment 3 deadline postponed New deadline: Monday, Feb 26, noon EST Questions about assignment 1 grading? See grading TAs during office hours My office hours (for now): Monday, 12pm-1pm, MC 232 2

Recall: the perceptron We can take a linear combination and threshold it: The output is taken as the predicted class. 3

Decision surface of a perceptron Can represent many functions. To represent non-linearly separable functions (e.g. XOR), we could use a network of perceptron-like elements. If we connect perceptrons into networks, the error surface for the network is not differentiable (because of the hard threshold). 4

Example: A network representing XOR N1 N2 o2 N3 o1 Decision boundary for two neurons in the first hidden layer Decision boundary for output neuron 5

Recall the sigmoid function s(w x) = 1 1+ e -w x Sigmoid provide soft threshold, whereas perceptron provides hard threshold 1 is the sigmoid function: s(z) = 1+ e -z It has the following nice property: ds(z) = s(z)(1-s(z)) dz We can derive a gradient descent rule to train: One sigmoid unit; Multi-layer networks of sigmoid units. 6

Feed forward neural networks A collection of neurons with non-linear activation functions, arranged in layers. Layer 0 is the input layer, its units just copy the input. Last layer (layer K) is the output layer, its units provide the output. Layers 1,.., K-1 are hidden layers, cannot be detected outside of network. 7

Why this name? In feed-forward networks the output of units in layer k become input to the units in layers k+1, k+2,, K. No cross-connection between units in the same layer. No backward ( recurrent ) connections from layers downstream. Typically, units in layer k provide input to units in layer k+1 only. In fully-connected networks, all units in layer k provide input to all units in layer k+1. 8

Feed-forward neural networks Notation: w ji denotes weight on connection from unit i to unit j. By convention, x j0 = 1, j Also called bias, b Output of unit j, denoted o j is computed using a sigmoid: o j = (w j x j ) where w j is vector of weights entering unit j x j is vector of inputs to unit j By definition, x ji = o i. Given an input, how do we compute the output? How do we train the weights? 9

Computing the output of the network Suppose we want network to make prediction about instance <x,y=?>. Run a forward pass through the network. For layer k = 1 K 1. Compute the output of all neurons in layer k: 2. Copy this output as the input to the next layer: The output of the last layer is the predicted output y. 10

Learning in feed-forward neural networks Assume the network structure (units + connections) is given. The learning problem is finding a good set of weights to minimize the error at the output of the network. Approach: gradient descent, because the form of the hypothesis formed by the network, h w is: Differentiable! Because of the choice of sigmoid units. Very complex! Hence direct computation of the optimal weights is not possible. 11

Gradient-descent preliminaries for NN Assume we have a fully connected network: N input units (indexed 1,, N) H hidden units in a single layer (indexed N+1,, N+H) one output unit (indexed N+H+1) Suppose you want to compute the weight update after seeing instance <x, y>. Let o i, i = 1,, H+N+1 be the outputs of all units in the network for the given input x. For regression: the sum-squared error function is: 12

Gradient-descent update for output node Derivative with respects to the weights w N+H+1,j entering o N+H+1 : Use the chain rule: J(w)/ w = ( J(w)/ σ) ( σ/ w) Note: j here is any node in the hidden layer J(w)/ σ = -(y-o N+H+1 ) 13

Gradient-descent update for output node Derivative with respects to the weights w N+H+1,j entering o N+H+1 : Use the chain rule: J(w)/ w = ( J(w)/ σ) ( σ/ w) Hence, we can write: where: 14

Gradient-descent update for hidden node The derivative wrt the other weights, w l,j where j = 1,, N and l = N+1,, N+H can be computed using chain rule: Recall that x N+H+1,l = o l. Hence we have: Note: now j is any node in the input layer, and l is any node in the hidden layer Putting these together and using similar notation as before: COMP-424: Artificial intelligence 15

Gradient-descent update for hidden node Note: now h is any node in the hidden layer Image from: http://openi.nlm.nih.gov/detailedresult.php?img=2716495_bcr2257-1&req=4 16

Stochastic gradient descent (SGD) Initialize all weights to small random numbers. Initialization Repeat until convergence: Pick a training example. Feed example through network to compute output o = o N+H+1. Forward pass For the output unit, compute the correction: For each hidden unit h, compute its share of the correction: Backpropagation Update each network weight: Gradient descent 17

Flavours of gradient descent Stochastic gradient descent: Compute error on a single example at a time (as in previous slide). Batch gradient descent: Compute error on all examples. Loop through the training data, accumulating weight changes. Update all weights and repeat. Mini-batch gradient descent: Compute error on small subset. Randomly select a mini-batch (i.e. subset of training examples). Calculate error on mini-batch, apply to update weights, and repeat. 18

Expressiveness of feed-forward NN A single sigmoid neuron? Same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR. A neural network with a single hidden layer? 19

Expressiveness of feed-forward NN (non-linearity) Image from: Hugo Larochelle s & Pascal Vincent s slides 20

Expressiveness of feed-forward NN Image from: Hugo Larochelle s & Pascal Vincent s slides 21

Expressiveness of feed-forward NN Image from: Hugo Larochelle s & Pascal Vincent s slides 22

Expressiveness of feed-forward NN A single sigmoid neuron? Same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR. A neural network with a single hidden layer? Universal approximation theorem (Hornik, 1991): Every bounded continuous function can be approximated with arbitrary precision by a single-layer neural network But might require a number of hidden units that is exponential in the number of inputs. Also, this doesn t mean that we can easily learn the parameter values! 23

Expressiveness of feed-forward NN A single sigmoid neuron? Same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR. A neural network with a single hidden layer? Universal approximation theorem (Hornik, 1991) But might require a number of hidden units that is exponential in the number of inputs. Also, this doesn t mean that we can easily learn the parameter values! A neural network with two hidden layers? Any function can be approximated to arbitrary accuracy by a network with two hidden layers. 24

What you should know: Final notes Definition / components of neural networks. Training by backpropagation Stochastic gradient descent and its variants Additional information about neural networks: Video & slides from the Montreal Deep Learning Summer School: http://videolectures.net/deeplearning2017_larochelle_neural_networks/ https://drive.google.com/file/d/0byukrdicdk7-c2s2rjbisms2uza/view?usp=drive_web https://drive.google.com/file/d/0byukrdicdk7-uxb1r1zpx082mek/view?usp=drive_web Manifold perspective on neural nets with cool visualizations: http://colah.github.io/posts/2014-03-nn-manifolds-topology/ 25