Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Similar documents
Neural Network Training

Machine Learning (CSE 446): Neural Networks

4. Multilayer Perceptrons

Neural Networks and Deep Learning

Security Analytics. Topic 6: Perceptron and Support Vector Machine

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning: Logistic Regression. Lecture 04

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

y(x n, w) t n 2. (1)

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Logistic Regression & Neural Networks

Stochastic gradient descent; Classification

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSC242: Intro to AI. Lecture 21

ECS171: Machine Learning

Neural Networks: Backpropagation

Lecture 5: Logistic Regression. Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Advanced statistical methods for data analysis Lecture 2

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Multilayer Perceptron

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Neural Networks

Intro to Neural Networks and Deep Learning

Artificial Neural Networks. MGS Lecture 2

Multilayer Perceptrons and Backpropagation

Logistic Regression. COMP 527 Danushka Bollegala

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning Basics III

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 6. Regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Learning from Data: Multi-layer Perceptrons

Computational Intelligence Winter Term 2017/18

Introduction to Neural Networks

Data Mining (Mineria de Dades)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Iterative Reweighted Least Squares

Neural Networks: Backpropagation

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.

Computational Intelligence

Simple neuron model Components of simple neuron

Introduction to Neural Networks

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Computational statistics

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 3 - Linear and Logistic Regression

Deep Feedforward Networks

Reading Group on Deep Learning Session 1

Regression with Numerical Optimization. Logistic

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Artificial Neural Networks

Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Statistical Machine Learning from Data

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Lecture 17: Neural Networks and Deep Learning

CS229 Supplemental Lecture notes

Neural Networks Lecture 4: Radial Bases Function Networks

Neural networks III: The delta learning rule with semilinear activation function

Artificial Intelligence

CS260: Machine Learning Algorithms

CS489/698: Intro to ML

CSC321 Lecture 5 Learning in a Single Neuron

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

A summary of Deep Learning without Poor Local Minima

Lecture 5 Neural models for NLP

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning 4771

Linear Models for Regression. Sargur Srihari

MLPR: Logistic Regression and Neural Networks

Learning from Data: Regression

Course 395: Machine Learning - Lectures

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Multilayer Perceptron = FeedForward Neural Network

Artificial Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Multilayer Perceptron

Basic Principles of Unsupervised and Unsupervised

Linear Models in Machine Learning

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Linear Regression. S. Sumitra

Lecture 1: Supervised Learning

Transcription:

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC

Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function in parameter space. Gradient descent Online learning Newton s method 2

Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function in parameter space. Gradient descent Online learning Newton s method The gradient is the vector of derivatives: ( ) de de E(θ :d )... dθ dθ d The gradient vector is orthogonal to the contours. Hence, to minimise the error, we follow the gradient (the direction of steepest descent in error). 2

Gradient for linear model Let s go back to the linear model Y Xθ with quadratic error function E (Y Xθ) T (Y Xθ). The gradient for this model is: Gradient descent Online learning Newton s method 3

Gradient for linear model Let s go back to the linear model Y Xθ with quadratic error function E (Y Xθ) T (Y Xθ). The gradient for this model is: Gradient descent Online learning Newton s method The gradient descent learning rule, at iteration t, is: θ (t) θ (t ) + α E θ (t ) + αx T (Y Xθ (t ) ) where α is a user-specified learning rate. 3

Online learning In some situations, we might want to learn the parameters by going over the data online: θ (t) θ (t ) + αx (t) (y (t) x (t) θ (t ) ) This is the least mean squares algorithm. This learning rule is a stochastic approximation technique also known as the Robbins-Monro procedure. It s stochastic because the data is assumed to come from a stochastic process. If α decreases with rate /n, one can show that this algorithm converges. If the θ vary slowly with time, it is also possible to obtain convergence proofs with α set to a small constant. There are many tricks, including averaging, momentum and minibatches, to accelerate convergence. Gradient descent Online learning Newton s method 4

Hessian of linear model The Newton-Raphson algorithm uses the gradient learning rule, with the inverse Hessian matrix in place of α: θ (t) θ (t ) + H E H 2 E θ 2 Gradient descent Online learning Newton s method 5

Very fast convergence! Gradient descent Online learning Newton s method Note that α is a scalar, while H is a large matrix. So there is a trade-off between speed of convergence and storage. 6

Multi-layer perceptrons Gradient descent techniques allow us to learn complex, nonlinear supervised neural x networks known as x 2 multi-layer percetrons. θ5 θ 9 θ6 θ 8 θ4 θ 7 u u2 o o 2 θ 2 θ 3 θ y MLPs Regression Classification Backpropagation 7

Multi-layer perceptrons Gradient descent techniques allow us to learn complex, nonlinear supervised neural x networks known as x 2 multi-layer percetrons. Mathematically, an MLP is a nonlinear function approximator: ŷ φ j ( φi ( Xθj ) θi ) θ5 θ 9 θ6 θ 8 θ4 θ 7 u u2 o o 2 θ 2 θ 3 θ y MLPs Regression Classification Backpropagation where φ( ) is the sigmoidal (logistic) function: φ i ( Xθj ) + e Xθ j 7

Multi-layer perceptrons Gradient descent techniques allow us to learn complex, nonlinear supervised neural x networks known as x 2 multi-layer percetrons. Mathematically, an MLP is a nonlinear function approximator: ŷ φ j ( φi ( Xθj ) θi ) θ5 θ 9 θ6 θ 8 θ4 θ 7 u u2 o o 2 θ 2 θ 3 θ y MLPs Regression Classification Backpropagation where φ( ) is the sigmoidal (logistic) function: φ i ( Xθj ) + e Xθ j 7

Loss functions for regression Assume we are given the data {X :n, Y :n } and want to come up with a nonlinear mapping Ŷ f(x, θ), where θ is obtained my minimising a loss function: quadratic E (Y f(x, θ)) T (Y f(x, θ)) when doing regression. What is the likelihood model? MLPs Regression Classification Backpropagation 8

Loss functions for classification MLPs Regression Classification Backpropagation 9

Backpropagation The synaptic weights θ can be learned by following gradients: θ (t+) θ (t) Ŷ + α(y Ŷ) θ (t) where Ŷ f(x, θ(t) ). The output layer mapping for our example is given by: ŷ θ + θ 2 o + θ 3 o 2 and consequently, the derivatives with respect to the weights are given by: MLPs Regression Classification Backpropagation ŷ θ ŷ θ 2 ŷ θ 3 0

Backpropagation The hidden layer mapping for the top neuron is: o + exp( u ) where u θ 4 + θ 5 x + θ 6 x 2 Note that o u o ( o ) The derivatives with respect to the weights are: MLPs Regression Classification Backpropagation ŷ ŷ o u θ 4 o u θ 4 ŷ θ 5 ŷ θ 6

Backpropagation The derivatives with respect to the weights of the other hidden layer neuron can be calculated following the same procedure. Once we have all the derivatives, we can use either steepest descent or Newton-Raphson to update the weights. MLPs Regression Classification Backpropagation 2