MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

Similar documents
Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 4.1 Gradient-Based Learning: Back-Propagation and Multi-Module Systems Yann LeCun

Logistic Regression & Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Lecture 3 Feedforward Networks and Backpropagation

Machine Learning Lecture 5

Neural Networks and Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Lecture 3 Feedforward Networks and Backpropagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Networks

Lecture 17: Neural Networks and Deep Learning

arxiv: v3 [cs.lg] 14 Jan 2018

Lecture 2: Modular Learning

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Statistical Machine Learning from Data

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Machine Learning. Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks. MGS Lecture 2

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Reading Group on Deep Learning Session 1

Neural networks (NN) 1

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Intelligent Systems Discriminative Learning, Neural Networks

Linear & nonlinear classifiers

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Artificial Neural Networks 2

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Network Training

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Lecture 2: Learning with neural networks

Machine Learning Lecture 10

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

A Tutorial on Energy-Based Learning

Computational statistics

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Multilayer Perceptron

Deep Learning Recurrent Networks 2/28/2018

Introduction to Machine Learning

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Multi-layer Neural Networks

Machine Learning. 7. Logistic and Linear Regression

Deep Feedforward Networks

Multiclass Logistic Regression

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Stochastic gradient descent; Classification

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

Introduction to Machine Learning Spring 2018 Note Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

CS 4700: Foundations of Artificial Intelligence

Support Vector Machines and Kernel Methods

Deep Neural Networks (1) Hidden layers; Back-propagation

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CSCI567 Machine Learning (Fall 2018)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Pattern Recognition and Machine Learning

Statistical NLP for the Web

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Neural Networks (1) Hidden layers; Back-propagation

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Generative v. Discriminative classifiers Intuition

Neural Networks Lecture 4: Radial Bases Function Networks

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Artificial Neural Networks (ANN)

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Introduction to Machine Learning (67577)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Neural Architectures for Image, Language, and Speech Processing

Radial-Basis Function Networks

10. Artificial Neural Networks

Neural networks and support vector machines

CS 340 Lec. 16: Logistic Regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Final Exam, Machine Learning, Spring 2009

Machine Learning Lecture 12

Linear discriminant functions

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Transcription:

Y. LeCun: Machine Learning and Pattern Recognition p. 1/3 MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun The Courant Institute, New York University http://yann.lecun.com

PARAM Y. LeCun: Machine Learning and Pattern Recognition p. 2/3 A Trainer class Simple Trainer L LOSS ENERGY COST MOUT MACHINE INPUT OUTPUT The trainer object is designed to train a particular machine with a given energy function and loss. The example below uses the simple energy loss. (defclass simple-trainer object input ; the input state output ; the output/label state machin ; the machine mout ; the output of the machine cost ; the cost module energy ; the energy (output of the cost) and param ; the trainable parameter vector )

PARAM Y. LeCun: Machine Learning and Pattern Recognition p. 3/3 A Trainer class: running the machine Simple Trainer L LOSS ENERGY COST MOUT MACHINE INPUT OUTPUT Takes an input and a vector of possible labels (each of which is a vector, hence <label-set> is a matrix) and returns the index of the label that minimizes the energy. Fills up the vector <energies> with the energy produced by each possible label. (defmethod simple-trainer run (sample label-set energies) (==> input resize (idx-dim sample 0)) (idx-copy sample :input:x) (==> machine fprop input mout) (idx-bloop ((label label-set) (e energies)) (==> output resize (idx-dim label 0)) (idx-copy label :output:x) (==> cost fprop mout output energy) (e (:energy:x))) ;; find index of lowest energy (idx-d1indexmin energies))

PARAM Y. LeCun: Machine Learning and Pattern Recognition p. 4/3 A Trainer class: training the machine Simple Trainer L LOSS ENERGY COST MOUT MACHINE INPUT OUTPUT Performs a learning update on one sample. <sample> is the input sample, <label> is the desired category (an integer), <label-set> is a matrix where the i-th row is the desired output for the i-th category, and <updateargs> is a list of arguments for the parameter update method (e.g. learning rate and weight decay). (defmethod simple-trainer learn-sample (sample label label-set update-args) (==> input resize (idx-dim sample 0)) (idx-copy sample :input:x) (==> machine fprop input mout) (==> output resize (idx-dim label-set 1)) (idx-copy (select label-set 0 (label 0)) :outpu (==> cost fprop mout output energy) (==> cost bprop mout output energy) (==> machine bprop input mout) (==> param update update-args) (:energy:x))

Y. LeCun: Machine Learning and Pattern Recognition p. 5/3 Other Topologies The back-propagation procedure is not limited to feed-forward cascades. It can be applied to networks of module with any topology, as long as the connection graph is acyclic. If the graph is acyclic (no loops) then, we can easily find a suitable order in which to call the fprop method of each module. The bprop methods are called in the reverse order. if the graph has cycles (loops) we have a so-called recurrent network. This will be studied in a subsequent lecture.

Y. LeCun: Machine Learning and Pattern Recognition p. 6/3 More Modules A rich repertoire of learning machines can be constructed with just a few module types in addition to the linear, sigmoid, and euclidean modules we have already seen. We will review a few important modules: The branch/plus module The switch module The Softmax module The logsum module

Y. LeCun: Machine Learning and Pattern Recognition p. 7/3 The Branch/Plus Module The PLUS module: a module with K inputs X 1,...,X K (of any type) that computes the sum of its inputs: X out = k X k E back-prop: X k = E X out k The BRANCH module: a module with one input and K outputs X 1,...,X K (of any type) that simply copies its input on its outputs: X k = X in k [1..K] back-prop: E in = k E X k

Y. LeCun: Machine Learning and Pattern Recognition p. 8/3 The Switch Module A module with K inputs X 1,...,X K (of any type) and one additional discrete-valued input Y. The value of the discrete input determines which of the N inputs is copied to the output. X out = k δ(y k)x k E X k = δ(y k) E X out the gradient with respect to the output is copied to the gradient with respect to the switched-in input. The gradients of all other inputs are zero.

Y. LeCun: Machine Learning and Pattern Recognition p. 9/3 The Logsum Module fprop: X out = 1 β log k exp( βx k ) bprop: or with E X k = E exp( βx k ) X out j exp( βx j) E X k = E X out P k P k = exp( βx k) j exp( βx j)

Y. LeCun: Machine Learning and Pattern Recognition p. 10/3 Log-Likelihood Loss function and Logsum Modules MAP/MLE Loss L ll (W, Y i, X i ) = E(W, Y i, X i ) + 1 β log k exp( βe(w, k, Xi )) A classifier trained with the Log-Likelihood loss can be transformed into an equivalent machine trained with the energy loss. The transformed machine contains multiple replicas of the classifier, one replica for the desired output, and K replicas for each possible value of Y.

Y. LeCun: Machine Learning and Pattern Recognition p. 11/3 Softmax Module A single vector as input, and a normalized vector as output: Exercise: find the bprop (X out ) i = exp( βx i) k exp( βx k) (X out ) i x j =???

Y. LeCun: Machine Learning and Pattern Recognition p. 12/3 Radial Basis Function Network (RBF Net) Linearly combined Gaussian bumps. F(X, W, U) = i u i exp( k i (X W i ) 2 ) The centers of the bumps can be initialized with the K-means algorithm (see below), and subsequently adjusted with gradient descent. This is a good architecture for regression and function approximation.

Y. LeCun: Machine Learning and Pattern Recognition p. 13/3 MAP/MLE Loss and Cross-Entropy classification (y is scalar and discrete). Let s denote E(y, X, W) = E y (X, W) MAP/MLE Loss Function: L(W) = 1 P P i=1[e y i(x i, W) + 1 β log k exp( βe k (X i, W))] This loss can be written as L(W) = 1 P P 1 β log i=1 exp( βe y i(x i, W)) k exp( βe k(x i, W))

Y. LeCun: Machine Learning and Pattern Recognition p. 14/3 Cross-Entropy and KL-Divergence let s denote P(j X i, W) = exp( βe j(x i,w)) Pk exp( βe k(x i,w)), then L(W) = 1 P P i=1 1 β log 1 P(y i X i, W) L(W) = 1 P P i=1 1 β D k (y i ) log k D k (y i ) P(k X i, W) with D k (y i ) = 1 iff k = y i, and 0 otherwise. example1: D = (0, 0, 1, 0) and P(. X i, W) = (0.1, 0.1, 0.7, 0.1). with β = 1, L i (W) = log(1/0.7) = 0.3567 example2: D = (0, 0, 1, 0) and P(. X i, W) = (0, 0, 1, 0). with β = 1, L i (W) = log(1/1) = 0

Y. LeCun: Machine Learning and Pattern Recognition p. 15/3 Cross-Entropy and KL-Divergence L(W) = 1 P P i=1 1 β D k (y i ) log k D k (y i ) P(k X i, W) L(W) is proportional to the cross-entropy between the conditional distribution of y given by the machine P(k X i, W) and the desired distribution over classes for sample i, D k (y i ) (equal to 1 for the desired class, and 0 for the other classes). The cross-entropy also called Kullback-Leibler divergence between two distributions Q(k) and P(k) is defined as: k Q(k) log Q(k) P(k) It measures a sort of dissimilarity between two distributions. the KL-divergence is not a distance, because it is not symmetric, and it does not satisfy the triangular inequality.

Y. LeCun: Machine Learning and Pattern Recognition p. 16/3 Multiclass Classification and KL-Divergence Assume that our discriminant module F(X, W) produces a vector of energies, with one energy E k (X, W) for each class. A switch module selects the smallest E k to perform the classification. As shown above, the MAP/MLE loss below be seen as a KL-divergence between the desired distribution for y, and the distribution produced by the machine. L(W) = 1 P P i=1[e y i(x i, W)+ 1 β log k exp( βe k (X i, W))]

Y. LeCun: Machine Learning and Pattern Recognition p. 17/3 Multiclass Classification and Softmax The previous machine: discriminant function with one output per class + switch, with MAP/MLE loss It is equivalent to the following machine: discriminant function with one output per class + softmax + switch + log loss L(W) = 1 P P i=1 1 β log P(yi X, W) with P(j X i, W) = the E j s). exp( βe j(x i,w)) Pk exp( βe k(x i,w)) (softmax of Machines can be transformed into various equivalent forms to factorize the computation in advantageous ways.

Y. LeCun: Machine Learning and Pattern Recognition p. 18/3 Multiclass Classification with a Junk Category Sometimes, one of the categories is none of the above, how can we handle that? We add an extra energy wire E 0 for the junk category which does not depend on the input. E 0 can be a hand-chosen constant or can be equal to a trainable parameter (let s call it w 0 ). everything else is the same.

Y. LeCun: Machine Learning and Pattern Recognition p. 19/3 NN-RBF Hybrids sigmoid units are generally more appropriate for low-level feature extraction. Euclidean/RBF units are generally more appropriate for final classifications, particularly if there are many classes. Hybrid architecture for multiclass classification: sigmoids below, RBFs on top + softmax + log loss.

Y. LeCun: Machine Learning and Pattern Recognition p. 20/3 Parameter-Space Transforms Reparameterizing the function by transforming the space E(Y, X, W) E(Y, X, G(U)) gradient descent in U space: E(Y,X,W) W U U η G U equivalent to the following algorithm in W space: W W η G U G U E(Y,X,W) W dimensions: [N w N u ][N u N w ][N w ]

Y. LeCun: Machine Learning and Pattern Recognition p. 21/3 Parameter-Space Transforms: Weight Sharing A single parameter is replicated multiple times in a machine E(Y, X, w 1,...,w i,...,w j,...) E(Y, X, w 1,...,u k,...,u k,...) gradient: E() u k = E() w i + E() w j w i and w j are tied, or equivalently, u k is shared between two locations.

Y. LeCun: Machine Learning and Pattern Recognition p. 22/3 Parameter Sharing between Replicas We have seen this before: a parameter controls several replicas of a machine. E(Y 1, Y 2, X, W) = E 1 (Y 1, X, W)+E 1 (Y 2, X, W) gradient: E(Y 1,Y 2,X,W) W = E 1(Y 1,X,W) W + E 1(Y 2,X,W) W W is shared between two (or more) instances of the machine: just sum up the gradient contributions from each instance.

Y. LeCun: Machine Learning and Pattern Recognition p. 23/3 Path Summation (Path Integral) One variable influences the output through several others E(Y, X, W) = E(Y, F 1 (X, W), F 2 (X, W), F 3 (X, W), V ) gradient: E(Y,X,W) X gradient: E(Y,X,W) W = i = i E i (Y,S i,v ) S i E i (Y,S i,v ) S i F i (X,W) X F i (X,W) W there is no need to implement these rules explicitely. They come out naturally of the objectoriented implementation.

Y. LeCun: Machine Learning and Pattern Recognition p. 24/3 Mixtures of Experts Sometimes, the function to be learned is consistent in restricted domains of the input space, but globally inconsistent. Example: piecewise linearly separable function. Solution: a machine composed of several experts that are specialized on subdomains of the input space. The output is a weighted combination of the outputs of each expert. The weights are produced by a gater network that identifies which subdomain the input vector is in. F(X, W) = k u kf k (X, W k ) with u k = P exp( βg k(x,w 0 )) k exp( βg k(x,w 0 )) the expert weights u k are obtained by softmax-ing the outputs of the gater. example: the two experts are linear regressors, the gater is a logistic regressor.

Y. LeCun: Machine Learning and Pattern Recognition p. 25/3 Sequence Processing: Time-Delayed Inputs The input is a sequence of vectors X t. simple idea: the machine takes a time window as input R = F(X t, X t 1, X t 2, W) Examples of use: predict the next sample in a time series (e.g. stock market, water consumption) predict the next character or word in a text classify an intron/exon transition in a DNA sequence

Y. LeCun: Machine Learning and Pattern Recognition p. 26/3 Sequence Processing: Time-Delay Networks One layer produces a sequence for the next layer: stacked time-delayed layers. layer1 X 1 t = F 1 (X t, X t 1, X t 2, W 1 ) layer2 X 2 t = F 1 (X 1 t, X 1 t 1, X 1 t 2, W 2 ) cost E t = C(X 1 t, Y t ) Examples: predict the next sample in a time series with long-term memory (e.g. stock market, water consumption) recognize spoken words recognize gestures and handwritten characters on a pen computer. How do we train?

Y. LeCun: Machine Learning and Pattern Recognition p. 27/3 Training a TDNN Idea: isolate the minimal network that influences the energy at one particular time step t. in our example, this is influenced by 5 time steps on the input. train this network in isolation, taking those 5 time steps as the input. Surprise: we have three identical replicas of the first layer units that share the same weights. We know how to deal with that. do the regular backprop, and add up the contributions to the gradient from the 3 replicas

Y. LeCun: Machine Learning and Pattern Recognition p. 28/3 Convolutional Module If the first layer is a set of linear units with sigmoids, we can view it as performing a sort of multiple discrete convolutions of the input sequence. 1D convolution operation: St 1 = T j=1 W j 1 Xt j. w j k j [1, T] is a convolution kernel sigmoid X 1 t = tanh(s 1 t ) derivative: E wj 1k = 3 t=1 E X St 1 t j

Y. LeCun: Machine Learning and Pattern Recognition p. 29/3 Simple Recurrent Machines The output of a machine is fed back to some of its inputs Z. Z t+1 = F(X t, Z t, W), where t is a time index. The input X is not just a vector but a sequence of vectors X t. This machine is a dynamical system with an internal state Z t. Hidden Markov Models are a special case of recurrent machines where F is linear.

Y. LeCun: Machine Learning and Pattern Recognition p. 30/3 Unfolded Recurrent Nets and Backprop through time To train a recurrent net: unfold it in time and turn it into a feed-forward net with as many layers as there are time steps in the input sequence. An unfolded recurrent net is a very deep machine where all the layers are identical and share the same weights. E W = t E F(X t,z t,w) Z t W This method is called back-propagation through time. examples of use: process control (steel mill, chemical plant, pollution control...), robot control, dynamical system modelling...