Statistical Machine Learning from Data

Similar documents
Statistical Machine Learning from Data

Statistical Machine Learning from Data

Statistical Machine Learning from Data

Statistical Machine Learning from Data

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

An Introduction to Statistical Machine Learning - Theoretical Aspects -

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

Greedy Layer-Wise Training of Deep Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Basic Principles of Unsupervised and Unsupervised

CS:4420 Artificial Intelligence

Machine Learning Techniques

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural networks. Chapter 19, Sections 1 5 1

Lecture 17: Neural Networks and Deep Learning

Christian Mohr

CSC242: Intro to AI. Lecture 21

Artificial Neural Networks Examination, June 2005

Artificial neural networks

Multi-Layer Boosting for Pattern Recognition

Slide credit from Hung-Yi Lee & Richard Socher

Intelligent Systems Discriminative Learning, Neural Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

Neural networks. Chapter 20. Chapter 20 1

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Deep unsupervised learning

Deep Feedforward Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

Lecture 11 Recurrent Neural Networks I

Grundlagen der Künstlichen Intelligenz

Pattern Recognition and Machine Learning

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Introduction to Neural Networks

Neural networks. Chapter 20, Section 5 1

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

I D I A P. A New Margin-Based Criterion for Efficient Gradient Descent R E S E A R C H R E P O R T. Ronan Collobert * Samy Bengio * IDIAP RR 03-16

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks and Deep Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised Learning: Projections

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Lecture 16 Deep Neural Generative Models

Advanced Machine Learning

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Lecture 15: Exploding and Vanishing Gradients

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Introduction to Neural Networks: Structure and Training

Artificial Neural Networks 2

Recurrent neural networks

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Learning Theory. Part I 5. Deep Learning

Chapter 20. Deep Generative Models

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Machine Learning Lecture 5

Artificial Neural Networks Examination, March 2004

Neural Networks and the Back-propagation Algorithm

Regularization in Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Links between Perceptrons, MLPs and SVMs

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 2: Learning with neural networks

Computational statistics

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

arxiv: v3 [cs.lg] 14 Jan 2018

Numerical Learning Algorithms

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Lecture 5: Recurrent Neural Networks

Artificial Neural Networks Examination, June 2004

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Sum-Product Networks: A New Deep Architecture

Automatic Differentiation and Neural Networks

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Name (NetID): (1 Point)

Artificial Neural Networks (ANN)

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Speech and Language Processing

Deep Learning Architecture for Univariate Time Series Forecasting

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Neural networks and optimization

Lecture 2 Machine Learning Review

Neural networks and optimization

Transcription:

January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Other Artificial Neural Networks Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio

Samy Bengio Statistical Machine Learning from Data 2 1 Radial Basis Functions 2 3 4 5

Generic Mechanism Samy Bengio Statistical Machine Learning from Data 3 if a = f (b, c; θ f ) is differentiable and b = g(c; θ g ) is differentiable then you should be able to compute the gradient with respect to a, b, c, θ f, θ g... Hence only your imagination prevents you from inventing another neural network machine! a = f(b,c) b = g(c) c

Samy Bengio Statistical Machine Learning from Data 4 Radial Basis Functions 1 Radial Basis Functions 2 3 4 5

Difference Between RBFs and MLPs RBFs MLPs Samy Bengio Statistical Machine Learning from Data 5

Samy Bengio Statistical Machine Learning from Data 6 Radial Basis Functions Radial Basis Function (RBF) Models Normal MLP but the hidden layer l is encoded as follows: si l = 1 (γ l 2 i,j) 2 (y l 1 j µ l i,j) 2 j yi l = exp(si l) The parameters of such layer l are θ l = {γi,j l, µl i,j : i, j} These layers are useful to extract local features (whereas tanh layers extract global features)

Samy Bengio Statistical Machine Learning from Data 7 Radial Basis Functions for RBFs s l i = 1 2 y l i = exp(s l i ) y l i s l i s l i y l 1 j j = exp(s l i ) = y l i (γ l i,j) 2 (y l 1 j µ l i,j) 2 = (γ l i,j) 2 (y l 1 j µ l i,j) si l µ l i,j si l γi,j l = (γ l i,j) 2 (y l 1 j µ l i,j) = γ l i,j (y l 1 j µ l i,j) 2

Warning with Variances Initialization: use K-Means for instance One has to be very careful with learning γ by gradient descent Remember: yi l = exp 1 (γ l 2 i,j) 2 (y l 1 j µ l i,j) 2 If γ becomes too high, the RBF output can explode! One solution: constrain them in a reasonable range Otherwise, do not train them (keep the K-Means estimate for γ) j Samy Bengio Statistical Machine Learning from Data 8

Samy Bengio Statistical Machine Learning from Data 9 Radial Basis Functions Long Term Dependencies 1 Radial Basis Functions 2 3 4 5

Samy Bengio Statistical Machine Learning from Data 10 Radial Basis Functions Long Term Dependencies Such models admit layers l with integration functions si l = f (y l+k j ) where k 0, hence loops, or recurrences Such layers l encode the notion of a temporal state Useful to search for relations in temporal data Do not need to specify the exact delay in the relation In order to compute the gradient, one must enfold in time all the relations between the data: s l i (t) = f (y l+k j (t 1)) where k 0 Hence, need to exhibit the whole time-dependent graph between input sequence and output sequence Caveat: it can be shown that the gradient vanishes exponentially fast through time

Recurrent NNs (Graphical View) Long Term Dependencies y(t) y(t 2) y(t 1) y(t) y(t+1) x(t) x(t 2) x(t 1) x(t) x(t+1) Samy Bengio Statistical Machine Learning from Data 11

Long Term Dependencies Recurrent NNs - Example of Derivation Consider the following simple recurrent neural network: h t = tanh(d h t 1 + w x t + b) ŷ t = v h t + c with {d, w, b, v, c} the set of parameters Cost to minimize (for one sequence): C = T C t = t=1 T t=1 1 2 (y t ŷ t ) 2 Samy Bengio Statistical Machine Learning from Data 12

Samy Bengio Statistical Machine Learning from Data 13 Radial Basis Functions Derivation of the Gradient Long Term Dependencies We need to derive the following: C d, C w, C b, C v, C c Let us do it for, say, C w. C T w = = = t=1 T t=1 T t=1 C t w C t ŷ t ŷ t w C t ŷ t h t ŷ t h t w

Samy Bengio Statistical Machine Learning from Data 14 Radial Basis Functions Derivation of the Gradient (2) Long Term Dependencies C T w = t=1 C t ŷ t ŷ t h t h t w = T t=1 C t ŷ t ŷ t h t t s=1 h t h s h s w C t ŷ t = ŷ t y t h t h s = t i=s+1 ŷ t h t = v h i h i 1 = t i=s+1 h s w = (1 h2 s ) x s (1 h 2 i ) d

Long Term Dependencies Long Term Dependencies Suppose we want to classify a sequence according to its first frame but the target is known at the end only: Noise Target Meaningful Input Unfortunately, the gradient vanishes: C T w = T t=1 This is because for t T C T = C T T h t h T C T h t h t w 0 h τ h τ 1 and h τ=t+1 τ 1 Samy Bengio Statistical Machine Learning from Data 15 h τ < 1

Samy Bengio Statistical Machine Learning from Data 16 1 Radial Basis Functions 2 3 4 5

Auto Associative Nets (Graphical View) Samy Bengio Statistical Machine Learning from Data 17 targets = inputs internal space inputs

Samy Bengio Statistical Machine Learning from Data 18 Apparent objective: learn to reconstruct the input In such models, the target vector is the same as the input vector! Real objective: learn an internal representation of the data If there is one hidden layer of linear units, then after learning, the model implements a principal component analysis with the first N principal components (N is the number of hidden units). If there are non-linearities and more hidden layers, then the system implements a kind of non-linear principal component analysis.

Samy Bengio Statistical Machine Learning from Data 19 Radial Basis Functions 1 Radial Basis Functions 2 3 4 5

Mixture of Experts Radial Basis Functions Let f i (x; θ fi ) be a differentiable parametric function Let there be N such functions f i. Let g(x; θ g ) be a gater: a differentiable function with N positive outputs such that N g(x; θ g )[i] = 1 i=1 Then a mixture of experts is a function h(x; θ): h(x; θ) = N g(x; θ g )[i] f i (x; θ fi ) i=1 Samy Bengio Statistical Machine Learning from Data 20

Samy Bengio Statistical Machine Learning from Data 21 Radial Basis Functions Mixture of Experts - (Graphical View) Expert 1 Expert 2 Σ.. Expert N Gater

Samy Bengio Statistical Machine Learning from Data 22 Radial Basis Functions Mixture of Experts - Training We can compute the gradient with respect to every parameters: parameters in the expert f i : h(x; θ) h(x; θ) = θ fi f i (x; θ fi ) f i(x; θ fi ) θ fi = g(x; θ g )[i] f i(x; θ fi ) θ fi parameters in the gater g: h(x; θ) θ g = = N i=1 N i=1 h(x; θ) g(x; θ g )[i] g(x; θ g )[i] θ g f i (x; θ fi ) g(x; θ g )[i] θ g

Mixture of Experts - Discussion The gater implements a soft partition of the input space (to be compared with, say, K-Means hard partition) Useful when there might be regimes in the data Special case: when the experts can be trained by EM, the mixture can also be trained by EM. Samy Bengio Statistical Machine Learning from Data 23

...... Gater Gater Gater Radial Basis Functions Hierarchical Mixture of Experts When the experts are themselves represented as mixtures of experts: σ.. σ Σ Σ Gater Samy Bengio Statistical Machine Learning from Data 24

Samy Bengio Statistical Machine Learning from Data 25 Radial Basis Functions Time Delay Neural Networks LeNet 1 Radial Basis Functions 2 3 4 5

Time Delay Neural Networks LeNet Time Delay Neural Networks (TDNNs) TDNNs are models to analyze sequences or time series. Hypothesis: some regularities exist over time. The same pattern can be seen many times during the same time series (or even over many times series). First idea: attribute one hidden unit to model each pattern These hidden units should have associated parameters which are the same over time Hence, the hidden unit associated to a given pattern p i at time t will share the same parameters as the hidden unit associated to the same pattern p i at time t + k. Note that we are also going to learn what are the patterns! Samy Bengio Statistical Machine Learning from Data 26

TDNNs: Convolutions Time Delay Neural Networks LeNet How to formalize this first idea? using a convolution operator. This operator can be used not only between the input and the first hidden layer, but between any hidden layers. Time Features Samy Bengio Statistical Machine Learning from Data 27

Samy Bengio Statistical Machine Learning from Data 28 Radial Basis Functions Convolutions: Equations Time Delay Neural Networks LeNet Let st,i l be the input value at time t of unit i of layer l. Let yt,i l be the output value at time t of unit i of layer l. (inputs values: yt,i 0 = x t,i ). Let wi,j,k l be the weight between unit i of layer l at any time t and unit j of layer l at time t k. Let bi l be the bias of unit i at layer l. Convolution operator for windows of size K: s l t,i = K 1 k=0 j w l i,j,k y l 1 t k,j + bl i Transfer: y l t,i = tanh(s l t,i)

Convolutions (Graphical View) Time Delay Neural Networks LeNet Time Convolution Features Note: weights w l i,j,k and biases bl i do not depend on time. Hence the number of parameters of such model is independent of the length of the time series. Each unit s l t,i represents the value of the same function at each time step. Samy Bengio Statistical Machine Learning from Data 29

TDNNs: Subsampling Time Delay Neural Networks LeNet The convolution functions always work with a fixed size window (K in our case, which can be different for each unit/layer). Some regularities might exist at different granularities. Hence, second idea: subsampling (it is more a kind of smoothing operator in fact). In between each convolution layer, let us add a subsampling layer. This subsampling layer provides a way to analyze the time series at a coarser level. Samy Bengio Statistical Machine Learning from Data 30

Subsampling: Equations Time Delay Neural Networks LeNet How to formalize this second idea? Let y l t,i be the output value at time t of unit i of layer l. (inputs values: y 0 t,i = x t,i ). Let r be the ratio of subsampling. This is often set to values such as 2 to 4. Subsampling operator: y l t,i = 1 r r 1 k=0 y l 1 rt k,i Only compute values y l t,i such that (t mod r) = 0. Note: there are no parameter in the subsampling layer (but it is possible to add some, replacing for intance 1 r by a parameter and adding a bias term). Samy Bengio Statistical Machine Learning from Data 31

TDNNs (Graphical View) Time Delay Neural Networks LeNet Time Convolution Features Subsampling Samy Bengio Statistical Machine Learning from Data 32

Learning in TDNNs Radial Basis Functions Time Delay Neural Networks LeNet TDNNs can be trained by normal gradient descent techniques. Note that, as for MLPs, each layer is a differentiable function. We just need to compute the local gradient: Convolution layers: C w l i,j,k = t C s l t,i s l t,i w l i,j,k = t C s l t,i y l 1 t k,j Subsampling layers: C y l 1 t k,i = C y l t,i y l t,i y l 1 rt k,i = C y l t,i 1 r Samy Bengio Statistical Machine Learning from Data 33

LeNet for Images Radial Basis Functions Time Delay Neural Networks LeNet TDNNs are useful to handle regularities of time series ( 1D data) Could we use the same trick for images ( 2D data)? After all, regularities are often visible on images. It has indeed been proposed as well, under the name LeNet. Samy Bengio Statistical Machine Learning from Data 34

LeNet (Graphical View) Time Delay Neural Networks LeNet Convolution Subsampling Convolution Decision Samy Bengio Statistical Machine Learning from Data 35