The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Similar documents
Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks

Learning Deep Architectures for AI. Part II - Vijay Chakilam

RegML 2018 Class 8 Deep learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks. MGS Lecture 2

Feed-forward Network Functions

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 17: Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Neural Networks (1) Hidden layers; Back-propagation

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 5: Logistic Regression. Neural Networks

Neural Networks and Deep Learning.

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning Lecture 10

Introduction to Convolutional Neural Networks 2018 / 02 / 23

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Understanding How ConvNets See

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

TUTORIAL PART 1 Unsupervised Learning

CSC 411 Lecture 10: Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

On autoencoder scoring

Classic K -means clustering. Classic K -means example (K = 2) Finding the optimal w k. Finding the optimal s n J =

Intelligent Systems Discriminative Learning, Neural Networks

Pattern Recognition and Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 4.1 Gradient-Based Learning: Back-Propagation and Multi-Module Systems Yann LeCun

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Lecture 2: Modular Learning

Reading Group on Deep Learning Session 1

Intro to Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Unsupervised Neural Nets

Deep Learning: a gentle introduction

Neural Networks (and Gradient Ascent Again)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Understanding Neural Networks : Part I

Machine Learning Lecture 5

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Week 5: Logistic Regression & Neural Networks

Learning features by contrasting natural images with noise

Ch.6 Deep Feedforward Networks (2/3)

Deep unsupervised learning

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 2: Learning with neural networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Artificial Neural Networks

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Neural Networks and Deep Learning

Artificial Neuron (Perceptron)

Logistic Regression & Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks

4. Multilayer Perceptrons

STA 414/2104: Lecture 8

Neural Networks: Backpropagation

Neural Networks: Introduction

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Generative Models. (Unsupervised Learning)

How to do backpropagation in a brain

Classification. Sandro Cumani. Politecnico di Torino

Neural Networks and the Back-propagation Algorithm

Lecture 3 Feedforward Networks and Backpropagation

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Machine Learning (CSE 446): Backpropagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

SGD and Deep Learning

Convolutional networks. Sebastian Seung

Statistical Machine Learning from Data

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

ECE521 Lectures 9 Fully Connected Neural Networks

STA 414/2104: Lecture 8

CSC321 Lecture 20: Autoencoders

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Lecture 5 Neural models for NLP

4 3hrs. Stefano Rovetta Introduction to neural networks 20/23-Jul / 109

Machine Learning Lecture 12

Multilayer Perceptron

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

A summary of Deep Learning without Poor Local Minima

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Jakub Hajic Artificial Intelligence Seminar I

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Transcription:

The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006)

It s the features, stupid It s the features, stupid The common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. 4. Combine the descriptors into a representation for the image. It s the features, stupid It s the features, stupid The common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. 4. Combine the descriptors into a representation for the image.

It s the features, stupid It s the features, stupid f 1 f 1 1 f M 1 f n f 1 n f M n The common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. 4. Combine the descriptors into a representation for the image. It s the features, stupid The common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. 4. Combine the descriptors into a representation for the image. What do good features look like? f 2? cathedral high-rise f 1 This creates a representation that even a linear classifier can deal with. There are many incarnations: SIFT, HOG, GIST, SURF,... Common to all is that they summarize local oriented structure, followed by several stages of non-linear processing.

Neural networks Neural networks can in principle perform the same computation if the weights are set appropriately. Finding the right weigts is slow in principle, but trivially parallelizable. (picture adapted from Bishop 2006) Back-prop It is easy to compute derivatives if the network is composed of modules that provide the following three functions: A function fprop() to compute outputs, given inputs and parameters. A function bprop() to compute derivatives of some function wrt. its inputs, given the derivatives of that function wrt. its outputs. A function grad() to compute derivatives of some function wrt. its parameters, given inputs and the derivatives of that function wrt. its outputs. This is based of the chain-rule of differentiation, but it is better than using the chain rule on paper, because the computation of derivatives can now be automated. Back-prop A neural network with a single hidden layer w f f(g; w f ) f fprop f g bprop f w f grad w g g(h; w g ) w h h(x; w h ) x A feed-forward neural net (AKA backprop net) computes its output layer-by-layer: y k (x) = M j=0 this and most of the following images from: (Bishop, 2006) w (2) kj h ( D i=0 w (1) ji x i )

A neural network with a single hidden layer Output activation functions With explicit bias terms: y k (x) = M j=1 w (2) kj h ( D i=1 ) w (1) ji x i + w (1) j0 + w (2) k0 The last layer determines the functionality of the network. For example: linear outputs + squared error loss = non-linear regression softmax outputs + log-loss = non-linear logistic regression Backpropagation in detail Backpropagation in detail Define the training cost as the sum over per-example costs: N E(w) = E n (w) n=1 Now we can compute derivatives En individually for w each training case and add them up afterwards. Definitions: Let aj = i w jiz i be the net input of unit j. Let zi be the output of unit i, in other words z i = h(a i ). i w ji We need derivatives of the cost, E n, with respect to each weight w ji connecting nodes i and j in the network. j

Backpropagation in detail By the chain-rule of differentiation, we have E n w ji = E n The second factor is easy: w ji = z i w ji And to compute it for all i: Just run the network! Backpropagation in detail Apply the chain-rule once more to get E n = k E n where the sum is over those units k connected to j. Intuitively, this reflects the fact that if we wiggle a j this will affect the cost function through all the a k. With = w kj h (a j ) this simplifies to: E n = h (a j ) k w kj E n Backpropagation in detail Backpropagation in detail Thus, we can use a recursion to compute all En starting at the outputs. For squared error (regression), we have E n = since y (n) k 1 2 y(n) (x, w) t (n) 2 = y (n) k t (n) k = a (n) k in the case of regression. Same for classification if we define the log-probability as softmax and use negative log-probability as the loss (exercise). If h is the logistic sigmoid, we have: h (a j ) = h(a j )(1 h(a j )) So in this case we may use the activations themselves to compute the derivatives. But any activation function that is differentiable almost everywhere will work.

Backpropagation in detail Learning arbitrary non-linear functions Backpropagation summary (Bishop, page 244): 1. Given input x n, propagate forward to compute activations for all hiddens z i and outputs. 2. Evaluate E n and En for all output units. 3. Compute En recursively for each hidden unit. 4. Compute the derivatives for each w ji by multiplying the appropriate En and z i terms. A network with a single hidden layer can model any non-linear function under fairly mild conditions to arbitrary accuracy (eg. Funahashi, 1989). Unfortunately, the proof relies on using an exponentially large number of hidden units. So the practical relevance of this result is very limited. In practice, networks with many layers have proven to be much more useful. Implementing backprop There are several software packages that implement backprop. theano (http://deeplearning.net/software/theano) takes the idea to the extreme, by using symbolic differentiation, so you don t even need to implement bprop and grad yourself. import theano import theano.tensor as T x = T.dmatrix("x") w = T.dmatrix("w") somefunction = T.dot(w,x).sum() python_function = theano.function([x,w], somefunction) python_function(randn(100, 10), randn(10, 100)) derivative = T.grad(somefunction, w) theano

Weight sharing Convolutional networks A common approach to reducing the number of model parameters is weight sharing: Force different parts of the network to use the same parameters. It is trivial to implement weight sharing using backprop: Just let your modules make use of the same parameter array. Derivatives for these parameters will simply accumulate. Convolutional networks ( conv nets ) are probably the most common application of weight sharing. They are neural networks designed specifically for visual tasks (though there are examples for conv nets used in other domains). Since structure in images is local and invariant, they use local receptive fields with weight-sharing. This defines a convolution with (flipped) filters which are learned discriminatively using back-prop. Convolutional networks Convolutional networks Alternating sub-sampling layers are commonly used to get invariance to small shifts and to reduce the spatial extent of the representation towards the higher layers. (LeCun et al., 1989) Convolutional networks were inspired by Hubel & Wiesel s complex/simple cells results. There are various related models (but without back-prop learning): Neocognitron (Fukushima, 1980), HMAX (Riesenhuber & Poggio, 1999) A standard reference for conv nets is: Gradient-based learning applied to document recognition. Y. LeCun, et al. 1998. (interestingly, that paper introduced another concept now heavily used in vision: conditional random fields)

Autoencoders Overcomplete autoencoders r(x) ˆx j A kj s k W jk x j x Autoencoders are simple neural networks that are trained to reconstruct their input: cost = r(x) x 2 The hidden layer is a bottleneck that forces the model to compress its input. Linear autoencoders implement a variation of PCA (Baldi, Hornik; 1989) r(x) ˆxj A kj sk W jk xj x With overcomplete hiddens, the model can cheat and learn the identity. One solution: Corrupt the inputs during training, but train the model to reconstruct the original, uncorrupted inputs (Vincent et al. 2008): cost = r(x + noise) x 2 Sparse autoencoders As discussed in the context of ICA, another way to define an overcomplete autoencoder, is by forcing hiddens to be sparse. This will also let the hidden layer act like a bottleneck. For example, to train a linear autoencoder with L 1 sparsity term: minimize W W T z t z t 2 + λ w T i z t t t K -mean clustering can be viewed as a (very) sparse autoencoder, too. Other sparse autoencoders include: contractive autoencoders, shrinking autoencoders, and others i Factorial representations In K -means clustering, each hidden unit represents a convex blob in the data-space. Thus, hidden units cannot collaborate to define regions in input space. A sparse autoencoder may be viewed as a way to allow for some collaboration between hiddens. The number of blobs that a set of hiddens can represent thus becomes, in principle, exponential in the number of hiddens involved. Codes that can collaborate are commonly called factorial.

Relationship between encoder and decoder weights Take an autoencoder with tied weights (W = A T ) If the model is defined on whitened data, the decoder weights (in terms of the original data) will be smoothed encoder weights (also in terms of the original data). To see this, write the encoder weights in terms of the original, unwhitened data as: s(x) = Wz(x) = WL 1 2 U T x =: Wx and the decoder weights in terms of the original, unwhitened data as: Relationship between encoder and decoder weights Now multiply the encoder weights by the data covariance matrix: C W T = C ( WL 1 2 U T ) T = CUL 1 2 W T = ULU T UL 1 2 W T = UL 1 2 W T = Ā x(s) = UL 1 2 W T s =: Ās Relationship between encoder and decoder weights Autoencoders as dynamical systems An autoencoder maps points x R n to reconstructions r(x) R n. This defines a dynamical system. We can plot r(x) x as a vector field. (Seung, 1998), (Alain, Bengio; 2013)

Autoencoders as dynamical systems Computing the energy of an autoencoder The dynamical systems perspective provides another explanation for why denoising and contractive training works: Training can be viewed as making training data points attractive fixed points of the network dynamics. (Seung, 1998), (Swersky et al. 2011), (Vincent 2011), (Alain, Bengio; 2013), (Kamyshanska, Memisevic, 2013) Some dynamical systems can be defined as the derivative of a scalar function (AKA scalar field or potential energy ) E(x). If such an energy exists, extrema of the scalar field will be fixed points of the dynamical system, and running the dynamical system will amount to performing gradient descent in the energy. A sufficient condition for the energy function to exist is that the weights are tied (Clairaut s theorem), in other words W = A T Computing the energy of an autoencoder Computing the energy of an autoencoder We can compute the energy of these autoencoders by integration (Kamyshanska, Memisevic; 2013): E(x) = (r(x) x) dx. = k h(s k ) ds k 1 2 x b r 2 + const where b r is the vector of (visible) bias terms. Thus, computing the energy boils down to the following recipe: 1. Replace hidden activation function by its anti-derivative (eg., softplus for sigmoid, half-square for rectifier, etc.). 2. Sum up these new activations. 3. subtract 1 2 x b r 2 For binary-output models, the last term turns into b T r x For sigmoid hiddens, this recipe comes down to computing exactly the RBM free energy! Potential energies are additive in the hiddens, so autoencoders are ICA-like.

Examples Sigmoid activation σ(x) = (1 + exp( x)) 1 : Examples Hyperbolic tangent activation tanh(x) = exp(x) exp( x) exp(x)+exp( x) : E sigmoid (x) = k softplus(s k ) 1 2 x b r 2 + const E tanh (x) = k log (cosh(s k )) 1 2( x br ) 2 + const same as the free energy of a binary-gaussian RBM. Sigmoid activation (binary inputs): E sigmoid (x) = softplus(s k ) + br T x + const k same as the free energy of a binary-binary RBM. Linear activation h(s) = s: E linear (x) = 1 2 (W x + b h) T (W x + b h ) 1 2( x br ) 2 + const the norm of the latent representation, same as PCA generative classifier { 0 if x < 0 Rectifier hiddens h(x) = : x else E relu (x) = k (sign(s k) + 1) s2 k 2 1 2( x br ) 2 + const