Learning Deep Architectures for AI. Part II - Vijay Chakilam

Similar documents
Learning Deep Architectures for AI. Part I - Vijay Chakilam

Greedy Layer-Wise Training of Deep Networks

UNSUPERVISED LEARNING

Deep unsupervised learning

Restricted Boltzmann Machines

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Unsupervised Learning

Lecture 16 Deep Neural Generative Models

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

How to do backpropagation in a brain

TUTORIAL PART 1 Unsupervised Learning

Denoising Autoencoders

Deep Learning Autoencoder Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Learning Deep Architectures

Deep Feedforward Networks

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Neural Networks and Deep Learning

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

STA 414/2104: Lecture 8

Deep Learning: a gentle introduction

Deep Belief Networks are compact universal approximators

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning & Neural Networks Lecture 2

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Techniques

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

The Origin of Deep Learning. Lili Mou Jan, 2015

arxiv: v2 [cs.ne] 22 Feb 2013

Notes on Back Propagation in 4 Lines

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Jakub Hajic Artificial Intelligence Seminar I

Deep Generative Models. (Unsupervised Learning)

STA 414/2104: Lecture 8

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

CSC321 Lecture 20: Autoencoders

Neural networks and optimization

Ch.6 Deep Feedforward Networks (2/3)

Artificial Neural Networks. MGS Lecture 2

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Knowledge Extraction from DBNs for Images

Introduction to Convolutional Neural Networks (CNNs)

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Learning Deep Architectures

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Part 2. Representation Learning Algorithms

RegML 2018 Class 8 Deep learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning Lecture 10

Spatial Transformer Networks

Basic Principles of Unsupervised and Unsupervised

Intelligent Systems Discriminative Learning, Neural Networks

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Deep Boltzmann Machines

Deep Neural Networks

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Feedforward Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Introduction to Machine Learning (67577)

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Learning to Disentangle Factors of Variation with Manifold Learning

Lecture 3 Feedforward Networks and Backpropagation

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Introduction to Deep Neural Networks

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Learning Tetris. 1 Tetris. February 3, 2009

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Restricted Boltzmann Machines

Deep Learning Architectures and Algorithms

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Feature Design. Feature Design. Feature Design. & Deep Learning

Generative models for missing value completion

10. Artificial Neural Networks

Reward-modulated inference

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Modeling Documents with a Deep Boltzmann Machine

Statistical Machine Learning from Data

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

(Artificial) Neural Networks in TensorFlow

Convolutional Neural Network Architecture

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Transcription:

Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam

Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model results in right target for every example 0,0 1,0 A graphical view of the XOR problem

Learning representations x1 W, b φ u1 h1 V, c 2,1 y x2 u2 h2 W = 11 11, b = 0 1, V = 1 2 and c = 0 0,0 1,0

y j j z j Backpropagation E = 1 2 (t j y j ) 2 j output E y j = (t j y j ) y i i E z j = dy j dz j E y j = y j (1 y j ) E y j E w ij = z j w ij E z j = y i E z j E y i = dz j E dy i z = w E ij j z j j j

Deep vs Shallow Pascanu et al. (2013) compared deep rectifier networks with their shallow counterparts. For a deep model with n 0 inputs and k hidden layers of width n, the maximal number of response regions per parameter behaves as Ω n n 0 k 1 n n 0 2 For a shallow model with inputs and nk hidden units, the maximal number of response regions per parameter behaves as n 0 k O( k n0 1 n n 0 1 )

Deep vs Shallow Pascanu et al. (2013) showed that the deep model can generate exponentially more regions per parameter in terms of the number of hidden layers, and at least order polynomially more regions per parameter in terms of layer width n. Montufar et al. (2014) came up with significantly improved lower bound on the maximal linear regions.

Unsupervised Pre-training Scarcity of labeled examples and availability of many unlabeled examples. Unknown future tasks. Once a good high-level representation is learned, other learning tasks could be much easier. For example, kernel machines can be very powerful if using an appropriate kernel, i.e. an appropriate feature space.

W 1 DNN with unsupervised pre-training using RBMs W 2 X Z1 Z2 W 3 Z3 Z1 Z2 W 1 W 2 W 3 W 4 X Y Z1 Z2 Z3

W W T 1 1 ^X DNN with unsupervised pre-training using Autoencoders X Z1 W W T 2 2 Z2 Z3 ^Z 2 W W T 3 3 Z1 Z2 ^Z 1 W 1 W 2 W 3 W 4 X Y Z1 Z2 Z3

Restricted Boltzmann Machines

Restricted Boltzmann Machine Energy function: E(x, h) = h T Wx c T x b T h = W j,k h j x k c k x k b j h j j k k j Distribution: p(x, h) = exp( E(x, h)) / Z W X h

x h - E e -E P(x, h ) P(x) 1 1 1 1 2 7.39.186 1 1 1 0 2 7.39.186 1 1 0 1 1 2.72.069 0.466 1 1 0 0 0 1.025 1 0 1 1 1 2.72.069 1 0 1 0 2 7.39.186 1 0 0 1 0 1.025 0.305 1 0 0 0 0 1.025 0 1 1 1 0 1.025 0 1 1 0 0 1.025 0 1 0 1 1 2.72.069 0.144 0 1 0 0 0 1.025 0 0 1 1-1 0.37.009 0 0 1 0 0 1.025 0.084 0 0 0 1 0 1.025 0 0 0 0 0 1.025 39.70 An example of how weights define a distribution h1 h2-1 +2 +1 x1-1 x2

Log likelihood To train an RBM, we d like to minimize the average negative log-likelihood 1 T t l( f (x (t) )) = 1 T t log p(x (t) ) log p(x (t) ) θ E(x (t), h) = E h θ x (t) E(x, h) E x,h θ

Contrastive Divergence Idea: Replace the expectation by a point estimate at x Obtain the point x by Gibbs sampling Start sampling chain at x(t)... ~ p(h x) ~ p(x h) x (t) x 1 x k = x ~

Conditional probabilities p(h x) = p(h j x) p(h j =1 x) = j 1 1+ exp( (b j +W j. x)) = sigm(b j +W j. x) p(x h) = p(x k h) k p(x k =1 h) = 1 1+ exp( (c k + h T W.k )) = sigm(c k + h T W.k )

Conditional probabilities

Contrastive Divergence Parameter Updates Derivation of E(x, h) θ for θ = W jk E(x, h) = W jk h j x k c k x k W jk W b j h j jk jk k j = W jk jk W jk h j x k = h j x k W E(x, h) = hx T

Contrastive Divergence Parameter Updates Derivation of E(x, h) Ε h θ x for θ = W jk E(x, h) Ε h W jk x = Ε h h x x j k = h j x k p(h j x) h j {0,1} = x k p(h j =1 x) Ε h [ W E(x, h) x] = h(x)x T

Contrastive Divergence Parameter Updates x (t) x ~ θ = W jk Given and the learning rule for becomes W W α ( W log p(x (t) )) ( [ ]) W α Ε h W E(x (t), h) x (t) Ε x,h W E(x, h) W α Ε h W E(x (t), h) x (t) Ε h W E(x ~, h) x ~ W +α h(x (t) )x (t)t h(x ~ ) x ~ T

Contrastive Divergence - Algorithm For each training example Generate a negative sample sampling, starting at Update parameters x (t) x (t) x ~ W W +α h(x (t) )x (t)t h(x ~ ) x ~ T b b +α h(x (t) ) h(x ~ ) c c +α x (t) x ~ Repeat until stopping criterion using k steps of Gibbs

Autoencoders

3 Autoencoder x h(x) c k b j W * = W T (tied weights) Decoder x = o(a(x)) = sigm(c + W * h(x)) for binary inputs W h(x) Encoder = g(a(x)) = sigm(b + Wx)

Autoencoder: Loss Function Loss function: l( f (x)) = log p(x µ) Use as the loss function Example: We get the squared error loss l( f (x)) = 1 ( ˆx k x k ) 2 2 By choosing a Gaussian distribution with mean µ and identity covariance 1 p(x µ) = (2π ) exp( 1 (x D/2 k µ k ) 2 ) 2 k and choosing µ = c +W * h(x) For binary inputs: k k l( f (x)) = (x k log( ˆx k )+ (1 x k )log(1 ˆx k ))

Autoencoder: Gradient For both binary and real valued inputs, the gradient has a very simple form â(x (t ) ) l( f (x(t) )) = ˆx (t) x (t) Parameter gradients are obtained by backpropagating the gradient like in a regular network

3 Denoising Autoencoder Representation should be robust to introduction of noise Random assignment of subset of inputs to 0, with probability v Gaussian additive noise Reconstruction computed from the corrupted input Loss function compares reconstruction with original input h(x~) X^ X~ x c k b j W * = W T (tied weights) W 0 0 0 noise process p(x~ x)

Dataset MNIST ("Modified National Institute of Standards and Technology") is the de facto hello world dataset of computer vision. Contains gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixelvalue is an integer between 0 and 255, inclusive.

Softmax The output units in a softmax group use a non-local non-linearity: y i y i = softmax group e z i j group e z j z i y i = y (1 - y ) z i i i

Theano Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Tight integration with NumPy Transparent use of a GPU. Perform data-intensive computations much faster than on a CPU. Efficient symbolic differentiation. Theano does your derivatives for functions with one or many inputs. Speed and stability optimizations.

Theano Installation and more information: http://deeplearning.net/software/theano/ Latest reference: Theano: A Python framework for fast computation of mathematical expressions

Theano.Tensor Tensor A Theano module for symbolic NumPy types and operations There are many types of symbolic expressions. Basic Tensor functionality includes creation of tensor variables such as scalar, vector, matrix, etc.

Create algebraic expressions on symbolic variables: Theano Functions

Create a shared variable so we can do gradient descent Define a cost function that has a minimum value Compute the gradient

Get data and initialize network

Define Theano variables and expressions

Define training expressions and functions

Loop through the data in batches

DNN without unsupervised pretraining

DNN with unsupervised pre-training W 1 using RBMs W 2 X Z1 Z2 W 3 Z3 Z1 Z2 W 1 W 2 W 3 W 4 X Y Z1 Z2 Z3

DNN unsupervised pre-training with RBMs

W W T 1 1 ^X DNN with unsupervised pre-training using Autoencoders X Z1 W W T 2 2 Z2 Z3 ^Z 2 W W T 3 3 Z1 Z2 ^Z 1 W 1 W 2 W 3 W 4 X Y Z1 Z2 Z3

DNN unsupervised pre-training with AutoEncoders

Unsupervised Pre-training Layer-wise unsupervised learning: The author believes that greedy layer-wise unsupervised pre-training overcomes the challenges of deep learning by introducing a useful prior to the supervised fine-tuning training procedure. Gradients of a criterion defined at the output layer could become less useful as they are propagated backwards to lower layers. The author states it is reasonable to believe unsupervised learning criterion defined at the level of a single layer could be used to move its parameters in a favorable direction.

Unsupervised Pre-training The author claims unsupervised pre-training acts as a regularizer by establishing an initialization point of the fine-tuning procedure inside a region of parameter space in which the parameters are henceforth restricted. Another motivation is that greedy layer-wise unsupervised pre-training could be a way to naturally decompose the problem into subproblems associated with different levels of abstraction and extract salient information about the input distribution and capture in a distributed representation.

References Yoshua Bengio, Learning deep architecures for AI: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf Hugo Larochelle, Neural networks: http://info.usherbrooke.ca/hlarochelle/ neural_networks/content.html Geoff Hinton, Neural networks for machine learning: https://www.coursera.org/learn/neural-networks Goodfellow et al., Deep Learning: http://www.deeplearningbook.org/ Montufar et al., On the number of linear regions of Deep Networks: https://arxiv.org/pdf/1402.1869.pdf

References Pascanu et al., On the number of response regions of deep feed forward neural networks with piece-wise linear activations: https://arxiv.org/pdf/1312.6098.pdf Perceptron weight update graphics: https://www.cs.utah.edu/~piyush/teaching/8-9- print.pdf Proof of Theorem (Block, 1962, and Novikoff, 1962). http://cs229.stanford.edu/notes/cs229-notes6.pdf Python and Theano code reference: https://deeplearningcourses.com/

References Erhan et al., Why does unsupervised pre-training help deep learning: http://www.jmlr.org/papers/volume11/erhan10a/ erhan10a.pdf Theano installation and more information: http://deeplearning.net/software/theano/ Theano latest reference: Theano: A Python framework for fast computation of mathematical expressions