Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Similar documents
A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Unsupervised Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Joint Training of Partially-Directed Deep Boltzmann Machines

UNSUPERVISED LEARNING

Deep unsupervised learning

Greedy Layer-Wise Training of Deep Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

How to do backpropagation in a brain

Lecture 16 Deep Neural Generative Models

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Restricted Boltzmann Machines for Collaborative Filtering

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Gaussian Cardinality Restricted Boltzmann Machines

Learning Deep Architectures

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

ECE521 Lectures 9 Fully Connected Neural Networks

STA 4273H: Statistical Machine Learning

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Part 2. Representation Learning Algorithms

STA 4273H: Statistical Machine Learning

Learning Energy-Based Models of High-Dimensional Data

The Origin of Deep Learning. Lili Mou Jan, 2015

Chapter 20. Deep Generative Models

Knowledge Extraction from DBNs for Images

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

ECE521 week 3: 23/26 January 2017

arxiv: v2 [cs.ne] 22 Feb 2013

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Learning Tetris. 1 Tetris. February 3, 2009

Jakub Hajic Artificial Intelligence Seminar I

TUTORIAL PART 1 Unsupervised Learning

Deep Boltzmann Machines

Pattern Recognition and Machine Learning

Recent Advances in Bayesian Inference Techniques

Importance Reweighting Using Adversarial-Collaborative Training

Based on the original slides of Hung-yi Lee

Chapter 16. Structured Probabilistic Models for Deep Learning

The connection of dropout and Bayesian statistics

Restricted Boltzmann Machines

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Machine Learning Basics

Reading Group on Deep Learning Session 1

Deep Learning Autoencoder Models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

A Spike and Slab Restricted Boltzmann Machine

Deep Belief Networks are compact universal approximators

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Study Notes on the Latent Dirichlet Allocation

Summary and discussion of: Dropout Training as Adaptive Regularization

Undirected Graphical Models

Deep Generative Models. (Unsupervised Learning)

Improved Local Coordinate Coding using Local Tangents

Learning Deep Architectures for AI. Part I - Vijay Chakilam

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

An efficient way to learn deep generative models

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Introduction to Convolutional Neural Networks (CNNs)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

OPTIMIZATION METHODS IN DEEP LEARNING

Introduction to Restricted Boltzmann Machines

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Modeling Documents with a Deep Boltzmann Machine

A summary of Deep Learning without Poor Local Minima

Deep Feedforward Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

arxiv: v1 [cs.lg] 9 Nov 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Self Supervised Boosting

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Restricted Boltzmann Machines

Learning to Disentangle Factors of Variation with Manifold Learning

Collaborative Filtering Applied to Educational Data Mining

Cardinality Restricted Boltzmann Machines

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Maxout Networks. Hien Quoc Dang

Fundamentals of Computational Neuroscience 2e

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines

Lecture 5: Logistic Regression. Neural Networks

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

Deep Neural Networks

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

RegML 2018 Class 8 Deep learning

Active and Semi-supervised Kernel Classification

Sum-Product Networks: A New Deep Architecture

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Transcription:

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1

Outline Contributions Spike-and-Slab Sparse Coding (S3C) Model Algorithm: Variational EM for S3C Results 2

Contributions Introduce a new feature learning and extraction procedure based on a factor model, Spike-and-Slab Sparse Coding (S3C). Overcome two major scaling challenges. 1) Scale inference in the spike-and-slab coding model to work for the large problem sizes required for object recognition. 2) Use the enhanced regularization properties of spikeand-slab sparse coding to scale object recognition techniques to work with large numbers of classes and small amounts of labeled data. 3

The Spike-and-Slab Sparse Coding Model 1. Latent binary spike variables 2. Latent real-valued slab variables, 3. Real-valued D-dimensional visible vector : the logistic sigmoid function : a set of biases on the spike variables : governs the linear dependence of s on h. : the element-wise product of h and s. : governs the linear dependence of v on s. The columns of W have unit norm. and are diagonal precision matrices. The state of a hidden unit is best understood as 4

Comparison to Sparse Coding 1. One can derive the S3C model from sparse coding by replacing the factorial Cauchy or Laplace prior with a spike-and-slab prior. 2. One drawback of sparse coding is that the latent variables are not merely encouraged to be sparse; they are encouraged to remain close to 0, even when they are active. S3C uses b and s. 3. Another drawback of sparse coding is that the factors are not actually sparse in the generative distribution. 4. Sparse coding is also difficult to integrate into a deep generative model of data such as natural images. 5

Comparison to Restricted Boltzmann Machines Energy based model: S3C moves from an undirected ssrbm model to the directed graphical model. A variant of the Effects: 1. Partition function 2. Posterior 3. Prior 6

Review of RBM(1) A restricted Boltzmann machine (RBM) is a generative stochastic neural network that can learn a probability distribution over its set of inputs. RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph: they have input units, corresponding to features of their inputs, hidden units that are trained, and each connection in an RBM must connect a visible unit to a hidden unit. RBMs have found applications in dimensionality reduction, classification, collaborative filtering and topic modeling. They can be trained in either supervised or unsupervised ways, depending on the task. Diagram of a restricted Boltzmann machine with three visible units and four hidden units (no bias units). 7

Review of RBM(2) A joint conguration, (v; h) of the visible and hidden units has an energy given by: The network assigns a probability to every possible pair of a visible and a hidden vector via this energy function: The probability that the network assigns to a visible vector, Geoffrey Hinton (2010). A Practical Guide to Training Restricted Boltzmann Machines. UTML TR 2010 003, University of Toronto. 8

Effects: from Undirected Modeling to Directed Modeling 1. Partition function The partition function becomes tractable. 2. Posterior RBMs have a factorial posterior, but S3C and sparse coding have a complicated posterior due to the explaining away effect. Being able to selectively activate a small set of features that cooperate to explain the input likely provides S3C a major advantage in discriminative capability. 3. Prior S3C introduces a factorial prior. This probably makes it a poor generative model, but this is not a problem for the purpose of feature discovery. 9

Variational EM for S3C A variant of the EM algorithm 1) E-step, compute a variational approximation to the posterior rather than the posterior itself. 2) M-step, a closed-form solution. Online learning with small gradient steps on the M-step objective worked better in practice. The goal of the variational E-step is to maximize the energy functional with respect to a distribution Q over the unobserved variables. Selecting Q that minimizes the KL divergence: is drawn from a restricted family of distributions. This family can be chosen to ensure that Q is tractable. 10

Parallel Updates to All Units (1) 1) Start each iteration by partially minimizing the KL divergence with respect to The terms of the KL divergence that depend on make up a quadratic function so this can be minimized via conjugate gradient descent. Implement conjugate gradient descent efficiently by using the R-operator to perform Hessian vector products rather than computing the entire Hessian explicitly. This step is guaranteed to improve the KL divergence on each iteration. 2) Next update in parallel, shrinking the update by a damping coefficient. This approach is not guaranteed to decrease the KL divergence on each iteration but it is a widely applied approach. Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147 160. 11

Parallel Updates to All Units (2) Replace the conjugate gradient update to with a more heuristic approach. In practice faster convergence, reaching equally good solutions. Use a parallel damped update on much like. An additional heuristic modication to the update rule which is made necessary by the unbounded nature of. Clip the update to so that if has the opposite sign from, its magnitude is at most. This prevents a case where multiple mutually inhibitory s units inhibit each other so strongly that rather than being driven to 0 they change sign and actually increase in magnitude. This case is a failure mode of the parallel updates that can result in amplifying without bound if clipping is not used. 12

13

14

15

Performance Results 16

17

18

Classification Results 19

CIFAR-100 Classification Results The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The CIFAR-100 has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). 20

Transfer Learning Challenge NIPS 2011 Workshop on Challenges in Learning Hierarchical Models. Dataset: 32x32 color images, including 100,000 unlabeled examples, 50,000 labeled examples of 100 object classes not present in the test set, and 120 labeled examples of 10 object classes present in the test set. Results: A test set accuracy of 48.6 %. This approach disregards the 50,000 labels and treats this transfer learning problem as a semisupervised learning problem. Experimented with some transfer learning techniques but the transferfree approach performed best on leave-one-out cross-validation on the 120 example training set, so we chose to enter the transfer-free technique in the challenge. 21