Convolutional Dictionary Learning and Feature Design

Similar documents
Nonparametric Bayes tensor factorizations for big data

Convolution and Pooling as an Infinitely Strong Prior

TUTORIAL PART 1 Unsupervised Learning

Nonparametric Bayesian Dictionary Learning for Machine Listening

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Knowledge Extraction from DBNs for Images

Dropout as a Bayesian Approximation: Insights and Applications

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Online Dictionary Learning with Group Structure Inducing Norms

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Dictionary Learning for photo-z estimation

CONVOLUTIONAL DEEP BELIEF NETWORKS

UNSUPERVISED LEARNING

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Hierarchical Infinite Divisibility for Multiscale Shrinkage

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Channel Pruning and Other Methods for Compressing CNN

Tensor Methods for Feature Learning

Sum-Product Networks: A New Deep Architecture

arxiv: v2 [cs.ne] 22 Feb 2013

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Improved Bayesian Compression

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Bayesian Networks (Part I)

Topic 3: Neural Networks

Invariant Scattering Convolution Networks

The Kernel Beta Process

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Lasso, Ridge, and Elastic Net

Shrinkage Methods: Ridge and Lasso

Fine-grained Classification

How to do backpropagation in a brain

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Intro. ANN & Fuzzy Systems. Lecture 38 Mixture of Experts Neural Network

arxiv: v3 [cs.lg] 18 Mar 2013

Improved Local Coordinate Coding using Local Tangents

Statistical Data Mining and Machine Learning Hilary Term 2016

ECE521 Lectures 9 Fully Connected Neural Networks

CSC321 Lecture 20: Autoencoders

Global vs. Multiscale Approaches

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

Why Should I Trust You? Speaker: Philip-W. Grassal Date: 05/03/18

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CSC321 Lecture 2: Linear Regression

I-theory on depth vs width: hierarchical function composition

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Or How to select variables Using Bayesian LASSO

THE task of identifying the environment in which a sound

Introduction to Machine Learning

Introduction to Deep Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Recent Advances in Structured Sparse Models

Learning Task Grouping and Overlap in Multi-Task Learning

Deep Learning Autoencoder Models

Learning Deep Architectures

arxiv: v1 [stat.ml] 27 Jul 2016

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Sparsifying Transform Learning for Compressed Sensing MRI

STA 414/2104: Lecture 8

Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

Global Scene Representations. Tilke Judd

Recurrent Latent Variable Networks for Session-Based Recommendation

Introduction to Probabilistic Machine Learning

Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Deep unsupervised learning

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Default Priors and Effcient Posterior Computation in Bayesian

Modularity Matters: Learning Invariant Relational Reasoning Tasks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

STA 414/2104: Lecture 8

Neural Network Training

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences

Nonparametric Regression With Gaussian Processes

Lecture 16 Deep Neural Generative Models

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Introduction to Deep Learning CMPT 733. Steven Bergner

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

RegML 2018 Class 8 Deep learning

Efficient Data-Driven Learning of Sparse Signal Models and Its Applications

Introduction to Convolutional Neural Networks (CNNs)

arxiv: v1 [stat.ml] 3 Mar 2019

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

Learning the Number of Neurons in Deep Networks

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Sparse Stochastic Inference for Latent Dirichlet Allocation

Bayesian Methods for Sparse Signal Recovery

Scalable Non-linear Beta Process Factor Analysis

Transcription:

1 Convolutional Dictionary Learning and Feature Design Lawrence Carin Duke University 16 September 214

1 1 Background 2 Convolutional Dictionary Learning 3 Hierarchical, Deep Architecture 4 Convolutional Neural Network 5 Summary

2 Outline 1 Background 2 Convolutional Dictionary Learning 3 Hierarchical, Deep Architecture 4 Convolutional Neural Network 5 Summary

3 Motivation In sensing systems, the representation used for the incoming data may significantly impact performance This is typically termed feature extraction, which is often performed with hand-crafted features Ideally the data representation is performed jointly with the learning task Much recent success using deep architectures, and convolutional neural networks Have made connections to recent ideas in dictionary learning and sparse signal representations, with goal of demystifying

4 Dictionary Learning There has been significant recent interest in dictionary learning for image representation Consider an image, which we represent via the matrix X R n m The overall, large image is represented by a set of patches {x i } i=1,n, which may be overlapping

5 Dictionary Learning: Formulation Given a set of image patches {x i } i=1,n, we may infer a dictionary representation x i = Dw i + ɛ i where x i R p, D R p K, and w i R K is sparse Typically the dictionary is over-complete, meaning K > p, and for images we often set p = 8 8 3 = 192 Typical setup: N ˆD, {ŵ i } = argmin D,{wi } x i Dw i 2 2 + λ 1 K d k 2 2 + λ 2 N i=1 i=1 w i 1

uillermo Sapiro,and David Dunson and Lawrence Inferred Dictionary Activations.6.6.6.5.5.5.4.4.4.3.3.3 14.2.2.2 12.1.1.1 1 8 6 4 2 14 12 1 8 6.5.5.5.5.45.45.45.45.4.4.4.4.35.35.35.35.3.3.3.3.25.25.25.25.2.2.2.2.15.15.15.15.1.1.1.1.5.5.5.5 4 2 onsidering BP (PSNR 26.9 db) and dhbp (PSNR 2 6

7 Outline 1 Background 2 Convolutional Dictionary Learning 3 Hierarchical, Deep Architecture 4 Convolutional Neural Network 5 Summary

8 Convolutional Dictionary Learning Using the patch representation yields many redundant dictionary elements, that are simply shifts of a basic dictionary type Now perform dictionary learning directly on the entire image X: X = K d k W k + E where d k is again a dictionary element over a small patch size, but now W k is a sparse activation map over the entire image Solution methodology K ˆD, {Ŵk} = argmin D,{Wk } X d k W k 2 F +λ 1 K d k 2 2+λ 2 K W k 1

9 Efficient Implementation K ˆD, {Ŵk} = argmin D,{Wk } X d k W k 2 F +λ 1 K d k 2 2+λ 2 K W k 1 By moving away from the patched-based solution, we remove many redundant dictionary elements which are shifted and clipped versions of basic types By using FFTs, the convolution operation may be made fast We learn the dictionary ˆD offline, based on many training images For a test image D is fixed, and we must solve {Ŵk} = argmin {Wk } X K d k W k 2 F + λ 2 Still relatively expensive at test time; we return to this K W k 1

1 Outline 1 Background 2 Convolutional Dictionary Learning 3 Hierarchical, Deep Architecture 4 Convolutional Neural Network 5 Summary

11 Multiscale Convolutional Dictionary Learning K ˆD, {Ŵk} = argmin D,{Wk } X d k W k 2 F +λ 1 K d k 2 2+λ 2 K W k 1 The mapping X {W k } constitutes a set of K feature (dictionary activation) maps The scale of these features are dictated by the size of the dictionary elements d k We can use the feature stack {W k },K as new input data, and then we can perform dictionary learning on these features Useful to perform pooling when going to the next layer, to increase the effective size of the dictionary elements at the next level in the hierarchy

Multiscale Convolutional Dictionary Learning ion of max pooling and collecting of coefficients from layer one, for analysis at layer two hen transiting between any two consecutive layers in the hierarchical model). The matrix W ients {w (1) nki } i S for all two-dimensional shifts of dictionary element (1) {d(1) ki } i S (1), for imag ed for all images n {1,..., N}). The matrix of these coefficients are partitioned with spatia 12

max-pooling the original image X is represented in terms of L tensors {X (l) n }. The index l increases 13 Unsupervised Feature Learning Fig. 2. Explanation of max pooling and collecting of coefficients from layer one, for analysis at layer two (a similar procedure is implemented when transiting between any two consecutive layers in the hierarchical model). The matrix W (1) nk defines the shiftdependent coefficients {w (1) nki } i S for all two-dimensional shifts of dictionary element (1) {d(1) ki } i S (1), for image n (this same maxpooling is performed for all images n {1,..., N}). The matrix of these coefficients are partitioned with spatially contiguous blocks (bottom-left). To perform max pooling, the maximum-amplitude coefficient within each block is retained, and is used to define the matrix Ŵ(1) nk (top-left). This max-pooling process is performed for all dictionary elements d(1) k, k {1,..., K(1) }, and the set of max-pooled matrices {Ŵ(1) nk },K are stacked to define the tensor at right. The second-layer data for image n is (1) X(2) n, defined by a tensor like that at right. In practice, when performing stacking (right), we only retain the ˆK (2) K (1) layers for which at least one b (1) nki, corresponding to those dictionary elements d(1) k used in the expansion of the N images. The multiscale dictionary elements may be learned in an unsupervised manner Allows leveraging of massive quantities of unlabeled, but relevant data pooling step, the number of spatial positions in such images decreases as one moves to higher levels. Therefore, Fine-scale tuning may be performed subsequently, using labels or the basic computational complexity decreases with increasing layer within the hierarchy. This process may be rewards (in the RL case) continued for additional layers; in the experiments we consider up to three layers. C. Model features and visualization Connect label/action/policy to features at the top of the hierarchy Assume the hierarchical factor-analysis model discussed above is performed for L layers, and therefore after

14 defines which of the candidate dictionary elements d k are used to represent a given image. In Figure 1(b) we present the usage of the d Learned Convolutional (2) Dictionary Elements k Applied to Caltech 11 database Layer 1 dictionary elements: (white indicates being used, and black not used, and the dictionary elements are organized from most probable to least probable to be used). From Figures 1(c) and 1(f), one notes that of the 2 candidate dictionary elements, roughly 134 of them are used frequently at layer two, and 34 are frequently used at layer three. The Layer 1 dictionary elements for these data (typical of all Layer 1 dictionary elements for natural images) are depicted in Figure 7. Fig. 7. Layer 1 dictionary elements learned by a Gibbs sampler, for the Caltech 11 dataset. Second (left) and third (right) layer dictionary elements: In Figure 8 we show Layer 2 and Layer 3 dictionary elements from five additional classes of the Caltech 11 dataset (the data from each class are analyzed separately). Note that these dictionary elements take on a form well matched to the associated image class. Similar class-specific Layer 2 and object-specific Layer 3 dictionary elements were found when each of the Caltech 11 data classes was analyzed in isolation. Finally, 18

15 Outline 1 Background 2 Convolutional Dictionary Learning 3 Hierarchical, Deep Architecture 4 Convolutional Neural Network 5 Summary

16 Fast Approximate Analysis Recall that, for a test X, using previously learned convolutional dictionary D, we must solve {Ŵk} = argmin {Wk } X Relatively expensive K d k W k 2 F + λ 2 K W k 1 Instead implement a convolutional filterbank followed by a pointwise nonlinear function ĝ( ), ˆf k = argmin g,f k W k g(x f k ) F In practice we have a series of convolutional filterbanks, followed by a nonlinear function, and then a pooling operation The pooling plays an important role in achieving robustness, and accuracy of the above approximation

17 Our Implementation Methodology When learning the model architecture, sparsity plays a key role Have utilized a structure related to adaptive Lasso, via w j 1 τ/γ j exp( w j τ/γ j ) 2 N (w j ;, τ 1 α 1 )InvGa(α; 1, (2γ j ) 1 )dα Impose sparsity promotion via a heavy-tailed gamma process on γ j Bayesian setup used to infer the number of filterbanks at each level of the hierarchy

.3 18 Bayesian Inferred Dictionary Usage.25.2 Probability.15.1.5 2 12 13 14 15 16 17 Number of Dictionary Elements (c)

19 Outline 1 Background 2 Convolutional Dictionary Learning 3 Hierarchical, Deep Architecture 4 Convolutional Neural Network 5 Summary