Why Sparse Coding Works

Similar documents
SPARSE signal representations have gained popularity in recent

What is Singular Learning Theory?

Compressed Sensing and Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

How to do backpropagation in a brain

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Blind Compressed Sensing

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Structured matrix factorizations. Example: Eigenfaces

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Deep Generative Models. (Unsupervised Learning)

Neural networks: Unsupervised learning

TUTORIAL PART 1 Unsupervised Learning

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Simple, Efficient and Neural Algorithms for Sparse Coding

Sparse Covariance Selection using Semidefinite Programming

Self-Calibration and Biconvex Compressive Sensing

Dictionary Learning Using Tensor Methods

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Geometric and algebraic structures in pattern recognition

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER

Dictionary Learning for L1-Exact Sparse Coding

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Understanding the Curse of Singularities in Machine Learning

CIFAR Lectures: Non-Gaussian statistics and natural images

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

2.3. Clustering or vector quantization 57

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Learning Deep Architectures

Review: Learning Bimodal Structures in Audio-Visual Data

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Inferring Sparsity: Compressed Sensing Using Generalized Restricted Boltzmann Machines. Eric W. Tramel. itwist 2016 Aalborg, DK 24 August 2016

UNSUPERVISED LEARNING

STA 414/2104: Lecture 8

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Dictionary Learning for photo-z estimation

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Greedy Layer-Wise Training of Deep Networks

When does a mixture of products contain a product of mixtures?

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Learning Deep Architectures for AI. Part II - Vijay Chakilam

A Unified Energy-Based Framework for Unsupervised Learning

Sparse Gaussian conditional random fields

Lecture Notes 10: Matrix Factorization

Machine Learning Basics: Maximum Likelihood Estimation

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

Accelerated MRI Image Reconstruction

NON-NEGATIVE SPARSE CODING

Lecture 16 Deep Neural Generative Models

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Statistical Issues in Searches: Photon Science Response. Rebecca Willett, Duke University

Introduction to Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Data Mining

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

arxiv: v1 [cs.it] 26 Oct 2018

EUSIPCO

Basic Principles of Unsupervised and Unsupervised

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

Competitive Learning for Deep Temporal Networks

Asymptotic Approximation of Marginal Likelihood Integrals

Neural Networks and Deep Learning.

Simple, Efficient and Neural Algorithms for Sparse Coding

OPTIMIZATION METHODS IN DEEP LEARNING

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Dimensionality reduction

Neural networks and optimization

Invertible Nonlinear Dimensionality Reduction via Joint Dictionary Learning

Bayesian Learning in Undirected Graphical Models

Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies

Central Groupoids, Central Digraphs, and Zero-One Matrices A Satisfying A 2 = J

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

1 Sparsity and l 1 relaxation

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Sparsifying Transform Learning for Compressed Sensing MRI

Recent Advances in Structured Sparse Models

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Restricted Boltzmann Machines for Collaborative Filtering

Beyond Spatial Pyramids

A Potpourri of Nonlinear Algebra

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

Denoising Autoencoders

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Transcription:

Why Sparse Coding Works Mathematical Challenges in Deep Learning Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 10 Aug 2011 Deep Learning Kickoff Meeting

What is Sparse Coding? There are many formulations and algorithms for sparse coding in the literature. We isolate the basic mathematical idea in order to study its properties. Let y R n be a data point in a data set Y. Assume that y can be expressed as y = Aa where A R n m is a dictionary matrix and a R m is a sparse vector. (Ignore noise in the data y for now.) Assume a has at most k nonzero entries (k-sparse). Usually, the dictionary A is overcomplete: there are fewer rows than columns (n < m). The columns of A are the atoms of the dictionary.

What is Sparse Coding? y = Aa e.g. y small image patch, A dictionary of features, a sparse representation of y. In compressed sensing, we attempt to recover a, assuming we know the dictionary A. In sparse coding, we attempt to find a dictionary A so that each y Y has a sparse representation. When A is overcomplete, both problems are ill-posed without the sparsity condition.

Sparse Coding in Deep Learning Implement sparse coding on each layer of the autoencoder. Each layer is a Restricted Boltzmann Machine (RBM). Without sparse coding, the autoencoder learns a low-dim representation similar to Principal Component Analysis.

Sparse Coding in Deep Learning Aim: Find code map b(y) and dictionary B such that is as close to y as possible. ŷ = Bb(y) Method: Relax the problem and optimize: [ ] y Bb(y) 2 +βs(b(y)) + λw(b) min B,b(y) y S(b) sparsity penalty, W(B) weight penalty to limit growth of B, penalties controlled by parameters β, λ.

Sparse Coding in Deep Learning min B,b(y) [ y ] y Bb(y) 2 +βs(b(y)) Solve using gradient descent methods. + λw(b) Leads to local update rules for B such as Hebb s learning rule, Oja s learning rule, etc. QN: Why does sparse coding work?

What do we mean by Why? Two ways we want to understand sparse coding: Sparse coding only applies if the data actually have sparse structure. [Olshausen and Field, 1997] How can we explain this rigorously using statistical learning theory and model selection? The brain evolved to extract sparse structure in data effectively and efficiently. How does the brain do this? We want to mimic what goes on in the brain for many real-world applications.

Sparse Structure in Data In statistical learning theory, when the sample size is large and all given models attain the true distribution of the data, model selection can be accomplished using Watanabe s Generalized Bayesian Information Criterion: NS λ log N + (θ 1) log log N N sample size, S entropy of true distribution, λ, θ learning coefficient and its multiplicity.

Sparse Structure in Data NS λ log N + (θ 1) log log N Conventional BIC: λ = 2 1 (model dimension),θ = 1. model dimension number of hidden factors This may explain why sparsity is a good candidate as a model selection criterion. See [Geiger and Rusakov, 2005] for computation of λ for binary RBMs with one hidden node. What is the learning coefficient of a general RBM?

Sparse Structure in Data As hidden features start having dependencies among themselves, is sparsity still effective for model selection? Perhaps we can discover new learning criteria from this exploration.

Uniqueness in Sparse Coding In principle, optimal recovery in sparse coding is NP-hard and not necessarily unique. But there are many suboptimal methods for approx a solution (e.g. optimization by gradient descent). In fact, given some mild assumptions about the dictionary and code sparsity, one can show that solutions to the sparse coding problem are unique. [Hillar and Sommer, 2011] We say a dictionary A is incoherent if Aa 1 = Aa 2 for k-sparse a 1,a 2 R m a 1 = a 2

Uniqueness in Sparse Coding Reconstructing representation a: Given A R n m and y = Aa for some k-sparse a. If n < Ck log(m/k), then it is impossible to recover a from y. (C is some constant.) Reconstructing dictionary A: ACS Reconstruction Theorem Suppose A R n m is incoherent. There exist k-sparse a 1,...,a N R m such that: if B R n m, k-sparse b 1,...,b N R m satisfy Aa i = Bb i, then A = BPD for some permutation matrix P R m m and invertible diagonal matrix D R m m. [Hillar and Sommer, 2011]

Mimicking the Brain Proof of the ACS Reconstruction Theorem requires delicate combinatorial tools from Ramsey Theory. The theorem assures us that if there is sparse structure in the data, then this structure can be recovered if we can solve the sparse coding problem optimally. Are there also mild assumptions on the suboptimal methods to ensure they find this optimal solution? This may show us the extent to which learning in the brain depends on the formulation of the learning rule. Can we implement these methods on the neuron level? Build parallel computers for deep learning?

Summary Sparse coding and model selection Correct recovery in sparse coding Learning group symmetries How do our brains account for group symmetries in the data (rotations, translations, etc.)? Do these symmetries arise naturally from temporal coherence (i.e. videos instead of still images)?

References 1. A. CUETO, J. MORTON AND B. STURMFELS: Geometry of the restricted Boltzmann machine, in Algebraic Methods in Statistics and Probability, (eds. M. Viana and H. Wynn), AMS, Contemporary Mathematics 516 (2010) 135 153. 2. D. GEIGER AND D. RUSAKOV: Asymptotic model selection for naive Bayesian networks, J. Mach. Learn. Res. 6 (2005) 1 35. 3. S. GLEICHMAN: Blind compressed sensing, preprint (2010). 4. C. J. HILLAR AND F. T. SOMMER: Ramsey theory reveals the conditions when sparse dictionary learning on subsampled data is unique, preprint (2011). 5. S. LIN: Algebraic methods for evaluating integrals in Bayesian statistics, PhD dissertation, Dept. Mathematics, UC Berkeley (2011). 6. A. NG, J. NGIAM, C. Y. FOO, Y. MAI, C. SUEN: Unsupervised feature learning and deep learning tutorial: ufldl.stanford.edu (2011). 7. B. OLSHAUSEN AND D. FIELD: Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37 (1997) 3311 3325. 8. S. WATANABE: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on Applied and Computational Mathematics 25 (2009).