Machine Learning for Signal Processing Sparse and Overcomplete Representations

Similar documents
Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Structured matrix factorizations. Example: Eigenfaces

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Sparse & Redundant Signal Representation, and its Role in Image Processing

A tutorial on sparse modeling. Outline:

2.3. Clustering or vector quantization 57

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Sparse Estimation and Dictionary Learning


COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine Learning for Signal Processing Non-negative Matrix Factorization

Sensing systems limited by constraints: physical size, time, cost, energy

Lecture Notes 10: Matrix Factorization

Lecture: Face Recognition and Feature Reduction

Introduction to Compressed Sensing

SPARSE signal representations have gained popularity in recent

EUSIPCO

Sparse linear models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparsity in Underdetermined Systems

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Sparse Approximation and Variable Selection

Lecture 5 : Sparse Models

Compressed Sensing: Extending CLEAN and NNLS

DATA MINING AND MACHINE LEARNING

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture: Face Recognition and Feature Reduction

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Complementary Matching Pursuit Algorithms for Sparse Approximation

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

17 Random Projections and Orthogonal Matching Pursuit

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Topographic Dictionary Learning with Structured Sparsity

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Generalized Orthogonal Matching Pursuit- A Review and Some

1 Sparsity and l 1 relaxation

Machine Learning for Signal Processing Non-negative Matrix Factorization

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Statistical approach for dictionary learning

Sketching for Large-Scale Learning of Mixture Models

Linear Methods for Regression. Lijun Zhang

Compressed Sensing and Neural Networks

Greedy Sparsity-Constrained Optimization

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Designing Information Devices and Systems I Discussion 13B

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Bayesian Paradigm. Maximum A Posteriori Estimation

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

SOS Boosting of Image Denoising Algorithms

Analysis of Greedy Algorithms

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

Linear Models for Regression. Sargur Srihari

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Deriving Principal Component Analysis (PCA)

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Unsupervised learning: beyond simple clustering and PCA

A Quest for a Universal Model for Signals: From Sparsity to ConvNets

The Iteration-Tuned Dictionary for Sparse Representations

Least Mean Squares Regression

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

Pre-weighted Matching Pursuit Algorithms for Sparse Recovery

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Learning Theory Continued

Sparsity Regularization

Machine Learning Techniques

Introduction to Machine Learning

Sparsifying Transform Learning for Compressed Sensing MRI

The Analysis Cosparse Model for Signals and Images

ISyE 691 Data mining and analytics

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Parcimonie en apprentissage statistique

Greedy Signal Recovery and Uniform Uncertainty Principles

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Compressive Sensing and Beyond

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

A Unified Energy-Based Framework for Unsupervised Learning

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Review: Learning Bimodal Structures in Audio-Visual Data

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

COMS 4771 Regression. Nakul Verma

Factor Analysis (10/2/13)

CSE 554 Lecture 7: Alignment

Low Rank Matrix Completion Formulation and Algorithm

Noise Removal? The Evolution Of Pr(x) Denoising By Energy Minimization. ( x) An Introduction to Sparse Representation and the K-SVD Algorithm

Designing Information Devices and Systems I Spring 2018 Homework 13

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Graphical Models for Collaborative Filtering

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Signal Analysis. Principal Component Analysis

Lecture 11: Unsupervised Machine Learning

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Transcription:

Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1

So far Weights Data Basis Data Independent ICA PCA NNMF # basis <= dim X 2

So far Can we use the same structure to identify basic units that compose the signal? 3

Key Topics in this Lecture Basics Component-based representations Overcomplete and Sparse Representations, Dictionaries Pursuit Algorithms How to learn a dictionary Why is an overcomplete representation powerful? 4

Representing Data Dictionary (codebook) 5

Representing Data Dictionary Atoms 6

Representing Data Dictionary Atoms Each atom is a basic unit that can be used to compose larger units. 7

Representing Data Atoms 8

Representing Data Atoms 9

Representing Data 1

Representing Data 11

Representing Data 12

Representing Data 6.44.4.1 12.19.2 4. 9.12. 13

Representing Data w 1 w 5 w 2 w 6 w 3 w 7 w 4. 14

Representing Data Linear combination of elements in the Dictionary = 15

Representing Data Linear combination of elements in the Dictionary = = 16

Representing Data Linear combination of elements in the Dictionary = = 17

Quick Linear Algebra Refresher Remember, #(Basis Vectors)= #unknowns Basis Vectors (from Dictionary) Weights Input data 18

Overcomplete Representations What is the dimensionality of the input image? (say 64x64 image) 496 What is the dimensionality of the dictionary? (each image = 64x64 pixels) 496 x N 19

Overcomplete Representations What is the dimensionality of the input image? (say 64x64 image) 496 What is the dimensionality of the dictionary? (each image = 64x64 pixels) 496 x N??? 2

Overcomplete Representations What is the dimensionality of the input image? (say 64x64 image) 496 What is the dimensionality of the dictionary? (each image = 64x64 pixels) 496 x N VERY LARGE!!! 21

Overcomplete Representations What is the dimensionality of the input image? (say 64x64 image) If N > 496 (as it likely is) we have 496 an overcomplete representation What is the dimensionality of the dictionary? (each image = 64x64 pixels) 496 x N VERY LARGE!!! 22

Overcomplete Representations What is the dimensionality of the input image? (say 64x64 image) More generally: 496 If #(basis vectors) > dimensions of input What is the dimensionality of the dictionary? we have an overcomplete representation (each image = 64x64 pixels) 496 x N VERY LARGE!!! 23

Quick Linear Algebra Refresher Remember, #(Basis Vectors)= #unknowns Basis Vectors (from Dictionary) Weights Input data 24

Dictionary based Representations Overcomplete dictionary -based representations are composition-based representations with more bases than the dimensionality of the data D α X = Bases matrix is wide (more bases than dimensions) Input

Why Dictionary-based Representations? Dictionary based representations are semantically more meaningful Enable content-based description Bases can capture entire structures in data E.g. notes in music E.g. image structures (such as faces) in images Enable content-based processing Reconstructing, separating, denoising, manipulating speech/music signals Coding, compression, etc. Statistical reasons: We will get to that shortly.. 26

Problems How to obtain the dictionary Which will give us meaningful representations How to compute the weights? D α X = Input

Problems How to obtain the dictionary Which will give us meaningful representations How to compute the weights? D α X = Input

Quick Linear Algebra Refresher Remember, #(Basis Vectors)= #unknowns Basis Vectors Weights Input data When can we solve for α? 29

Quick Linear Algebra Refresher = Unique solution D: full rank = We may have no exact solution = Infinite Solutions 3

Quick Linear Algebra Refresher = Unique solution D: full rank = = We may have no exact solution Our Case Infinite Solutions 31

Using Pseudo-Inverse? 32

Overcomplete Representation Unknown D α X = #(Basis Vectors) > dimensions of the input 33

Representing Data Using bases that we know But no more than k=4 bases 34

Overcompleteness and Sparsity To solve an overcomplete system of the type: Make assumptions about the data. Suppose, we say that X is composed of no more than a fixed number (k) of bases from D (k dim(x)) The term bases is an abuse of terminology.. Now, we can find the set of k bases that best fit the data point, X. 35

Overcompleteness and Sparsity Atoms But no more than k=4 bases are active 36

Overcompleteness and Sparsity Atoms But no more than k=4 bases 37

No more than 4 bases α D = X 38

No more than 4 bases ONLY THE a COMPONENTS CORRESPONDING TO THE 4 KEY D DICTIONARY ENTRIES ARE NON-ZERO α = X 39

No more than 4 bases ONLY THE a COMPONENTS CORRESPONDING TO THE 4 KEY D DICTIONARY ENTRIES ARE NON-ZERO α = X MOST OF a IS ZERO!! a IS SPARSE 4

Sparsity- Definition Sparse representations are representations that account for most or all information of a signal with a linear combination of a small number of atoms. (from: www.see.ed.ac.uk/~tblumens/sparse/sparse.html) 41

The Sparsity Problem We don t really know k You are given a signal X Assuming X was generated using the dictionary, can we find α that generated it? 42

The Sparsity Problem We want to use as few basis vectors as possible to do this. Min a a s.t. X Da 43

The Sparsity Problem We want to use as few basis vectors as possible to do this. Min a a s.t. X Da Counts the number of nonzero elements in α 44

The Sparsity Problem We want to use as few basis vectors as possible to do this Ockham s razor: Choose the simplest explanation invoking the fewest variables Min a a s.t. X Da 45

The Sparsity Problem We want to use as few basis vectors as possible to do this. Min a a s.t. X Da How can we solve the above? 46

Obtaining Sparse Solutions We will look at 2 algorithms: Matching Pursuit (MP) Basis Pursuit (BP) 47

Matching Pursuit (MP) Greedy algorithm Finds an atom in the dictionary that best matches the input signal Remove the weighted value of this atom from the signal Again, find an atom in the dictionary that best matches the remaining signal. Continue till a defined stop condition is satisfied. 48

Matching Pursuit Find the dictionary atom that best matches the given signal. Weight = w 1 49

Matching Pursuit Remove weighted image to obtain updated signal Find best match for this signal from the dictionary 5

Matching Pursuit Find best match for updated signal 51

Matching Pursuit Find best match for updated signal Iterate till you reach a stopping condition, norm(residualinputsignal) < threshold 52

Matching Pursuit From http://en.wikipedia.org/wiki/matching_pursuit 53

Matching Pursuit Problems??? 54

Matching Pursuit Main Problem Computational complexity The entire dictionary has to be searched at every iteration 55

Comparing MP and BP Matching Pursuit Hard thresholding Basis Pursuit (remember the equations) Greedy optimization at each step Weights obtained using greedy rules 56

Basis Pursuit (BP) Remember, Min a a s.t. X Da 57

Basis Pursuit Remember, Min a a s.t. X Da In the general case, this is intractable 58

Basis Pursuit Remember, Min a a s.t. X Da In the general case, this is intractable Requires combinatorial optimization 59

Basis Pursuit Replace the intractable expression by an expression that is solvable Min a a 1 s.t. X Da 6

Basis Pursuit Replace the intractable expression by an expression that is solvable Min a a 1 s.t. X Da This will provide identical solutions when D obeys the Restricted Isometry Property. 61

Basis Pursuit Replace the intractable expression by an expression that is solvable Min a a 1 s.t. X Da Objective Constraint 62

Basis Pursuit We can formulate the optimization term as: Min a { X Da 2 a 1 } Constraint Objective 63

Basis Pursuit We can formulate the optimization term as: Min a { X Da 2 a 1 } λ is a penalty term on the non-zero elements and promotes sparsity 64

Basis Pursuit Equivalent to LASSO; for more details, see this paper We can by Tibshirani formulate the optimization term as: http://www-stat.stanford.edu/~tibs/ftp/lasso.ps Min a { X Da 2 a 1 } λ is a penalty term on the non-zero elements and promotes sparsity 65

Basis Pursuit a 1 is not differentiable at a j = Gradient of a 1 for gradient descent update At optimum, following conditions hold 66 { 1 } 2 a a a D X Min at 1 [-1,1]at at 1 1 j j j j a a a a a if, if, ) ( 2 2 j j j j j X sign X a a a a a D D

Basis Pursuit There are efficient ways to solve the LASSO formulation. http://web.stanford.edu/~hastie/glmnet_matlab/ Simplest solution: Coordinate descent algorithms On webpage.. 67

L 1 vs L Min a a s. t. X Da X Overcomplete set of 6 bases L minimization Two-sparse solution ANY pair of bases can explain X with error 68

L 1 vs L Min a a 1 s. t. X Da X Overcomplete set of 6 bases L 1 minimization Two-sparse solution All else being equal, the two closest bases are chosen 69

Comparing MP and BP Matching Pursuit Hard thresholding Basis Pursuit Soft thresholding (remember the equations) Greedy optimization at each step Weights obtained using greedy rules Global optimization Can force N-sparsity with appropriately chosen weights 7

General Formalisms L minimization L constrained optimization L 1 minimization Min a Da L 1 constrained optimization a s. t. X Min a a 1 s. t. X Da Min a s. t. a Min a s. t. a X X 1 Da C Da C 2 2 2 2 71

Many Other Methods.. Iterative Hard Thresholding (IHT) CoSAMP OMP 72

Problems How to obtain the dictionary Which will give us meaningful representations How to compute the weights? D α X = Input

Trivial Solution D = Training data Impractical and useless 74

Dictionaries: Compressive Sensing Just random vectors! D 75

More Structured ways of Constructing Dictionaries Dictionary entries must be structurally meaningful Represent true compositional units of data Have already encountered two ways of building dictionaries NMF for non-negative data K-means.. 76

K-Means for Composing Dictionaries Train the codebook from training data using K-means Every vector is approximated by the centroid of the cluster it falls into Cluster means are codebook entries Dictionary entries Also compositional units the compose the data 77

K-Means for Dictionaries Each column is a codeword (centroid) from the codebook D a must be 1 sparse Only a entry must 1 = 1 1 = 1 α 1 X 78

K-Means Learn Codewords to minimize the total squared length of the training vectors from the closest codeword 79

Length-unconstrained K-Means for Dictionaries Each column is a codeword (centroid) from the codebook D a must be 1 sparse No restriction on a value = 1 α 3 X 8

SVD K-Means Learn Codewords to minimize the total squared projection error of the training vectors from the closest codeword 81

1. Initialize a set of centroids randomly 2. For each data point x, find the projection from the centroid for each cluster p cluster SVD K means cluster 3. Put data point in the cluster of the closest centroid Cluster for which p cluster is maximum 4. When all data points are clustered, recompute centroids x T m m cluster Principal Eigenvector({ x x cluster}) 5. If not converged, go back to 2 82

Problem Only represents Radial patterns 83

What about this pattern? Dictionary entries that represent radial patterns will not capture this structure 1-sparse representations will not do 84

What about this pattern? We need AFFINE patterns 85

What about this pattern? We need AFFINE patterns Each vector is modeled by a linear combination of K (here 2) bases 86

What about this pattern? Every line is a (constrined) combination of two bases 2-sparse Constraint: Line = a.b 1 + (1-a)b 2 We need AFFINE patterns Each vector is modeled by a linear combination of K (here 2) bases 87

Codebooks for K sparsity? Each column is a codeword (centroid) from the codebook D a must be k sparse No restriction on a value = k α 3 1 4 X 88

Formalizing Given training data We want to find a dictionary D, such that With sparse 89

Formalizing Two objectives: - Approximation - Sparsity in coefficients NON-Convex!!! 9

An iterative method Given D, estimate We can use any method to get sparse solution Given, estimate D Difficult! 91

K SVD Initialize Codebook D = 1. For every vector, compute K-sparse alphas Using any pursuit algorithm a = 2 1 1 7 1 3 5 1 1 2 92

K-SVD D j j!= 1 2. For each codeword (k): For each vector x Subtract the contribution of all other codewords to obtain e k (x) Codeword-specific residual Compute the principal Eigen vector of {e k (x)} 3. Return to step 1 D = a = 2 1 1 a j (x) j!= 1 7 1 3 5 1 1 2 e k x = x-σ j k j D j 93

K-SVD D Termination of each iteration: Updated dictionary Conclusion: A dictionary where any data vector can be composed of at most K dictionary entries More generally, sparse composition 94

Problems How to obtain the dictionary Which will give us meaningful representations How to compute the weights? D α X = Input

Applications of Sparse Representations Many many applications Signal representation Statistical modelling.. We ve seen one: Compressive sensing Another popular use Denoising 97

Denoising As the name suggests, remove noise! 98

A toy example 99

A toy example Identity matrix Translation of a Gaussian pulse 1

Image Denoising Here s what we want 11

Image Denoising Here s what we want 12

Image Denoising Here s what we want 13

The Image Denoising Problem Given an image Remove Gaussian additive noise from it 14

Image Denoising Noisy Input Orig. Image Y = X + ε ε = Ν(, σ) Gaussian Noise 15

Image Denoising Remove the noise from Y, to obtain X as best as possible. 16

Image Denoising Remove the noise from Y, to obtain X as best as possible Using sparse representations over learned dictionaries 17

Image Denoising Remove the noise from Y, to obtain X as best as possible Using sparse representations over learned dictionaries We will learn the dictionaries 18

Image Denoising Remove the noise from Y, to obtain X as best as possible Using sparse representations over learned dictionaries We will learn the dictionaries What data will we use? The corrupted image itself! 19

Image Denoising We use the data to be denoised to learn the dictionary. Training and denoising become an iterated process. We use image patches of size n x n pixels (i.e. if the image is 64x64, patches are 8x8) 11

Image Denoising The data dictionary D Size = n x k (k > n) This is known and fixed, to start with Every image patch can be sparsely represented using D 111

Image Denoising Recall our equations from before. We want to find α so as to minimize the value of the equation below: Min a { X Da 2 a } Min{ X Da a } 1 a 2 112

Image Denoising Min a { X Da 2 a 1 } In the above, X is a patch. 113

Image Denoising Min a { X Da 2 a 1 } In the above, X is a patch. If the larger image is fully expressed by the every patch in it, how can we go from patches to the image? 114

Image Denoising Min { X Y 2 a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij 115

Image Denoising Min { X Y 2 a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij (X Y) is the error between the input and denoised image. μ is a penalty on the error. 116

Image Denoising Min { X Y 2 a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij Error bounding in each patch -R ij selects the (ij) th patch -Terms in summation = no. of patches 117

Image Denoising Min { X Y 2 a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij λ forces sparsity 118

Image Denoising But, we don t know our dictionary D. We want to estimate D as well. 119

Image Denoising But, we don t know our dictionary D. We want to estimate D as well. Min { X Y 2 D,a ij,x 2 } ij a ij ij ij R ij X Da ij 2 We can use the previous equation itself!!! 2 12

Image Denoising Min { X Y 2 D,a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij How do we estimate all 3 at once? 121

Image Denoising Min { X Y 2 D,a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij How do we estimate all 3 at once? We cannot estimate them at the same time! 122

Image Denoising Min { X Y 2 D,a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij How do we estimate all 3 at once? Fix 2, and find the optimal 3 rd. 123

Image Denoising Min { X Y 2 D,a ij,x 2 ij R ij X Da ij 2 2 } ij a ij ij Initialize X = Y 124

Image Denoising 2 Min { X Y a 2 ij } ij a ij ij ij R ij X Da ij 2 2 Initialize X = Y, initialize D You know how to solve the remaining portion for α MP, BP! 125

Image Denoising Now, update the dictionary D. Update D one column at a time, following the K-SVD algorithm K-SVD maintains the sparsity structure 126

Image Denoising Now, update the dictionary D. Update D one column at a time, following the K-SVD algorithm K-SVD maintains the sparsity structure Iteratively update α and D 127

Image Denoising Learned Dictionary for Face Image denoising From: M. Elad and M. Aharon, Image denoising via learned dictionaries and sparse representation, CVPR, 26. 128

Image Denoising Min X { X Y 2 2 ij R ij X Da ij 2 2 } ij a ij ij Const. wrt X We know D and α The quadratic term above has a closedform solution 129

Image Denoising Min X { X Y 2 2 ij R ij X Da ij 2 2 } ij a ij ij Const. wrt X We know D and α X ( I R T R ) 1 ( Y ij ij ij ij R T ij D a ij ) 13

Image Denoising Summarizing We wanted to obtain 3 things 131

Image Denoising Summarizing We wanted to obtain 3 things Weights α Dictionary D Denoised Image X 132

Image Denoising Summarizing We wanted to obtain 3 things Weights α Your favorite pursuit algorithm Dictionary D Using K-SVD Denoised Image X 133

Image Denoising Summarizing We wanted to obtain 3 things Weights α Your favorite pursuit algorithm Dictionary D Using K-SVD Denoised Image X Iterating 134

Image Denoising Summarizing We wanted to obtain 3 things Weights α Dictionary D Denoised Image X- Closed form solution 135

Image Denoising Here s what we want 136

Image Denoising Here s what we want 137

Comparing to Other Techniques Non-Gaussian data PCA of ICA Which is which? Images from Lewicki and Sejnowski, Learning Overcomplete Representations, 2. 138

Comparing to Other Techniques Non-Gaussian data PCA ICA Images from Lewicki and Sejnowski, Learning Overcomplete Representations, 2. 139

Comparing to Other Techniques Predicts data here Non-Gaussian data Doesn t predict data here Does pretty well PCA ICA Images from Lewicki and Sejnowski, Learning Overcomplete Representations, 2. 14

Comparing to Other Techniques Data still in 2-D space ICA Overcomplete Doesn t capture the underlying representation, which Overcomplete representations can do 141

Summary Overcomplete representations can be more powerful than component analysis techniques. Dictionary can be learned from data. Relative advantages and disadvantages of the pursuit algorithms. 142