EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Similar documents
Applied Machine Learning for Biomedical Engineering. Enrico Grisan

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Machine Learning for Signal Processing Sparse and Overcomplete Representations

SPARSE signal representations have gained popularity in recent

Linear Methods for Regression. Lijun Zhang

An Introduction to Sparse Approximation

1 Regression with High Dimensional Data

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Oslo Class 6 Sparsity based regularization

Introduction to Compressed Sensing

Analysis of Greedy Algorithms

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Structured matrix factorizations. Example: Eigenfaces

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Least Mean Squares Regression

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Sparse analysis Lecture III: Dictionary geometry and greedy algorithms

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Designing Information Devices and Systems I Discussion 13B

Sparse linear models

1 Sparsity and l 1 relaxation

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Sparse Linear Models (10/7/13)

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

sparse and low-rank tensor recovery Cubic-Sketching

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Least Mean Squares Regression. Machine Learning Fall 2018

Sparse & Redundant Signal Representation, and its Role in Image Processing

ISyE 691 Data mining and analytics

Lecture 5 : Projections

Homework 5. Convex Optimization /36-725

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Designing Information Devices and Systems I Spring 2018 Homework 13

Sensing systems limited by constraints: physical size, time, cost, energy

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

1 Computing with constraints

17 Random Projections and Orthogonal Matching Pursuit

SGN Advanced Signal Processing Project bonus: Sparse model estimation

Lecture Notes 10: Matrix Factorization

A tutorial on sparse modeling. Outline:

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

Lecture 24 November 27

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Greedy Signal Recovery and Uniform Uncertainty Principles

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

25 : Graphical induced structured input/output models

Sparsifying Transform Learning for Compressed Sensing MRI

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

EE 381V: Large Scale Learning Spring Lecture 16 March 7

MLISP: Machine Learning in Signal Processing Spring Lecture 10 May 11

Lecture 16: Introduction to Neural Networks

Linear Regression. S. Sumitra

Methods for sparse analysis of high-dimensional data, II

Bayesian Methods for Sparse Signal Recovery

SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS

Lecture 18 Nov 3rd, 2015

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Dual Augmented Lagrangian, Proximal Minimization, and MKL

Compressed Sensing and Neural Networks

Package sgpca. R topics documented: July 6, Type Package. Title Sparse Generalized Principal Component Analysis. Version 1.0.

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

Simultaneous Sparsity

Algorithms for sparse analysis Lecture I: Background on sparse approximation

DATA MINING AND MACHINE LEARNING

Machine Learning for NLP

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 25

Dictionary Learning for L1-Exact Sparse Coding


CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Compressive Sensing, Low Rank models, and Low Rank Submatrix

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Methods for sparse analysis of high-dimensional data, II

Blind Compressed Sensing

Sparse linear models and denoising

Minimizing the Difference of L 1 and L 2 Norms with Applications

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

25.2 Last Time: Matrix Multiplication in Streaming Model

Quick Introduction to Nonnegative Matrix Factorization

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

EECS 275 Matrix Computation

Linear Regression (continued)

1 Review of Winnow Algorithm

SVD, PCA & Preprocessing

Sparsity in Underdetermined Systems

EUSIPCO

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Linear Models for Regression

A NEW FRAMEWORK FOR DESIGNING INCOHERENT SPARSIFYING DICTIONARIES

Lecture 25: November 27

L26: Advanced dimensionality reduction

Optimization and Gradient Descent

Transcription:

EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that we are given a set of basis D (dictionary, design matrix) and a signal y, and we want to find a data x of high dimensions, but of few elements being nonzero. If the measurements we have are linear measurements, then the problem can be expressed as y = Dx, where x is the data we want to recover from the linear measurements y and matrix D. If written down as a formulation, it is represented as min x x 0 s.t. y = Dx, or approximately y Dx, satisfying y Dx p ε. Or if we limited the sparsity to be s, then it is represented as min x y Dx 2 2 s.t. x 0 s. 24.2 Dictionary Learning Now we loo at the reverse problem: could we design dictionaries based on learning? Our goal is to find the dictionary D that yields sparse representations for the training signals. We believe that such dictionaries have the potential to outperform commonly used predetermined dictionaries. With evergrowing computational capabilities, computational cost may become secondary in importance to the improved performance achievable by methods that adapt dictionaries for special classes of signals. Given y 1, y 2,, y N, where N is large, find a dictionary matrix D with columns number K N, such that every y i is a sparse combination of columns of D. From now on, denote dictionary matrix as D, with columns {d i } K 1=1; collection of signals as a m N matrix Y, whose columns are points y i s; representation matrix X, a n N matrix with columns x i s being representation of y i s in dictionary D. 24-1

Intuitively, under natural idea, we would have the formulation represented as: min D,{x i } y i Dx i 2 s.t. x i 0 s (24.1) But representation (24.1) has several issues to be considered: Issue 1: there is a combinatorial sparsity constraint; i Issue 2: Optimization is over both D and {x i }, so this is a non-convex problem due to bi-linearity in objective function. 24.2.1 Dictionary Learning, Idea 1 Do Lagrangian relaxation to the formulation (24.1) and relax the l 0 norm to l 1 norm, we have min Y DX 2 F + λ x 1 (24.2) D,{x i } Here Y DX 2 F = i y i Dx i 2. Formulation (24.2) does penalize l 1 norm of x, but does not penalize the entries of D as it does for x. Thus, the solution will tend to increase the dictionary entries values, in order to allow the coefficients to become closer to zero. This difficulty has been handled by constraining the l 2 norm of each basis element, so that the output variance of the coefficients is ept at an appropriate level. An iterative method was suggested for solving (24.2). It includes two main steps in each iteration: Sparse coding: calculate the coefficients x i s; Dictionary update: fix {x i }, update D. 24-2

The iterative method would lead to a local-minimum, so the performance of the solution is sensitive to the initial D and {x i } we choose and the method we use to upgrade dictionary and coefficients. In sparse coding part, there is a bunch of methods that could be used: OMP, l 1 minimization, iterative thresholding, etc, which, more or less, were already been covered in former classes. In dictionary update part, fix {x i }, we are left with solving Option 1: directly solve the least square problem. min D Y DX 2 F (24.3) Issue: the guess X might be lousy. By doing the exact solving step, the whole process would be pushed to local minimum nearest the initial guess. The situation is similar to over-fitting current observation in regression model, where noise is unsuitably considered in optimization and push regression model to a lousy direction. Option 2: simple gradient descent procedure with small step D (n+1) = D (n) η N (D (n) x i y i )x i i=1 Initializing D: greedy algorithm 1 Pic largest (l 2 norm) column of Y, move to D from Y ; 2 To all columns that remain in Y, subtract their orthogonal projection to range(d) repeat (1) and (2) till a full basis D is found. 24.2.2 Dictionary Learning, Idea 2 Unions of orthonormal dictionaries Consider a dictionary composed as a union of orthonormal bases. D = [D 1,, D L ] where D i O(m) (m is the dimension of y i s), that is, D id i = D i D i = I m m, for i = 1,, L. 24-3

Correspondingly, divide X to L submatrices, with each submatix X i containing the coefficients of of D i X = [X 1, X 2,, X L] The update algorithm need to preserve the orthonormality of D i s in each iteration. Assuming nown coefficients, the proposed algorithm updates each orthonormal basis D i sequentially. The update of D i is done by first computing the residual matrix E i = Y j i D j X j Then we solve min Di O(m) E i D i X i 2 F for updated D i 24.3 K-SVD algorithm Overall, this algorithm is a generalization of K-means algorithm for clustering points {y i } N i=1. K-means algorithm In K-means, first we are offered K centers, denoted as C = [c 1, c 2,, c K ]. This could also be seen as a basis or dictionary. The index j of point y i is selected by finding j such that y i Ce j 2 2 y i Ce, j This has analogy to sparse coding stage. We force the sparsity of coefficients to be 1 and the nonzero component should be 1. Written as formulation and in matrix form min Y CX 2 F x s.t. x i = e for some, i ( x i 0 = 1) And in updating {c i } stage, let R be the set of index of points clustered into th cluster c = 1 y i = argmin R c i R i R y i c 2 2 combine all K such optimization problem, we have update of C is C = arg min Y C CX 2 F 24-4

remember here X comes from sparse coding stage, defining which cluster each point in Y lies. Denote th row of X as x T, then Y CX = Y c x T = Y j c j x j T c x T Since the whole problem is decomposable to subproblems of finding each c,, so it is equivalent to sequentially search c that minimizes E c x T 2 F = i R y i c 2 2. Bac to dictionary learning problem, E Y DX = (Y j d j x j T ) d x T So E Y DX 2 F = E d x T 2 F In dictionary update step, analogous to K-means, assuming nown X, we could sequentially solve min E d x T 2 F d for update of columns of D. K-SVD algorithm: Data: {y i } N i=1 Initialization: D (0) R m K, with each column normalized; t = 1 Repeat until convergence: 1 Do sparse coding, solve for each {x i } N i=1 using suitable algorithm; 2 for each column = 1,, K in D (t 1) 1 Define R = {i 1 i N, x T (i) 0} 2 Compute E = Y j d jx j T, extract columns with indices in R, denote as E R ; ER R m R 3 Apply SVD, E R = U V T, choose updated dictionary column d as the first column of U, update the nonzero items of coefficient x T by x R, x R is the first column of V multiplied by σ 1 (E R ) 3 Set t = t + 1 24-5