Efficient Data-Driven Learning of Sparse Signal Models and Its Applications

Similar documents
Sparsifying Transform Learning for Compressed Sensing MRI

Learning Sparsifying Transforms

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler

EUSIPCO

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse & Redundant Signal Representation, and its Role in Image Processing

SPARSE signal representations have gained popularity in recent

694 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 3, NO. 4, DECEMBER 2017

Regularizing inverse problems using sparsity-based signal models

Sparse linear models

2.3. Clustering or vector quantization 57

Machine Learning for Signal Processing Sparse and Overcomplete Representations

EFFICIENT LEARNING OF DICTIONARIES WITH LOW-RANK ATOMS

Lecture Notes 5: Multiresolution Analysis

Recovering overcomplete sparse representations from structured sensing

The Iteration-Tuned Dictionary for Sparse Representations

TRACKING SOLUTIONS OF TIME VARYING LINEAR INVERSE PROBLEMS

Introduction to Compressed Sensing

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

CSC 576: Variants of Sparse Learning

Lecture Notes 9: Constrained Optimization

Edinburgh Research Explorer

Randomness-in-Structured Ensembles for Compressed Sensing of Images

Online Dictionary Learning with Group Structure Inducing Norms

Minimizing the Difference of L 1 and L 2 Norms with Applications

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Strengthened Sobolev inequalities for a random subspace of functions

Compressed Sensing and Neural Networks

Structured matrix factorizations. Example: Eigenfaces

Deep Learning: Approximation of Functions by Composition

Rémi Gribonval - PANAMA Project-team INRIA Rennes - Bretagne Atlantique, France

A tutorial on sparse modeling. Outline:

Learning Task Grouping and Overlap in Multi-Task Learning

Sparse molecular image representation

SOS Boosting of Image Denoising Algorithms

Provable Alternating Minimization Methods for Non-convex Optimization

A discretized Newton flow for time varying linear inverse problems

Sparse Recovery of Streaming Signals Using. M. Salman Asif and Justin Romberg. Abstract

Generalized Conditional Gradient and Its Applications

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Greedy Dictionary Selection for Sparse Representation

1 Sparsity and l 1 relaxation

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Recent developments on sparse representation

sparse and low-rank tensor recovery Cubic-Sketching

Optimization for Compressed Sensing

Application to Hyperspectral Imaging

Sparse linear models and denoising

Sparse Optimization Lecture: Basic Sparse Optimization Models

Learning an Adaptive Dictionary Structure for Efficient Image Sparse Coding

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences

Primal Dual Pursuit A Homotopy based Algorithm for the Dantzig Selector

Blind Compressed Sensing

arxiv: v1 [cs.it] 26 Oct 2018

Compressed Sensing and Related Learning Problems

Mathematical introduction to Compressed Sensing

Tutorial: Sparse Signal Processing Part 1: Sparse Signal Representation. Pier Luigi Dragotti Imperial College London

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

An Introduction to Sparse Approximation

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Sparse Approximation and Variable Selection

Efficiently Implementing Sparsity in Learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Compressive Sensing of Sparse Tensor

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Robust Principal Component Analysis

Seismic data interpolation and denoising by learning a tensor tight frame

Robust Sparse Recovery via Non-Convex Optimization

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Signal Recovery, Uncertainty Relations, and Minkowski Dimension

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

Sparse Approximation of Signals with Highly Coherent Dictionaries

Fast Hard Thresholding with Nesterov s Gradient Method

Proximal Methods for Optimization with Spasity-inducing Norms

Stability of Modified-CS over Time for recursive causal sparse reconstruction

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Non-convex Robust PCA: Provable Bounds

Bayesian Paradigm. Maximum A Posteriori Estimation

Convolutional Dictionary Learning and Feature Design

Research Overview. Kristjan Greenewald. February 2, University of Michigan - Ann Arbor

SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS

Constrained optimization

Sensing systems limited by constraints: physical size, time, cost, energy

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A Two-Stage Approach for Learning a Sparse Model with Sharp

Bayesian Nonparametric Dictionary Learning for Compressed Sensing MRI

Spatially adaptive alpha-rooting in BM3D sharpening

Convergence of a distributed asynchronous learning vector quantization algorithm.

Tensor-Based Dictionary Learning for Multidimensional Sparse Recovery. Florian Römer and Giovanni Del Galdo

Residual Correlation Regularization Based Image Denoising

Guaranteed Rank Minimization via Singular Value Projection: Supplementary Material

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER

CoSaMP. Iterative signal recovery from incomplete and inaccurate samples. Joel A. Tropp

LINEARIZED BREGMAN ITERATIONS FOR FRAME-BASED IMAGE DEBLURRING

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Sparse Estimation and Dictionary Learning

Transcription:

Efficient Data-Driven Learning of Sparse Signal Models and Its Applications Saiprasad Ravishankar Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor Dec 10, 2015 1

Outline of Talk Synthesis & Transform models. Transform learning: Efficient, Scalable, Effective, Guarantees. Transform learning methods: Union of transforms (OCTOBOS) learning Online transform learning for big data Applications: Compression, Denoising, Compressed sensing, Classification. Conclusions Fig. from B. Wen, UIUC. 2

Synthesis Model for Sparse Representation Given a signal y R n, and dictionary D R n K, we assume y = Dx with x 0 K. Real world signals modeled as y = Dx +e, e is deviation term. Given D, sparsity level s, the synthesis sparse coding problem is This problem is NP-hard. ˆx = argmin x y Dx 2 2 s.t. x 0 s Dictionary-based approaches are often computationally expensive. 3

Transform Model for Sparse Representation Given a signal y R n, and transform W R m n, we model Wy = x +η with x 0 m and η - error term. Natural signals are approximately sparse in Wavelets, DCT. Given W, and sparsity s, transform sparse coding is ˆx = argmin x Wy x 2 2 s.t. x 0 s ˆx = H s(wy) computed by thresholding Wy to the s largest magnitude elements. Sparse coding is cheap! Signal recovered as W ˆx. Sparsifying transforms exploited for compression (JPEG2000), etc. 4

Key Topic of Talk: Sparsifying Transform Learning Square Transform Models Unstructured transform learning [IEEE TSP, 2013 & 2015] Doubly sparse transform learning [IEEE TIP, 2013] Online learning for Big Data [IEEE JSTSP, 2015] Convex formulations for transform learning [ICASSP, 2014] Overcomplete Transform Models Unstructured overcomplete transform learning [ICASSP, 2013] Learning structured overcomplete transforms with block cosparsity (OCTOBOS) [IJCV, 2015] Applications: Image & Video denoising, Classification, Magnetic resonance imaging (MRI) [SPIE 2015, ICIP 2015]. 5

Square Transform Learning Formulation 1 (P0) min W,X Sparsification Error {}}{ WY X 2 F + s.t. X i 0 s i Regularizer v(w) {}} ){ λ ( W 2F log detw Y = [Y 1 Y 2... Y N ] R n N : matrix of training signals. X = [X 1 X 2... X N ] R n N : matrix of sparse codes of Y i. Sparsification error - measures deviation of data in transform domain from perfect sparsity. λ > 0. Regularizer v(w) prevents trivial solutions and fully controls condition number of W. (P0) is limited due to a single W for all the data. 1 [Ravishankar & Bresler 12] 6

Why Union of Transforms? Natural images typically have diverse features or textures. 7

Why Union of Transforms? Union of transforms: one for each class of textures or features. 8

OCTOBOS Learning Formulation (P1) min { Wk,X i,c k } Sparsification Error { }} { K W ky i X i 2 2 + k=1 i C k s.t. X i 0 s i, {C k} K k=1 G { Regularizer }} { K k=1 λ k ( W k 2 F log detwk ) C k is the set of indices of signals in class k. G is the set of all possible partitions of [1 : N] into K disjoint subsets. (P1) jointly learns the union-of-transforms {W k} and clusters the data Y. Regularizer necessary to control scaling and conditioning (κ) of transforms. λ k = λ 0 Y Ck 2 F, with Y Ck the matrix of all Y i C k, achieves scale invariance of the solution in (P1). As λ 0, κ(w k) 1, W k 2 1/ 2 k for solutions in (P1). 9

Alternating Minimization Algorithm for (P1) (P1) min { Wk,X i,c k } Sparsification Error Regularizer {}}{{}}{ K K ) W k Y i X i 2 2 + λ k ( W k 2 F log detw k k=1 i C k k=1 s.t. X i 0 s i, {C k } K k=1 G 10

Alternating OCTOBOS Learning Algorithm: Step 1 Transform Update: Solves for only the {W k } in (P1). min {W k } { K k=1 i C k W k Y i X i 2 2 +λ kv(w k ) } (1) Closed-form solution using Singular Value Decomposition (SVD): Ŵ k = 0.5R k (Σ k +(Σ 2 k +2λ ki) 1 2)V T k L 1 k, k (2) I is the identity matrix. λ k = λ 0 Y Ck 2 F. Y Ck Y T C k +λ k I = L k L T k. L k is a matrix square root. SVD: L 1 k Y Ck X T C k = V k Σ k U T k. 11

Alternating OCTOBOS Learning Algorithm: Step 2 Sparse Coding & Clustering: Solves for only the {C k, X i } in (P1). min {C k },{X i } K k=1 i C k { } W k Y i X i 2 2 +λ 0 Y i 2 2 v(w k) s.t. X i 0 s i, {C k } G (3) Exact Clustering: finds the global optimum {Ĉ k } in (3) as } {Ĉk = arg min {C k } Clustering Measure M k,i {}}{ K { } W k Y i H s(w k Y i ) 2 2 +λ 0 Y i 2 2 v(w k) k=1 i C k (4) For each Y i, the optimal cluster index ˆk i = argminm k,i. k Exact and Cheap Sparse Coding: ˆX i = H s(w k Y i ) i Ĉ k, k. 12

Computational Advantages of OCTOBOS Cost per-iteration for learning OCTOBOS W R Kn n : Assume number of training signals N m = Kn. Cost of Clustering & Sparse coding Step: O(mnN). Cost of Transform Update Step: O(n 2 N). Cost dominated by clustering. Model (m,s n) Square W R n n OCTOBOS W R m n KSVD D R n m Per-iter. Cost O(n 2 N) O(n 2 N) O(n 3 N) In practice, OCTOBOS learning converges in few iterations. OCTOBOS learning is cheaper than dictionary learning by K-SVD 2. 2 [Aharon et al. 06] 13

Global Convergence Guarantees for OCTOBOS (P1) min { Wk,X i,c k } Sparsification Error { }} { k W ky i X i 2 2 + k=1 i C k s.t. X i 0 s i, {C k} K k=1 G { Regularizer }} { k k=1 λ k ( W k 2 F log detwk ) The alternating OCTOBOS learning algorithm is globally convergent to the set of partial minimizers of the objective in (P1). These partial minimizers are global minimizers w.r.t. {W k } and {X i,c k }, respectively, and local minimizers w.r.t. {W k, X i }. Under certain (mild) conditions, the algorithm converges to the set of stationary points of the equivalent objective f (W). N } f (W) min { W k Y i H s(w k Y i ) 2 2 +λ 0v(W k ) Y i 2 2 k i=1 14

Algorithm Insensitive to Initializations x 10 7 3 x 107 Objective Function 5 4.5 4 3.5 3 KLT DCT Random Identity 2.5 10 0 10 1 10 2 Iteration Number Sparsification Error 2.5 2 1.5 1 0.5 KLT DCT Random Identity 10 0 10 1 10 2 Iteration Number 8 8 patches, K = 2 Various initializations for {W k } 3 x 107 8 x 106 Objective Function 2.9 2.8 2.7 2.6 2.5 Random K means Equal Single 10 0 10 1 10 2 Iteration Number Sparsification Error 7 6 5 4 3 Random K means Equal Single 2 10 0 10 1 10 2 Iteration Number Various initializations for {C k } and K = 1 (single) case 15

Visualization of Learned OCTOBOS The square blocks of a learnt OCTOBOS are NOT similar cluster-specific W k. OCTOBOS W learned with different initializations appear different. The W learned with different initializations sparsify equally well. 1 0.8 0.6 0.4 0.2 Cross-gram matrix between W 1 and W 2 for KLT Init. Random matrix Init. KLT Init. 16

Application: Unsupervised Classification The overlapping image patches are first clustered by OCTOBOS learning. Each image pixel is then classified by a majority vote among the patches that cover that pixel. Input Image K-means OCTOBOS Input Image K-means OCTOBOS 17

Application: Compressed Sensing (CS) CS enables accurate recovery of images from far fewer measurements than the number of unknowns Sparsity of image in transform domain or dictionary Measurement procedure incoherent with transform Reconstruction non-linear Reconstruction problem (NP-hard) - min x Data Fidelity Regularizer {}}{{}}{ Ax y 2 2 +λ Ψx 0 (5) x C P : vectorized image, y C m : measurements (m < P). A : fat sensing matrix, Ψ : transform. l 0 norm counts non-zeros. CS with non-adaptive regularizer limited to low undersampling in MRI. 18

UNITE-BCS: Union of Transforms Blind CS (P2) Sparsification Error Sparsity Penalty Data Fidelity {}}{{}}{{}}{ K min ν Ax y 2 2 + K W k R j x b j 2 x,b,{w k,c k } 2 +η2 b j 0 k=1 j C k k=1 j C k s.t. W H k W k = I k, {C k } G, x 2 C. R j R n P extracts patches. W k C n n is cluster-specific transform. W kr jx b j, j C k, k with b j C n sparse. B [b 1 b 2... b N]. (P2) learns a union of unitary transforms, reconstructs x, and clusters the patches of x, using only the undersampled y. model adaptive to underlying image. x 2 C is an energy or range constraint. C > 0. 19

Block Coordinate Descent (BCD) Algorithm for (P2) Sparse Coding & Clustering: Solves for only {C k } & B in (P2). min {C k },{b j } K { Wk R j x b j 2 } 2 +η2 b j 0 k=1 j C k s.t. {C k } G (6) Exact Clustering: finds the global optimum {Ĉ k } in (6) as } {Ĉk = arg min {C k } K k=1 j C k Clustering Measure M k,j {}}{ Wk R j x H η(w k R j x) 2 2 +η2 Hη(W k R j x) 0 (7) For patch P j x, the optimal cluster index ˆk j = argminm k,j. k Exact Sparse Coding by Hard-thresholding: ˆb j = H η(w k R j x) j Ĉ k, k. 20

BCD Algorithm: Transform Update Step Transform Update Step solves (P2) for {W k }. For each k, solve min W k W k X Ck B Ck 2 F s.t. W H k W k = I. (8) X Ck is matrix with columns R jx for j C k. B Ck is matrix of corresponding sparse codes. Closed-form solution: Ŵ k = VU H (9) SVD: X Ck B H C k = UΣV H. 21

BCD Algorithm: Image Update Step Image Update Step solves (P2) for x with other variables fixed. min x ν Ax y 2 2 + K k=1 j C k W k R j x b j 2 2 s.t. x 2 C. (10) Least squares problem with l 2 norm constraint. Solve Least squares Lagrangian formulation with Normal Equation: N K Rj T R j + ν A H A+ ˆµI x = Rj T Wk H b j + ν A H y (11) j=1 k=1 j C k The optimal multiplier ˆµ R + is the smallest real such that ˆx 2 C. ˆµ and ˆx can be found cheaply in applications such as MRI. 22

BCS Convergence Guarantees - Notations Define the barrier function ϕ(w) as { 0, ϕ(w) = +, W H W = I else χ(x) is the barrier function corresponding to x 2 C. (P2) can be written in unconstrained form: K h(w,b,γ,x) = ν Ax y 2 2 + { Wk R j x b j 2 2 + } K η2 b j 0 + ϕ(w k ) + χ(x) k=1 j C k k=1 OCTOBOS W obtained by stacking the W k s. Γ : row vector whose entries are the cluster indices of patches. 23

Union of Transforms BCS Convergence Guarantees Theorem 1 For the sequence {W t,b t,γ t,x t } generated by the BCD Algorithm with initial (W 0,B 0,Γ 0,x 0 ), we have {h(w t,b t,γ t,x t )} h = h (W 0,B 0,Γ 0,x 0 ). {W t,b t,γ t,x t } is bounded, and all its accumulation points are equivalent, i.e., they achieve the same value h of the objective. x t x t 1 2 0 as t. Every accumulation point (W, B, Γ, x) satisfies the following partial global optimality conditions W argmin W x argmin h(w,b,γ, x) (12) x ) ( ) h( W,B,Γ,x, (B,Γ) argmin h W, B, Γ,x B, Γ (13) 24

UNITE-BCS Convergence Guarantees Theorem 2 Each accumulation point (W,B,Γ,x) of {W t,b t,γ t,x t } also satisfies the following partial local optimality condition for all x C P, and all B C n N satisfying B < η/2. h(w,b + B,Γ,x + x) h(w,b,γ,x) = h (14) 25

UNITE-BCS Global Convergence Guarantees Corollary 1 For each initialization, the iterate sequence in the BCD algorithm converges to an equivalence class (same objective values) of accumulation points of the objective that are also partial global and partial local minimizers. Corollary 2 The BCD algorithm is globally convergent to (a subset of) the set of partial minimizers of the objective. 26

CS MRI Example - 2.5x Undersampling (K = 3) Sampling mask Zero-filling (24.9 db) UNITE-MRI recon (37.3 db) Reference 27

Convergence Behavior: UTMRI (K = 1) & UNITE-MRI Objective Function 2700 2400 2100 1800 1500 UTMRI (K = 1) UNITE MRI (K = 3) 1200 1 100 200 300 400 Iteration Number x t x t 1 2 / x ref 2 10 1 10 2 10 3 Objectives UNITE MRI (K = 3) UTMRI (K = 1) Sparsity Fraction UNITE MRI PSNR 0.12 0.11 0.1 0.09 0.08 0.07 0.06 UTMRI (K = 1) UNITE MRI (K = 3) 0.05 1 100 200 300 400 Iteration Number 40 35 30 25 Sparsity fractions (B) PSNR HFEN 4 3 UNITE MRI HFEN 2 1 10 4 10 0 10 1 10 2 Iteration Number x t x t 1 2 20 1 100 200 300 400 0 Iteration Number UNITE-MRI PSNR, HFEN 28

UNITE-MRI Clustering with K = 3 UNITE-MRI recon Cluster 1 Cluster 2 Cluster 3 Real part of Imaginary part of learnt W for cluster 2 learnt W for cluster 2 29

UNITE-MRI Clustering with K = 4 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Real part of Imaginary part of learnt W for cluster 4 learnt W for cluster 4 30

Reconstructions - Cartesian 2.5x Undersampling (K = 16) 0.15 0.1 0.05 UNITE-MRI recon (37.4 db, 631s) DLMRI 3 error (36.6 db, 1797s) 0.15 0.15 0.1 0.1 0.05 0.05 UNITE-MRI error UTMRI error (37.2 db, 125s) 3 [Ravishankar & Bresler 11] 31

Example - Cartesian 2.5x Undersampling (K = 16) PSNR (db) 44.5 44 43.5 43 Sampling mask UTMRI (42.5 db) UNITE-MRI (44.3 db) 42.5 1 4 8 12 16 20 28 36 Number of Clusters PSNR vs. K UTMRI Error UNITE-MRI Error 0.06 0.05 0.04 0.03 0.02 0.01 0 0.06 0.05 0.04 0.03 0.02 0.01 0 32

Online Transform Learning 33

Why Online Transform Learning? Batch learning: learning using all the training data simultaneously. Big data large training sets batch learning computationally expensive in time and memory. Real-time or streaming data applications data arrives sequentially, and must be processed sequentially to limit latency. Online learning uses sequential model adaptation and signal reconstruction. cheap computations and modest memory requirements. 34

Online Transform Learning 35

Online Transform Learning Formulation For t = 1, 2, 3,..., solve } (P3) {Ŵt,ˆx t 1 = argmin W,x t t t j=1 { } Wy j x j 2 2 +λ jv(w) s.t. x t 0 s, x j = ˆx j, 1 j t 1. λ j = λ 0 y j 2 2, with λ0 > 0. v(w) W 2 F log detw. λ 0 controls the condition number and scaling of learned Ŵ t. Ŵ 1 t ˆx t is an (e.g., denoised) estimate of y t computed efficiently. For non-stationary data, use forgetting factor ρ [0, 1], to diminish the influence of old data. 1 t t j=1 { } ρ t j Wy j x j 2 2 +λ jv(w) (15) 36

Mini-Batch Transform Learning For J = 1, 2, 3,..., solve {ŴJ, ˆX J } 1 = argmin W,X J JM J j=1 { } WY j X j 2 F +Λ jv(w) s.t. x JM M+i 0 s, 1 i M. (P4) Y J = [y JM M+1 y JM M+2... y JM ], with M : mini-batch size. X J = [x JM M+1 x JM M+2... x JM ]. Λ j = λ 0 Y j 2 F. Mini-batch learning can provide reductions in operation count over online learning. increased latency and memory requirements. Alternative: Sparsity constraints can be replaced with l 0 penalties. 37

Online Transform Learning Algorithm Sparse Coding: solve for x t in (P3) with fixed W = Ŵ t 1. min x t Wy t x t 2 2 s.t. x t 0 s (16) Cheap Solution: ˆx t = H s(wy t). Transform Update: solves for W in (P3) with x t = ˆx t. 1 t )} min { Wy j x j ( W 22 W t +λj 2F log detw (17) j=1 ( ) Ŵ t = 0.5R t (Σ t + Σ 2 )1 2 t +2β ti Qt T L 1 t (18) t 1 ( ) t j=1 y j yj T + λ 0 y j 2 2 I = L tl T t. Perform rank-1 update. β t = λ 0t 1 t j=1 y j 2 2. QtΣtRT t is full SVD of L 1 t Θ t = t 1 t j=1 L 1 t y j x T j. L 1 t Θ t (1 t 1 )L 1 t 1 Θt 1 + t 1 L 1 t y tx T t rank-1 SVD update. No matrix-matrix products. Approx. error bounded, and cheaply monitored. 38

Mini-Batch Transform Learning Algorithm Sparse Coding: solve for X J in (P4) with fixed W = Ŵ J 1. min X J WY J X J 2 F s.t. x JM M+i 0 s i. (19) Cheap Solution: ˆx JM M+i = H s(wy JM M+i) i {1,..,M}. Transform Update: solves for W in (P4) with fixed {X j} J j=1. 1 J )} min { WY j X j ( W 2F W JM +Λj 2F log detw j=1 (20) Closed-form solution involving SVDs. For M n, use rank-m updates. For M O(n), direct SVDs. 39

Comparison of Computations, Memory, and Latency Properties Online Mini-batch Batch Small M n Large M Computations per sample O(n 2 log 2 n) O(n 2 log 2 n) O(n 2 ) O(Pn 2 ) Memory O(n 2 ) O(n 2 ) O(nM) O(nN) Latency 0 M 1 M 1 N 1 Latency: max. time between arrival of a signal and generation of the output. P: # batch iterations, N: total samples, M: mini-batch size, n: signal size. log 2 n < P online scheme is computationally cheaper than batch. For big data, online & mini-batch schemes have low memory & latency costs. Online synthesis learning 4 has high computational cost per sample: O(n 3 ). 4 [Mairal et al. 10] 40

Online Learning Convergence Analysis: Notations The objective in the transform update step of (P3) is ĝ t (W) = 1 t } { Wy j ˆx j 22 t +λ 0 y j 22 v(w) j=1 (21) The empirical objective function is g t (W) = 1 t t { Wyj H s (Wy j ) 2 2 +λ 0 y j 2 2 v(w)} (22) j=1 This is the objective that is minimized in batch transform learning. In the online setting, the sparse codes of past signals cannot be optimally set at future times t. 41

Expected Transform Learning Cost Assumption: y t are i.i.d. random samples from the sphere S n = {y R n : y 2 = 1}, assuming absolutely continuous probability measure p. We consider the minimization of the expected learning cost: g(w) = E y [ Wy H s (Wy) 2 2 +λ 0 y 2 2 v(w) ] (23). It follows from the Assumption that lim t g t (W) = g(w) a.s. Given a specific training set, it is unnecessary to minimize the batch objective g t(w) to high precision, since g t(w) only approximates g(w). Even an inaccurate minimizer of g t (W) could provide the same, or better value of g(w) than a fully optimized one. 42

Main Convergence Results Theorem 3 For the sequence { Ŵ t } generated by our online scheme, we have (i) As t, ĝ t (Ŵ t ), g t (Ŵ t ), and g(ŵ t ) all converge a.s. to a common limit, say g. (ii) The sequence { Ŵ t } is bounded. Every accumulation point Ŵ of {Ŵt } satisfies g(ŵ ) = 0 and g(ŵ ) = g with probability 1. (iii) The distance between Ŵ t and the set of stationary points of g(w) converges to 0 a.s. (iv) ĝ t+1 (Ŵ t+1 ) ĝ t (Ŵ t ) and Ŵ t+1 Ŵ t both decay as O(1/t). 43

Empirical Convergence Behavior 10 1 Objective Function 7.5 7 6.5 6 5.5 Online Mini batch 1 2 3 4 5 No. of Signals Processed x 10 5 Sparsification Error 10 0 10 1 10 2 Online Mini batch 1 2 3 4 5 No. of Signals Processed x 10 5 Norm of changes in W 10 1 10 3 10 5 Mini batch 1 2 3 4 5 No. of Signals Processed x 10 5 Objective Sparsification error Ŵ t+1 Ŵ t F (M = 320) {y t} generated as { W 1 x t } with random unitary 20 20 W, and random x t with x t 0 = 3. Objective converges quickly for both the online and mini-batch schemes. Sparsification error converges to zero, and κ(w) [1.02, 1.04] for the schemes learned a good model. 44

Online Video Denoising by 3D Transform Learning z t is a noisy video frame. ẑ t is its denoised version. G t is a tensor with m frames formed using a sliding window scheme. Overlapping 3D patches in the G t s are denoised sequentially using adaptive mini-batch denoising. Denoised patches averaged at 3D locations to yield frame estimates. 45

Video Denoising by Online Transform Learning Video σ DCT SK-SVD VBM3D VBM4D VIDOLSAT VIDOLSAT Salesman Miss America Coastguard (n = 512) (n = 768) 10 36.9 37.0 37.3 37.1 37.8 38.0 20 33.1 33.2 34.1 33.3 34.0 34.3 50 27.8 28.4 28.3 28.3 29.3 29.7 10 39.5 39.7 39.6 39.9 40.3 40.3 20 36.2 37.3 38.0 37.8 38.3 38.4 50 30.6 33.4 34.6 34.3 35.2 35.3 10 34.6 34.8 34.8 35.4 35.7 35.7 20 31.1 31.3 31.7 31.7 32.2 32.3 50 26.6 27.1 26.9 27.1 28.0 28.1 Proposed VIDOLSAT is simulated at two patch sizes: 8 8 8 (n = 512), and 8 8 12 (n = 768). VIDOLSAT provides 1.7 db, 1.2 db, 0.8 db, and 0.8 db better PSNRs than 3D DCT, sparse K-SVD 5, VBM3D 6, and VBM4D 7. 5 [Rubinstein et al. 10] 6 [Dabov et al. 07] 7 [Maggioni et al. 12] 46

Video Denoising Example: Salesman Noisy frame 0.3 VIDOLSAT (PSNR = 30.97 db) 0.3 0.25 0.2 0.15 0.1 0.05 0.25 0.2 0.15 0.1 0.05 VBM4D Error (PSNR = 27.20 db) 0 VIDOLSAT Error 0 47

Conclusions We introduced several data-driven sparse model adaptation techniques. Transform learning methods are highly efficient and scalable enjoy good theoretical and empirical convergence behavior are highly effective in many applications Highly promising results were obtained using transform learning for denoising and compressed sensing. Future work: online blind compressed sensing. Acknowledgments: Yoram Bresler, Bihan Wen. Transform learning webpage: http://transformlearning.csl.illinois.edu 48

Thank you! Questions?? 49