Representation and Processing of Big Visual Data: A Data-driven Perspective

Size: px

Start display at page:

Download "Representation and Processing of Big Visual Data: A Data-driven Perspective"

Imogene Hill
5 years ago
Views:

1 and of Big Visual National University of Singapore July 7, 2017

2 Main topics and Linear in Hilbert space and Gabor Sparse learning in classification/recognition Numerical methods for related optimization From sparse coding to convolutional neural network

3 Development of real applications Sparsity motivated Solutions Method for related optimization problem and Problem specific modeling Application Design of system

4 Section I: Linear and Let H denote a Hilbert space H with inner product,. Linear : consider a system X = {x n } n Z via f H {c[n]} n Z l 2 (Z) f = n Z c[n]x n. The properties of X determines basic characteristics of linear : the relationship between {c[n]} and f, x n : Existence, completeness, uniqueness, correspondence Two types of fundamental : span(x) = H. Non-redundant: Riesz bases and orthonormal bases Redundant: frames and tight frames

5 Basic linear operators in Hilbert space and Theorem (adjoint operator) Let H 1, H 2 be two Hilbert spaces, and let T be a bounded linear operator from H 1 to H 2. Then, there exists a unique bounded linear operator T : H 2 H 1 such that and T = T. T x, y H2 = x, T y H1, x H 1, y H 2, Theorem (self-adjoint operator) An operator S in Hilbert space H is called self-adjoint if S = S. Suppose there exists two constants, A, B > 0 such that a A f 2 Sf, f B f 2, f H. Then, S is an invertible operator and S B, S 1 A 1.

6 Operators for linear A sequence is called Bessel sequence if there exits a constant B > 0 such that f, xn 2 B f 2, f H. and For a Bessel sequence X, define Synthesis operator T : l 2 (Z) H: T c = c[n]x n. Analysis operator T : H l 2 (Z): (T f)[n] = f, x n, n Z. T and T are an adjoint pair with T = T = B. Ker(T ) = (Range(T )). X is linearly independent, if T is injective in l 2 (Z). X is fundamental (i.e., span(x) = H), if T is injective.

7 Definitions: with certain properties and Bessel sequences with Riesz properties A Bessel X is called a Riesz sequence if A c[n] 2 n Z n Z c[n]x n 2 2 B n Z c[n] 2. A Riesz sequence is called an orthonormal sequence if A = B = 1. A Riesz (or orthonormal) sequence X is called a Riesz (or orthonormal ) basis if span(x) = H. Bessel sequences with frame properties A Bessel sequence X is called a frame if A f 2 f, x n 2 B f 2, f H. A/B is called lower/upper frame bound. A frame is called a tight frame if T = 1, i.e., A = B = 1.

8 Two adjoint operators for and Consider a Bessel sequence X = {x k } k Z. Defining two operators Gramian operator: frame operator: T T : l 2 (Z) l 2 (Z) T T : H H Both operators are bounded self-adjoint operators. The operator T T is the Gramian operator for analyzing Riesz properties: T T [i, j] = x i, x j. The operator T T is the frame operator for analyzing frame properties: (T T )(f) = k Z f, x k x k.

9 Characterization via operators and Characterization of Riesz properties via Gramian operator T T : X is l 2 -independent iff T T is injective X is a Riesz sequence iff T T has a bounded inverse X is a orthonormal sequence iff T T = I Characterization of frame properties via frame operator T T : X is fundamental iff T T is injective X is a frame iff T T has a bounded inverse X is a tight frame iff T T = I Connections between Riesz and frame A Riesz basis is an exact frame, and a frame becomes a Riesz basis if it is l 2 -independent. An orthonormal basis is a tight frame, and a tight frame becomes an orthonormal basis if x n = 1 for all n.

10 Gramian and dual Gramian and Gramian and dual Gramian matrix for representing T T and T T Definition (Pre-Gramian) For a Bessel sequence X, the pre-gramian matrix, denoted by J X is defined as J x := ( x n, e k ) k,n, where {e k } k Z is an orthonormal basis of H. J X represents the matrix form of synthesis operator T under the orthonormal basis {e k } k Z. Definition (Gramian and dual Gramian) For a Bessel sequence X, the matrix G X = JX J X is called Gramian matrix of X, and G X = J X JX is called dual Gramian matrix.

11 Eigenvalue and Gramian G X = J X J X is with T T for Riesz properties dual Gramian G X = J X J X is with T T for frame properties. Let U denote the unitary operator that maps l 2 (Z) to H. Theorem (Bridge between dual Gramian and the operator T T ) Consider a Bessel sequence X. We have T T c = G X c, c l 2 (Z), U (T T )Ud = G X d, d l 2 (Z). Theorem (Riesz properties and frame properties) Consider a Bessel sequence X. We have X is l 2 -independent, iff G X is injective, a Riesz sequence iff G X is bounded below, an orthonormal sequence iff G X = I. X is fundamental iff G X is injective, a frame iff G X is bounded below, a tight frame iff G X = I.

12 Duality principle and Definition (Adjoint system) The system in H is called a adjoint system of a Bessel sequence X in H if J Y = V 1 J XV 2, where V 1, V 2 are two unitary operators. X and Y are connected via G X = V 1 G Y V 1 ; G X = V 2 G Y V 2, i.e., up to some unitary operators, T X = T Y and T X = T Y. Theorem (Duality Principle) Consider a Bessel sequence X H and let Y denote its adjoint system in H. X forms a frame for H iff Y forms a Riesz sequence in H. X forms a tight frame for H iff Y forms an orthonormal sequence in H.

13 Canonical dual frame and For a frame X, the frame operator S = T T is self-adjoint, positive definite and invertible. Definition (Dual frame) For a frame X, Y is called its dual frame if T Y TX = T XTY = I, i.e. f, xn y n = f, y n x n, f H. For a frame X, Y = S 1 X is called its cananical dual frame. Theorem (frame and dual frame) Consider a frame X H. We have G 1 U X U X is the canonical dual frame of X. 1/2 U G X U X forms a tight frame for H.

14 Example in R 2 and T = w 2 w 3 e 2 2 T = 2/3 0 w 1 e 1 Tight frame X = {x 1, x 2, x 3 } 2 x 1 = 3 [1, 0], 2 x 2 == 3 [cos 2π 3, sin 2π 3 ] 2 x 3 == 3 [cos 4π 3, sin 4π 3 ] T T = ( ) 1 0 T T = 0 1

15 Example in L 2 [0,1] : { be 2πibkt } k Z w/ b 1 and Proof. Consider an orthonormal basis {e k (t) = e 2πikt } k Z for L 2 [0,1]. The pre-gramian J X is J X [k, n] = b 1 0 e 2πi(k bn)t dt The system Y = {b 1/2 e 2πikb 1t 1 [0,b] (t) is an adjoint system of X by J Y [k, n] = (J X [n, k]), since J Y [k, n] = 1 b e 2πi( n b +k)t dt = b b e 2πi( n+bk)t dt. Clearly, Y forms an orthonormal sequence in L 2 [0,1], and thus X forms a tight frame for L 2 [0,1].

16 Reference and 1 Amos Ron, Zuowei Shen, Gramian of affine basis and affine frames, Approximation Theory VIII, Wavelets and Multilevel Approximation, World Scientific Publishing, (1995), Amos Ron, Zuowei Shen, Frames and stable bases for subspaces of L 2 (R d ): the duality principle of Weyl-Heisenberg sets, Proceedings of the Lanczos Centenary Conference, SIAM Pub. (1993) Casazza, Peter G, The art of frame theory, arxiv preprint math/ (1999). 4 Zhitao Fan, Hui Ji, Zuowei Shen, Dual Gramian : duality principle and unitary extension principle, Mathematics of Computation, 85 (2016) Zhitao Fan, Andreas Heinecke, Zuowei Shen, Duality for frames, Journal of Fourier Analysis and Applications, 22(1), (2016),

17 Windowed Fourier transform (Gabor transform) and Motivation: analyzing local frequency information of signals Basic idea: localizing a function by a window function and running Fourier transform on it. Consider an even window function g L 2 (R) with translation u and modulation η: g u,η (t) = e iηt g(t u) and ĝ = e u(ω η) ĝ(ω η). Definition (Gabor transform) Gf(u, η) = R f(t)g u,η (t) = 1 f(ω)ĝ u,η (ω)dω. 2π R Heisenberg Uncertainty Principle: minimum variances of a window function g in plane σt 2 σω 2 1 4, is obtained only when g is a Gaussian function, i.e. g = c 0 e (t u 0 ) 2 2σ 2.

18 From Gabor transform to discrete Gabor and Discrete Gabor system for l 2 (R) is obtained by sampling Gabor transform in plane: (u, η), g j,k = g(t u 0 j}e i2πkη0t, j, k Z. slicing time domain into different intervals with translations of a window function with compact support (or with fast decay) Constructing a Fourier system in each interval. Example (Constructing a localized o.n. basis for L 2 (R)) Define windows function g = 1 [0,1). Let {e k } k Z denote an orthonormal basis for L 2 ([0, 1]). Then, the system {g(t j)e k (t j)} j,k Z forms an orthonormal basis for L 2 (R).

19 Orthonormal bases w/ Gabor structure and Consider the indicator window function g = 1 [ u 0 2, u 0 2 ). Using Fourier basis for L 2 [ u0 2, u0 2 ): e k(t) = e i2πkη0t with η = u 1 0, we have blocked Fourier basis for L2 (R): {g(t u 0 j)e 2kηπt } j,k Z. Using Cosine I basis for L 2 ([ u0 2, u0 2 )): e k(t) = λ k cos πkt with λ = 1 if k = 0 and 2 otherwise, we have blocked Cosine basis for L 2 (R): {λ k g(t u 0 j) cos πηkt} j,k Z. Main weakness of using indicator window function: Signals are multipled by a discontinuous window function g = 1 [0,1), which creates artificial discontinuities.

20 Tight frame with Gabor structure Theorem (Non-existence of smooth localized Fourier basis) Consider a compactly supported non-negative windows function g L 2 (R) and positive constants u 0, η 0. The system and {g(t u 0 j}e i2πkη0t } j,k Z forms an orthonormal basis if and only u 0 η 0 = 1; g = 1 [0,u0)( ) (up to a translation). Theorem (Tight frame with localized atoms) Consider a window function g with supp(g) = [0, a] and g > 0 on (0, a). Then {g(t u 0 j}e iη0kt } j,k Z forms a tight frame for L 2 (R) iff (i) η 0 a 2π; (ii) g(t u 0 j) 2 = η 0.

21 From discrete system to digital system and For discrete Gabor system in L 2 (R), digital system for l 2 (Z) is constructed by directly sampling the in L 2 (R). Definition (Local DFT/DCT) Digital orthonormal basis for R N from local DFT/DCT: DFT : {e k [n] = 1 N e i 2πkn N k=0 ; DCT : {e k [n] = λ k N cos( kπ N (n )}N 1 k=0 forms an orthonormal basis for R N. } N 1 Consider digital indicator sequence g l 2 (Z) defined by g[n] = 1 [0:N 1] [n]. Then, the system {g j,k : g[n jn]e k [n], n Z} j Z,0 k N 1 forms an orthonormal basis for l 2 (Z).

22 Digital Gabor tight frames and Consider a digital system X = {g j,k } j,k l 2 (Z) defined by where u 0 is a non-zero integer. g j,k [n] = g[(n u 0 j)]e i2πη0kn, Theorem (digital Gabor tight frame for l 2 (Z)) Suppose that g l 2 (Z) is a non-negative sequence with support [0 : N 1]. Then, the system X is a tight frame for C N iff (i) η 0 N 1; and (ii) (g[( u0 j)]) 2 η 0. The same conclusion holds when using DCT as local system: g j,k [n] = g[(n u 0 j)]λ k cos(πη 0 k(n )).

23 Examples of smooth windows functions and Given an integer knot sequence Z, the B-spline of order 1 is defined as B 1 = 1 [0,1). The B-spline of order m is recursively defined as We have B m (t) = B m 1 B 1 = 1 0 B m 1 (x t)dt. (i) supp(b m ) = [0, m] and B m (t) > 0, t (0, m) (ii) j Z B m( j) = 1. (Partition of Unity). Consider the windows function g = B m (u 0 ). Then, g( u0 j) 2 = 1. Remark: normalized B-spline, 1 m 2 3 B m( 1 m 2 3 x + m 2 ), uniformly converges to Gaussian function

24 Reference and 1 De Boor, C., B(asic)-spline basics, Mathematics Research Center, University of Wisconsin-Madison (1986) 2 Ron, A., Shen, Z., Weyl-Heisenberg Frames and Riesz Bases in L 2 (R d ), Duke Math. J., 89, (1997) 3 Mallat, Stephane. A wavelet tour of signal processing. Academic press, G. Casazza, G. Kutyniok and M.C. Lammers, Duality Principles, Localization of Frames, and Gabor Theory, Vol. 10 (2005) H. Ji, Z. Shen and Y. Zhao, Directional frames for image recovery: discrete Gabor frames, Journal of Fourier Analysis and Applications, xx(x), xx-xx, xxx. 2016

.., 2 n/2 ψ L (2 n k)} Definition (wavelet tight frames) An affine system X = {2 n/2 ψ l (2 n k)} L

25 and Non-stationary signals with varying local structures An affine system with multi-scales and multi-bands structures Sub-band frequency X = {2 n/2 ψ 1 (2 n k), 2 n/2 ψ 2 (2 n k),..., 2 n/2 ψ L (2 n k)} Definition (wavelet tight frames) An affine system X = {2 n/2 ψ l (2 n k)} L l=1 L2 (R) is called wavelet tight frames if X is a tight frame for L 2 (R), i.e. f = l,n,k f, 2 n/2 ψ l (2 n k), f L 2 (R).

26 Multi-resolution (MRA) and Multi-resolution pyramid Definition (Multi-resolution ) The sequence {V n } n Z forms an MRA for L 2 (R) if (i) V n V n+1 ; (ii) n V n = L 2 (R); (iii) V n = {0}. A function φ L 2 (R) with φ(0) = 1 is called the scaling function of the MRA, if V n = span{2 n/2 φ(2 n k)} k Z, which gives Refinable property: φ( ) = 2 k a 0 [k]φ(2 k).

27 MRA-based wavelet tight frames and A block system in scale-time plane: ; V 1 = V 0 W 0 ; V 2 = V 1 W 1 = V 0 W 0 W 1, Then, n Z W n = L 2 (R) and W i W j, i j. Definition (MRA-based wavelet tight frame) The system X = {2 n/2 ψ l (2 n k)} l [1:L],n,k Z L 2 (R) is an MRA-based tight frame if X forms a tight frame for L 2 (R), and ψ l ( ) = 2 k a l [k]φ(2 k), 1 l L. The set of sequences A = {a 0, a 1,... a L } l 2 (Z) is called the filter bank of an MRA-based wavelet tight frame, where a 0 is a low-pass filter and the others are high-pass filters.

28 Unitary extension principle (UEP) and Theorem (UEP) Consider an MRA with scaling function φ. The affine system X = {2 n/2 ψ l (2 n k)} L l=1 L2 (R) forms a tight frame for L 2 (R) if the filter bank A = {a 0, a 1,... a L } l 2 (Z) satisfies L l=0 n Z a l[2n]a l [2n + m] = δ m ; L l=0 n Z a l[2n + 1]a l [2n m] = δ m. Together with UEP, a finitely supported filter a 0 admits a compactly supported refinable function φ L 2 (R) of an MRA. Theorem (Filter bank and continuum space) A finitely supported filter bank satisfying n Z a 0[n] = 2 and (1) admits an MRA-based compactly supported wavelet tight frame for L 2 (R). (1)

29 Example and Example (Daubechies DB4 wavelets) Filter bank: Example (linear B-spline framelet) a 0 = 2[0.1629, , , , , , ] a 1 = 2[0.0075, , , , , , , ] φ ψ 1 ψ 2 Filter bank: a 0 = 1 4 [1, 2, 1]; a 1 = 2 4 [ 1, 0, 1]; a 2 = 1 4 [ 1, 2, 1].

30 Basic operators in filter bank and Definition (discrete convolution) For a signal f l 2 (R) and a filter a l 2 (R), the discrete Convolution between f and a is defined by (f a)[k] = j Z f[j]a[k j], k Z. Definition (down/up-sampling) The down-sampling operator with sampling rate p, denoted by p, is defined as (f p )[n] = f[pn], n Z. Its adjoint operator, the up-sampling operator p, is defined by (g p )[pn] = g[n], n Z and all other entries of (g p ) are zero.

31 Cascade algorithm for dyatic wavelets Mathematical Syllabus in and Hilbert space Section 2: Block with Gabor structure Consider f l 2 (Z). Define a quasi-interpolant g n : Consider a sequence g n f= l 2 (Z). fφdefine n,k φ n,k a quasi-interpolant g n : k g n = f, φ n,k φ n,k, φ n,k = 2 n/2 φ(2 n k), n = 0, 1,.... and φ n,k = 2 n/2 k φ(2 n k), ψ l,n,k = 2 n/2 ψ l (2 n k). Define Define w 0,n [k] w 0,n = [k] f, = φ g, n,k φ, n,k w, l,n w[k] l,n [k] = = f, g, ψ l,n,k, 1 l L. L. Then, Unitary Extension Principle and multi-scale Cascade algorithm w l,n [k] w= l,n [k] m Z a l[m 2k]w 0,n 1 [m]; 0,., L; w 0,n 1 [k] = = m Z L a l[m 2k]w 0,n 1 [m]; l = 0,..., L; w 0,n 1 [k] = L l=0 m Z l=0 m Z a l[k a l[k 2m]w 2m]w l,n [m]; l = 0,..., L. l,n [m]; 0,., L., a ( ) 0 a ( ) 1 a ( ) L 2 2 2,,, a () 0 2 a () 2 1 al () 2, : convolution 2: down-sampling 2: up-sampling

32 An adjoint system for filter bank frame and Consider the system X = {a l [n 2k], n Z} l [0:L],k Z. Pre-gramian matrix of X: Adjoint system: J X = {a l [n 2k] n Z } (l,k) [0:L] Z. X = {y j } j Z, with y j = [a l [n]] (l [0:L],n (j+2z). dual Grammian matrix G X : G X [n, n] == L a l [n 2k]a l [n 2k] l=0 k Z X is a tight frame iff G X = I: for m, j [0 : 1], L l=0 n (j+2z) a l [n]a l [n + m] = δ m.

33 Digital multi-scale wavele tight frames and Example ( single-level wavelet tight frame for l 2 (Z), n = 1) Define the system X = {X 0, X 1,..., X L } l 2 (Z) with X l = {a l ( 2k)} k Z. Then, the system X forms a tight frame for l 2 (Z) Example (N-level wavelet tight frame for l 2 (Z)) Define the system X = {{X l,1 } L l=1, {X l,2 } N l=1,..., {X l,n } N l=1, X 0,N } l 2 (Z), where X l,n denotes the sub-system of l-th sub-band channel at scale 2 n : X l,n = {a l,n ( 2 n k} k Z }, l = 0,..., L, n = 1,..., N, with a l,n = a l (a 0,n 1 ) 2, l = 0,..., L and a 0,0 = δ.

34 Analysis operator and synthesis operator and Consider the periodic boundary extension for finite convolution: (h f[k]) = j h[j]f[(k j) mod N], Define a basic operator W a : C 2N C N : f, h C N W a f = a( ) f 2 ; W a g = a( ) (g 2 ). Analysis operator W : Synthesis operator W : W = We have W a0 W a,1. W al W = [ W a 0, W a,1,..., W a L ] W W = I 2N 2N, W W I LN LN.

35 Tensor product approach for multi-dim. data and Tensor product of Hilbert spaces f 1 f 2, g 1 g 2 = f 1, f 2 g 1, g 2 Consider scaling function φ and one wavelet function ψ. V n+1 V n+1 := V n V n W (0,1) n W (1,0) n W (1,1) n, where LH: W n (0,1) = span(2 n φ(2 n x k 1 )ψ(2 n y k 2 ), k 1, k 2 Z) LH: W n (1,0) = span(2 n ψ(2 n x k 1 )φ(2 n y k 2 ), k 1, k 2 Z) HH: W n (1,1) = span(2 n ψ(2 n x k 1 )ψ(2 n y k 2 ), k 1, k 2 Z) A discrete decomposition/reconstruction scheme: f[n1,n2] along rows a0[ n1] 2 a[ n] a 2 [ n ] 0 2 a1[ n2] 2 a along columns [ n ] a1[ n2] 2 LL LH HL HH

36 Other variations and extensions and Complex-valued wavelet system Only real-valued symmetric/anti-symmetric wavelet orthonormal bases are Harr wavelets Complex-valued symmetric/anti-symmetric wavelet orthonormal bases with smoothness. Translation invariance for better processing with less artifacts Un-decimal discrete wavelet, frame/tight frame, without down-sampling (up-sampling) process. Large sampling rate for computational efficiency p-dilation wavelet : down-sampling (up-sampling) rate is p 2. Non-separable wavelet for multi-dimensional signals a separable function g(x, y) = g x (x)g y (y).

37 Connection between Gabor frames and MRA-based wavelet tight frames and Consider a discrete Gabor system X = {g[n 2j]e i2π1/lkn }. digital Gabor filter bank {g l } L l=0 : g l [n] = g[n]e i2πl/ln. operator can be implemented as single-level wavelet filter bank, up to a diagonal matrix with phase shifts. f [(g 0 f) 2, (g 1 f) 2,... (g L f) 2 ] Question: Introducing MRA and affint structure to Gabor system? Theorem The filter bank {g l [n] = g[n]e i2πl/ln } L l=0 satisfies the UEP, if and only if g = L 1 [1, 1, 1, 1, 1, 1]. Remark: there is no MRA-based wavelet tight frame generated via Gabor digital filter bank with smooth window function.

38 Wavelet for multi-dim. data Multi-dim signals, e.g. images, have geometrical regularities Discontinuities only on normal direction of edges and original image local edges tensor product real-valued wavelets only have two orientations. 2D filters of linear spline frames 2D filters of DCT

39 Multi-dim w/ strong orientation selectivity and multi-dim for gaining more orientation selectivity non-separable real-valued and related: ridgelet, curvelet, shearlet and many others. tensor product of complex valued : dual tree complex wavelet transform and discrete Gabor bi-frames. real part of Gabor filters imaginary part of Gabor filters

40 Reference and 1 S. Mallat. A wavelet tour of signal processing. Academic press, I. Daubechies. Ten lectures on wavelets. Vol. 61. Philadelphia: Society for industrial and applied mathematics, Z. Shen, Wavelet frames and image restorations, Proceedings of ICM. Vol IV, Hyderabad, India, (2010), B. Dong, Z. Shen, MRA-based wavelet frames, IAS/Park City Mathematics Series: The Mathematics of Image Processing, Vol 19, (2010), B. Dong and Z. Shen, Image restoration: a data-driven perspective, Proceedings of ICIAM, Beijing, China, (2015) 6 Z. Fan, H. Ji and Z. Shen, Dual Gramian : duality principle and unitary extension principle, Mathematics of Computation, 85 (2016) H. Ji, Z. Shen and Y. Zhao, Digital Gabor Filters Do Generate MRA-based Wavelet Tight Frames, Preprint, 2016.

41 of signals and Definition of sparsity and compressibility A vector is called sparse, if most entries are zero or close to zero. A class of signals is called compressible if there exists a system X such that the linear expansion of any signal in this sparse is sparse When the system X is a redundant (tight) frame, the linear expansion is no longer unique. Consider a signal f H and a redundant frame X = {x k } k Z. Synthesis sparse model: there exist a sparse coefficient sequence c such that f = k c[k]x k. Analysis sparse model: the canonical coefficient sequence { f, x k } k Z is sparse.

42 Linear inverse problems and Consider a linear system Af = b + η, where A R M N : measurement matrix f R N : unkown signal b R M : available observation η R M : measurement noise The matrix A often is ill-conditioned or not-invertible. Image in-painting : a non-invertible down-sampling matrix Image deconvolution : an ill-conditioned convolution matrix Compressed sensing : a random matrix with N >> M Tomopgraphy : a non-invertible partial radon transform A straightforward inverse will magnifies noise η in the solution.

43 Sparse solution of ill-posed inverse problems and Consider an ill-posed or non-invertible linear system Af = b + η. Suppose there exist a system such that signal f is compressible. Synthesis model: AW c = b + η, where most entries of c are zeros. Suppose the support of c, Ω, is known by oracle. Define c Ω [k] = c[i k ], i k Ω. Then, the system, AW Ωc Ω = b + η, is well-posed, if the cardinality Ω is sufficiently small. Analysis model: most entries of W f are zeros. Suppose that the support of W f is given by oracle. Let Ω c denote its complement. Then, the system ( A W Ω c ) f = ( b 0 ) + ( η ɛ is well-posed, if the cardinality of Ω c is sufficiently large. )

44 Sparsity-based convex regularization and Using l 1 -norm as the convex relaxation of the sparsity-prompting functional l 0 -norm. synthesis sparse model: x = W c, sparse model: c := argmin c 1 2 A(W c) b λ c 1. x := argmin x 1 2 Ax b λ W x 1 balanced sparse model: x = W c, c := argmin c 1 2 A(W c) b µ 2 (I W W )c λ c 1. When µ = 0, balanced model becomes synthesis model. When µ +, balanced model becomes model.

45 Mutual coherence and Exact recovery and Definition (mutual coherence) The mutual coherence of a system X = {x l } N l=1 with normalized atoms is defined as µ(x) = max i j x i, x j A vector x R N is called s-sparse if at most s entries are not zero. Theorem (sufficient condition) Consider a full rank matrix X R d N with normalized columns. For any s-sparse vector x R N satisfying s 1 2 (1 + µ(x) 1 ), y = ȳ, ȳ {argmin y 1, s.t. Xz = Xy}. Theorem (boundedness on mutual coherence) For a d N redundant system X, µ(x) ( N d d(n 1) ) d.

46 Exact recovery in noise-free case and Definition (null space property) A matrix X is said to satisfy the null space property of order s, if there exists a constant 0 β < 1 such that for any h ker(x) and any index set Ω of size Ω s, h Ω 1 β h Ω c 1. Theorem (Sufficient and necessary condition) Consider a matrix X R M N. For any s-sparse vector y, y = ȳ, ȳ {argmin y 1, s.t. Xz = Xy}. if and only if X satisfies null space property. Mutual coherence based condition is easy to check but very strong. Null space property is much weaker but difficult to check.

47 Demonstration Image deconvolution and original image blurry image de-convoluted image Image in-painting original image damaged image in-painted image

Real applications Blind image deconvolution

= A k f + η A linear problem with very low

48 Real applications Blind image deconvolution Model: g = p f + η A bi-linear problem with unknowns: p, f, η and motion-blurred mage Electron Microscopy recovered image Model: g k = A k f + η A linear problem with very low signal-to-noise ratio multiple EM images 3D structure of virus

49 Reference and 1 Bin Dong, Zuowei Shen, MRA-based wavelet frames, IAS/Park City Mathematics Series: The Mathematics of Image Processing, Vol 19, (2010), Jianfeng Cai, Stanley Osher, Zuowei Shen, Split Bregman methods and frame based image restoration, SIAM J: Multiscale Modeling and Simulation 8(2), (2009), J. Cai, H. Ji, C. Liu and Z. Shen, Blind motion deblurring from a single image using sparse approximation, IEEE CVPR, Miami, M. Li, Z. Fan, H. Ji and Z. Shen, Wavelet frame based algorithm for 3D reconstruction in electron microscopy, SIAM Journal on Scientific Computing, 36(1), B45-B69, Jan H. Ji and K. Wang, Robust image deconvolution with an inaccurate blur kernel, IEEE Transactions on Image Processing, 21(4), , Apr Bin Dong, Hui Ji, Jia Li, Zuowei Shen, Yuhong Xu, Wavelet frame based blind image inpainting, Applied and Computational Harmonic Analysis, 32(2), (2012), H. Ji, Y. Luo and Z. Shen, Image recovery via geometrically structured approximation, Applied and Computational Harmonic Analysis, To appear, 2016

50 learning and For a sequence l 2 (Z), partitioning it to finitely supported blocks (with possible overlap): {f n = f Ωn } n Z R T, Ω n = [np + 1, np + T ] Z. The K-SVD method by Aaron et al. is to learn a dictionary D R T L which sparsifies {f k }, via solving 1 min { D l =1 } L 2 f k Dc n λ c k 0. l=1,{cn}n n Suppose that (Remark: Condition (i) is not guaranteed in model) (i) span{d 1,..., D M } = R T (ii) A f 2 2 k f k 2 2 B f 2 2. Then the system {D l, n} l,n forms a frame for l 2 (Z), where D l,n denotes the translated D l w.r.t. the n-th block.

51 -based and Gabor/wavelet and Given an index partition: Ω n = [np, np + T 1] and f n = f Ωn. For a Gabor system w/ window function g and step size u 0 = P and supp(g) = [0 : T 1], the entries of its operator are f, D l,n = f n, D l, where D l [k] = g[k]e i2πblk. For a single-level wavelet system w/ dilation factor p = P and supp(a l ) [0 : T 1], the entries of its operator are where D l [k] = a l [k]. f, D l,n = f n, D l, In terms of operator, Gabor are exactly dictionary-based with predefined dictionary, either Gabor atoms or wavelet filters.

52 tight frame and frames (existing dictionary learning technique) No closed-form Linear expansion Completeness of D in R T is difficult to be guaranteed. tight frames fast linear expansion: f = k f, x k x k UEP on D for generating tight frames w/ multi-scale. A general data-driven tight frame model 1 min D,{c n} n 2 f k Dc n λ c k 0. n where the filter bank D satisfies the UEP. The wavelet system generated by D forms a wavelet tight frames as long as {f k } k is a uniform (overlapped) partition of f.

53 A simple and efficient construction and Consider a un-decimal discrete wavelet tight frame, i.e., X := {X n } n Z, where X n = [a 0 [ n],..., a L 1 [ n]]. The UEP for such a system is simplified to L 1 a l [n + k]a l [n] = δ k. l=0 n Consider a finitely supported filter bank {a 0,..., a L 1 } with supp(a l ) [0 : T 1]. Define a dictionary D R T L Theorem (Local and global) D l = a l 1 [0 : T 1], l = 1,..., L. The filter bank {a 0,..., a L 1 } generates an un-decimal wavelet tight frame for l 2 (Z), provided that { 1 T D l } forms a tight frame for R T, i.e. DD = T 1 I.

54 Variational model for dictionary learning and The sparsity-based model for data-driven tight frame min f k Dc k µ c k 0, s.t. DD = T 1 I L, D,{c k } k which is equivalent to the following real-valued dictionary learning model (D T D): min Y D,C DC 2 F + λ C 0, s.t. DD = I, where Y = T [..., f 1, f 0, f 1,... ] and C = [..., c 1, c 0, c 1,... ], The optimization problem is a challenging non-convex problem. When D is a over-complete tight frame, the subproblem of calculating the sparse code C under D is a NP-hard problem.

55 Fast methods for orthogonal dictionary learning and A further simplification by consider a square matrix D R T T. D D = DD = I = min D D=I,{c k } C D Y λ C 0, Alternating iteration scheme: for k = 0, 1, 2,..., { P1 : Ck+1 := argmin C C D Y 2 F + λ C 0 P2 : D k+1 := argmin D D=I Y DC 2 F Each step in the iteration has closed-form solution { P1 : Ck+1 := Γ λ (D Y ) P2 : D k+1 := UV, where Γ µ denote the hard-thresholding operator: Γ µ (x) = x if x > µ and 0 otherwise, and (U, V ) denotes the orthogonal matrices of SVD of Y C k such that Y C k = UΣV. k

56 Demonstration of data-driven filter bank Images and filter banks

57 Reference and 1 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, T. Sejnowski, learning algorithms for sparse, Neural Computation 15 (2) (2003) 2 J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Supervised dictionary learning, NIPS, M. Aharon, M. Elad, and A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse, IEEE Trans. Signal Process., 54 (11), M. Elad, M. Ahron, Image denoising via sparse s over learned dictionaries, IEEE Trans. Image Processing, 54 (12) (2006) 5 J. Cai, H. Ji, Z. Shen and G. Ye, tight frame construction and image denoising, Applied and Computational Harmonic Analysis, 37(1), Y. Quan, H. Ji and Z. Shen, multi-scale non-local wavelet frame construction and image recovery, Journal of Scientific Computing, 63(2), C. Bao, H. Ji and Z. Shen, Convergence for iterative data-driven tight frame construction scheme, Applied and Computational Harmonic Analysis, 2015

58 Structured dictionary learning and Consider a data matrix Y R N M whose columns represent input features. learning is about determining a dictionary D R N K under which the linear expansion {c n } of Y n, Y n K k=1 D kc n [k], has desired properties. A general variational model 1 2 Y DC 2 F + λψ(c) + Φ(D), Ψ is the functional on code C for desired properties, e.g. sparse approximation: 0 Lable (L) consistency of linear classifier (V): min V L V C 2 F Φ is the functional on dictionary for desires structures, e.g. Normalization constraints: Φ( ) = δ Dk =1( ) Orthogonal (or tight frame) constraints: Φ( ) = δ D D=I( ), where δ D (D) = 0 if D D and + otherwise.

59 List of models of dictionary learning and 1 K-SVD for general sparse coding 1 2 Y DC 2 F + λ C 0, s.t. D k 2 = 1, k [1 : M]. 2 Orthogonal dictionary tight frame 1 2 Y DC 2 F + λ C 0, s.t. D D = I. 3 Discriminative dictionary learning for recognition 1 2 Y DC 2 F + α 2 L W C 2 F + λ C 0 + µ W 2 F, subject to D k 2 = 1, k [1 : M]. The matrix L denotes class lables of training samples and W denotes linear classifier.

60 In-coherent dictionary learning for sparse coding and Recall that for an input feature y, finding a sparse vector c such that y = Dc is NP-hard if D is an over-complete dictionary. Consider two representative methods. orthogonal matching pursuit (OMP) for solving min c 0, s.t. y = Dc l 1 -norm relating convex relaxation: min c 1, s.t. y = Dc Both will exactly recover c, provided that its mutual coherence, µ(d) = max i j D i, D j, is sufficiently small. For example, µ(d) (2s 1) 1 for OMP.

61 Model for in-coherent dictionary learning and Theorem (Boundness of mutual coherence) Consider an over-complete normalized dictionary D R N K in R N. We have µ(d) ( K N N(K 1) ) 1 2 N 1 2. Remark: The lower bound of mutual coherence is achieved when the dictionary is a equiangular frame for R N satisfying D i, D j = const, i j. An equiangular frame is a tight frame after normalization. As uniform norm is not differential. We consider relaxing it by the square of l 2 -norm, which leads to the following model: min Y D l =1,C DC 2 F + λψ(c) + µ D D I 2 F.

(IDL) Table: Classification accuracies (%) on face and object recognition

62 Demo. of dictionary learning for sparse coding Visualization of mutual coherence of dictionaries and K-SVD In-coherent dict. learn. (IDL) Table: Classification accuracies (%) on face and object recognition Application Dataset K-SVD IDL Face Extended YaleB Face AR Face Object Caltech

63 classificakernel space Capturing non-linear patterns in data Non-linear regression No-linear classification Kernels: Making linear models works in non-linear settings Mapping data to higher dimension to exhibits linear patterns and Euclidean space Kernel space

64 Feature mapping If the data set is not linearly separable, then adding new features might make data sets. 1 any m+ points in R m is always linearly separable and 0 x 2 k zk ( xk, xk ) Consider a data set {x k } R m. Mapping it to a large feature space via the so-called feature mapping function: φ(x) = [x 1, x 2 1, sin(x 2 ), e x2+x3, tan(x 4 ),...].

65 Feature mapping and Kernel and 1 Data set if often mapped to a very high ( ) dimensional space, How to? 2 Many classifiers use only the inner product, e.g. perception, SVM. Consider two feature vectors x i, x j, κ(x i, x j ) = φ(x i ), φ(x j ). Can t we just call κ(x i, x j ) without explicitly calling φ. Definition (kernel) A kernel K has an associated feature mapping Φ: Φ : X H K : X X R, such that K(x, z) = φ(x), φ(z), where X is input space and H is a Hilbert feature space with inner product.

66 Mercer s Theorem and Conditions: 1 K(, ) L 2 (X X ) is symmetric: K(x, y) = K(y, x). 2 K is positive semi-definite: i j K(x i, x i )c i,j 0. Define a associated Hilbert-Schmidt integral operator on L 2 (X ): Theorem [T K φ](x) = K(x, s)φ(s)ds. X Mercer s Theorem Suppose K is continuous symmetric and positive semidefinite. Them there exists an orthonormal basis {e i } i Z of L 2 (X ) such that each e i is an eigenfunction of T K with eigenvalue λ i 0, and K(x, y) = j Z λ ie j (x)e j (y). Examples: Polynormal function: K(x, z) = (1 + x, z ) d. Radial basis function: K(x, z) = e γ x z 2 2.

67 Equiangular frames in kernel space and Definition A frame X for a Hilbert space H is called a Grassmanian frame, or equiangular frame for H, if there exists a constant c 0 >) such that (i) x k = 1, x k X; ((ii) x i, x j = c 0, i j. An equiangular frame, up to a constant, is a tight frame. Among all finite-dimensional frames with the same cardinality, equiangular frame has the minimum mutual coherence. Theorem Let Φ denote the feature mapping from X to a Hilbert space H with associated kernel K(x, y) = γ( x y 2 ) for some function ψ. Then, the map of an equiangular frame X in X, denoted by Φ(X) is also an equiangular frame of the linear space span{φ(x)}.

68 Equiangular kernel dictionary learning and A direct introduction of equiangular constraints leads to challenging optimization problems. A simplified approach is considering an orthogonal dictionary D R T,T with D D = I. Variational model for equiangular kernel dictionary learning: min Φ(Y ) D D=I,C Φ(D)C 2 + λψ(c). Let D denote its minimizer. Then, Φ(D) forms an equiangular dictionary in H. When solving the optimization problem above, only the kernel K(, ) is needed to be called. e.g. radial basis function K(x z) = e (γ x z 2).

69 Reference and 1 J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res., 11, 1960, Q. Zhang and B. Li, Discriminative K-SVD for dictionary learning in face recognition, CVPR, C. Bao, H. Ji, Y. Quan and Z. Shen, learning for sparse coding: Algorithms and, IEEE Transactions on Pattern Analysis and Machine Intelligence, Y. Quan, C. Bao and H. Ji, Equiangular kernel learning with applications to dynamic texture, IEEE CVPR, Las Vegas, C. Bao, Y. Quan and H. Ji, A convergent incoherent dictionary learning algorithm for sparse coding, ECCV, Zurich, S. Gao, I. W. Tsang, and L.-T. Chia, Sparse with kernels. IEEE Trans. Image Process., 22(2), 2013.

70 Basics of proximal gradient descent method and Define proximal mapping of a proper and lower semi-continuous function: Examples: prox α g (x) = argmin y 1 2α y x 2 + g(x), α > 0. l 1 -norm, i.e. g(x) = x 1 prox α 1 (x) = sign(x) max( x α, 0) l 0 -norm, i.e. g(x) = x 0 x, if x > α prox α 0 (x) = {0, x}, if x = α 0, if x < α

71 Proximal gradient descent method and Consider an un-constrained convex minimization: where min f(x) + g(x), x f C 1 has Lipschitz gradient with modulus L g is proper and lower semi-continuous Example: the synthesis (or balanced) model for sparse recovery: f(x) = 1 2 A(W c) b 2 2, g(x) = λ c 1. Proximal gradient method (PG): for k=0,1,..., x k+1 = prox α g (x k α f(x k )).

72 Acceleration and convergence and Accelerated proximal gradient method (APG): for k = 0, 1,...,, t k = ( 4(t k 1 ) )/2, y k = x k + tk 1 1 t k (x k x k 1 ), x k+1 = prox α g (y k α f(y k )). Theorem (Convergence behavior of PG and APG) Let h = min h(x) = f(x) + g(x) and x be any minimizer and 0 < α < 1/L. Then, Proximal gradient method satisfies h(x k ) h x 0 x 2 /(2αk). Accelerated proximal gradient method satisfies h(x k ) h 2 x 0 x 2 /(α(k + 1) 2 ).

73 Alternating direction method of multiplier and Consider a convex model with linear constraints: min x,y f(x) + h(y), s.t. Ax + By = b. Example: Analysis model based sparse recovery min x,y 1 2 Ax b λ y 1, s.t. W x y = 0. Define the augmented Lagrangian L σ (x, y; z) = f(x) + h(y) + z, Ax + By b + σ Ax + By b 2 2 Alternating direction of multiplier method (ADMM) x k+1 argmin x L σ (x, y k ; z k ), y k+1 argmin y L σ (x k+1, y; z k ), z k+1 = z k + σ(ax k+1 + By k+1 b).

74 Convergence of ADMM and Assumptions on function f and g: 1 The extended-real-valued functions f and g are closed, proper and convex 2 The unaugmented Lagrangian L 0 has a saddle point 3 Both sub-problems in ADMM have bounded solutions. Theorem (Convergence of ADMM) Suppose assumptions 1-3 hold. Let (x, y ; z ) be a saddle point of L 0, p be optimal value and (x k, y k ; z k ) be the infinite sequence generated by ADMM. Then, we have Residual convergence, i.e. Ax k + By k b 0 as k + Objective convergence, i.e. f(x k ) + g(y k ) p as k + Dual variable convergence, i.e. z k z as k + The convergence of primal variables can be achieved under additional assumptions e.g. A is surjective and B = Id.

75 Multi-block iterations for non-convex problems and Consider the unconstrained non-convex minimization: min F (x) = f(x) + n g i (x i ) (2) x=(x 1,...,x n) inf F >, inf f > and inf g i > for all i f C 1 and f is Lipschitz continuous on any bounded set For each block x i, i f is L i -Lipschitz continuous g i, i = 1,..., n are proper and lower semi-continuous Example: Multi-linear problems with (non)-convex regularization Blind deconvolution f(x, h) = 1 2 y x h 2, g 1 (x) = W 1 x 1 ; g 2 (h) = W 2 h 1 +µ h 2 2. learning f(d, C) = 1 2 Y DC 2, g 1 (C) = C 0 ; g 2 (D) = δ Dk =1(D) i=1

76 Kurdyka-Łojasiewicz property and Define X η ( x) = X {x : f( x) < f(x) < f( x) + η} Φ η = {φ C 1 (0, η) C[0, η) : φ is concave, φ > 0 on (0, s)} Definition (Kurdyka-Łojasiewicz (KL) property) Let f be a proper and lower semi-continuous function. The function is said to have the KL property at x if there exist η > 0 and φ Φ η such that φ (f(x) f( x))dist(0, f(x)) 1 holds for all x X η ( x). If f satisfies the KL property at each point of dom f = {x R d : f(x) }, then f is called a KL function. Example (Typical KL functions) Semi-algebraic or Analytic functions: Polynomial functions, norm functions, TV norm, l 0 norm, rank function, logistic regression function, etc.

77 Hybrid proximal alternating method and Recall the problem: min F (x) = f(x) + n g i (x i ) x=(x 1,...,x n) i=1 Define f k i (x i) = f(x k 1,..., x k i 1, x i, x k 1 i+1,..., xk 1 n ). Hybrid proximal alternating iteration: for k = 0, 1,..., for j = 1,..., n x k+1 i end end either prox λk i f k i +gi (x k i ) or proxλk i g i (x k i λk i f k i (xk i )). Theorem (Global convergence) Let F (x) be defined in (2) and {x k = (x k 1,..., x k n)} be the infinite sequence generated by Hybrid proximal alternating method. If F is a KL function and {x k } is bounded, then {x k } converges to a point x, which is a critical point of F i.e. 0 F (x ).

78 Demonstration in dictionary learning and The performance depends on blocks partition, initialization, updating order and scheme. Consider a classic dictionary learning problem: min D,C 1 2 Y DC 2 F + λ 2 C 0, s.t. C M, d j 2 = 1, j. Choice of block partition C1 (x 1,..., x n) = (c 1,..., c q, d 1,..., d q) C2 (x 1,..., x n) = (C, d 1,..., d q) Choice of update scheme and order S1 Proximal alternating: C1 + proximal method S2 Proximal linearized alternating: C1/C2 + proximal linearized method S3 Hybrid method with acceleration: C2 + proximal method with updating order that re-updating the coefficient c i right after updating the corresponding dictionary atom d i. All sub-problems in S1-S3 have closed form solutions.

79 Reference and 1 J. Bolte, et al., Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Math. Program., 136, H. Attouch, et al., Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forwardbackward splitting, and regularized gauss seidel methods, Math. Program., 137(1-2), Attouch, H., et al. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Lojasiewicz inequality, Mathematics of Operations Research 35(2), Y. Xu and W. Yin, A block coordinate descent method for multi-convex optimization with applications to nonnegative tensor factorization and completion, SIAM J. Imaging. Sci., 6(3), C. Bao, et al., learning for sparse coding: Algorithms and convergence, IEEE Transactions on Pattern Analysis and Machine Intelligence, J.F. Cai, et al., Split Bregman methods and frame based image restoration. Multiscale modeling and simulation, 8(2), , Z. Shen, et al., An accelerated proximal gradient algorithm for frame-based image restoration via the balanced approach. SIAM Journal on Imaging Sciences, 4(2),

Deep Learning: Approximation of Functions by Composition

Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation