Deep Learning: Approximation of Functions by Composition

Similar documents
Sparse Approximation: from Image Restoration to High Dimensional Classification

IMAGE RESTORATION: TOTAL VARIATION, WAVELET FRAMES, AND BEYOND

Multiscale Frame-based Kernels for Image Registration

An analysis of wavelet frame based scattered data reconstruction

LINEARIZED BREGMAN ITERATIONS FOR FRAME-BASED IMAGE DEBLURRING

EXAMPLES OF REFINABLE COMPONENTWISE POLYNOMIALS

Nonparametric regression using deep neural networks with ReLU activation function

Linear & nonlinear classifiers

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Compressed Sensing and Neural Networks

When Dictionary Learning Meets Classification

An Introduction to Wavelets and some Applications

CSC 576: Variants of Sparse Learning

Symmetric Wavelet Tight Frames with Two Generators

446 SCIENCE IN CHINA (Series F) Vol. 46 introduced in refs. [6, ]. Based on this inequality, we add normalization condition, symmetric conditions and

Advanced Introduction to Machine Learning CMU-10715

Extremely local MR representations:

Inpainting for Compressed Images

ECE 595: Machine Learning I Adversarial Attack 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear & nonlinear classifiers

Machine Learning: Basis and Wavelet 김화평 (CSE ) Medical Image computing lab 서진근교수연구실 Haar DWT in 2 levels

Band-limited Wavelets and Framelets in Low Dimensions

Invariant Scattering Convolution Networks

Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Local Affine Approximators for Improving Knowledge Transfer

ECE 595: Machine Learning I Adversarial Attack 1

Channel Pruning and Other Methods for Compressing CNN

Introduction to (Convolutional) Neural Networks

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning Techniques

Wavelet Bi-frames with Uniform Symmetry for Curve Multiresolution Processing

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture Notes 5: Multiresolution Analysis

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Sparse linear models

EUSIPCO

BAND-LIMITED REFINABLE FUNCTIONS FOR WAVELETS AND FRAMELETS

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

Principal Component Analysis

arxiv: v1 [stat.ml] 27 Jul 2016

Introduction to Machine Learning Spring 2018 Note Neural Networks

Wavelets: Theory and Applications. Somdatt Sharma

ROBUST ESTIMATOR FOR MULTIPLE INLIER STRUCTURES

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Dictionary Learning for photo-z estimation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Representation and Processing of Big Visual Data: A Data-driven Perspective

Sparse representation classification and positive L1 minimization

L 2,1 Norm and its Applications

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Machine Learning Midterm Exam

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Neural Networks: Optimization & Regularization

Convolutional Neural Networks

RegML 2018 Class 8 Deep learning

Denoising and Compression Using Wavelets

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Normalization Techniques in Training of Deep Neural Networks

An Overview of Sparsity with Applications to Compression, Restoration, and Inverse Problems

(i) The optimisation problem solved is 1 min

Joint Weighted Dictionary Learning and Classifier Training for Robust Biometric Recognition

Kernel Methods. Machine Learning A W VO

The Application of Extreme Learning Machine based on Gaussian Kernel in Image Classification

Topic 3: Neural Networks

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

SPARSE REPRESENTATION ON GRAPHS BY TIGHT WAVELET FRAMES AND APPLICATIONS

Introduction to Biomedical Engineering

Statistical Pattern Recognition

Intelligent Systems Discriminative Learning, Neural Networks

Neural Network Training

Introduction to Hilbert Space Frames

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Learning gradients: prescriptive models

Support Vector Machines: Maximum Margin Classifiers

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Introduction to Deep Learning

Learning with multiple models. Boosting.

Signal Denoising with Wavelets

Construction of Biorthogonal Wavelets from Pseudo-splines

Lecture Support Vector Machine (SVM) Classifiers

Introduction to Machine Learning

Variational Image Restoration

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Functional Preprocessing for Multilayer Perceptrons

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CS 4495 Computer Vision Principle Component Analysis

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

8.1 Concentration inequality for Gaussian random matrix (cont d)

Statistical Pattern Recognition

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Topics in Harmonic Analysis, Sparse Representations, and Data Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

UNIVERSITY OF WISCONSIN-MADISON CENTER FOR THE MATHEMATICAL SCIENCES. Tight compactly supported wavelet frames of arbitrarily high smoothness

MLISP: Machine Learning in Signal Processing Spring Lecture 10 May 11

A Convergent Incoherent Dictionary Learning Algorithm for Sparse Coding

Transcription:

Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore

Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation of functions by composition 3 Approximation of CNNs and sparse coding 4 Approximation in feature space

Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation of functions by composition 3 Approximation of CNNs and sparse coding 4 Approximation in feature space

A brief introduction of approximation theory For a given function f : R d R and ɛ > 0, approximation is to find a simple function g such that f g < ɛ.

A brief introduction of approximation theory For a given function f : R d R and ɛ > 0, approximation is to find a simple function g such that f g < ɛ. Function g : R n R can be as simple as g(x) = a x. To make sense of this approximation, we need to find a map T : R d R n, such that f g T < ɛ.

A brief introduction of approximation theory For a given function f : R d R and ɛ > 0, approximation is to find a simple function g such that f g < ɛ. Function g : R n R can be as simple as g(x) = a x. To make sense of this approximation, we need to find a map T : R d R n, such that f g T < ɛ. 1 Classical approximation: T is independent of data and n depends on ɛ. 2 Deep learning: T depends on the data and n can be independent of ɛ (T is learned from data).

Classical approximation Linear approximation: Given a finite set of generators {φ 1,..., φ n }, e.g. splines, wavelet frames, finite elements or generators in reproducing kernel Hilbert spaces. Define T = [φ 1, φ 2,..., φ n ] : R d R n and g(x) = a x. The linear approximation is to find a R n such that g T = n a i φ i f i=1 It is linear because f 1 g 1, f 2 g 2 f 1 + f 2 g 1 + g 2.

Classical approximation Linear approximation: Given a finite set of generators {φ 1,..., φ n }, e.g. splines, wavelet frames, finite elements or generators in reproducing kernel Hilbert spaces. Define T = [φ 1, φ 2,..., φ n ] : R d R n and g(x) = a x. The linear approximation is to find a R n such that g T = n a i φ i f i=1 It is linear because f 1 g 1, f 2 g 2 f 1 + f 2 g 1 + g 2. Nonlinear approximation: Given infinite generators Φ = {φ i } i=1 and define T = [φ 1, φ 2,..., ] : R d R and g(x) = a x The nonlinear approximation of f is to find a finitely supported a such that g T f. It is nonlinear because f 1 g 1, f 2 g 2 f 1 + f 2 g 1 + g 2 as the support of the approximator g of f depends on f.

Examples Consider a function space L 2 (R d ), let {φ i } i=1 be an orthonormal basis of L 2 (R d ).

Examples Consider a function space L 2 (R d ), let {φ i } i=1 be an orthonormal basis of L 2 (R d ). Linear approximation For a given n, T = [φ 1,..., φ n] and g = a x where a j = f, φ j. Denote H = span{φ 1,..., φ n} L 2(R d ). Then, n g T = f, φ i φ i i=1 is the orthogonal projection onto the space H and is the best approximation of f from the space H.

Examples Consider a function space L 2 (R d ), let {φ i } i=1 be an orthonormal basis of L 2 (R d ). Linear approximation For a given n, T = [φ 1,..., φ n] and g = a x where a j = f, φ j. Denote H = span{φ 1,..., φ n} L 2(R d ). Then, n g T = f, φ i φ i i=1 is the orthogonal projection onto the space H and is the best approximation of f from the space H. g T provides a good approximation of f when the sequence { f, φ j } j=1 decays fast as j +.

Examples Consider a function space L 2 (R d ), let {φ i } i=1 be an orthonormal basis of L 2 (R d ). Linear approximation For a given n, T = [φ 1,..., φ n] and g = a x where a j = f, φ j. Denote H = span{φ 1,..., φ n} L 2(R d ). Then, n g T = f, φ i φ i i=1 is the orthogonal projection onto the space H and is the best approximation of f from the space H. g T provides a good approximation of f when the sequence { f, φ j } j=1 decays fast as j +. Therefore, 1 Linear approximation provides a good approximation for smooth functions. 2 When n =, it reproduces any function in L 2(R d ). 3 Advantage: It is a good approximation scheme for d is small, domain is simple, function form is complicated but smooth. 4 Disadvantage: if d is big and ɛ is small, n is huge.

Examples Nonlinear approximation T = (φ j ) j=1 : Rd R and g(x) = a x and each a j is { f, φj, for the largest n terms in the sequence { f, φ j } j=1 a j = 0, otherwise.

Examples Nonlinear approximation T = (φ j ) j=1 : Rd R and g(x) = a x and each a j is { f, φj, for the largest n terms in the sequence { f, φ j } j=1 a j = 0, otherwise. The approximation of f by g T depends less on the decay of the sequence { f, φ j } j=1. Therefore, 1 the nonlinear approximation is better than the linear approximation when f is nonsmooth. 2 curse of dimensionality: if d is big and ɛ is small, n is huge.

Both linear and nonlinear approximations are schemes to approximate a class of function where T is fixed and it essentially changes a basis in order to represent or approximate a certain class of functions. Both linear and nonlinear approximations do not suit for approximating f when f is defined on a complex domain, e.g manifold in a very high dimensional space. However, in deep learning, T is constructed by given data that is adaptive to the underlying function to be approximated. T changes variables and maps domain of f to a feature domain where approximation become simpler, robust, and efficient. Deep learning approximation is to find map T that maps the domain of f to a simple/ better domain so that simple classical approximation can be applied.

Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation of functions by composition 3 Approximation of CNNs and sparse coding 4 Approximation in feature space

Approximation for deep learning Given data {(x i, f(x i ))} m i=1. 1 The key of deep learning is to construct a T by the given data. 2 T can simplify the domain of f through the change of variables. 3 T maps the key features of the domain of f and f, so that 4 It is easy to find g s.t. g T gives a good approximation of f. What is the mathematics behind this? Settings: construct a map T : R d R n and a simple function g (e.g. g = a x ) from data such that g T provides a good approximation of f.

Approximation by compositions Question 1: For arbitrarily given ɛ > 0, is there T : R d R such that f g T ɛ?

Approximation by compositions Question 1: For arbitrarily given ɛ > 0, is there T : R d R such that f g T ɛ? Answer: Yes! Theorem Let f : R d R and g : R n R and assume Im(f) Im(g). For an arbitrarily given ɛ > 0, there exists T : R d R n such that f g T ɛ T can be explicitly written out in terms of f and g. T can be complex. This leads to

Approximation by compositions Question 2: can T be a composition of simple maps? That is, can we write T = T 1 T J and each T i, i = 1, 2,..., J is simple, e.g. perturbation of identity. Answer: Yes! Theorem Denote f : R d R and g : R n R. For an arbitrarily given ɛ > 0, if Im(f) Im(g), then there exists J simple maps T i, i = 1, 2,..., J such that T = T 1 T 2... T J : R d R n and f g T 1 T J ɛ T i, i = 1, 2,..., J can be written out explicitly in terms of T.

Approximation by compositions Question 3: Can T i, i = 1, 2,..., J be mathematically constructed by some scheme? Answer: Yes! T i, i = 1,..., J can be constructed by solving the minimization problem: min f g T 1 T J T 1,T 2,...,T J A constructive proof is given in paper.

Approximation by compositions Question 3: Can T i, i = 1, 2,..., J be mathematically constructed by some scheme? Answer: Yes! T i, i = 1,..., J can be constructed by solving the minimization problem: min f g T 1 T J T 1,T 2,...,T J A constructive proof is given in paper. Question 4: Given training data {x i, f(x i )} m i=1, can we design numerical scheme to find T i, i = 1, 2..., J and g such that f g T 1 T J ɛ, with high probability by minimizing 1 m m (f(x i ) g T 1 T J (x i )) 2? i=1 Answer: Yes! We do have designed deep neural networks for that. Numerical simulations show it performs well.

Ideas One of the simplest ideas is f g T 1 T J f g T 1 T J + g T 1 T J g T 1 T J =: Bias + Variance Li, Shen and Tai, Deep learning: approximation of functions by composition, 2018.

This theory is complete, but does not answer all questions! For example, we do not have approximation order in terms of the number of nodes yet.

This theory is complete, but does not answer all questions! For example, we do not have approximation order in terms of the number of nodes yet. There are many different machine learning architectures, e.g. convolutional neural networks (CNNs) and sparse coding based classifiers, are different from the architecture we designed here. Next, we will present approximation theory of CNNs and sparse coding based classifiers for image classification.

Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation of functions by composition 3 Approximation of CNNs and sparse coding 4 Approximation in feature space

Approximation for image classification Binary classification Ω = Ω 0 Ω 1 and Ω 0 Ω 1 =. Let f to be the oracle classifier, i.e. f : Ω R d {0, 1} and { 0, if x Ω 0, f(x) = 1, if x Ω 1. Construct feature map T and the classifier g such that is small. f g T

Approximation for image classification g is normally a fully connected layer followed by softmax defined on feature space. Construct a feature map T, so that g T approximates f well, when constructing a fully connected layer to approximate f from the data in image space is hard.

Approximation for image classification g is normally a fully connected layer followed by softmax defined on feature space. Construct a feature map T, so that g T approximates f well, when constructing a fully connected layer to approximate f from the data in image space is hard. g T gives a good approximation of f if T satisfies T (x) T (y) C x y, x, y Ω, (3.1) T (x) T (y) > c x y, x Ω 0, y Ω 1. (3.2) for some C, c > 0. The above inequalities are not easy to prove for both CNNs and sparse coding based classifiers especially when n d.

Accuracy of image classification of CNNs The CNNs achieve desired classification accuracy with high probability! All the numerical results confirmed it.

Accuracy of image classification of CNNs The CNNs achieve desired classification accuracy with high probability! All the numerical results confirmed it. Settings: Given J sets of convolution kernels {w i } J i=1 and bias {b i} J i=1. A J layer convolutional neural network is a nonlinear function g T where T (x) = σ(w J σ(w J 1 (σ(w 1 x + b 1 )) + b J 1 ) + b J ) n g(x) = a i σ(wi x + b i ), and σ(x) = max(0, x). i=1 Normally, the convolutional kernels {w i } J i=1 have small size.

Accuracy of image classification of CNNs The CNNs achieve desired classification accuracy with high probability! All the numerical results confirmed it. Settings: Given J sets of convolution kernels {w i } J i=1 and bias {b i} J i=1. A J layer convolutional neural network is a nonlinear function g T where T (x) = σ(w J σ(w J 1 (σ(w 1 x + b 1 )) + b J 1 ) + b J ) n g(x) = a i σ(wi x + b i ), and σ(x) = max(0, x). i=1 Normally, the convolutional kernels {w i } J i=1 have small size. Given m training samples {(x i, y i )} m i=1, T and a g are learned from 1 m min (y i g T (x i )) 2. g,t m i=1

Accuracy of image classification of CNNs Question: Whether can we have a rigorous proof for this statement? Answer: Yes! Theorem For any given ɛ > 0 and sample data Z with sample size m, there exists a CNN classifier whose filter size can be as small as 3, such that the classifier accuracy A satisfies η(ɛ, m) 0 as m +. P(A 1 ɛ) 1 η(ɛ, m), The difficult part is to prove the inequalities (3.1) and (3.2) for T. Bao, Shen, Tai, Wu and Xiang, Approximation and scaling analysis of convolutional neural networks, 2017.

Accuracy of image classification of sparse coding Given m training samples {(x i, y i )} m i=1, the sparse coding based classifier learn D and W via solving the problem 1 min d k =1,{c i} m i=1,w m m { xi Dc i 2 + λ c i 0 + γ y i W c i 2} i=1 There are numerical algorithms with global convergence property to solve the above minimization.

Accuracy of image classification of sparse coding Given m training samples {(x i, y i )} m i=1, the sparse coding based classifier learn D and W via solving the problem 1 min d k =1,{c i} m i=1,w m m { xi Dc i 2 + λ c i 0 + γ y i W c i 2} i=1 There are numerical algorithms with global convergence property to solve the above minimization. The sparse coding based classifier is g T, where g(x) = W x and T (x) arg min c x Dc 2 + λ c 0. There is no mathematical analysis of classification accuracy of g T, i.e. f g T. Bao, Ji, Quan, Shen, Dictionary learning for sparse coding: Algorithms and convergence analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7), (2016), 1356-1369. Bao, Ji, Quan, Shen, L 0 norm based dictionary learning by proximal methods with global convergence, IEEE Conference Computer Vision and Pattern Recognition (CVPR), Columbus, (2014).

Accuracy of image classification of sparse coding Consider an orthogonal dictionary learning (ODL) scheme 1 min D D=I,{c i} m i=1,g m m { xi Dc i 2 + λ c i 1 + γ y i g(c i ) 2} (3.3) i=1 where g is a fully connected layer. The numerical algorithm to solve the above problem has global convergence property. The classification accuracy is similar to the previous models. Classification accuracies (%) Dataset K-SVD D-KSVD IDL ODL Face: Extended Yale B 93.10 94.10 95.72 96.12 Face: AR face 86.50 88.80 96.18 96.37 Object: Caltech101 68.70 68.60 72.29 72.54

Accuracy of image classification of sparse coding Consider an orthogonal dictionary learning (ODL) scheme 1 min D D=I,{c i} m i=1,g m m { xi Dc i 2 + λ c i 1 + γ y i g(c i ) 2} (3.3) i=1 where g is a fully connected layer. The numerical algorithm to solve the above problem has global convergence property. The classification accuracy is similar to the previous models. Classification accuracies (%) Dataset K-SVD D-KSVD IDL ODL Face: Extended Yale B 93.10 94.10 95.72 96.12 Face: AR face 86.50 88.80 96.18 96.37 Object: Caltech101 68.70 68.60 72.29 72.54

Accuracy of image classification of sparse coding Consider an orthogonal dictionary learning (ODL) scheme 1 min D D=I,{c i} m i=1,g m m { xi Dc i 2 + λ c i 1 + γ y i g(c i ) 2} (3.3) i=1 where g is a fully connected layer. The numerical algorithm to solve the above problem has global convergence property. The classification accuracy is similar to the previous models. Classification accuracies (%) Dataset K-SVD D-KSVD IDL ODL Face: Extended Yale B 93.10 94.10 95.72 96.12 Face: AR face 86.50 88.80 96.18 96.37 Object: Caltech101 68.70 68.60 72.29 72.54 For this model, we have mathematical analysis of accuracy.

Sparse coding approximation of image classification Let g to be the fully connected layer and D is the dictionary from solving (3.3). Define T (x) = arg min c x Dc 2 + λ c 1. The sparse coding based classifier from ODL model is g T.

Sparse coding approximation of image classification Let g to be the fully connected layer and D is the dictionary from solving (3.3). Define T (x) = arg min c x Dc 2 + λ c 1. The sparse coding based classifier from ODL model is g T. Theorem Consider the ODL model. For any given ɛ > 0 and sample data Z with sample size m, there exists a sparse coding based classifier, such that the classifier accuracy A satisfies η(ɛ, m) 0 as m +. P(A 1 ɛ) 1 η(ɛ, m), To prove the two inequalities (3.1) and (3.2) of T is not easy. Bao, Ji and Shen, Classification accuracy of sparse coding based classifier, 2018.

Data-driven tight frame is convolutional sparse coding Given an image g, the data driven tight frame model solves min c,w WT c g 2 2 + (I WW T )c 2 2 + λ 2 c 0 s.t. W T W = I. When W W = I, the rows of W form a tight frame. (3.4)

Data-driven tight frame is convolutional sparse coding Given an image g, the data driven tight frame model solves min c,w WT c g 2 2 + (I WW T )c 2 2 + λ 2 c 0 s.t. W T W = I. When W W = I, the rows of W form a tight frame. The minimization model (3.4) is equivalent to (3.4) min Wg c,w c 2 2 + λ 2 c 0, s.t. W W = I. (3.5) By the structure of W, each channel of W corresponds to a convolutional kernel.

Data-driven tight frame is convolutional sparse coding Given an image g, the data driven tight frame model solves min c,w WT c g 2 2 + (I WW T )c 2 2 + λ 2 c 0 s.t. W T W = I. When W W = I, the rows of W form a tight frame. The minimization model (3.4) is equivalent to (3.4) min Wg c,w c 2 2 + λ 2 c 0, s.t. W W = I. (3.5) By the structure of W, each channel of W corresponds to a convolutional kernel. To solve (3.5), we use ADM. For fixed W, c can be solved by hard thresholding; For fixed c, W has an analytical solution and easy to compute. Thanks the convolution structure of W and the tight frame property. The iteration algorithm converges.

Data-driven tight frame for image denoising [1] Cai, Huang, Ji, Shen and Ye, Data-driven tight frame construction and image denoising, Applied and Computational Harmonic Analysis, 37(1), (2014), 89-105. [2] Bao, Ji and Shen, Convergence analysis for iterative data-driven tight frame construction scheme, Applied and Computational Harmonic Analysis, 38(3), (2015), 510-523.

Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation of functions by composition 3 Approximation of CNNs and sparse coding 4 Approximation in feature space

Back to classical approximation on feature domain Classical approximation is still useful for approximation in feature space.

Back to classical approximation on feature domain Classical approximation is still useful for approximation in feature space. Given noisy data {x i, y i } m i=1 with y i = (Sf)(x i ) + n i, where {y i } m i=1 are samples of f with noise n i.

Back to classical approximation on feature domain Classical approximation is still useful for approximation in feature space. Given noisy data {x i, y i } m i=1 with y i = (Sf)(x i ) + n i, where {y i } m i=1 are samples of f with noise n i. By applying some data fitting scheme, e.g., the wavelet frame scheme, one obtains the denoised result {y i } m i=1.

Back to classical approximation on feature domain Classical approximation is still useful for approximation in feature space. Given noisy data {x i, y i } m i=1 with y i = (Sf)(x i ) + n i, where {y i } m i=1 are samples of f with noise n i. By applying some data fitting scheme, e.g., the wavelet frame scheme, one obtains the denoised result {y i } m i=1. Let g be the function reconstructed by yi approximation scheme. through some

Back to classical approximation on feature domain Classical approximation is still useful for approximation in feature space. Given noisy data {x i, y i } m i=1 with y i = (Sf)(x i ) + n i, where {y i } m i=1 are samples of f with noise n i. By applying some data fitting scheme, e.g., the wavelet frame scheme, one obtains the denoised result {y i } m i=1. Let g be the function reconstructed by yi through some approximation scheme. Question: 1. What s the error between g and f? 2. Can we have g f when the sampling data is sufficiently dense?

Data fitting Let Ω := [0, 1] [0, 1] and f L 2 (Ω). Let φ be the tensor product of B-spline functions and denote the scaled functions by φ n,α := 2 n φ(2 n α). Let (Sf)[α] = 2 n f, φ n,α.

Data fitting Let Ω := [0, 1] [0, 1] and f L 2 (Ω). Let φ be the tensor product of B-spline functions and denote the scaled functions by φ n,α := 2 n φ(2 n α). Let (Sf)[α] = 2 n f, φ n,α. Given noisy observations y[α] = (Sf)[α] + n α, α = (α 1, α 2 ), 0 α 1, α 2 2 n 1. The data fitting problem is to recover f on Ω from y.

Wavelet frame Let φ be a refinable function and Ψ := {ψ i, i = 1,..., r} be the wavelet functions associated with φ in L 2 (R 2 ).

Wavelet frame Let φ be a refinable function and Ψ := {ψ i, i = 1,..., r} be the wavelet functions associated with φ in L 2 (R 2 ). Denote the scaled functions by φ n,α := 2 n φ(2 n α) and ψ i,n,α := 2 n ψ i (2 n α).

Wavelet frame Let φ be a refinable function and Ψ := {ψ i, i = 1,..., r} be the wavelet functions associated with φ in L 2 (R 2 ). Denote the scaled functions by φ n,α := 2 n φ(2 n α) and ψ i,n,α := 2 n ψ i (2 n α). X(Ψ) := {ψ i,n,α } is a tight frame if f 2 2 = i,n,α f, ψ i,n,α 2, f L 2 (R). Unitary extension principle (UEP): Assume the masks h i satisfy the following equalities r 2 h i (m + 2k + l)h i (2k + l) = δ m, for any m, l Z, i=0 k Z X(Ψ) := {ψ i,n,α } is a tight frame.

Wavelet frame Let φ be a refinable function and Ψ := {ψ i, i = 1,..., r} be the wavelet functions associated with φ in L 2 (R 2 ). Denote the scaled functions by φ n,α := 2 n φ(2 n α) and ψ i,n,α := 2 n ψ i (2 n α). X(Ψ) := {ψ i,n,α } is a tight frame if f 2 2 = i,n,α f, ψ i,n,α 2, f L 2 (R). Unitary extension principle (UEP): Assume the masks h i satisfy the following equalities r 2 h i (m + 2k + l)h i (2k + l) = δ m, for any m, l Z, i=0 k Z X(Ψ) := {ψ i,n,α } is a tight frame. Ron and Shen, Affine systems in L 2 (R d ): the analysis of the analysis operator, Journal of Functional Analysis, 148(2), (1997), 408-447. Daubechies, Han, Ron and Shen, Framelets: MRA-based constructions of wavelet frames, Applied and Computational Harmonic Analysis, 14(1), (2003), 1-46.

Examples of spline wavelets from UEP Piecewise linear refinable B-spline and the corresponding framelets. Refinement mask h 0 = [ 4 1, 1 2, 4 1 ]. High pass filters h 1 = [ 4 1, 1 2, 4 1 ] and h 2 = [ 2 4, 0, 2 4 ]. Piecewise cubic refinable B-spline and the corresponding framelets. Refinement mask h 0 = [ 16 1, 1 4, 8 3, 1 4, 16 1 ]. High pass filters h 1 = [ 16 1, 1 4, 8 3, 1 4, 16 1 ], h 2 = [ 1 8, 1 4, 0, 4 1, 1 8 ], h 3 = [ 6 16, 0, 6 8, 0, 6 16 ] and h 4 = [ 8 1, 1 4, 0, 4 1, 1 8 ].

Examples of spline wavelets from UEP Piecewise linear refinable B-spline and the corresponding framelets. Refinement mask h 0 = [ 4 1, 1 2, 4 1 ]. High pass filters h 1 = [ 4 1, 1 2, 4 1 ] and h 2 = [ 2 4, 0, 2 4 ]. Piecewise cubic refinable B-spline and the corresponding framelets. Refinement mask h 0 = [ 16 1, 1 4, 8 3, 1 4, 16 1 ]. High pass filters h 1 = [ 16 1, 1 4, 8 3, 1 4, 16 1 ], h 2 = [ 1 8, 1 4, 0, 4 1, 1 8 ], h 3 = [ 6 16, 0, 6 8, 0, 6 16 ] and h 4 = [ 8 1, 1 4, 0, 4 1, 1 8 ]. Discrete wavelet transform W: {a n,α := f, φ n,α } W { f, ψ i,n,α } r i=0 = {h i a n,α } r i=0, where {h i } are the wavelet frame filters.

Wavelet frame based data fitting scheme Example: analysis-based wavelet approach for data fitting Let f n be a minimizer of the model E n (f) := f y 2 2 + diag(λ n )W n f 1, where λ is a vector which scales the different wavelet channels.

Wavelet frame based data fitting scheme Example: analysis-based wavelet approach for data fitting Let f n be a minimizer of the model E n (f) := f y 2 2 + diag(λ n )W n f 1, where λ is a vector which scales the different wavelet channels. Let g n := α I n f n(α)φ(2 n α).

Wavelet frame based data fitting scheme Example: analysis-based wavelet approach for data fitting Let f n be a minimizer of the model E n (f) := f y 2 2 + diag(λ n )W n f 1, where λ is a vector which scales the different wavelet channels. Let g n := α I n f n(α)φ(2 n α). What s the bound of g n f L2 (Ω)?

Wavelet frame based data fitting scheme Example: analysis-based wavelet approach for data fitting Let f n be a minimizer of the model E n (f) := f y 2 2 + diag(λ n )W n f 1, where λ is a vector which scales the different wavelet channels. Let g n := α I n f n(α)φ(2 n α). What s the bound of g n f L2 (Ω)? Does g n converge to f in L 2 (Ω) as n?

Regularity assumption of f: There exits β > 1 such that f, φ 0,α + 2 βn f, ψ i,n,α <. α n 0 i,α

Regularity assumption of f: There exits β > 1 such that f, φ 0,α + 2 βn f, ψ i,n,α <. α n 0 i,α Then for an arbitrary given 0 < δ < 1, the following inequality gn 1+β n min{ f L2 (Ω) C 1 2 2, 1 2 } log 1 δ + C 2σ 2 holds with confidence 1 δ.

Regularity assumption of f: There exits β > 1 such that f, φ 0,α + 2 βn f, ψ i,n,α <. α n 0 i,α Then for an arbitrary given 0 < δ < 1, the following inequality gn 1+β n min{ f L2 (Ω) C 1 2 2, 1 2 } log 1 δ + C 2σ 2 holds with confidence 1 δ. When n, one can design a data fitting scheme such that lim E( g ) n n f L2 (Ω) = 0.

Regularity assumption of f: There exits β > 1 such that f, φ 0,α + 2 βn f, ψ i,n,α <. α n 0 i,α Then for an arbitrary given 0 < δ < 1, the following inequality gn 1+β n min{ f L2 (Ω) C 1 2 2, 1 2 } log 1 δ + C 2σ 2 holds with confidence 1 δ. When n, one can design a data fitting scheme such that lim E( g ) n n f L2 (Ω) = 0. In this case, data is given on uniform grids.

When the noisy data are obtained from nonuniform grids or obtained by random sampling, can we have a similar result?

When the noisy data are obtained from nonuniform grids or obtained by random sampling, can we have a similar result? Yes. It is technical but it has been carefully studied.

When the noisy data are obtained from nonuniform grids or obtained by random sampling, can we have a similar result? Yes. It is technical but it has been carefully studied. Cai, Shen and Ye, Approximation of frame based missing data recovery, Applied and Computational Harmonic Analysis, 31(2), (2011), 185-204. Yang, Stahl and Shen, An analysis of wavelet frame based scattered data reconstruction, Applied and Computational Harmonic Analysis, 42(3), (2017), 480-507. Yang, Dong and Shen, Approximation of analog signals from noisy data, manuscript, 2018.

Thank you! http://www.math.nus.edu.sg/ matzuows/