MLISP: Machine Learning in Signal Processing Spring Lecture 10 May 11

Similar documents
MLISP: Machine Learning in Signal Processing Spring Lecture 8-9 May 4-7

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture Notes 5: Multiresolution Analysis

An Introduction to Wavelets and some Applications

Sparsity Regularization

Sparse linear models

Data Mining Stat 588

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

ECE 901 Lecture 16: Wavelet Approximation Theory

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sparse linear models and denoising

Linear Models for Regression. Sargur Srihari

Lasso: Algorithms and Extensions

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Optimization methods

Lectures notes. Rheology and Fluid Dynamics

Introduction to Discrete-Time Wavelet Transform

Module 7:Data Representation Lecture 35: Wavelets. The Lecture Contains: Wavelets. Discrete Wavelet Transform (DWT) Haar wavelets: Example

Multiresolution Analysis

1 Introduction to Wavelet Analysis

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Linear Methods for Regression. Lijun Zhang

Assignment #09 - Solution Manual

MGA Tutorial, September 08, 2004 Construction of Wavelets. Jan-Olov Strömberg

ESL Chap3. Some extensions of lasso

Lecture Notes 10: Matrix Factorization

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Sparsity in Underdetermined Systems

Tutorial: Sparse Signal Processing Part 1: Sparse Signal Representation. Pier Luigi Dragotti Imperial College London

EE123 Digital Signal Processing

Linear Models for Regression

Digital Image Processing

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 5 : Projections

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 7 Multiresolution Analysis

Wavelets in Pattern Recognition

Lecture 25: November 27

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Wavelet Transform. Figure 1: Non stationary signal f(t) = sin(100 t 2 ).

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Optimization methods

Recap from previous lecture

Reproducing Kernel Hilbert Spaces

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

An Introduction to Wavelets

Niklas Grip, Department of Mathematics, Luleå University of Technology. Last update:

Wavelets Marialuce Graziadei

Wavelets and multiresolution representations. Time meets frequency

Deep Learning: Approximation of Functions by Composition

Bayesian Paradigm. Maximum A Posteriori Estimation

MULTIRATE DIGITAL SIGNAL PROCESSING

Compressed Sensing and Neural Networks

Let p 2 ( t), (2 t k), we have the scaling relation,

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Two Channel Subband Coding

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Sparse Estimation and Dictionary Learning

Development and Applications of Wavelets in Signal Processing

( nonlinear constraints)

Lecture 27. Wavelets and multiresolution analysis (cont d) Analysis and synthesis algorithms for wavelet expansions

The Illustrated Wavelet Transform Handbook. Introductory Theory and Applications in Science, Engineering, Medicine and Finance.

Lecture 5 Multivariate Linear Regression

Reproducing Kernel Hilbert Spaces

Recent Advances in Structured Sparse Models

Data Sparse Matrix Computation - Lecture 20

Wavelets. Introduction and Applications for Economic Time Series. Dag Björnberg. U.U.D.M. Project Report 2017:20

Lasso Regression: Regularization for feature selection

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Wavelets and Signal Processing

0.1 Tangent Spaces and Lagrange Multipliers

1 Regression with High Dimensional Data

Lasso Regression: Regularization for feature selection

Transform methods. and its inverse can be used to analyze certain time-dependent PDEs. f(x) sin(sxπ/(n + 1))

Introduction to Wavelets and Wavelet Transforms

AN INTRODUCTION TO COMPRESSIVE SENSING

Lecture 3: Haar MRA (Multi Resolution Analysis)

STATISTICAL LEARNING SYSTEMS

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Wavelet analysis on financial time series. By Arlington Fonseca Lemus. Tutor Hugo Eduardo Ramirez Jaime

An Overview of Sparsity with Applications to Compression, Restoration, and Inverse Problems

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

A tutorial on sparse modeling. Outline:

Which wavelet bases are the best for image denoising?

A Modern Look at Classical Multivariate Techniques

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Denoising via Recursive Wavelet Thresholding. Alyson Kerry Fletcher. A thesis submitted in partial satisfaction of the requirements for the degree of

Signal Denoising with Wavelets

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Wavelet Shrinkage for Nonequispaced Samples

Wavelet Footprints: Theory, Algorithms, and Applications

Factor Analysis (10/2/13)

Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine

Linear Models for Regression

Transcription:

MLISP: Machine Learning in Signal Processing Spring 2018 Lecture 10 May 11 Prof. Venia Morgenshtern Scribe: Mohamed Elshawi Illustrations: The elements of statistical learning, Hastie, Tibshirani, Friedman Agenda 1. Wavelets with good frequency localisation 2. Wavelet transform of NMR signal 3. Denoising via wavelet shrinkage 1 Wavelets with good frequency localisation Haar wavelets are simple to understand, but they are not smooth enough for most purposes. The Daubechies symmlet-8 wavelets, constructed in a similar way, have the same orthonormal properties as Haar wavelets, but are smoother. Here is the comparison of father functions for the two systems: and here are the shifts: 1

Let s compare the two systems: Each symmlet-8 wavelet has a support covering 15 consecutive time intervals rather than one for the Haar basis. More generally, the symmlet-p family has the support of 2p 1 consecutive intervals. The wider support, the more time the wavelet has two die to zero, and so can achieve this more smoothly. Note that the effective support seems to be much narrower than the theoretical one. The symmlet-p wavelet ψ(x) has p vanishing moments: ψ(x)x j dx = 0, j = 0,..., p 1. (1) It follows that any order-p polynomial over N = 2 J time points is reproduced exactly in V 0. The Haar wavelets have one vanishing moment and V 0 can reproduce any constant function. Consider the time-frequency plane. It is instructive to compare how wavelets tile this plane to how short time Fourier transform functions tile it: We observe that in the case of wavelets every time we go up the hierarchy stack, the number of 2

functions doubles, each function becomes more localized (more narrow), and therefore it s frequency content doubles. 2 Wavelet transform of NMR signal In the following figure we see an NMR (nuclear magnetic Resonance) signal, which appears to be composed of smooth components and isolated spikes, plus some noise: The wavelet transform using Symmlet basis, is shown in the lower left panel: The wavelet coefficients are arranged in rows, from lowest scale at the bottom to highest scale at the top. The length of each line segment indicates the size of the coefficients. These rows correspond to the decomposition: that we studied before. signal span = V 4 W 4 W 5 W 6 W 7 W 8 W 9 (2) The bottom right panel shows the wavelet coefficients after they have been thresholded via softthresholding (will study next). Note that many of the smaller coefficients have been set to zero. The Green curve in the top panel shows inverse transform of the thresholded coefficients. This is the denoised version of the original signal. 3

3 Denoising via wavelet shrinkage Let s now give a mathematically precise method to obtain the wavelet smoothing. Suppose our NMR signal, is sampled at N = 2 J lattice points; call the discrete signal y. Since wavelets form an orthonormal basis, we can stack the wavelet basis functions into an orthonormal matrix W T. Each row of W T is one wavelet basis function, sampled at the lattice of N points t = [t 1,... t N ]: W T = φ(t 1 ) φ(t 2 )... φ(t N ) ψ(2t 1 ) ψ(2t 2 )... ψ(2t N ) ψ(2t 1 1) ψ(2t 2 1)... ψ(2t N 1) ψ(2 2 t 1 ) ψ(2 2 t 2 )... ψ(2 2 t N ) ψ(2 2 t 1 1) ψ(2 2 t 2 1)... ψ(2 2 t N 1) ψ(2 2 t 1 2) ψ(2 2 t 2 2)... ψ(2 2 t N 2). Then y = W T y is called the discrete wavelet transform of y. Here is a signal y at the top and it s wavelet transform coefficients arranged by scale in the remaining rows: Notice that in this representation, the coefficients do not discent all way to V 0 but stop at V 4 which has 16 basis functions. As we ascend to each new level of detail, the coefficients get smaller, except in the locations where spiky behavior is present. We would like to somehow capture the fact that the true signal should have few nonzero coefficients only those that correspond to spiky behavior. One approach to do this is to solve: y W 2 2 }{{} least squares regression term + 2λ (# of nonzero elements in ). }{{} regularization term 4

Above, y is the noisy data, W is the transformation matrix from the wavelet domain to the signal domain, is the vector of estimated wavelet coefficients, λ is the parameter that controls the trade-off between data fidelity and sparsity. Unfortunately, the above optimization problem is computationally infeasible! It is non-convex and even non-smooth; we cannot use the gradient descent. Instead we solve this: y W 2 2 + 2λ 1. (3) where the 1 is the term that promotes sparsity as we will shortly see. In general, the optimization problem of form (3) is convex, and therefore the imum can easily be found numerically. The case we are considering now is special, because W is an orthonormal matrix. In this case, the solution can be found analytically as follows. First note that since W is orthogonal, it does not change the norm of a vector so that: y W 2 W 2 = T (y W) 2 = 2 WT y W}{{ T W} Hence, (3) is equivalent to: which can be written as y 2 2 + 2λ 1 N N (yi i ) 2 + 2λ i. Therefore, the problem decouples and we just need to solve: for each i separately. i=1 I i=1 i (y i i ) 2 + 2λ i 2 2 = W T y 2 = 2 y 2 2. Let f() = (y ) 2 + 2λ. What is the imum of f()? To the imum of this function, take the derive and set it to zero: f () = 2(y i ) + 2λ sign where we used = sign() as can be seen from the plot: We find, f () = 0 iff y + + λ sign = 0. (4) 5

Claim: The solution to (4) is = sign(y )( y λ) +, where { u, u > 0 (u) + = 0, otherwise. Proof. If > 0, then (4) is equivalent to: y + + λ = 0 = y λ = (y λ) + = sign(y )( y λ) + where the last step follows because = y λ > 0 y > λ > 0 sign(y ) = 1. If < 0, then (4) is equivalent to: y + λ = 0 = y + λ = (y λ) + = sign(y )( y λ) + where the last step follows because = y + λ < 0 y < λ < 0 sign(y ) = 1. What is the meaning of the formula: = sign(y )( y λ) +? The function f λ (u) = sign(u)( u λ) + is called the soft-thresholding function. It looks like this: Therefore, all the coefficients that are smaller than λ are set to zero. All other coefficients are reduced by λ in absolute value. What is the good choice for λ? The answer depends on the amount of noise in the signal. Suppose y = s + z, where s is the true signal and z = [z 1,..., z N ] T is the noise vector. Assume z i N (0, σ 2 ) and independent over i. Now y = Wy = Ws + Wz 6

. Because W is an orthogonal matrix, the elements z i of z = Wz = [z 1,..., z N ]T are again independent over i and distributed as z i N (0, σ2 ). Let z = max i z i. It is not difficult to calculate that Ez σ 2 log N. Hence, if we set all coefficients that are smaller than σ 2 log N to zero, we are likely to remove all the noise from the signal. This means that a principled choice is λ := σ 2 log N. With this choice we obtain the denoising result as follows (orange line, top plot): Let s compare the spline and the wavelet smoothing for two different signals as in the figure above. In NMR example, splines introduce many unnecessary features. Wavelets fit nicely localized spikes. In the second plot, the true function is smooth and the noise is high. The wavelet fit has left unnecessary wiggles - a price it pays in variance for additional adaptivity. 7