MLISP: Machine Learning in Signal Processing Spring 2018 Prof. Veniamin Morgenshtern Lecture 8-9 May 4-7 Scribe: Mohamed Solomon Agenda 1. Wavelets: beyond smoothness 2. A problem with Fourier transform 3. Short time Fourier transform 4. Wavelets: the basic idea 5. Haar wavelets 1 Wavelets: beyond smoothness Recall the example of phoneme classification from the previous lecture. In the example of audio classification, we have seen how we can improve performance by using splines to smooth the data and regularize the linear classifier. Concretely, our approach can be summarized as follows: For a signal x R 256 we would normally need 256 basis functions to represent x exactly. Instead, we only used D = 12 basis functions h 1 ( ),..., h 12 ( ) and the functions are designed in such a way that every function in the span of h 1 ( ),..., h 12 ( ) is smooth. The processed data is: 256 x m = [H T x] m = h m (f j )x j = h T mx. j=1 Let us decompose x = x smooth + x non smooth, where x smooth is the projection of x onto the span of h 1 ( ),..., h 12 ( ), and x non smooth is the projection of x onto the orthogonal complement of the span of h 1 ( ),..., h 12 ( ). Then, h T mx = h T m(x smooth + x non smooth ) = h T mx smooth + h T mx non smooth }{{} 0 = h T mx smooth. We see that the non-smooth part of the data does not affect the coefficients x m, and hence does not affect learning. This is filtering: the non-smooth part of signal happens to be irrelevant for separation of ao and aa phonemes. Hence, removing that part improves the outcome. Often, our true signal is far from smooth: 1
Discontinuities are important part of natural phenomena. For example, in images edges carry an important information: Wavelets are the tools that allow us to capture structures beyond smoothness and still separates out the irrelevant part. 2 A problem with Fourier transform Fourier transform reveals important information about a signal: its frequency content. If the function f (t) is periodic on the interval [ π, π], it can be written as the Fourier series: f (t) = + X cn φn (t) n= where Z π cn = hφn, f i = φn (t)f (t)dt π and the basis functions are 1 φn (t) = eint. 2π Importantly, the Fourier basis function form an orthonormal basis: hφn, φn i = 1 and hφn, φk i = 0 if n 6= k. This property makes all computations and algorithms very simple. If x(t) represents a piece of music, then the Fourier coefficients {cn } reveal which tones form that piece of music. The problem is that Fourier transform is a global transformation, a local change in the signal affects the Fourier transform everywhere. Concretely, consider a function with one discontinuity. The first 5 terms of Fourier series approximate the function as follows: 2
The first 20 terms of Fourier series approximate it as follows: The first 40 terms of Fourier series approximate it as follows: We observe a jump of about 9% near the singularity, no matter how many terms we take. In general, singularities (discontinuities in f(x) or in its derivatives) cause high frequency coefficients (c n with large n) in the Fourier series f(t) = 1 π + n= c n e int 3
to be large. Note that the singularity is only in one point, but it causes all c n (for large n) to be large. This is bad for two reasons: (1) If there is even one singularity in the signal, then many Fourier coefficients will be large. Hence, we cannot use Fourier coefficients for efficient signal compression. (2) The Fourier series tells us nothing about the time location of an interesting features in the signal. For example, consider two signals: For both signals c 100 is large, but the information on where in time does the frequency burst happen is not immediately available. 3 Short time Fourier transform One possible solution for the frequency localisation problem is to use the window function like this: and calculate that Fourier coefficients for the windowed function f(t) g(t kt 0 ). Specifically c nk = 1 2π π π f(t)g(t kt 0 )e int dt. This is the short-time Furier transform. It decomposes the 1D signal into the 2D time-frequency representation (k indexes time, n indexes frequency) like this: 4
This representation is very similar to the way people write the sheet music. Such representation is often useful. We will return to it when we discuss the Shazam Algorithm. Note however that the functions g nk (t) = g(t kt 0 )e int are not orthogonal, unlike φ n (t) = 1 2π e int in Fourier series. Therefore these functions do not form a nice basis. We need something better. 4 Wavelets: the basic idea The basic idea is to start with an appropriate basic function h(t) (the father wavelet), which might look like this: Then form all possible translations by integers and all possible stretchings by powers of 2 of the father wavelet: h nk (t) = 2 n 2 h(2 n t k). Above, 2 n 2 is just a normalization constant. To illustrate this construction, below we display h(2t) and h(4x 3): 5
Define the expansion coefficients as: c nk = dtf(t)h nk (t). It turns out that if h is chosen properly, then h nk (t) are orthogonal: h nk, h n k = h nk (t)h n k (t)dt = 0 unless n = n and k = k. Furthermore, h nk form a complete set so that every function can be expected as: f(t) = n,k c nk h nk (t). Indeed, by completeness: f(t) = n,k a nk h nk (t) for some a nk. To find the coefficients, use the orthonormality of h nk (t): f, h n k = n,k a nk h nk, h n k = a n k. This is just like in the case of Fourier series. There are major advantages compared to Fourier series. For high frequencies (n large), the functions h nk (t) have good localization (they get thinner as n ). The location of high frequency components can be seen from wavelet analysis, but not from Fourier seriers. Next we consider the simples possible wavelet construction Haar wavelets. 6
5 Haar wavelets 5.1 The pixel approximation spaces Suppose we have a pixel function φ(t) = { 1, if 0 x 1 0, otherwise We wish to build all other functions out of φ(t) and all translates φ(t k): Note that any function that is constant on integers can be written as a linear combination of φ(t k): f(t) = 2φ(t) + 3φ(t 1) + 2φ(t 2) + 4φ(t 3) (1) Given a function f(t) (that is not constant on integers) we can approximate it by linear combinations of φ(t k): 7
Let V 0 denote the set of all squared integrable functions that are constant on integer intervals. Every element of g V 0 can be written as a linear combination of integer translates of the pixel φ: g(t) = k a k φ(t k). (2) The elements of V 0 look like this: To get better approximation shrink the basic function: Figure 1: φ(t), φ(2t), φ(2 2 t) In details, φ(2t) looks like this: 8
and φ(2t 1) looks like this To see this note: 2t = 1 iff t = 1 2 and 2t 1 = 1 iff t = 1. The smaller functions φ(2t + k) give better approximation to f(t): Let V 1 = denote the set of all square integrable functions that are constant on all half-integers. Every element of g V 1 can be written as a linear combination of integer translates of the pixel φ(2t): g(t) = a k φ(2t k). (3) k The elements of V 1 look like this: 9
We can continue in the analogous fashion. The elements of V 2 look like this: In general, let V j denote the set of all square integrable functions that are constant on 2 j -length intervals. Every element of g V j can be written as a linear combination of integer translates of the pixel φ(2 j t): g(t) = a k φ(2 j t k). k 5.2 The general theory The pixel function φ( ) generates the approximation space we need. However its scaled versions and translates are not orthonormal to each other. We fix this problem next. Define the father function: 1, if 0 t 1 2 ψ(t) = 1 1, if 2 t 1 0, otherwise. This function looks like this: 10
It is important not to confuse ψ( ) and φ( )! From the father function we generate the family of Haar wavelets by integer translation, ψ 0,5 = ψ(t 5): and scaling, ψ 3,7 = 2 3 2 ψ(2 3 t 7): 11
In general ψ jk = 2 j 2 ψ(2 j t k). It is not difficult to see that Haar wavelets are orthogonal across scales and within the same scale: + ψjk, ψ j k = dtψ jk (t)ψ j k (t) = 0 if j j or k k. Indeed, If j = j and k k, then ψ jk, ψ j k = 0 because ψjk (t) = 0 for those t for which ψ jk (t) 0 and vice versa. If j j then ψ jk, ψ j k = ψjk (t)ψ j k (t)dt = 0, as can be seen from the figure: Can every function be represented as combination of Haar wavelets? Recall that V j is the space of square integrable function constant on dyadic intervals of length 2 j. The elements of this space can be written as k a kφ(2 j t k). If j is negative, the intervals are of length greater than 1. Concretely, V 1 contains functions constant on intervals of length 2; V 2 contains functions constant on intervals of length 4, etc. We will next list the properties of {V j } : 1. If a function is piecewise constant on integers then it is piecewise constant on half integers. Therefore,...V 2 V 1 V 0 V 1 V 2 V 3... 2. It can be shown that n V n = {0} (the constant zero function). 12
3. Every function can be approximated by a staircase function arbitrarily well if the side of the stairs is small enough. Therefore, n V n is dense in the space of square integrable functions. 4. Take a function that is constant on interval of length 2 n. Shrink it by a factor of 2. The result is constant on interval of length 2 n 1. Therefore, if f(t) V n then f(2t) V n+1. 5. Translating a function by an integer does not change the fact that it is constant on integer intervals. Therefore, if f(x) V 0 then f(t k) V 0. 6. The family of function: φ 0k = φ(x k) form an orthogonal basis for V 0. The function φ is called scaling function. Definition 1. A sequence of spaces {V j } together with scaling function φ that generates V 0 so that (1)-(6) are satisfied is called the multiresolution analysis. Definition 2. Assume M 1 and M 2 are orthogonal subspaces, i.e. w 1 w 2 for all w 1 M 1 and w 2 M 2. The subspace V is the orthogonal direct sum of M 1 and M 2, denoted as V = M 1 M 2, if every v V can be written uniquely as with w 1 M 1 and w 2 M 2. v = w 1 + w 2 Now consider again our hierarchy of subspaces...v 2 V 1 V 0 V 1 V 2 V 3... Since V 0 V 1, there iss a subspace W 0 such that V 0 W 0 = V 1, we done this subspace W 0 = V 1 V 2. Similarly define W 1 = V 2 V 1 and, in general, W j 1 = V j V j 1. We have: V 3 = V 2 W 2 = V 1 W 1 W 2 = V 0 W 0 W 1 W 2 = V 1 W 1 W 0 W 1 W 2 =... and therefore for v 3 V 3 : v 3 = v 2 + w 2 = v 1 + w 1 + w 2 = v 0 + w 0 + w 1 + w 2 = v 1 + w 1 + w 0 + w 1 + w 2 =... with v i V i and w i W i. In conclusion, the relationship between {V j } and {W j } looks like this: 13
How can we characterize the W j spaces? Claim 3. W 0 is the set of functions that are constant on half integers and take equal and opposite values on half of each integer interval. For example, here is a element from W 0 : Proof. Let A be the the set of functions that are constant on half integers and take equal and opposite values on half of each integer interval. We need to show that W 0 = A. First, show that V 0 A = V 1. and that V 0 and A are orthogonal. Then it will follow that: A = V 1 V 0 W 0. (4) First, take f V 0 and g A, these functions look like this: Thus it can be directly verified that f, g = + f(t)g(t)dt = 0 because f(t)g(t) takes on 14
equal and opposite values on each half of every integer interval, and so integers to 0 on each integer interval. Second, for every f V 1 there exist f 0 V 0 and g 0 A such that f = f 0 + g 0. Indeed, take f V 1. It is constant on the half integer intervals: Define f 0 to be the function that is constant on each integer interval and whose values is the average of two values of f on that interval: Then, f 0 is constant on integer intervals, and so f 0 V 0. Now define g 0 (t) = f(t) f 0 (t). Clearly g 0 takes on equal and opposite values on each half of every integer interval, and so g 0 (x) A. To summarise we have f(x) = f 0 (x) + g 0 (x) where f 0 V 0 and g 0 A. Therefore, V 1 = V 0 A A = W 0. Similarly we can show that W j is the space of square integrable functions that take on equal and opposite values on each half of the dyadic interval of length 2 j. To summarise we designed the following structure: 15
Every function with 1 16 quantization: can be decomposed into 8 functions in W3, plus 8 = 4 + 2 + 1 + 1 in lower layers. Typically the representation will be sparse. For example, consider the function we started with: The representation is sparse because most of the coefficients are zero at high level of details. Concretely, consider ψ3,i W3 Note hf, ψ3,i i = 0 for all i 6= 3, 4. Only hf, ψ3,i i = 6 0. Therefore, one singularity in the function only affects a few selected coefficients, unlike in the Fourier transform case. 16