Tfy Lecture 5

Size: px

Start display at page:

Download "Tfy Lecture 5"

Jemimah June Hudson
5 years ago
Views:

1 Non-linear signal processing Mark van Gils Why non-linear methods? The application field, biomedicine, usually deals with systems (humans or animals) that do not even come close to being mathematically convenient and well-behaved, e.g., physiological processes interact non-linearly signal statistics are non-stationary noise or signals have non-gaussian distributions noise is dependent on the signal (multiplicative noise) human perception and information processing is highly non-linear 1

2 Why linear methods? physiological signals/systems often can be viewed upon as having linear as well as non-linear components, therefore linear methods may well be used as a first (and sometimes very adequate) attempt to describe systems and signals many non-linear methods require prior information concerning possible non-linearities (that may not be available) linear methods are more understandable in their behaviour than non-linear ones non-linear methods may be superior over linear methods in laboratory circumstances, however in the real world, linear techniques are still the most common Linear and non-linear signals and methods Linearity has been defined as a system property - what is a linear or a non-linear signal? Definition: A linear signal is completely characterized by its 2nd order statistical properties 2nd order properties: 1st and 2nd moment (mean and variance), autocorrelation function, power spectral density (i.e. "frequency domain and basic statistics ) Definition: A non-linear signal is any signal that is not a linear signal 2

3 Origin of non-linear signals Non-linear signals are usually generated by nonlinear systems (such as human physiological processes) A linear system has a nonlinear output signal if and only if the system input is a nonlinear signal Non-linear analysis methods aim to characterize non-linear signals better than linear methods can do, i.e. somehow quantify characteristics related to higher order statistics Still, very often also linear methods are successfully applied to characterize (2nd order stats of) non-linear signals Processing and Analysis of non-linear signals (Linear methods) Non-linear time series modelling Higher order statistics and higher order spectral analysis Weighted order statistic filtering Deterministic chaos analysis, complexity measures and predictability Poincare analysis and return maps Fractals and 1/f scaling phenomenon Analysis of dimensionality Entropy Pattern analysis e.g. Lempel-Ziv complexity measure Artificial Neural networks...many more 3

4 Higher order statistics and spectral analysis Idea directly from definition: use higher order (=higher than 2) statistics to characterize the non-linear signal either time or frequency domain (higher order spectral analysis) Most often: skewness (3rd order), kurtosis (4th order) of the distribution or their frequency domain equivalents Challenges: numerical estimation methods often require large amounts of data and are relatively complex interpretation sometimes tricky Higher order spectral analysis In power spectral analysis (linear analysis) only magnitude of frequencies can be seen but NO phase relations between frequencies. Higher order spectral analysis may reveal more complex (nonlinear) relationships. Bispectral analysis reveal couplings between frequencies. Bispectrum of a stationary process {x(k)} is defined as B(, ) 1 2 m n C ( m, n) e j ( 1m 2n ) with C ( m, n) E x( k ) x( k m) x( k n) Third order statistic! for a Gaussian process C(m,n) = 0 for each m and n 4

5 Bispectral analysis analyses relationships between two primary frequencies f1 and f2 and a modulation component at f1+f2 (triplet: f1,f2,f1+f2) The quantity B(f1,f2) contains both phase and power information, can be separated out as bicoherence, BIC(f1,f2), containing phase information and real triple product, RTP(f1,f2), containing magnitude information Bicoherence A high BIC(f1,f2) indicates phase coupling within f1, f2 and f1+f2. This may indicate: f1 and f2 have a common generator, or they have a non-linear interaction that creates a new, dependent, frequency at f1+f2 5

In A three waves having no phase relationships are mixed producing the waveform upper right. Bispectrum is zero everywhere.

6 a: f1+f2 = f3, but f3 is independent from f1 and f2 b: f4+f5 = f6, but f6 is a result of the coupling between f4 and f5 a power spectrum cannot discriminate whether situation a or b exists, bispectrum can! In A three waves having no phase relationships are mixed producing the waveform upper right. Bispectrum is zero everywhere. In B, two waves are combined non-linearly, creating a signal that contains the two original waves plus one of 13Hz, being phase locked to the 3 and 10Hz waves. In this case the bispectrum shows a spike at f1=10hz and f2=3hz 6

7 Calculation - direct method from a digitized epoch, x(i) calculate FFT to generate, complex, X(f) for each possible triplet, calculate multiple spectral values with complex conjugate of spectral value at f 1 +f 2 : * B f, f ) X ( f ) X ( f ) X ( f ) ( f2 if there is a large spectral amplitude and the phase angles are aligned then the product will be large, if one of the component sinuoids is small, or if the phases are not aligned the product is small finally, the complex bispectrum is converted to a real number by calculating the magnitude of the complex product. Calculation in practice For example, 4 sec EEG sampled at 128Hz Fourier spectrum: 0 to 64Hz in steps of 0.25Hz = 256 frequencies. Calculations of all triplets: 256*256=65536 evaluations of the complex product. However, there is symmetry, B(f 1, f 2 )=B(f 2, f 1 ), and f 1 +f 2 cannot be evaluated at higher than half the sampling frequency. Thus only a wedge of the whole f 1, f 2 plane needs to be evaluated (see shaded area several slides back). Still, computationally burdensome. 7

8 Calculation: another method the method using FT from x(i) (sometimes called the direct method) on the previous slide is widely used. However, we can also use the earlier definition B(, ) 1 2 m n C( m, n) e j ( 1m 2n ) with C ( m, n) E x( k ) x( k m) x( k n) and calculate the bispectrum according to that: the indirect method. In general the methods give a bit different results, but both are asymptotically unbiased and consistent estimates. Further analysis If one is only interested in examining phase relationships, the bispectrum must have the variations in signal magnitude normalised The amplitude of X(f) is determined by the magnitude of the complex value RTP (Real Triple Product) uses the squared magnitude of the three values in the triplet: 2 RTP( f1, f2) X ( f1) X ( f2) X ( f1 f2) 2 2 8

9 Bicoherence The square root of RTP is used to normalise the bispectrum into the bicoherence, which is a number between 0 and 1 quantizing the amount of phase coupling between the frequencies. BIC( f 1, f 2 ) B( f 1, RTP( f f 1 2, ) f 2 ) raw EEG, fs=125hz B(f1,f2) (logarithmic contour map) PSD, 1Hz is by far biggest component same as D but semi-3d pic (linear) (RTP looks similar) PSD on logarithmic scale; there seem to be other freqs as well BIC shows phase coupling over many frequencies 9

10 Motivations for bispectral analysis the bispectrum of a Gaussian process is zero - bispectral analysis can be used to remove non-gaussian noise from signals examination of phase-relationships between frequency components identification of non-linearities Example - Depth of Anesthesia monitor: Bispectral Index (BIS) developed to assess the hypnotic/sedative component of anaesthesia 0 BIS 100 BIS = f(power Spectral vars,bispectral vars,..) use of large (> 2000 patients under different types of anaesthesia) annotated database and statistical analysis 10

11 BIS hypnosis covered by one single number (=easy) if BIS < 70: probability of awareness is low if BIS ~ 90: consciousness 50 < BIS < 60 for maintenance anaesthesia note: this is an indicator for hypnosis, NOT for analgesia (pain suppression) 11

12 BIS index purely emperically obtained function that happens to function well under many circumstances. 3 components that are weighted and summed in a non-linear fashion to obtain a number between 0 and 100. The weights of the summation change within this range. from the time-domain; Burst Suppression Ratio (BSR), the fraction of time in an epoch when EEG is suppressed, and QUAZI suppression, which allows burst suppression detection in presence of a wandering voltage baseline; from the frequency domain; the relative beta ratio, the log ratio of power in the frequency bands Hz and Hz (the borders of these bands have been empirically obtained) from the bispectrum; the SynchFastSlow parameter, the log ratio of the sum of all bispectrum peaks in the range Hz over the sum of the bispectrum in the area Hz. 2, f 2 f s 4 f s 2 100Hz 1, f 1 The areas for SynchFastSlow measure calculation for a signal sampled at 200 Hz. Dashed line: the support area of bispectrum calculation, solid line: the area of B40-47, dotted line area of B

13 13

14 Higher orders than bispectrum The bispectrum was 'one step' above the usual spectrum We can extend the idea to higher-order spectra (HOS) in general; trispectrum gives three-dimensional infomation on cubic phase coupling etc. Much less easy to interpret but may give additional information (e.g. used in ECG analysis) Nonlinear time series modelling As linear time series modelling but with non-linear elements (operators, interactions) Model signal as an output of a non-linear system (filter), input is (usually) white noise Allows quantification of (known) non-linearities and potentially better fit to data than linear models Uses: quantification of system linear and non-linear characteristics such as kernels (sometimes) control or prediction Problems: numerical complexity and needs for data increases vs linear models type of nonlinearity needs to be known or to match well with the model to get good results 14

15 Example: Non-linear modelling of HRV signal Heart-rate fluctuation due to regulation of the ANS (autonomic nervous system) Processes: Breathing (RSA, respiratory sinus arrhythmia, f>0.15hz), blood pressure regulation and other mechanisms (f<0.15hz) Linear models (ARMA) exist to model relationship between instantaneous lung volume (ILV), arterial blood pressure (ABP) and heart rate (HR) Such methods however cannot show nonlinear coupling between the processes, that are shown experimentally to exist. Non-causal effects: e.g brainstem controls both respiration and heart rate with heart rate changes often leading to changes in lung volume Non-linear modelling of autonomic control of heart heart rate variability is modelled as a nonlinear model, inputs respiration (ILV) and blood pressure (ABP) linear & nonlinear parts linear effects as with linear modelling nonlinear parts explains what remains from linear parts nonlinearities: quadratic versions of the input cross-terms 2nd order time invariant Volterra model (c) Chon et al, T-BME

16 Non-linear methods Instead of the common LTI systems, like linear filters we can also use non-linear processing systems Examples are artificial neural networks and non-linear filters such as median filters Spike removal In general, it is difficult to remove spikes in signals using, e.g., FIR filters Removal of spikes may be done by using a linear interpolation between the samples where the large slopes occur instead of the actually measured data An alternative is using median filters linear interpolation 16

17 Spike removal using non-linear filtering: Median filter input sequence N samples reorder so that samples are arranged in ascending order of magnitude, x(1) x(2) x(3)... x(n) output of a median filter is the centre sample in the ordered sequence, x( k 1) med ( x) 1 ( x( k) x( k 1)) if N 2k 1 if N 2k Properties 2 good at removing sharp short-lasting artefacts ( shot/spike noise ) good at restoring step changes / edges problem : response in frequency domain depends on input (non-linearity) gets computationially heavy for large N (note: A median filter is the simples example of a Weighted Order Statistics Filter, with all the weights (relative importance of the each sample) equal ) Best application for median filtering: spike noise (horizontal: time in secs, vertical: amplitude (a.u.)) Blue: original EKG with spike-noise Green: FIR (N=151) filtered EKG Red: median (N=3) filtered EKG 17

18 Median filtering and step change in signal 950 for a median filter a step change in the signal remains preserved (horizontal: heart-beat number, vertical: RR Interval (in ms), distance from one R peak to the next in an ECG) Blue: original RRI Green: FIR (N=300) filtered RRI Red: median (N=300) filtered RRI Segmentation example often, features are calculated over segments ('windows') in the ongoing data for many features it is advantageous to take an as long as possible segment (e.g., to get better resolution in frequency descriptors), but it should not be too big to avoid getting nonstationary data within segments 18

segmentation breaking up a signal in equally sized segments is easy and fast but not the best method adaptive segmentation results in differently sized segments each having a maximum length of

19 segmentation breaking up a signal in equally sized segments is easy and fast but not the best method adaptive segmentation results in differently sized segments each having a maximum length of stationary signal different criteria for stationarity can be used. For example, by tracking the variability of feature vectors (containing e.g., power or spectral parameters). This requires tuning of parameters and relatively many computing operations. example estimation of variable duration windows Compare statistics (e.g., power, variance, freqs) of data in a sliding window with those of a data in a reference window. If the difference increases to exceed a preset threshold a segment border is identified. The reference window can either be constant or grow over time. 19

20 non-linear energy operator (NLEO) for segmentation (Agarwal et al) calculate for samples at window positions n and n+1 and use as segmentation criterion: for a sine wave, both a change in frequency and a change in amplitude give a step change in. Detect changes in to define segment borders. 20

above: example EEG during anesthesia induced with sevoflurane showing different types of activity (normal, spiky/epileptiform, burst suppression and return to normal).

21 above: example EEG during anesthesia induced with sevoflurane showing different types of activity (normal, spiky/epileptiform, burst suppression and return to normal). below: borders found with NLEO Example of EEG segmentation techniques applied to a six-channel reference recording shown in (a) from left and right frontal, parietal and occipital electrodes. Vertical bars indicate segmentation boundaries. (b) Segmentation criterion for the left and right-sided channels, respectively. (c) Overall segmentation criterion used for final simultaneous segmentation of the left and right sides. 21

22 long data recordings may be transformed to colour-coded representations for the segments, thus poviding a quick overview of data characteristics Another example of non-linear filtering: multi-layer perceptron neural networks We saw earlier that simple-to-implement adaptive linear filtering can be very powerful A non-linear extension of an adaptive linear element is the perceptron Many perceptrons in parallel make up one type of artificial neural network (ANN) ANN's can be used for many tasks: filtering, prediction, pattern recognition,. 22

23 Artificial Neural Networks a large number of simple (non-linear) units that are densely interconnected information is stored in weights that are associated with the interconnections the network 'learns' by adapting the network weights in response to information present in its environment 23

General ANN structure measure 1: measure 2: class '1' information is processed by multiplying measured values with weights and transmitting them through a network of

9 a large number of simple (non-linear) units that are densely interconnected information is stored in weights that are associated with the interconnections the network

24 General ANN structure measure 1: measure 2: class '1' information is processed by multiplying measured values with weights and transmitting them through a network of non-linear elements Output of one processing element ('neuron') ` age: a large number of simple (non-linear) units that are densely interconnected information is stored in weights that are associated with the interconnections the network 'learns' by adapting the network weights in response to information present in its environment lat sin: head flex: sacr. flex: 80 y = f(0.9x x x x80) = 0.91 the weights are adaptable; they can be tuned to change the output, y 24

25 Some terms inputs of a processing element are described by an input vector, x weights associated with input connections are described by a weight vector, w output signal, y transfer function, f for example: y f ( w, x) w i x i w i x processing element as non-linear version of an adaptive linear element (ADALINE) f(w.x) 25

26 often used transfer functions ` f(x) 1 1 f(x) 0 a x b x -1 a sigmoid function f 1 ( x) 1 e x a hyperbolic tangent e f ( x) e x x e e x x An extra nonlinearity parameter,, may be used to adjust the 'steepness' of the function, in which case the functions do not operate upon x, but on x. Training a neural network initially the network will produce nonsense output in response to input data due to random initialization values of the connection weights but, assuming we have example cases, we know what kind of output should have been produced in response to that input adapt the weight connections so that now the network output comes closer to that desired output take another example case, see what the network now gives as output, compare with what it should have been and adapt weights again repeat until actual network output 'always' is close to the desired output (may take 1000's of iterations) 26

27 Initialization, Iterative Training, Evaluation weights are randomly initialized actual out, desired out evaluation (using other data)? <> 0? <> 0? <> 1 iterative training, adaptation of weights 0.64 <> <> <> <> <> <> <> <> <> 0 often, one input is kept at a value of 1, the bias local memory contains data variables such as learning rate and momentum a learning rule influences the behaviour of the ANN by adapting the network weights 27

28 Learning in ANNs Different groups of training/learning methods supervised learning unsupervised learning reinforcement learning each group contains many methods popular examples: (generalised) delta rule, as used in "backpropagation networks" (supervised learning) Kohonen learning rule, as used in self-organising maps (SOMs) (unsupervised learning) Usage In non-trivial applications ANNs are typically used as a module co-operating with other modules that may use other techniques (rules, algorithms, ) in socalled hybrid systems. Rarely an ANN can be used to solve everything. 28

29 Sufficient usable data is crucial Large data sets are needed for training note: A data set with occurrences of disease A and only 2 occurrences of disease B may be a small data set (quite often we would like the ANN to help us exactly with identifying those rare disease B cases) Missing data often forms a serious problem in clinical applications next slides: example of a popular training algorithm generalised delta rule -> used in socalled 'backpropagation networks' don't learn the details by heart, but try to understand the main principles 29

30 Delta Rule (aka Widrow, Widrow/Hoff, or LMS learning rule) originally used in the 60's in ADALINEs (ADAptive LINear Elements) in its generalised form very popular (backpropagation) n-dimensional input vector x y w 1 w x w x n wx = 0 n x 2 wx > 0 x 1 hyperplane classifier wx < 0 input/desired output pair k is (xk, y k*) define cost function, G, as expectation of the squared error * 2 G( w) E( yk yk) or 1 2 N * G( w) lim ( yk yk) N N k 1 parabolic surface; slide towards minimum by using - w G (note: w ( y k ) x k ) N 1 * wg( w) lim 2 ( yk yk) ( xk ) N N k 1 wg( w) 2E k x k k is the error for input k 30

31 this implies: average a large number of k xk vectors, multiply by -2 and move weights in that direction Widrow & Hoff: update weights after every input presentation delta rule: with learning rate w 1 w x k k k k alternatives: update only after a number of input presentations (batched version) use most recent weight update also in current weight update (momentum version): with momentum w w x ( w w ) k1 k k k k k1 limited performance due to its linearity Multi-Layer Perceptron (MLP) trained with the backpropagation algorithm Perceptron: non-linear ADALINE n y fwixi if the argument of f is larger than or equal to 0, f is +1, else it is -1 learning rule similar to delta rule decision regions built of half planes i0 more complex decision regions by using multiple layers of perceptrons Q: how to train such a configuration? A: backpropagation 31

32 Backpropagation Minimise a cost function, Gp, defined for an input pattern p and a network with m outputs m 1 * 2 Gp ( y y ) 2 j1 With y the output of output element j when pattern p is presented. and y * the desired output- Again, we use the gradient descent to change the weights - the elements w ij of weight vector w j change according to: p w ij G w p ij G I p I w I is the input applied to processing element j as a result of presenting p ij hidden layer output layer I n p u t i=1 y pi i=2 y pi j=1 j=2 y y p a t t e r n p I pi i=i y pi y pi y pi y pi i=l w ij I K i1 w y ij ( y * j=j j=m pi y ) f y y ' j ( I ) 32

33 for output elements the weight change can be calculated straightforwardly using I k i0 w ij y We would like to use the following weight update rule: p w ij this is called the generalised delta rule pi y with learning rate (usually decreasing with time), and the error made by processing element j as a result of the application of pattern p denoted as pi For output elements, we can easily calculate since we know what their desired and actual output (and thus, error) is. p p w w ij ij G I thus, if ( y * G I 1 2( y 2 p p we would like to have a rule of the form y y pi, we need to estimate * i 1 2 w ) f ij w ij y ' j y j ( I ( y y ) I pi ) G I * I y ) p 2 y pi as : 33

34 Eventually we get for the output elements: ( y * y ) f ' j ( I For hidden nodes however, this is a bit more complex - we have to apply the chain rule for differentiation and use the error of the output elements. We thus backpropagate the error through the network to calculate all weight updates. For the elements in the hidden layer(s) we get pi M j1 w ( I f i is the derivative of the transfer function of processing element j Note: since the derivative of the transfer function is used, this function must be differentiable for every input value. ij f ' i pi ) ) Backpropagation training with the weight update rule and the expressions for we have the tools to calculate weight changes for every processing element after each pattern presentation p. Iterative training process present a new input pattern calculate network output using summations and transfer functions calculate errors of output elements calculate errors of hidden elements (backpropagation) update weights using generalised delta rule 34

Backpropagation process this iterative process will eventually lead us to a weight configuration that is associated with the global minimum in the, very-many-dimensional, cost-function surface (or,

35 Backpropagation process this iterative process will eventually lead us to a weight configuration that is associated with the global minimum in the, very-many-dimensional, cost-function surface (or, at least we hope so) There is no guarantee that we will actually reach the global minimum - the process can get stuck in local minima as well The training process can be very time-consuming - often training is stopped when a predefined number of iterations has been reached, the error on the training set or the training evaluation set drops below a certain threshold, or the weights do no significantly change over a long time. Again, variations like batched weight updating and use of momentum terms may be used to speed up the process. 35

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.