Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it

Data representation To find a representation that approximates elements of a signal class with a linear combination of base signals y = Dx x = arg min x y Dx 2

Orthonormal basis Fourier Base: D k = e j2πkt, k Z, t [0,1] y t = DX = X k e j2πkt k= 1/2 X k = y(t)e j2πkt dt 1/2

Orthonormal bases Fourier, DCT, Hadamard, Wavelets, Lots of good properties Projections Fast transforms Drawbacks Not spatially compact Bases have global support Few non zero coefficient only for periodic signals

Discrete cosine transform

Haar wavelets

Orthonormal wavelets Spatially compact Multiresolution Fast transforms D kn = ψ kn t = 1 2 k ψ t 2k n 2 k

Designing filter banks Wedgelets, curvelets, countourlets Gabor filters

Learning bases Given dataset can we learn a dictionary that best represents a signal? Principal component analysis: Best linear approximation to the data

Learning sparse bases To find a representation that: 1. approximates elements of a signal class 2. with as few elements as possible

Sparse representation Given: Y R mxn with N samples Sparsity level s Dictionary D R mxn ; each column is named atom or word Sparse representation problem to solve: subject to: X = arg min X Y DX F 2 x l 0 s, l = 1,, N

Notation x l is the lth column of the representation matrix X 0 is the non-zero elements in a vector Representation error is: E = Y DX E 2 m F = N 2 i=1 l=1 e il

Sparse representation of data

Greedy approach Solve the problem separately for each data sample

Orthogonal matching pursuit Find the words one by one Assume that at some point the support is I The residuals are: e = y j I (x j d j ) Choose the new word: d k = arg max j I et d j Add the new word to the support I I {k} New optimal representation is: x I = D I T D I 1 DI T y

What kind of dictionaries Preset Made from the rows of a classic transform Random Especially built e.g. for incoherence Learned Learned from training signals for each specic application

Learned dictionaries Advantages maximize performance for the application at hand Learning can be done before application Drawbacks No structure, hence no fast algorithms Learning dictionaries takes time and might be hard

Dictionary learning Given: Y R mxn with N samples Sparsity level s Dictionary D R mxn ; each column is named atom or word Dictionary learning problem to solve: subject to: {D, X} = arg min D,X Y DX F 2 x l 0 s, l = 1,, N d j = 1, 2 j = 1,, n

More notations Indeterminations Multiplicative: removed by word normalization Permutation of words: not significant The position of the nonzero elements of X are: Ω = i, l x il 0 X Ω c = 0

Problem analysis (shortly) NP-hard due to the sparsity constraint If sparsity pattern Ω is fixed, the problem is biquadratic, hence still nonconvex The problem is convex in D, if X is fixed and normalization ignored in X, if D and Ω are fixed

Difficulties Many local minima, at least one for each Ω Big size, many variables: Example: m = 64, n = 128, N = 10000, s = 6 D is 64 128 full matrix 8192 variables X has 60,000 nonzeros in 640,000 possible positions

Subproblem 1: sparse coding With fixed dictionary, compute sparse representations X = arg min X Y DX F 2 subject to: x l 0 s, l = 1,, N

Subproblem 2: dictionary update With fixed sparsity pattern Ω D = arg min D X Y DX F 2 subject to: X Ω c = 0

Basic algorithm Alternate between sparse coding and dictionary update Initial dictionary. random words random selection of data Stopping criteria Number of iterations Error convergence

Basic algorithm structure

Basic algorithms For sparse coding the use OMP For dictionary update

Gradient descent f D = Y DX F 2 D f D = 2 DX Y XT = 2EX T

Sparsenet update Fixed step gradient descent Update one word at a time Update: d j = d j + α Y DX x j T T x j T is the row j of X α is the step size Poor trade off between complexity and convergence speed

Sparsenet algorithm

MOD: method of optimal directions Dictionary update is convex with respect to D when there is no word/atom normalization Setting D f D = 0 D = YX T T 1 XX

Normalization or no normalization?

MOD analysis Advantages: Good performance due to optimal dictionary update But: The update is optimal in terms of the dictionary, but not of the representations (with fixed sparsity pattern) Drawbacks: The matrix XX T is nxn The computation of the whole dictionary is costlier than updating all atoms one at a time

Optimizing a single word Goal: optimize atom d j with everything else fixed Indices of the signals that use d j in their representation: I j = l j, l Ω If word d j is ignored the representation error is: F = Y i j d i x i T I j

Optimal word 1 Optimization without normalization Standard least squares: d = arg min d F dx F 2 d = Fx x 2 Remembering that E = Y DX we can obtain: F = E Ij + d j X j,ij

Sequential generalization of K-means

Optimal word 2 Optimization with normalization d = Fx Fx After the word update, the representation can be optimized: x = F T d Alternate optimization of words and representation

Approximate K-SVD

Optimal atom 3: K-SVD d = arg min F d =1,x dxt F 2 = arg min d =1,x F F 2 d T FF T d The minimum is obtained when d is the first eigenvector of FF T

Dictionary size To optimize n, we can set the error in the dictionary learning procedure min D,X n subject to: Y DX F 2 ε x l 0 s, l = 1,, N

Dictionary reduction methods General idea: train dictionary with DL algorithm replace clusters of near atoms with a single one How to form clusters? How big? Mean shift Competitive agglomeration Subtractive clustering K-means, K-subspaces

Unused words During learning, atom d j is not used in any representation This means I j = 0 Similarly, the atom hardly contributes to representations, which means that X j,ij small is Solutions replace the atom with a random vector eliminate the atom and so decrease n

Similar words During learning, two atoms become very similar The absolute inner product d j d j T is almost 1 Both are used although only one could replace them Solution: replace one atom with a random vector More generally, a low number of atoms become linearly dependent: use regularization

Applications

Inpainting

Immunohistochemical images