Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Similar documents
Sparse linear models

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Structured matrix factorizations. Example: Eigenfaces

Machine Learning for Signal Processing Sparse and Overcomplete Representations

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

SPARSE signal representations have gained popularity in recent

Sparse analysis Lecture III: Dictionary geometry and greedy algorithms

Sparse linear models and denoising

Sparsifying Transform Learning for Compressed Sensing MRI

SGN Advanced Signal Processing Project bonus: Sparse model estimation

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Linear Regression (continued)

Compressed Sensing and Neural Networks

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Tutorial: Sparse Signal Processing Part 1: Sparse Signal Representation. Pier Luigi Dragotti Imperial College London

Sparse & Redundant Signal Representation, and its Role in Image Processing

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

An Introduction to Sparse Approximation

Multiresolution Analysis

A tutorial on sparse modeling. Outline:

STA141C: Big Data & High Performance Statistical Computing

Machine Learning: Basis and Wavelet 김화평 (CSE ) Medical Image computing lab 서진근교수연구실 Haar DWT in 2 levels

Edinburgh Research Explorer

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

A Simple Algorithm for Nuclear Norm Regularized Problems

Review: Learning Bimodal Structures in Audio-Visual Data

Introduction to Compressed Sensing

STA141C: Big Data & High Performance Statistical Computing

Sparsity in Underdetermined Systems

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Blind Compressed Sensing

Linear Models for Regression. Sargur Srihari

LINEAR SYSTEMS (11) Intensive Computation

PCA, Kernel PCA, ICA

Optimization and Gradient Descent

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

MLISP: Machine Learning in Signal Processing Spring Lecture 10 May 11

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Unsupervised Learning

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

2.3. Clustering or vector quantization 57

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler

Orthogonal tensor decomposition

Inverse problems and sparse models (6/6) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France.

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

Greedy Dictionary Selection for Sparse Representation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Algorithms for sparse analysis Lecture I: Background on sparse approximation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Contents. Acknowledgments

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

c 4, < y 2, 1 0, otherwise,

Linear Methods for Regression. Lijun Zhang

Generalized Power Method for Sparse Principal Component Analysis

Gradient Descent. Sargur Srihari

Compressive Sensing, Low Rank models, and Low Rank Submatrix

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Chapter 7 Iterative Techniques in Matrix Algebra

Sparse Approximation of Signals with Highly Coherent Dictionaries

Parallel Singular Value Decomposition. Jiaxing Tan

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Sparse Approximation and Variable Selection

Recommendation Systems

SIGNAL SEPARATION USING RE-WEIGHTED AND ADAPTIVE MORPHOLOGICAL COMPONENT ANALYSIS

Sparse Linear Models (10/7/13)

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 25

Lecture 25: November 27


New Applications of Sparse Methods in Physics. Ra Inta, Centre for Gravitational Physics, The Australian National University

Sketching for Large-Scale Learning of Mixture Models

Detecting Sparse Structures in Data in Sub-Linear Time: A group testing approach

Sensing systems limited by constraints: physical size, time, cost, energy

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Mathematical optimization

EUSIPCO

Sparse representation classification and positive L1 minimization

Lecture Notes 10: Matrix Factorization

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Lecture: Face Recognition and Feature Reduction

Oslo Class 6 Sparsity based regularization

Principal Component Analysis and Linear Discriminant Analysis

Reduced Complexity Models in the Identification of Dynamical Networks: links with sparsification problems

Clustering. SVD and NMF

The Conjugate Gradient Method

Wavelets For Computer Graphics

sparse and low-rank tensor recovery Cubic-Sketching

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Sparse Estimation and Dictionary Learning

Let p 2 ( t), (2 t k), we have the scaling relation,

Overcomplete Dictionaries for. Sparse Representation of Signals. Michal Aharon

Reproducing Kernel Hilbert Spaces

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Adapted Feature Extraction and Its Applications

Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Transcription:

Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it

Data representation To find a representation that approximates elements of a signal class with a linear combination of base signals y = Dx x = arg min x y Dx 2

Orthonormal basis Fourier Base: D k = e j2πkt, k Z, t [0,1] y t = DX = X k e j2πkt k= 1/2 X k = y(t)e j2πkt dt 1/2

Orthonormal bases Fourier, DCT, Hadamard, Wavelets, Lots of good properties Projections Fast transforms Drawbacks Not spatially compact Bases have global support Few non zero coefficient only for periodic signals

Discrete cosine transform

Haar wavelets

Orthonormal wavelets Spatially compact Multiresolution Fast transforms D kn = ψ kn t = 1 2 k ψ t 2k n 2 k

Designing filter banks Wedgelets, curvelets, countourlets Gabor filters

Learning bases Given dataset can we learn a dictionary that best represents a signal? Principal component analysis: Best linear approximation to the data

Learning sparse bases To find a representation that: 1. approximates elements of a signal class 2. with as few elements as possible

Sparse representation Given: Y R mxn with N samples Sparsity level s Dictionary D R mxn ; each column is named atom or word Sparse representation problem to solve: subject to: X = arg min X Y DX F 2 x l 0 s, l = 1,, N

Notation x l is the lth column of the representation matrix X 0 is the non-zero elements in a vector Representation error is: E = Y DX E 2 m F = N 2 i=1 l=1 e il

Sparse representation of data

Greedy approach Solve the problem separately for each data sample

Orthogonal matching pursuit Find the words one by one Assume that at some point the support is I The residuals are: e = y j I (x j d j ) Choose the new word: d k = arg max j I et d j Add the new word to the support I I {k} New optimal representation is: x I = D I T D I 1 DI T y

What kind of dictionaries Preset Made from the rows of a classic transform Random Especially built e.g. for incoherence Learned Learned from training signals for each specic application

Learned dictionaries Advantages maximize performance for the application at hand Learning can be done before application Drawbacks No structure, hence no fast algorithms Learning dictionaries takes time and might be hard

Dictionary learning Given: Y R mxn with N samples Sparsity level s Dictionary D R mxn ; each column is named atom or word Dictionary learning problem to solve: subject to: {D, X} = arg min D,X Y DX F 2 x l 0 s, l = 1,, N d j = 1, 2 j = 1,, n

More notations Indeterminations Multiplicative: removed by word normalization Permutation of words: not significant The position of the nonzero elements of X are: Ω = i, l x il 0 X Ω c = 0

Problem analysis (shortly) NP-hard due to the sparsity constraint If sparsity pattern Ω is fixed, the problem is biquadratic, hence still nonconvex The problem is convex in D, if X is fixed and normalization ignored in X, if D and Ω are fixed

Difficulties Many local minima, at least one for each Ω Big size, many variables: Example: m = 64, n = 128, N = 10000, s = 6 D is 64 128 full matrix 8192 variables X has 60,000 nonzeros in 640,000 possible positions

Subproblem 1: sparse coding With fixed dictionary, compute sparse representations X = arg min X Y DX F 2 subject to: x l 0 s, l = 1,, N

Subproblem 2: dictionary update With fixed sparsity pattern Ω D = arg min D X Y DX F 2 subject to: X Ω c = 0

Basic algorithm Alternate between sparse coding and dictionary update Initial dictionary. random words random selection of data Stopping criteria Number of iterations Error convergence

Basic algorithm structure

Basic algorithms For sparse coding the use OMP For dictionary update

Gradient descent f D = Y DX F 2 D f D = 2 DX Y XT = 2EX T

Sparsenet update Fixed step gradient descent Update one word at a time Update: d j = d j + α Y DX x j T T x j T is the row j of X α is the step size Poor trade off between complexity and convergence speed

Sparsenet algorithm

MOD: method of optimal directions Dictionary update is convex with respect to D when there is no word/atom normalization Setting D f D = 0 D = YX T T 1 XX

Normalization or no normalization?

MOD analysis Advantages: Good performance due to optimal dictionary update But: The update is optimal in terms of the dictionary, but not of the representations (with fixed sparsity pattern) Drawbacks: The matrix XX T is nxn The computation of the whole dictionary is costlier than updating all atoms one at a time

Optimizing a single word Goal: optimize atom d j with everything else fixed Indices of the signals that use d j in their representation: I j = l j, l Ω If word d j is ignored the representation error is: F = Y i j d i x i T I j

Optimal word 1 Optimization without normalization Standard least squares: d = arg min d F dx F 2 d = Fx x 2 Remembering that E = Y DX we can obtain: F = E Ij + d j X j,ij

Sequential generalization of K-means

Optimal word 2 Optimization with normalization d = Fx Fx After the word update, the representation can be optimized: x = F T d Alternate optimization of words and representation

Approximate K-SVD

Optimal atom 3: K-SVD d = arg min F d =1,x dxt F 2 = arg min d =1,x F F 2 d T FF T d The minimum is obtained when d is the first eigenvector of FF T

Dictionary size To optimize n, we can set the error in the dictionary learning procedure min D,X n subject to: Y DX F 2 ε x l 0 s, l = 1,, N

Dictionary reduction methods General idea: train dictionary with DL algorithm replace clusters of near atoms with a single one How to form clusters? How big? Mean shift Competitive agglomeration Subtractive clustering K-means, K-subspaces

Unused words During learning, atom d j is not used in any representation This means I j = 0 Similarly, the atom hardly contributes to representations, which means that X j,ij small is Solutions replace the atom with a random vector eliminate the atom and so decrease n

Similar words During learning, two atoms become very similar The absolute inner product d j d j T is almost 1 Both are used although only one could replace them Solution: replace one atom with a random vector More generally, a low number of atoms become linearly dependent: use regularization

Applications

Inpainting

Inpainting

Inpainting

Immunohistochemical images