Compressed Sensing and Neural Networks

Similar documents
From approximation theory to machine learning

Komprimované snímání a LASSO jako metody zpracování vysocedimenzionálních dat

Structured matrix factorizations. Example: Eigenfaces

Sparse linear models

18.6 Regression and Classification with Linear Models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

High-dimensional Statistics

Strengthened Sobolev inequalities for a random subspace of functions

Reconstruction from Anisotropic Random Measurements

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

An Introduction to Sparse Approximation

High-dimensional Statistical Models

Compressive Sensing and Beyond

ECS289: Scalable Machine Learning

Discriminative Models

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Optimization methods

Bits of Machine Learning Part 1: Supervised Learning

Bayesian Methods for Sparse Signal Recovery

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

A summary of Deep Learning without Poor Local Minima

CSC 576: Variants of Sparse Learning

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Minimizing the Difference of L 1 and L 2 Norms with Applications

Super-resolution via Convex Programming

AN INTRODUCTION TO COMPRESSIVE SENSING

Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego

Artificial Neural Networks

Discriminative Models

ECS289: Scalable Machine Learning

Lecture 4: Perceptrons and Multilayer Perceptrons

Feedforward Neural Nets and Backpropagation

Signal Recovery from Permuted Observations

Why Sparse Coding Works

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Sparsity in Underdetermined Systems

Multiple Change Point Detection by Sparse Parameter Estimation

Compressed Sensing and Linear Codes over Real Numbers

Uncertainty quantification for sparse solutions of random PDEs

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Lecture Notes 10: Matrix Factorization

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Dictionary Learning Using Tensor Methods

Introduction to Machine Learning Spring 2018 Note Neural Networks

Provable Alternating Minimization Methods for Non-convex Optimization

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Sparse Proteomics Analysis (SPA)

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review

Making Flippy Floppy

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Neural Networks: Introduction

Deep Learning: Approximation of Functions by Composition

Sparse Approximation and Variable Selection

Introduction to Sparsity. Xudong Cao, Jake Dreamtree & Jerry 04/05/2012

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Summary and discussion of: Dropout Training as Adaptive Regularization

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Unit 8: Introduction to neural networks. Perceptrons

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

LEARNING OVERCOMPLETE SPARSIFYING TRANSFORMS FOR SIGNAL PROCESSING. Saiprasad Ravishankar and Yoram Bresler

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Sparse & Redundant Signal Representation, and its Role in Image Processing

Lecture 6. Regression

Is the test error unbiased for these programs? 2017 Kevin Jamieson

SPARSE signal representations have gained popularity in recent

Tractable Upper Bounds on the Restricted Isometry Constant

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Compressive sensing of low-complexity signals: theory, algorithms and extensions

Recovering overcomplete sparse representations from structured sensing

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Optimisation Combinatoire et Convexe.

J. Sadeghi E. Patelli M. de Angelis

Compressed Sensing and Related Learning Problems

Neural Networks and Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

EUSIPCO

Introduction to Compressed Sensing

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Learning MMSE Optimal Thresholds for FISTA

Fast Hard Thresholding with Nesterov s Gradient Method

Generalized Orthogonal Matching Pursuit- A Review and Some

Making Flippy Floppy

Compressed Sensing and Sparse Recovery

arxiv: v1 [cs.it] 26 Oct 2018

Linear Methods for Regression. Lijun Zhang

Oslo Class 6 Sparsity based regularization

Algorithms for sparse analysis Lecture I: Background on sparse approximation

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Transcription:

and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31

Outline Lasso & Introduction Notation Training the network Applications 2 / 31

Part I Lasso & Introduction Notation Training the network Applications 3 / 31

Least squares Lasso & Fitting a cloud of points by a linear hyperplane Considered already by Gauss and Legendre in 18th century In 2D: 4 / 31

Least squares Objects (=points) described by Ω real numbers: d 1 = (d 1,1,..., d 1,Ω ) R Ω. d N = (d N,1,..., d N,Ω ) R Ω N - number of objects; D - N Ω matrix with rows d 1,..., d N 5 / 31

Least squares Objects (=points) described by Ω real numbers: d 1 = (d 1,1,..., d 1,Ω ) R Ω. d N = (d N,1,..., d N,Ω ) R Ω N - number of objects; D - N Ω matrix with rows d 1,..., d N P = (P 1,..., P N ) are properties of interest We look for a linear dependence P = f (d) with a linear f, i.e. P i = Ω c j d i,j or P = Dc j=1 5 / 31

Least squares Lasso & The solution is found by minimizing the least-square error: ĉ = arg min c R Ω N ( P i i=1 Ω j=1 c j d i,j ) 2 = arg min c R Ω P Dc 2 2 6 / 31

Least squares The solution is found by minimizing the least-square error: ĉ = arg min c R Ω N ( P i i=1 Ω j=1 c j d i,j ) 2 = arg min c R Ω P Dc 2 2 Closed formula exists Convex objective function ĉ with all coordinates occupied Absolute term incorporated by an additional column full of ones 6 / 31

Regularization How to include preknowledge on c? Say, we prefer linear fit with small coefficients. We just weight the error of the fit against the size of the coefficient! λ > 0 - regularization parameter λ 0: least squares λ : c = 0 ĉ = arg min c R Ω P Dc 2 2 + λ c 2 2 7 / 31

Tractability Convexity The minimizer is unique Local minimum of a convex function is also a global one Many effective methods exist (convex optimization) 8 / 31

Tractability Convexity The minimizer is unique Local minimum of a convex function is also a global one Many effective methods exist (convex optimization) P vs. NP P-problems: solvable in polynomial time (in dependence on the size of the input) NP-problems: solution verifiable in polynomial time; P NP One million dollar problem: P=NP? Computational Complexity 8 / 31

Sparsity Lasso & If Ω is large (especially Ω N), we are often interested in selecting features, i.e. in c with many coordinates equal to zero. c 0 := #{i : c i 0} - the number of non-zero coordinates of c Looking for a linear fit using only two features: Regularized version: ĉ = arg min P Dc 2 2 c R Ω, c 0 2 ĉ = arg min c R Ω P Dc 2 2 + λ c 0 NP-hard! 9 / 31

l 1 -minimization Lasso & Other ways to measure the size of c: the l p -norms ( Ω c p = c j p) 1/p j=1 Unit balls in l p in R 2 p = : c = max c j j=1,...,ω p 1 - convex problem p 1 - promotes sparsity 10 / 31

l 1 -minimization Lasso & p 1 - promotes sparsity Solution of S p = arg min z p s.t. Az = y for p = 1, p = 2 z R 2 11 / 31

l 1 -minimization Take p = 1 (Lasso - Tibshirani, 1996) ĉ = arg min c R Ω P Dc 2 2 + λ c 1 Chen, Donoho, Saunders: Basis pursuit (1998) λ 0 : least squares λ : ĉ = 0 In between: λ selects sparsity 12 / 31

l 1 -minimization Lasso & Effect of λ > 0 on the support of ω 13 / 31

(aka Compressive Sensing, Compressive Sampling) Theorem: Let D R N Ω with independent gaussian entries! Let 0 < ε < 1, s a natural number and ( ) N C s log(ω) + log(1/ε), C a universal constant. If c R Ω is s-sparse, P = Dc and ĉ is the minimizer of ĉ = arg min u 1, s.t. P = Du, u R Ω then c = ĉ with prob. at least 1 ε. 14 / 31

(aka Compressive Sensing, Compressive Sampling) Candés, Romberg, Tao (2006); Donoho (2006) Extensive theory of recovery of sparse vectors from linear measurements Optimal conditions on the number of measurements (i.e. data points) N Cs log Ω Only true, if most of the features (i.e. the columns of D) are incoherent with the majority of the others (if two features are very similar, it is difficult to distinguish between them) H. Boche, R. Calderbank, G. Kutyniok, J.V., A Survey of, First chapter in and its Applications, Birkhäuser, Springer, 2015 15 / 31

Dictionaries Lasso & Real-life signals are (almost) never sparse in the canonical basis of R Ω, more often they are sparse in some orthonormal basis, i.e. x = Bc, where c R Ω is sparse and columns (and rows) of B R Ω Ω are orthonormal vectors - wavelets, Fourier basis, etc. applies then without any essential change!...just replace D with DB... i.e. you rotate the problem... 16 / 31

Dictionaries Lasso & Even more often, the signal is represented in an overcomplete dictionary/lexicon: x = Lc, where c R l is sparse and columns of L R Ω l is the dictionary/lexicon - its columns form an overcomplete system (l > Ω) x is a sparse combination of non-orthogonal vectors - the columns of L. Examples: Unions of two or more orthonormal bases, each capturing different features 17 / 31

Dictionaries Compressed sensing can be adapted also to this situation Optimization: ˆx = arg min L u 1, s.t. P = Du u R Ω We do not recover the (non-unique!) sparse coefficients c, but the (approximation of) the signal x. Error bound involves L x, is reasonably small for example when L L is nearly diagonal... not too many features in the dictionary are too correlated... 18 / 31

l 1 -based optimalization l 1 -SVM: Support vector machines are a standard tool for classification problems. l 1 -penalty term leads to sparse classifiers. Nuclear norm: Minimizing nuclear norm (=sum of absolute values of eigenvalues) of a matrix leads to low-rank matrices. TV(=total variation)-norm: Minimizing i,j u i,j+1 u i,j over images u gives images with edges and flat parts. L 1 : Minimizing the L 1 -norm (=integral of the absolute value) of a function leads to functions with small support TV-norm of f : Minimizing f leads to functions with jumps along curves. 19 / 31

Introduction Notation Training the network Applications Part II Lasso & Introduction Notation Training the network Applications 20 / 31

Lasso & Introduction Notation Training the network Applications W. McCulloch, W. Pitts (1943) Motivated by biological research on human brain and neurons Neural network is a graph of nodes, partially connected. Nodes represents neurons, oriented connections between the nodes represent the transfer of outputs of some neurons to inputs of other neurons. 21 / 31

Introduction Notation Training the network Applications In 70 s and 80 s a number of obstacles appeared - insufficient computer power to train large neural networks, theoretical problems of processing exclusive-or, etc. Support vector machines (and other simple algorithms) took over the field of machine learning 2010 s: Algorithmic advances and higher computational power allowed to train large neural networks to human (and superhuman) performance in pattern recognition Large neural networks (a.k.a. deep learning) used successfully in many tasks 22 / 31

Introduction Notation Training the network Applications : Artificial Neuron Artificial Neuron:... gets activated if a linear combination of its inputs grows over a certain threshold... Inputs x = (x 1,..., x n ) R n Weights w = (w 1,..., w n ) R n Comparing w, x with a threshold b R Plugging the result into the activation function - jump (or smoothed jump) function σ Artificial neuron is a function x σ( x, w b), where σ : R R might be σ(x) = sgn(x) or σ(x) = e x /(1 + e x ), etc. 23 / 31

: Layers Introduction Notation Training the network Applications Artificial neural network is a directed, acyclic graph of artificial neurons The neurons are grouped by their distance to the input into layers 24 / 31

: Layers Introduction Notation Training the network Applications Input: x = (x 1,..., x n ) R n First layer of neurons: y 1 = σ( x, w 1 1 b1 1 ),..., y n 1 = σ( x, w 1 n 1 b 1 n 1 ) The outputs y = (y 1,..., y n1 ) become inputs for the next layer... ; last layer outputs y R Training the network: given inputs x 1,..., x N and outputs y 1,..., y N and optimize over weights w s and b s 25 / 31

Introduction Notation Training the network Applications : Training The parameters p of the network are initialized (for example in a random way) = N p For a set of pairs input/output (x i, y i ) we calculate the output of the neural network with current parameters = z i = N p (x i ). In an optimal case, z i = y i for all inputs Update the parameters of the neural networks to minimize/decrease the loss function, i.e.... and repeat... y i z i 2 i 26 / 31

: Training Introduction Notation Training the network Applications Non-convex minimization over a huge space! Huge number of local minimizers exist Initialization of the minimization algorithm is important Backpropagation algorithm: the error at the output is redistributed to the neurons of the last hidden layer, then to the previous one, etc. The error is distributed back through the network and used to update the parameters of each neuron by a gradient descent method 27 / 31

: Training Introduction Notation Training the network Applications Discovered in 1960 s Applied to neural networks 1970 s Theoretical progress in 1980 s and 1990 s Profited from increased computational power in 2010 s, which allowed applications to large data sets and neural networks of tens or hundreds of layers Achieved human and super-human powers in pattern recognition and later on in many other applications 28 / 31

Introduction Notation Training the network Applications : Deep learning Training of a layer with large number ( 100) layers Made possible by the use of GPU s (Nvidia), which accelerated the speed of deep learning by ca. 100times Use of many parameters makes it sensitive to overfitting (=too exact adaptation to the training data, not observed in other data from the same area) Overfitting reduced by regularization methods: l 2 (decay) or l 1 (sparsity) of weights Further tricks used to accelerate the learning algorithm 29 / 31

Applications Lasso & Introduction Notation Training the network Applications Pattern recognition Computer vision Speech recognition Social network filtering Recommendation systems Bioinformatics AlphaGo... 30 / 31

Introduction Notation Training the network Applications Thank you for your attention! 31 / 31