and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31
Outline Lasso & Introduction Notation Training the network Applications 2 / 31
Part I Lasso & Introduction Notation Training the network Applications 3 / 31
Least squares Lasso & Fitting a cloud of points by a linear hyperplane Considered already by Gauss and Legendre in 18th century In 2D: 4 / 31
Least squares Objects (=points) described by Ω real numbers: d 1 = (d 1,1,..., d 1,Ω ) R Ω. d N = (d N,1,..., d N,Ω ) R Ω N - number of objects; D - N Ω matrix with rows d 1,..., d N 5 / 31
Least squares Objects (=points) described by Ω real numbers: d 1 = (d 1,1,..., d 1,Ω ) R Ω. d N = (d N,1,..., d N,Ω ) R Ω N - number of objects; D - N Ω matrix with rows d 1,..., d N P = (P 1,..., P N ) are properties of interest We look for a linear dependence P = f (d) with a linear f, i.e. P i = Ω c j d i,j or P = Dc j=1 5 / 31
Least squares Lasso & The solution is found by minimizing the least-square error: ĉ = arg min c R Ω N ( P i i=1 Ω j=1 c j d i,j ) 2 = arg min c R Ω P Dc 2 2 6 / 31
Least squares The solution is found by minimizing the least-square error: ĉ = arg min c R Ω N ( P i i=1 Ω j=1 c j d i,j ) 2 = arg min c R Ω P Dc 2 2 Closed formula exists Convex objective function ĉ with all coordinates occupied Absolute term incorporated by an additional column full of ones 6 / 31
Regularization How to include preknowledge on c? Say, we prefer linear fit with small coefficients. We just weight the error of the fit against the size of the coefficient! λ > 0 - regularization parameter λ 0: least squares λ : c = 0 ĉ = arg min c R Ω P Dc 2 2 + λ c 2 2 7 / 31
Tractability Convexity The minimizer is unique Local minimum of a convex function is also a global one Many effective methods exist (convex optimization) 8 / 31
Tractability Convexity The minimizer is unique Local minimum of a convex function is also a global one Many effective methods exist (convex optimization) P vs. NP P-problems: solvable in polynomial time (in dependence on the size of the input) NP-problems: solution verifiable in polynomial time; P NP One million dollar problem: P=NP? Computational Complexity 8 / 31
Sparsity Lasso & If Ω is large (especially Ω N), we are often interested in selecting features, i.e. in c with many coordinates equal to zero. c 0 := #{i : c i 0} - the number of non-zero coordinates of c Looking for a linear fit using only two features: Regularized version: ĉ = arg min P Dc 2 2 c R Ω, c 0 2 ĉ = arg min c R Ω P Dc 2 2 + λ c 0 NP-hard! 9 / 31
l 1 -minimization Lasso & Other ways to measure the size of c: the l p -norms ( Ω c p = c j p) 1/p j=1 Unit balls in l p in R 2 p = : c = max c j j=1,...,ω p 1 - convex problem p 1 - promotes sparsity 10 / 31
l 1 -minimization Lasso & p 1 - promotes sparsity Solution of S p = arg min z p s.t. Az = y for p = 1, p = 2 z R 2 11 / 31
l 1 -minimization Take p = 1 (Lasso - Tibshirani, 1996) ĉ = arg min c R Ω P Dc 2 2 + λ c 1 Chen, Donoho, Saunders: Basis pursuit (1998) λ 0 : least squares λ : ĉ = 0 In between: λ selects sparsity 12 / 31
l 1 -minimization Lasso & Effect of λ > 0 on the support of ω 13 / 31
(aka Compressive Sensing, Compressive Sampling) Theorem: Let D R N Ω with independent gaussian entries! Let 0 < ε < 1, s a natural number and ( ) N C s log(ω) + log(1/ε), C a universal constant. If c R Ω is s-sparse, P = Dc and ĉ is the minimizer of ĉ = arg min u 1, s.t. P = Du, u R Ω then c = ĉ with prob. at least 1 ε. 14 / 31
(aka Compressive Sensing, Compressive Sampling) Candés, Romberg, Tao (2006); Donoho (2006) Extensive theory of recovery of sparse vectors from linear measurements Optimal conditions on the number of measurements (i.e. data points) N Cs log Ω Only true, if most of the features (i.e. the columns of D) are incoherent with the majority of the others (if two features are very similar, it is difficult to distinguish between them) H. Boche, R. Calderbank, G. Kutyniok, J.V., A Survey of, First chapter in and its Applications, Birkhäuser, Springer, 2015 15 / 31
Dictionaries Lasso & Real-life signals are (almost) never sparse in the canonical basis of R Ω, more often they are sparse in some orthonormal basis, i.e. x = Bc, where c R Ω is sparse and columns (and rows) of B R Ω Ω are orthonormal vectors - wavelets, Fourier basis, etc. applies then without any essential change!...just replace D with DB... i.e. you rotate the problem... 16 / 31
Dictionaries Lasso & Even more often, the signal is represented in an overcomplete dictionary/lexicon: x = Lc, where c R l is sparse and columns of L R Ω l is the dictionary/lexicon - its columns form an overcomplete system (l > Ω) x is a sparse combination of non-orthogonal vectors - the columns of L. Examples: Unions of two or more orthonormal bases, each capturing different features 17 / 31
Dictionaries Compressed sensing can be adapted also to this situation Optimization: ˆx = arg min L u 1, s.t. P = Du u R Ω We do not recover the (non-unique!) sparse coefficients c, but the (approximation of) the signal x. Error bound involves L x, is reasonably small for example when L L is nearly diagonal... not too many features in the dictionary are too correlated... 18 / 31
l 1 -based optimalization l 1 -SVM: Support vector machines are a standard tool for classification problems. l 1 -penalty term leads to sparse classifiers. Nuclear norm: Minimizing nuclear norm (=sum of absolute values of eigenvalues) of a matrix leads to low-rank matrices. TV(=total variation)-norm: Minimizing i,j u i,j+1 u i,j over images u gives images with edges and flat parts. L 1 : Minimizing the L 1 -norm (=integral of the absolute value) of a function leads to functions with small support TV-norm of f : Minimizing f leads to functions with jumps along curves. 19 / 31
Introduction Notation Training the network Applications Part II Lasso & Introduction Notation Training the network Applications 20 / 31
Lasso & Introduction Notation Training the network Applications W. McCulloch, W. Pitts (1943) Motivated by biological research on human brain and neurons Neural network is a graph of nodes, partially connected. Nodes represents neurons, oriented connections between the nodes represent the transfer of outputs of some neurons to inputs of other neurons. 21 / 31
Introduction Notation Training the network Applications In 70 s and 80 s a number of obstacles appeared - insufficient computer power to train large neural networks, theoretical problems of processing exclusive-or, etc. Support vector machines (and other simple algorithms) took over the field of machine learning 2010 s: Algorithmic advances and higher computational power allowed to train large neural networks to human (and superhuman) performance in pattern recognition Large neural networks (a.k.a. deep learning) used successfully in many tasks 22 / 31
Introduction Notation Training the network Applications : Artificial Neuron Artificial Neuron:... gets activated if a linear combination of its inputs grows over a certain threshold... Inputs x = (x 1,..., x n ) R n Weights w = (w 1,..., w n ) R n Comparing w, x with a threshold b R Plugging the result into the activation function - jump (or smoothed jump) function σ Artificial neuron is a function x σ( x, w b), where σ : R R might be σ(x) = sgn(x) or σ(x) = e x /(1 + e x ), etc. 23 / 31
: Layers Introduction Notation Training the network Applications Artificial neural network is a directed, acyclic graph of artificial neurons The neurons are grouped by their distance to the input into layers 24 / 31
: Layers Introduction Notation Training the network Applications Input: x = (x 1,..., x n ) R n First layer of neurons: y 1 = σ( x, w 1 1 b1 1 ),..., y n 1 = σ( x, w 1 n 1 b 1 n 1 ) The outputs y = (y 1,..., y n1 ) become inputs for the next layer... ; last layer outputs y R Training the network: given inputs x 1,..., x N and outputs y 1,..., y N and optimize over weights w s and b s 25 / 31
Introduction Notation Training the network Applications : Training The parameters p of the network are initialized (for example in a random way) = N p For a set of pairs input/output (x i, y i ) we calculate the output of the neural network with current parameters = z i = N p (x i ). In an optimal case, z i = y i for all inputs Update the parameters of the neural networks to minimize/decrease the loss function, i.e.... and repeat... y i z i 2 i 26 / 31
: Training Introduction Notation Training the network Applications Non-convex minimization over a huge space! Huge number of local minimizers exist Initialization of the minimization algorithm is important Backpropagation algorithm: the error at the output is redistributed to the neurons of the last hidden layer, then to the previous one, etc. The error is distributed back through the network and used to update the parameters of each neuron by a gradient descent method 27 / 31
: Training Introduction Notation Training the network Applications Discovered in 1960 s Applied to neural networks 1970 s Theoretical progress in 1980 s and 1990 s Profited from increased computational power in 2010 s, which allowed applications to large data sets and neural networks of tens or hundreds of layers Achieved human and super-human powers in pattern recognition and later on in many other applications 28 / 31
Introduction Notation Training the network Applications : Deep learning Training of a layer with large number ( 100) layers Made possible by the use of GPU s (Nvidia), which accelerated the speed of deep learning by ca. 100times Use of many parameters makes it sensitive to overfitting (=too exact adaptation to the training data, not observed in other data from the same area) Overfitting reduced by regularization methods: l 2 (decay) or l 1 (sparsity) of weights Further tricks used to accelerate the learning algorithm 29 / 31
Applications Lasso & Introduction Notation Training the network Applications Pattern recognition Computer vision Speech recognition Social network filtering Recommendation systems Bioinformatics AlphaGo... 30 / 31
Introduction Notation Training the network Applications Thank you for your attention! 31 / 31