From approximation theory to machine learning
|
|
- Blake Ward
- 5 years ago
- Views:
Transcription
1 1/40 From approximation theory to machine learning New perspectives in the theory of function spaces and their applications September 2017, Bedlewo, Poland Jan Vybíral Charles University/Czech Technical University Prague, Czech Republic
2 2/40 Introduction Approximation Theory vs. Data Analysis (or Machine Learning, or Learning Theory) Example I: l 1 -minimization Example II: Support vector machines Example III: Ridge functions and neural networks
3 3/40 Approximation theory K - class of objects of interest: vectors, functions, operators, approximated by (same or simpler) objects usually with only limited information about the unknown object... with the similarity (approximation error) measured by some sort of distance worst case - supremum of the approximation error over K average case - mean value of the (square of the) approximation error over K w.r.t. some probability measure on K
4 4/40 Approximation of multivariate functions Let f : Ω R d R be a function of many (d 1) variables We want to approximate f using only (a small number of) function values f (x 1 ),..., f (x n ) Questions: Which functions (assumptions on f )? How to measure the error? How to choose the sampling points? Decay of the error with growing number of sampling points? Algorithms and optimality? Dependence on d?
5 5/40 Data Analysis Discovering structure in given data, allowing for predictions, conclusions, etc. Given data can have different (and also heterogeneous) formats. For inputs x 1,..., x n R d and outputs y 1,..., y n R we would like to study the functional dependence y i f (x i ) The performance is in practice often tested by learning on a subset of the data and testing the rest. To allow theoretical results, we often assume that the data is drawn independently from some (unknown) underlying probability distribution, i.e. (x i, y i ) A with prob. P(A).
6 6/40 Least squares & ridge regression Least squares: x 1,..., x n R d : n data points (inputs) in R d y 1,..., y n R: n real values (outputs) We look for dependence y i ω, x i n We minimize y i ω, x i 2 = y X ω 2 2 over ω R d i=1 Ridge regression: Adding regularization term λ ω 2 2 LASSO: Adding regularization term λ ω 1
7 7/40 Principal Component Analysis (PCA) PCA: Classical dimensionality-reduction technique Given data points x 1,..., x n R d minimize over subspaces L with dim L = k the distance of x i s to L; i.e. min L:dim L=k n x i P L x i 2 2 the answer is given by singular value decomposition of the n d data matrix. i=1... swiss army knife of numerical linear algebra...
8 8/40 Support Vector Machines V. N. Vapnik and A. Ya. Chervonenkis (1963) B. E. Boser, I. M. Guyon and V. N. Vapnik (1992) - non-linear C. Cortes and V. N. Vapnik (1995) - soft margin Given data points x 1,..., x n R d and labels y i { 1, 1}, separate the sets {x i : y i = 1} and {x i : y i = +1} by a linear hyperplane... i.e. f (x) = g( a, x ), where { 1, if t 0, g(t) = is known. 1, if t < 0
9 9/40 Soft margin SVM: min ω R d n (1 y i ω, x i ) + + λ ω 2 2 i=1
10 Limits of the theory... All theory, dear friend, is grey, but the golden tree of actual life springs ever green. Johann Wolfgang von Goethe It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. Sherlock Holmes, A Scandal in Bohemia Approximation theory: The class of functions of interest is unclear. The data points sometimes can not be designed. Data analysis - learning theory: The independence is not always ensured The underlying probability (and its basic properties) are unknown The success measured by leave-out data differs from one case to the other 10/40
11 11/40 Active choice of points! link between data analysis and approximation theory Difference between regression and sampling?... active choice of the points... Nowadays, more and more data analysis is done on calculated data! In material science, material properties (like color or conductivity) are calculated ab initio from the molecular information, not measured in the lab. Typically, these properties are governed by a PDE (like Schrödinger s equation), where the structure of the material comes in as initial data.
12 12/40 Calculation of one property for one material is then a numerical solution of a parametric: PDE(u, p, x) = 0 The (unknown) functional dependence f = f (p) of the material property on the input parameters is calculated for materials described with parameters p 1,..., p m Sampling of f at p j query of the solution running an expensive simulation Nearly noise-free sampling possible! Trade-off: more noisy samples vs. fewer less noisy samples The materials with p 1,..., p m can be actively chosen!
13 13/40 Summary of the introduction Approximation theory and Data Analysis often study similar problems from different points of view The choice of approximation algorithms is free in both areas Approximation theory (usually) assumes that the data can be designed/chosen The aim of the talk is to show the benefits of the combination of both approaches
14 14/40 l 1 -norm minimization Data Analysis LASSO (least absolute shrinkage and selection operator): Tibshirani (1996) Least squares with penalization term λ ω 1 X = (x 1,..., x n ) R n d : n data points (inputs) in R d y = (y 1,..., y n ) R n : n real values (outputs) LASSO: arg min y X ω λ ω 1 weights the fit y X ω ω R d against the number of non-zero coordinates of ω Leaves open a number of questions - how many data points and how distributed are sufficient to identify/approximate ω...?
15 15/40 l 1 -norm minimization Approximation theory Compressed Sensing: Let w R d be sparse (with small number s d of non-zero coefficients on unknown positions) and consider the linear function g(x) = x, w. Design n (as small as possible) sampling points x 1,..., x n and the function values y j = g(x j ) = x j, w and find a recovery mapping : R n R d, such that (y) is equal/close to w
16 15/40 l 1 -norm minimization Approximation theory Compressed Sensing: Let w R d be sparse (with small number s d of non-zero coefficients on unknown positions) and consider the linear function g(x) = x, w. Design n (as small as possible) sampling points x 1,..., x n and the function values y j = g(x j ) = x j, w and find a recovery mapping : R n R d, such that (y) is equal/close to w Take x j s independent at random, i.e. y = Xw Use l 1 -minimization for recovery (y) = arg min ω 1, s.t. X ω = y ω R d It is enough to take n s log(d)
17 16/40 l 1 -norm minimization Approximation theory Theorem (Donoho, Candes, Romberg, Tao ) There is a constant C > 0 such that if m Cs log(d) and the components of x i are independent Gaussian random variables, then (y) = w (with high probability). Many different aspects (other x i s, noisy sampling, nearly-sparse vectors w R d, etc.) were studied intensively. Benefits for Data Analysis: Estimates of the minimal number of data points If possible, choose mostly uncorrelated data points Better information about possibilities and limits of LASSO
18 17/40 l 1 -SVM: l 1 -SVM Data Analysis P.S. Bradley, O.L. Mangasarian (1998) J. Zhu, S. Rosset, T. Hastie, R. Tibshirani (2004) Support vector machine with l 1 -penalization term X = (x 1,..., x n ) R n d : n data points (inputs) in R d y = (y 1,..., y n ) { 1, +1}: n labels l 1 -SVM: min ω R d n (1 y i ω, x i ) + + λ ω 1 i=1 Weights the quality of the classification against the number of non-zero coordinates of ω Leaves open a number of questions - how many data points and how distributed are sufficient to identify ω...?
19 18/40 l 1 -SVM Data Analysis Standard technique of sparse classification Bioinformatics gene selection microarray classification cancer classification feature selection face recognition...
20 19/40 l 1 -SVM Data Analysis Results: I. Steinwart, Support vector machines are universally consistent, J. Complexity (2002) I. Steinwart, Consistency of support vector machines and other regularized kernel classifiers, IEEE Trans. Inf. Theory (2005): The SVM s are consistent (in general setting with kernels), i.e. they can learn the hidden classifier for n almost surely; the parameter of SVM has to grow to infinity.
21 20/40 l 1 -SVM Approximation theory We want to identify sparse (or nearly sparse) ω R d We assume ω K = {x R d : x p 1}, p < 1 We are allowed to take only 1-bit measurements y i = sign( ω, x i ) - non-linear! We are allowed to design the sampling points x 1,..., x n R d and a (non-linear) recovery algorithm Recent area of 1-bit Compressed Sensing
22 21/40 l 1 -SVM Approximation theory 1-bit Compressed Sensing: Boufounos & Baraniuk (2008), Y. Plan & R. Vershynin (2013) ω K = {x R d : x 2 1, x 1 s} Result: For n Cδ 6 w(k) 2 x i N (0, id), y i = sign( ω, x i ) n ˆω := arg max y i x i, w w K i=1 ˆω ω 2 2 δ log(e/δ) with prob. 1 8 exp( cδ 2 n) Mean Gaussian width: w(k) = E sup x K K g, x
23 l 1 -SVM Approximation theory What if we insist on l 1 -SVM as the recovery algorithm? Setting: ω 2 = 1, ω 1 R x i N (0, r 2 id), y i = sign( x i, ω ), i = 1,..., n i = 1,..., n ˆω R d a minimizer of the l 1 -SVM n [1 y i x i, w ] + subject to w 1 R. min w R d i=1 Theorem (Kolleck, V. (2016)) Let 0 < ε < 0.18, r > 2π(0.57 πε) 1, n Cε 2 r 2 R 2 ln(d) Then it holds ω ˆω ˆω 2 C (ε + 1 ) 2 r with probability at least 1 exp ( C ln(d)) 22/40
24 23/40 l 1 -SVM Summary on SVM s: Non-asymptotic analysis of (l 1 -)SVM s allows to predict the limits of the algorithm For random (i.e. highly uncorrelated data) small number n of measurements/data is sufficient Motivated by the analysis, we proposed an SVM, which combines the l 1 and the l 2 penalty; this method performs better in analysis, but also in the numerical simulation! Use of the 1-Bit CS algorithm in Machine Learning?
25 24/40 Ridge functions & Neural networks Ridge functions Let g : R R and a R d \ {0}. Ridge function with ridge profile g and ridge vector a is the function f (x) := g( a, x ). Constant along the hyperplane a = {y R d : y, a = 0} and its translates. More general, if g : R k R and A R k d with k d then is a k ridge function. f (x) := g(ax)
26 25/40 Ridge functions in mathematics Kernels of important transforms (Fourier, Radon) Plane waves in PDE s: Solutions to r ( b i x a i y are of the form i=1 F (x, y) = ) F = 0 r f i (a i x + b i y). i=1 Ridgelets, curvelets, shearlets,... : wavelet-like frames capturing singularities along curves and edges (Candes, Donoho, Kutyniok, Labate,... )
27 26/40 Ridge functions in approximation theory Approximation of a function by functions from the dictionary D ridge = {ϱ( k, x b) : k R d, b R} Fundamentality Greedy algorithms Lin & Pinkus, Fundamentality of ridge functions, J. Approx. Theory 75 (1993), no. 3, Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2 (1989), Leshno, Lin, Pinkus & Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6 (1993),
28 27/40 Neural networks Motivated by biological research on human brain and neurons W. McCulloch, W. Pitts (1943); M. Minsky, S. Papert (1969) Artificial Neuron:... gets activated if a linear combination of its inputs grows over a certain threshold... Inputs x = (x 1,..., x n ) R n Weights w = (w 1,..., w n ) R n Comparing w, x with a threshold b R Plugging the result into the activation function - jump (or smoothed jump) function σ Artificial neuron is a function x σ( x, w b), where σ : R R might be σ(x) = sign(x) or σ(x) = e x /(1 + e x ), etc.
29 28/40 Neural networks Artificial neural network is a directed, acyclic graph of artificial neurons Input: x = (x 1,..., x n ) R n First layer of neurons: y 1 = σ( x, w 1 1 b1 1 ),..., y n 1 = σ( x, w 1 n 1 b 1 n 1 ) The outputs y = (y 1,..., y n1 ) become inputs for the next layer... ; last layer outputs y R Deep Learning relies on an artificial neural network with layers Training the network: given inputs x 1,..., x N and outputs y 1,..., y N and optimize over weights w s and b s Non-convex minimization over a huge space...???
30 29/40 Approximation of ridge functions (and their sums) k = 1: f (x) = g( a, x ), a 2 = 1, g smooth Approximation has two parts: approximation of g and of a Recovery of a - from f (x): f (x) = g ( a, x )a, f (0) = g (0)a. After recovering a, the problem becomes essentially one-dimensional and one can use arbitrary sampling method to approximate g. g (0) 0... g (0) = 1
31 30/40 Buhmann & Pinkus 99 Identifying linear combinations of ridge functions Approximation of functions f (x) = m g i ( a i, x ), i=1 x R d g i C 2m 1 (R), i = 1,..., m; g (2m 1) i (0) 0, i = 1,..., m; m (Du 2m 1 k Dv k f )(0) = ( u, a i ) 2m 1 k ( v, a i ) k g (2m 1) i (0) i=1 for k = 0,..., 2m 1 and v 1,..., v d R d and solving this system of equations.
32 31/40 A.Cohen, I.Daubechies, R.DeVore, G.Kerkyacharian, D.Picard, 12 Capturing ridge functions in high dimensions from point queries k = 1 : f (x) = g( a, x ) f : [0, 1] d R g C s ([0, 1]), 1 < s g C s M 0 a l d q M 1, 0 < q 1 a 0 Then { ( ) } 1 + log(d/l) 1/q 1 f ˆf CM 0 L s + M 1 L using 3L + 2 sampling points
33 32/40 First sampling along the diagonal i L 1 = i L (1,..., 1), i = 0,..., L : ( i ) ( i ) f L 1 = g L 1, a = g(i a 1 /L) Recovery of g on a grid of [0, a 1 ] Finding i 0 with largest g((i 0 + 1) a 1 /L) g(i 0 a 1 /L) Approximating D ϕj f (i 0 /L 1) = g (i 0 a 1 /L) a, ϕ j by first order differences Then recovery of a from a, ϕ 1,..., a, ϕ m by methods of compressed sensing (CS)
34 33/40 Introduction l 1 -minimization SVM s Ridge functions & Neural networks M. Fornasier, K. Schnass, J. V., Learning functions of few arbitrary linear parameters in high dimensions (2012) Put f : B(0, 1) R a 2 = 1 0 < q 1, a q c g C 2 [ 1, 1] y j := f (hϕ j) f (0) h where h > 0 is small, and g (0) a, ϕ j, j = 1,..., m Φ, m Φ d, ϕ j,k = ± 1 mφ, k = 1,..., d
35 34/40 y j are scalar products a, ϕ j corrupted by deterministic noise ã = arg min z 1, s.t. ϕ j, z = y j, j = 1,..., m Φ. z R d â = ã/ ã 2 - approximation of a ĝ is obtained by sampling f along â: ĝ(y) := f (â t), t ( 1, 1). Then ˆf (x) := ĝ( â, x ), has the approximation property f ˆf [ ( ( ) m 1 Φ q 1 2 C log(d/m Φ ) + 1 ) + h mφ ].
36 35/40 Active coordinates: R. DeVore, G. Petrova, P. Wojtaszczyk, Approximation of functions of few variables in high dimensions, Constr. Appr. 11 K. Schnass, J.V., Compressed learning of high-dimensional sparse functions, Proceedings of ICASSP 11 f (x) = g(x i1,..., x ik ) Use of low-rank matrix recovery: H. Tyagi, V. Cevher, Learning non-parametric basis independent models from point queries via low-rank methods, ACHA 14
37 36/40 I. Daubechies, M. Fornasier, J.V., Approximation of sums of ridge functions, in preparation: Sums of ridge functions Recovery of k f (x) = g j ( a j, x ) j=1 We would like to identify a 1,..., a k, then g 1,..., g k Step 1.: Sampling of f (x) = k j=1 g j ( a j, x )a j at different points gives elements of span{a 1,..., a k } R d Afterwards, we can reduce the dimension to d = k... one k-dimensional problem...
38 37/40 Recovery of individual a i s for d = k? Step 2.: Second order derivatives: 2 f (x) = k j=1 g j ( a j, x )a j a j We can recover L - an approximation of L = span{a i a i, i = 1,..., k} R k k
39 38/40 Step 3.: We try to find a i a i in L We look for matrices in L with the smallest rank We analyze the non-convex problem arg max M, s.t. M L, M F 1 Every algorithm, which is able to find a 1 a 1 can also find a j a j, j = 2,..., k, hence it must be non-convex
40 39/40 Summary Approximation theory and Machine Learning study similar problems from different points of view They can inspire/enrich one the other Non-linear, non-convex classes; non-linear information Sparsity and other structural assumptions Open problem: Geometrical properties of the class of functions, which can be modeled by neural networks N d,l = {f : R d R : f is a neural network with L layers}
41 40/40 Thank you for your attention!
Compressed Sensing and Neural Networks
and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications
More informationSparse Proteomics Analysis (SPA)
Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universität Berlin Winter School on Compressed Sensing December 5, 2015
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationCompressed Sensing in Cancer Biology? (A Work in Progress)
Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSparsity in Underdetermined Systems
Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationNew Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit
New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationAn Introduction to Sparse Approximation
An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,
More informationApproximation of Multivariate Functions
Approximation of Multivariate Functions Vladimir Ya. Lin and Allan Pinkus Abstract. We discuss one approach to the problem of approximating functions of many variables which is truly multivariate in character.
More informationKomprimované snímání a LASSO jako metody zpracování vysocedimenzionálních dat
Komprimované snímání a jako metody zpracování vysocedimenzionálních dat Jan Vybíral (Charles University Prague, Czech Republic) November 2014 VUT Brno 1 / 49 Definition and motivation Its use in bioinformatics
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationInfinite Ensemble Learning with Support Vector Machinery
Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationOptimisation Combinatoire et Convexe.
Optimisation Combinatoire et Convexe. Low complexity models, l 1 penalties. A. d Aspremont. M1 ENS. 1/36 Today Sparsity, low complexity models. l 1 -recovery results: three approaches. Extensions: matrix
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNeural Networks: Introduction
Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning
More informationDeep Learning: Approximation of Functions by Composition
Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationArtificial Neural Networks
Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationSparse Algorithms are not Stable: A No-free-lunch Theorem
Sparse Algorithms are not Stable: A No-free-lunch Theorem Huan Xu Shie Mannor Constantine Caramanis Abstract We consider two widely used notions in machine learning, namely: sparsity algorithmic stability.
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationIssues and Techniques in Pattern Classification
Issues and Techniques in Pattern Classification Carlotta Domeniconi www.ise.gmu.edu/~carlotta Machine Learning Given a collection of data, a machine learner eplains the underlying process that generated
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationIEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER 2015 1239 Preconditioning for Underdetermined Linear Systems with Sparse Solutions Evaggelia Tsiligianni, StudentMember,IEEE, Lisimachos P. Kondi,
More informationNear Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing
Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of
More informationFeedforward Neural Nets and Backpropagation
Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationHigh-dimensional Statistical Models
High-dimensional Statistical Models Pradeep Ravikumar UT Austin MLSS 2014 1 Curse of Dimensionality Statistical Learning: Given n observations from p(x; θ ), where θ R p, recover signal/parameter θ. For
More informationHigh-dimensional Statistics
High-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse of Dimensionality Statistical Learning: Given
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationSparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationOn Ridge Functions. Allan Pinkus. September 23, Technion. Allan Pinkus (Technion) Ridge Function September 23, / 27
On Ridge Functions Allan Pinkus Technion September 23, 2013 Allan Pinkus (Technion) Ridge Function September 23, 2013 1 / 27 Foreword In this lecture we will survey a few problems and properties associated
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationA Lower Bound Theorem. Lin Hu.
American J. of Mathematics and Sciences Vol. 3, No -1,(January 014) Copyright Mind Reader Publications ISSN No: 50-310 A Lower Bound Theorem Department of Applied Mathematics, Beijing University of Technology,
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationMathematical Methods in Machine Learning
UMD, Spring 2016 Outline Lecture 2: Role of Directionality 1 Lecture 2: Role of Directionality Anisotropic Harmonic Analysis Harmonic analysis decomposes signals into simpler elements called analyzing
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationInterpolation via weighted l 1 minimization
Interpolation via weighted l 1 minimization Rachel Ward University of Texas at Austin December 12, 2014 Joint work with Holger Rauhut (Aachen University) Function interpolation Given a function f : D C
More informationAn Improved 1-norm SVM for Simultaneous Classification and Variable Selection
An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension
More informationArtificial Neural Networks
Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationSPARSE SIGNAL RESTORATION. 1. Introduction
SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationApproximation theory in neural networks
Approximation theory in neural networks Yanhui Su yanhui su@brown.edu March 30, 2018 Outline 1 Approximation of functions by a sigmoidal function 2 Approximations of continuous functionals by a sigmoidal
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationFast Hard Thresholding with Nesterov s Gradient Method
Fast Hard Thresholding with Nesterov s Gradient Method Volkan Cevher Idiap Research Institute Ecole Polytechnique Federale de ausanne volkan.cevher@epfl.ch Sina Jafarpour Department of Computer Science
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationKernel Methods & Support Vector Machines
Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu
More informationRecovering overcomplete sparse representations from structured sensing
Recovering overcomplete sparse representations from structured sensing Deanna Needell Claremont McKenna College Feb. 2015 Support: Alfred P. Sloan Foundation and NSF CAREER #1348721. Joint work with Felix
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationDiscriminative Learning and Big Data
AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationModel-Based Compressive Sensing for Signal Ensembles. Marco F. Duarte Volkan Cevher Richard G. Baraniuk
Model-Based Compressive Sensing for Signal Ensembles Marco F. Duarte Volkan Cevher Richard G. Baraniuk Concise Signal Structure Sparse signal: only K out of N coordinates nonzero model: union of K-dimensional
More informationLinear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights
Linear Discriminant Functions and Support Vector Machines Linear, threshold units CSE19, Winter 11 Biometrics CSE 19 Lecture 11 1 X i : inputs W i : weights θ : threshold 3 4 5 1 6 7 Courtesy of University
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationStructured Statistical Learning with Support Vector Machine for Feature Selection and Prediction
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationComparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition
Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition J. Uglov, V. Schetinin, C. Maple Computing and Information System Department, University of Bedfordshire, Luton,
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationExponential decay of reconstruction error from binary measurements of sparse signals
Exponential decay of reconstruction error from binary measurements of sparse signals Deanna Needell Joint work with R. Baraniuk, S. Foucart, Y. Plan, and M. Wootters Outline Introduction Mathematical Formulation
More informationAn Overview of Sparsity with Applications to Compression, Restoration, and Inverse Problems
An Overview of Sparsity with Applications to Compression, Restoration, and Inverse Problems Justin Romberg Georgia Tech, School of ECE ENS Winter School January 9, 2012 Lyon, France Applied and Computational
More informationINDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina
INDUSTRIAL MATHEMATICS INSTITUTE 2007:08 A remark on compressed sensing B.S. Kashin and V.N. Temlyakov IMI Preprint Series Department of Mathematics University of South Carolina A remark on compressed
More informationCompressed Sensing and Sparse Recovery
ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationSignal Recovery, Uncertainty Relations, and Minkowski Dimension
Signal Recovery, Uncertainty Relations, and Minkowski Dimension Helmut Bőlcskei ETH Zurich December 2013 Joint work with C. Aubel, P. Kuppinger, G. Pope, E. Riegler, D. Stotz, and C. Studer Aim of this
More informationSuper-resolution via Convex Programming
Super-resolution via Convex Programming Carlos Fernandez-Granda (Joint work with Emmanuel Candès) Structure and Randomness in System Identication and Learning, IPAM 1/17/2013 1/17/2013 1 / 44 Index 1 Motivation
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationACTIVE LEARNING OF SELF-CONCORDANT LIKE MULTI-INDEX FUNCTIONS
ACTIVE LEARNING OF SELF-CONCORDANT LIKE MULTI-INDEX FUNCTIONS Ilija Bogunovic Volkan Cevher Jarvis Haupt Jonathan Scarlett LIONS, EPFL, Lausanne, Switzerland {ilija.bogunovic, volkan.cevher, jonathan.scarlett}@epfl.ch
More informationCS168: The Modern Algorithmic Toolbox Lecture #6: Regularization
CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationLinear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1
Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de
More informationLecture Notes 5: Multiresolution Analysis
Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationMini-project in scientific computing
Mini-project in scientific computing Eran Treister Computer Science Department, Ben-Gurion University of the Negev, Israel. March 7, 2018 1 / 30 Scientific computing Involves the solution of large computational
More information