From approximation theory to machine learning

Size: px

Start display at page:

Download "From approximation theory to machine learning"

Blake Ward
5 years ago
Views:

1 1/40 From approximation theory to machine learning New perspectives in the theory of function spaces and their applications September 2017, Bedlewo, Poland Jan Vybíral Charles University/Czech Technical University Prague, Czech Republic

2 2/40 Introduction Approximation Theory vs. Data Analysis (or Machine Learning, or Learning Theory) Example I: l 1 -minimization Example II: Support vector machines Example III: Ridge functions and neural networks

3 3/40 Approximation theory K - class of objects of interest: vectors, functions, operators, approximated by (same or simpler) objects usually with only limited information about the unknown object... with the similarity (approximation error) measured by some sort of distance worst case - supremum of the approximation error over K average case - mean value of the (square of the) approximation error over K w.r.t. some probability measure on K

4 4/40 Approximation of multivariate functions Let f : Ω R d R be a function of many (d 1) variables We want to approximate f using only (a small number of) function values f (x 1 ),..., f (x n ) Questions: Which functions (assumptions on f )? How to measure the error? How to choose the sampling points? Decay of the error with growing number of sampling points? Algorithms and optimality? Dependence on d?

5 5/40 Data Analysis Discovering structure in given data, allowing for predictions, conclusions, etc. Given data can have different (and also heterogeneous) formats. For inputs x 1,..., x n R d and outputs y 1,..., y n R we would like to study the functional dependence y i f (x i ) The performance is in practice often tested by learning on a subset of the data and testing the rest. To allow theoretical results, we often assume that the data is drawn independently from some (unknown) underlying probability distribution, i.e. (x i, y i ) A with prob. P(A).

6 6/40 Least squares & ridge regression Least squares: x 1,..., x n R d : n data points (inputs) in R d y 1,..., y n R: n real values (outputs) We look for dependence y i ω, x i n We minimize y i ω, x i 2 = y X ω 2 2 over ω R d i=1 Ridge regression: Adding regularization term λ ω 2 2 LASSO: Adding regularization term λ ω 1

7 7/40 Principal Component Analysis (PCA) PCA: Classical dimensionality-reduction technique Given data points x 1,..., x n R d minimize over subspaces L with dim L = k the distance of x i s to L; i.e. min L:dim L=k n x i P L x i 2 2 the answer is given by singular value decomposition of the n d data matrix. i=1... swiss army knife of numerical linear algebra...

8 8/40 Support Vector Machines V. N. Vapnik and A. Ya. Chervonenkis (1963) B. E. Boser, I. M. Guyon and V. N. Vapnik (1992) - non-linear C. Cortes and V. N. Vapnik (1995) - soft margin Given data points x 1,..., x n R d and labels y i { 1, 1}, separate the sets {x i : y i = 1} and {x i : y i = +1} by a linear hyperplane... i.e. f (x) = g( a, x ), where { 1, if t 0, g(t) = is known. 1, if t < 0

9 9/40 Soft margin SVM: min ω R d n (1 y i ω, x i ) + + λ ω 2 2 i=1

10 Limits of the theory... All theory, dear friend, is grey, but the golden tree of actual life springs ever green. Johann Wolfgang von Goethe It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. Sherlock Holmes, A Scandal in Bohemia Approximation theory: The class of functions of interest is unclear. The data points sometimes can not be designed. Data analysis - learning theory: The independence is not always ensured The underlying probability (and its basic properties) are unknown The success measured by leave-out data differs from one case to the other 10/40

11 11/40 Active choice of points! link between data analysis and approximation theory Difference between regression and sampling?... active choice of the points... Nowadays, more and more data analysis is done on calculated data! In material science, material properties (like color or conductivity) are calculated ab initio from the molecular information, not measured in the lab. Typically, these properties are governed by a PDE (like Schrödinger s equation), where the structure of the material comes in as initial data.

12 12/40 Calculation of one property for one material is then a numerical solution of a parametric: PDE(u, p, x) = 0 The (unknown) functional dependence f = f (p) of the material property on the input parameters is calculated for materials described with parameters p 1,..., p m Sampling of f at p j query of the solution running an expensive simulation Nearly noise-free sampling possible! Trade-off: more noisy samples vs. fewer less noisy samples The materials with p 1,..., p m can be actively chosen!

13 13/40 Summary of the introduction Approximation theory and Data Analysis often study similar problems from different points of view The choice of approximation algorithms is free in both areas Approximation theory (usually) assumes that the data can be designed/chosen The aim of the talk is to show the benefits of the combination of both approaches

14 14/40 l 1 -norm minimization Data Analysis LASSO (least absolute shrinkage and selection operator): Tibshirani (1996) Least squares with penalization term λ ω 1 X = (x 1,..., x n ) R n d : n data points (inputs) in R d y = (y 1,..., y n ) R n : n real values (outputs) LASSO: arg min y X ω λ ω 1 weights the fit y X ω ω R d against the number of non-zero coordinates of ω Leaves open a number of questions - how many data points and how distributed are sufficient to identify/approximate ω...?

15 15/40 l 1 -norm minimization Approximation theory Compressed Sensing: Let w R d be sparse (with small number s d of non-zero coefficients on unknown positions) and consider the linear function g(x) = x, w. Design n (as small as possible) sampling points x 1,..., x n and the function values y j = g(x j ) = x j, w and find a recovery mapping : R n R d, such that (y) is equal/close to w

16 15/40 l 1 -norm minimization Approximation theory Compressed Sensing: Let w R d be sparse (with small number s d of non-zero coefficients on unknown positions) and consider the linear function g(x) = x, w. Design n (as small as possible) sampling points x 1,..., x n and the function values y j = g(x j ) = x j, w and find a recovery mapping : R n R d, such that (y) is equal/close to w Take x j s independent at random, i.e. y = Xw Use l 1 -minimization for recovery (y) = arg min ω 1, s.t. X ω = y ω R d It is enough to take n s log(d)

17 16/40 l 1 -norm minimization Approximation theory Theorem (Donoho, Candes, Romberg, Tao ) There is a constant C > 0 such that if m Cs log(d) and the components of x i are independent Gaussian random variables, then (y) = w (with high probability). Many different aspects (other x i s, noisy sampling, nearly-sparse vectors w R d, etc.) were studied intensively. Benefits for Data Analysis: Estimates of the minimal number of data points If possible, choose mostly uncorrelated data points Better information about possibilities and limits of LASSO

18 17/40 l 1 -SVM: l 1 -SVM Data Analysis P.S. Bradley, O.L. Mangasarian (1998) J. Zhu, S. Rosset, T. Hastie, R. Tibshirani (2004) Support vector machine with l 1 -penalization term X = (x 1,..., x n ) R n d : n data points (inputs) in R d y = (y 1,..., y n ) { 1, +1}: n labels l 1 -SVM: min ω R d n (1 y i ω, x i ) + + λ ω 1 i=1 Weights the quality of the classification against the number of non-zero coordinates of ω Leaves open a number of questions - how many data points and how distributed are sufficient to identify ω...?

19 18/40 l 1 -SVM Data Analysis Standard technique of sparse classification Bioinformatics gene selection microarray classification cancer classification feature selection face recognition...

20 19/40 l 1 -SVM Data Analysis Results: I. Steinwart, Support vector machines are universally consistent, J. Complexity (2002) I. Steinwart, Consistency of support vector machines and other regularized kernel classifiers, IEEE Trans. Inf. Theory (2005): The SVM s are consistent (in general setting with kernels), i.e. they can learn the hidden classifier for n almost surely; the parameter of SVM has to grow to infinity.

21 20/40 l 1 -SVM Approximation theory We want to identify sparse (or nearly sparse) ω R d We assume ω K = {x R d : x p 1}, p < 1 We are allowed to take only 1-bit measurements y i = sign( ω, x i ) - non-linear! We are allowed to design the sampling points x 1,..., x n R d and a (non-linear) recovery algorithm Recent area of 1-bit Compressed Sensing

22 21/40 l 1 -SVM Approximation theory 1-bit Compressed Sensing: Boufounos & Baraniuk (2008), Y. Plan & R. Vershynin (2013) ω K = {x R d : x 2 1, x 1 s} Result: For n Cδ 6 w(k) 2 x i N (0, id), y i = sign( ω, x i ) n ˆω := arg max y i x i, w w K i=1 ˆω ω 2 2 δ log(e/δ) with prob. 1 8 exp( cδ 2 n) Mean Gaussian width: w(k) = E sup x K K g, x

23 l 1 -SVM Approximation theory What if we insist on l 1 -SVM as the recovery algorithm? Setting: ω 2 = 1, ω 1 R x i N (0, r 2 id), y i = sign( x i, ω ), i = 1,..., n i = 1,..., n ˆω R d a minimizer of the l 1 -SVM n [1 y i x i, w ] + subject to w 1 R. min w R d i=1 Theorem (Kolleck, V. (2016)) Let 0 < ε < 0.18, r > 2π(0.57 πε) 1, n Cε 2 r 2 R 2 ln(d) Then it holds ω ˆω ˆω 2 C (ε + 1 ) 2 r with probability at least 1 exp ( C ln(d)) 22/40

24 23/40 l 1 -SVM Summary on SVM s: Non-asymptotic analysis of (l 1 -)SVM s allows to predict the limits of the algorithm For random (i.e. highly uncorrelated data) small number n of measurements/data is sufficient Motivated by the analysis, we proposed an SVM, which combines the l 1 and the l 2 penalty; this method performs better in analysis, but also in the numerical simulation! Use of the 1-Bit CS algorithm in Machine Learning?

25 24/40 Ridge functions & Neural networks Ridge functions Let g : R R and a R d \ {0}. Ridge function with ridge profile g and ridge vector a is the function f (x) := g( a, x ). Constant along the hyperplane a = {y R d : y, a = 0} and its translates. More general, if g : R k R and A R k d with k d then is a k ridge function. f (x) := g(ax)

26 25/40 Ridge functions in mathematics Kernels of important transforms (Fourier, Radon) Plane waves in PDE s: Solutions to r ( b i x a i y are of the form i=1 F (x, y) = ) F = 0 r f i (a i x + b i y). i=1 Ridgelets, curvelets, shearlets,... : wavelet-like frames capturing singularities along curves and edges (Candes, Donoho, Kutyniok, Labate,... )

27 26/40 Ridge functions in approximation theory Approximation of a function by functions from the dictionary D ridge = {ϱ( k, x b) : k R d, b R} Fundamentality Greedy algorithms Lin & Pinkus, Fundamentality of ridge functions, J. Approx. Theory 75 (1993), no. 3, Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2 (1989), Leshno, Lin, Pinkus & Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6 (1993),

28 27/40 Neural networks Motivated by biological research on human brain and neurons W. McCulloch, W. Pitts (1943); M. Minsky, S. Papert (1969) Artificial Neuron:... gets activated if a linear combination of its inputs grows over a certain threshold... Inputs x = (x 1,..., x n ) R n Weights w = (w 1,..., w n ) R n Comparing w, x with a threshold b R Plugging the result into the activation function - jump (or smoothed jump) function σ Artificial neuron is a function x σ( x, w b), where σ : R R might be σ(x) = sign(x) or σ(x) = e x /(1 + e x ), etc.

29 28/40 Neural networks Artificial neural network is a directed, acyclic graph of artificial neurons Input: x = (x 1,..., x n ) R n First layer of neurons: y 1 = σ( x, w 1 1 b1 1 ),..., y n 1 = σ( x, w 1 n 1 b 1 n 1 ) The outputs y = (y 1,..., y n1 ) become inputs for the next layer... ; last layer outputs y R Deep Learning relies on an artificial neural network with layers Training the network: given inputs x 1,..., x N and outputs y 1,..., y N and optimize over weights w s and b s Non-convex minimization over a huge space...???

30 29/40 Approximation of ridge functions (and their sums) k = 1: f (x) = g( a, x ), a 2 = 1, g smooth Approximation has two parts: approximation of g and of a Recovery of a - from f (x): f (x) = g ( a, x )a, f (0) = g (0)a. After recovering a, the problem becomes essentially one-dimensional and one can use arbitrary sampling method to approximate g. g (0) 0... g (0) = 1

31 30/40 Buhmann & Pinkus 99 Identifying linear combinations of ridge functions Approximation of functions f (x) = m g i ( a i, x ), i=1 x R d g i C 2m 1 (R), i = 1,..., m; g (2m 1) i (0) 0, i = 1,..., m; m (Du 2m 1 k Dv k f )(0) = ( u, a i ) 2m 1 k ( v, a i ) k g (2m 1) i (0) i=1 for k = 0,..., 2m 1 and v 1,..., v d R d and solving this system of equations.

32 31/40 A.Cohen, I.Daubechies, R.DeVore, G.Kerkyacharian, D.Picard, 12 Capturing ridge functions in high dimensions from point queries k = 1 : f (x) = g( a, x ) f : [0, 1] d R g C s ([0, 1]), 1 < s g C s M 0 a l d q M 1, 0 < q 1 a 0 Then { ( ) } 1 + log(d/l) 1/q 1 f ˆf CM 0 L s + M 1 L using 3L + 2 sampling points

33 32/40 First sampling along the diagonal i L 1 = i L (1,..., 1), i = 0,..., L : ( i ) ( i ) f L 1 = g L 1, a = g(i a 1 /L) Recovery of g on a grid of [0, a 1 ] Finding i 0 with largest g((i 0 + 1) a 1 /L) g(i 0 a 1 /L) Approximating D ϕj f (i 0 /L 1) = g (i 0 a 1 /L) a, ϕ j by first order differences Then recovery of a from a, ϕ 1,..., a, ϕ m by methods of compressed sensing (CS)

34 33/40 Introduction l 1 -minimization SVM s Ridge functions & Neural networks M. Fornasier, K. Schnass, J. V., Learning functions of few arbitrary linear parameters in high dimensions (2012) Put f : B(0, 1) R a 2 = 1 0 < q 1, a q c g C 2 [ 1, 1] y j := f (hϕ j) f (0) h where h > 0 is small, and g (0) a, ϕ j, j = 1,..., m Φ, m Φ d, ϕ j,k = ± 1 mφ, k = 1,..., d

35 34/40 y j are scalar products a, ϕ j corrupted by deterministic noise ã = arg min z 1, s.t. ϕ j, z = y j, j = 1,..., m Φ. z R d â = ã/ ã 2 - approximation of a ĝ is obtained by sampling f along â: ĝ(y) := f (â t), t ( 1, 1). Then ˆf (x) := ĝ( â, x ), has the approximation property f ˆf [ ( ( ) m 1 Φ q 1 2 C log(d/m Φ ) + 1 ) + h mφ ].

36 35/40 Active coordinates: R. DeVore, G. Petrova, P. Wojtaszczyk, Approximation of functions of few variables in high dimensions, Constr. Appr. 11 K. Schnass, J.V., Compressed learning of high-dimensional sparse functions, Proceedings of ICASSP 11 f (x) = g(x i1,..., x ik ) Use of low-rank matrix recovery: H. Tyagi, V. Cevher, Learning non-parametric basis independent models from point queries via low-rank methods, ACHA 14

37 36/40 I. Daubechies, M. Fornasier, J.V., Approximation of sums of ridge functions, in preparation: Sums of ridge functions Recovery of k f (x) = g j ( a j, x ) j=1 We would like to identify a 1,..., a k, then g 1,..., g k Step 1.: Sampling of f (x) = k j=1 g j ( a j, x )a j at different points gives elements of span{a 1,..., a k } R d Afterwards, we can reduce the dimension to d = k... one k-dimensional problem...

38 37/40 Recovery of individual a i s for d = k? Step 2.: Second order derivatives: 2 f (x) = k j=1 g j ( a j, x )a j a j We can recover L - an approximation of L = span{a i a i, i = 1,..., k} R k k

39 38/40 Step 3.: We try to find a i a i in L We look for matrices in L with the smallest rank We analyze the non-convex problem arg max M, s.t. M L, M F 1 Every algorithm, which is able to find a 1 a 1 can also find a j a j, j = 2,..., k, hence it must be non-convex

40 39/40 Summary Approximation theory and Machine Learning study similar problems from different points of view They can inspire/enrich one the other Non-linear, non-convex classes; non-linear information Sparsity and other structural assumptions Open problem: Geometrical properties of the class of functions, which can be modeled by neural networks N d,l = {f : R d R : f is a neural network with L layers}

41 40/40 Thank you for your attention!

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications