Optmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN)

Size: px

Start display at page:

Download "Optmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN)"

Godfrey Harper
5 years ago
Views:

Optmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN) Laura Palagi http://www.dis.

1 Optmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN) Laura Palagi palagi Dipartimento di Ingegneria informatica automatica e gestionale A. Ruberti Sapienza Università di Roma Via Ariosto 25

2 Perceptron algorithm Linear classifier f w,b (x)=sgn(w T x+b) Find w,b such that Empirical Risk R emp =0 It works only for separable sets Alternatively use a continuous differentiable sigmoidal activation function g and minimize the average quadratic error ( 2. y p g(w T x +b)) p E(w)= 1 2 P p=1 Find w,b minimizing E(w) (R emp 0) Linear classifier (y p { 1,1}) Linear regression (y p R m ) f w,b (x)=sign(w T x+b) f w,b (x)=g(w T x+b)

3 Supervised learning regression problem Given observed input-output pairs T ={(x p,y p ) X R n R m, p =1,...,P} drawn with unknown probability P. We look for a function f F parametrized in ω R d such that f(x p ;ω) is close to the true y p. How supervised learning works: choose the class F E app measures how closely functions in F can approximate the optimal solution f minimize empirical risk E est measures the effect of minimizing the empirical risk R(f,ω ) R(f emp,ω emp ) }{{} E app+e est

4 Large scale learning Large-scale supervised machine learning: large P, large n, large d d : number of parameters n : dimension of each observation (input) P : number of observations In large scale setting another term is affecting the error R(f,ω ) R(f emp,ω emp ) }{{} E app +E est +E opt Optimization error E opt which reflects the fact that, when computational time is restricted, algorithms returns an approximate solution ω of minr emp rather than ω emp.

5 Tradeoff of large scale learning [Bousquet and Bottou, 2008] The best optimization algorithms are not the best learning algorithms For large-scale pb, the iteration complexity may not be an appropriated measure of goodness. Instead the cost of achieving a given accuracy ε may be better. This observation leads to a huge effort to define algorithms to optimize R emp (ω) which reach given accuracy in a short time, rather than having high accuracy but take longer.

6 The learning paradigm Optimize R emp (ω) BUT use R test (ω), as surrogate of R, for checking goodness of the solution (generalization properties). In order to penalize complexity of the class (control the estimation-approximation tradeoff) add a regularization term to R emp by choosing Minimization of the regularized empirical risk 1 min ω R q P P p=1 l(f(x p ;ω) y p ) } {{ } loss function + R(λ,ω) }{{} regularization term which depends on an additional hyper-parameter λ.

7 Define the optimization learning problem In order to define the optimization problem: choose the measure of closeness : loss function l choose the class of functions F choose the regularization term R Typical losses l(ŷ p,y p ) for regression quadratic square loss ŷ p y p 2 Hinge loss max{0,1 y p ŷ p } Logistic loss log(1+e yp ŷ p ) The regularization term is usually R(ω,λ)= λ ω p p λ 0 λ ω 2 2 : ridge regularization λ ω 1 : LASSO method

8 Examples with Linear class Linear f =w T x regularization term w 2 2 Convex problems Ridge Regression quadratic loss R emp = 1 P Unbias soft Linear SVM hinge loss R emp = 1 P Logistic Regression logistic loss R emp = 1 P P p=1 (w T x p y p ) 2 + λ w 2 P max{0, y p w T x p }+λ w 2 p=1 P log(1+e yp wt x p )+λ w 2 p=1

9 Feedforward Neural Network FNN It is a layered structure x 1 y(x) x 2 3 d layer (output) Input nodes 2 d layer (hidden) 1 th layer (hidden)

10 FNN architectures Different structures depending on Number of layers L, number of neurons per layer N l affect the dimension of the optimization pb L=1 shallow network, L>1 deep network Unit hidden type (activation function g and its hyper-parameters π) affect structure of the optimization pb Unit type g define the FNN architecture 1 Multiplayer Perceptron networks Shallow and Deep Networks the activation function g acts a trigger (sigmoid, tanh, etc) and may depend on hyperparameters ω are the weights on the arcs connecting units (the bias in the units are included as fictitious inputs) 2 Radial Basis Function network later on

11 Internal structure of a single neuron j at layer l x i z (l 1) i w (l) ji j. x N z (l 1) N + g (l) j a (l) j b j (l) 1=z (l 1) 0 z j (l) =g j (l) (a j (l) ) a (l) N (l 1) j = i=1 w (l) ji x i z (l 1) i +b (l) N (l 1) j = i=0 w ji (l) z (l 1) i

12 Logistic/sigmoid The MLP activation function g acts a trigger 1 g(t)= 1+e ct ġ(t)= ce ct (1+e ct ) 2, c >0 It is a differentiable approximation of a threshold function (or Heaviside step function), which is obtained, in the limit, for hyperparameter c.

13 Hyperbolic tangent Hyperparameter is c g(t) tanh(t/2)= 1 e ct 2ce ct 1+e ct, c >0 ġ(t)= (1+e ct ) 2 Figure : tanh(t/2), for c =5,1,0.5

14 Two layer MLP + 1 g w 11 b 1 v 1 x 1 w 12 1 w g v y(x) + 2 w 22 b 2 w x g v 3 w 32 b 3 1

15 Two layer MLP N: number of neurons of the hidden layer; w ji : weight of the arc connecting input node i with neuron j of the hidden layer; b j : threshold of hidden neuron j; v j : weight of the arc connecting hidden neuron j to the output; g: activation function of the hidden neurons; the activation function of the output neuron is a linear function of the inputs. Then we can write where y(x)= N j=1 n v j g( w ji x i +b j )= i=1 N j=1 w j =(w j1,...,w jn ) T. ( ) v j g wj T x+b j

16 Interpolation property of MLP Given p distinct points in R n : X ={x i R n, i =1,...,p}, and a corresponding set of real numbers Y ={y i R, i =1,...,p}. The interpolation problem consists in finding a function f :R n R, in a given class of real functions F, which satisfies: Theorem (Pinkus 1999) f(x i )=y i i =1,...,P. (1) Let g C(R) not polynomial. Then w j R n, and v j,b j R, for j =1,...P exist s.t. P v j g(w jt x i b j )=y i, i =1,...,p. j=1

17 Approximation theory Universal Approximation Theorem states that, for every ε > 0, a shallow network can approximate to any degree of accuracy a continuous function f over a compacts set X, namely the input-output map y(x) satisfies f(x) y(x) <ε,for all x X (provided a continuous non-polynomial hidden activation function g is used) For a long period shallow networks played the lion role. this does not imply that there exists a learning algorithm that can learn the correct values of the parameters and hyper-parameters An exponential number of neurons may be needed Moved to Deep Learning

18 Convexity in ML 1 Linear predictions w T x p P 1 min ω R q P l(f(x p ;ω) y p ) p=1 Convex with convex loss Strongly Convex with Strongly Convex loss 2 FFN: highly nonlinear and non convex it may have local minimizers, saddle points and plateau (flat region where algorithms tend to stall long time before a sudden improvement) important

19 What about convexity in MLP? Input x of the network propagate through layers by sequentially applying the unit function g. Denoting by W j the weights matrix from the j 1-th to the j th layer and z j =g j ( ;π j ) the output vector at layer j we can write the output ŷ p =W L g L 1( ) W L 1,π L 1 ;g L 2 (W L 2,π L 2 ;...,g 1 (W 1,π 1 ;x))...) It is not convex even in the easiest case of a shallow network with linear g and quadratic loss R emp = W 1 W 2 x p y p 2 p The easiest case does not have local minimizer and the saddle points have negative curvature: it means that global minimizer can be found by suitable algorithms

20 Does Convexity really matter? Statistics Answer: Nor really! In Deep Networks finding the global solution seems not important for the efficiency (generalization property). Local minimizers/saddle points of R emp are usually good enough for R test early stopping rule prevents to reach the global solution even when viable Answer: it depends on the algorithm Optimization difficulty is due to the presence of plateau : flat region where algorithms tend to stall long time before a sudden improvement Look for stationary points Most of advanced Gradient-based algorithms need convexity for their convergence analysis

21 The target problem We consider { P } min 1 ω R qe(ω,π)= ŷ(ω;x p ) y p 2 + λ ω 2 = 1 P p=1 P P E p p=1 In principle we look for a global solution (ω,π ), namely for a setting of the parameters such that E(ω ;π ) E(ω;π) ω R q and all possible settings π Π In particular we focus on two phase approaches that splits the solution into the choice of hyper-parameters π tied to the topology of the network the choice of the weights on arcs and/or units of the network.

22 Two-phase training procedure Given a training set T ={(x p,y p ): x p R n ; y p R} p=1,...,p Repeat 1 [Hyper-parameter selection] Choose the hyper-parameters π; 2 [Weights selection] Choose the parameters ω; Until a stopping criterion is satisfied

23 Grid-Search Two-phase training procedure Given a training set T ={(x p,y p ): x p R n ; y p R p =1,...,P}. [Parameters selection] Let Π={π 1,...,π T }. For t =1...,T [Weights Optimization] Find ω t =argmin ω E(ω,πt ) End For [Solution selection] Select ( ω, π)=arg min t=1,...,t R val(ω t,π t ) Return ( ω, π)

24 Two-phase unconstrained problem We look for a global solution (ω,π ) of problem min ω;π E(ω;π) Existence of a global solution Optimality conditions (for a point to be a local solution) Definition of an iterative algorithm Convergence ω k+1 = ω k + α k d k

Optimization Methods for Machine Learning Decomposition methods for FFN

Optimization Methods for Machine Learning Laura Palagi http://www.dis.uniroma1.it/ palagi Dipartimento di Ingegneria informatica automatica e gestionale A. Ruberti Sapienza Università di Roma Via Ariosto