Optimization Methods for Machine Learning Decomposition methods for FFN

Size: px

Start display at page:

Download "Optimization Methods for Machine Learning Decomposition methods for FFN"

Camron Spencer
5 years ago
Views:

Optimization Methods for Machine Learning Laura Palagi http://www.dis.uniroma1.

1 Optimization Methods for Machine Learning Laura Palagi palagi Dipartimento di Ingegneria informatica automatica e gestionale A. Ruberti Sapienza Università di Roma Via Ariosto 25

2 Unconstrained problem P min ω R qe(ω)= E p (ω) p=1 Batch Backpropagation (BP) method ω k+1 = ω k η E(ω k ), Exploiting the objective function Sample-wise decomposition exploiting the additive structure of E Online gradient methods and improvements Block coordinate-wise decomposition

3 Block methods Consider the unconstrained minimization problem min ω R qe(ω) with compact level sets. At each iteration k, it chooses index set B k {1,...,q} and it updates only the variables ω i with i B k ω k+1 i = { ω k i if i B k ω k i +s k i if i B k B k ={1...,q} Full step B k {1...,q} Block-coordinate B k =i k {1...,q} Coordinate descent

4 Block gradient methods Consider the gradient method for the unconstrained minimization problem min w R qe(w) ( ) w k+1 =w k η k I ii B ke(w k ) i B }{{ k } q B k identity columns Assume an iteration in which only the components in the index set B k are changed. Define [ ] E B ke = w i i B k

5 Why should we use a decomposition method in SL? Different reasons that apply to Supervised Learning as it happens that fixing some variables, subproblems have a special structure in the remaining variables that can be conveniently exploited e.g. FNN with linear output unit Extreme Learning for large/huge q: subproblems are smaller q k = B k e.g. Deep Networks to achieve good generalization performance with less training time. Early stopping activates earlier escaping from the attraction of irrelevant local minimizers distributed/parallel optimization techniques.

6 Shallow FNN with linear output Assume y R and linear output unit : y(x)=ω 2 T g(ω 1 ;x) + g 3 ( ) u 11 x v 1 u 1 12 u 21 + g 2 ( ) y(x) v + 2 u 22 u 31 x 2 u v g 1 ( ) ω 1 =u ω 2 =v

7 Shallow FNN with linear output A natural two blocks splitting of MLP ω =(ω 1,ω 2 ) is ω 1 =u R nn ω 2 =v R N The problem is equivalently written as min ω 1 R nn ω 2 R N E(ω 1,ω 2 ) with E(ω 1,ω 2 )= 1 P P y p ω T 2 g(ω 1 ;x p ) 2 + λ 1 ω λ 2 ω 2 2 p=1

8 What does it happen when fixing ω 1 or ω 2? Fixing ω 1 =u, we have that g(ω 1 ;x p )={g i (ω 1 ;x p )} }{{} i=1,...,n. g(ui T x p ) E(ω 1,ω 2 )= 1 P P y p ω T 2 g(ω 1 ;x p ) 2 + λ 1 ω λ 2 ω 2 2 p=1 and G ={G pi } the P N hidden matrix G pi =g(u T i x p ) and ω 2 R N G =... g(u T i x 1 ) g(u T i x 2 ) g(ui T x P )...

9 The problem reduce to P 1 min v P Gv y 2 + λ v 2 p=1 where y R P. It is a strongly convex linear least square (LLSP) in the weights v R N, being G ={G pi } the P N hidden matrix with G pi =g(u T i x p ) and N the number of neurons of the last hidden layer. It is convex quadratic!

10 Two-phase optimization: Extreme Learning Two-phases methods have been derived to exploit the structure min ω 1 R q 1 v R N E(ω 1,v) not to be confused with two-block decomposition. Fixed randomly or by some procedure a guess of ω 0 1 Choose v 0 by solving exactly/approximately to global optimality the LLSP min ω 0 1 Rq 1 v R N P v T g ( ω1,x 0 p) y p 2 + λ v 2 p=1

11 Randomization methods Universal approximation of Random FNN Under proper assumption, the shallow random network possesses a universal approximation capability with order of the approximation error comparable to that of a with fully adaptable network. [4, 1] PROs: cheap CONs: The approximation result relies on proper choices of a range and a sampling distribution of parameters; it is possible to sampling a network with extremely poor accuracy [2].

12 From two phase to two blocks iterations The solution returned by a two phase randomization algorithm (ω 0 1,v0 ) satisfies v E(ω 0 1,v 0 )=0 but ω1 E(ω1 0,v0 ) 0 (ω 1 0,v0 ) is not even a stationary point of E(w) We can do better using block decomposition methods! Starting with (ω 0 1,v0 ) a two block-decomposition method alternates further optimization steps respectively in ω 1 and v producing a sequence {ω k 1,vk }, k =0,1,2...

13 Two blocks methods A two-block decomposition method splits the variables into two blocks B k 1, Bk 2 ={1,...,q}\Bk and it selects one of the two blocks B h h {1,2} and moves only variables in ω Bh. Consider the block update { ω ω k+1 k if i h B = k B k i i ω k +s k if i =h B k i B k i In FNN the two blocks B 1,B 2 are actually independent on the iteration k: { u ω B k = v A two-block decomposition method for FNN alternates updates of the same two block of variables.

14 Two Block decomposition for SL What about the two block decomposition in FFN? Fixing ω 1 : minimization w.r.t. weights v Compute v k+1 P 1 solving min v P Gv y 2 + λ v 2 p=1 It is convex. Global minimization w.r.t. v is ok! Fixing ω 2 =v: minimization w.r.t. ω 1 Compute ω1 k+1 = ω1 k =argmin P ω 1 1 P v T g(ω 1 ) y 2 + λ v 2 p=1 It is highly non linear /non convex!! How do we find a global solution? Hard problem!

15 From (G-S) to (G-S) 2 Can global minimization be relaxed? YES! [Grippo & Sciandrone, 1999 [3]] { (G-S)=Gauss-Seidel (G-S) 2 Gauss-Seidel = Grippo-Sciandrone 1 Exact/global minimization is needed only w.r.t. one of the two blocks. Fixed ω 2, the exact global minimization that returns ω1 k w.r.t. ω 1 is replaced by a local algorithm that returns a ω1 k satisfying even less E(ω k 1,ω 2) E(ω k 1 1,ω 2 ) ω1 E(ω k 1,ω 2)=0;

16 A possible (G-S) 2 scheme (G-S) 2 two-blocks algorithm for FNN Choose starting guess ω 0 1 RnN. Set k =0 Repeat over k Exact global minimization w.r.t. weights v Compute v k solving P v R N p=1 global min Minimization w.r.t. ω 1 ( v T g ω1 k,xp) y p 2 + λ v 2 ω 1 k+1 argmin ω 1 E (ω 1,v k)

17 Observations on the (G-S) 2 1 For the solution of the non convex subproblem any convergent gradient type scheme can be used 2 It allows further simplification in that the accuracy of minimization at Step 1 and Step2 can be calibrated with the its the far from the solution, the less the computational effort 3 at each k 1 the returned solution is always not worst (usually better) than the initial guess when applied to MLP training problem, it outperforms randomization algorithms (any two-phase algorithm) in terms of quality of the returned solution E(ω k+1 1,v k+1 ) E(ω k 1,vk+1 ) E(ω 0 1,v0 )

18 References C Buzzi, L Grippo, and Marco Sciandrone. Convergent decomposition techniques for training rbf neural networks. Neural Computation, 13(8): , L. Grippo, A. Manno, and M. Sciandrone. Decomposition techniques for multilayer perceptron training. IEEE Transactions on Neural Networks and Learning Systems, 27(11): , Luigi Grippo and Marco Sciandrone. Globally convergent block-coordinate techniques for unconstrained optimization. Optimization methods and software, 10(4): , Boris Igelnik and Yoh-Han Pao. Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Transactions on Neural Networks, 6(6): , 1995.

19 References - continue Jin-Yan Li, WS Chow, Boris Igelnik, and Yoh-Han Pao. Comments on stochastic choice of basis functions in adaptive function approximation and the functional-link net [with reply]. IEEE Transactions on Neural Networks, 8(2): , Jose C Principe and Badong Chen. Universal approximation with convex optimization: Gimmick or reality?[discussion forum]. IEEE Computational Intelligence Magazine, 10(2):68 77, 2015.

Block-wise Decomposition methods for FNN Lecture 1 - Part B July 3, 2017

Block-wise Decomposition methods for FNN Lecture 1 - Part B July 3, 2017 Laura Palagi 1 1 Dipartimento di Ingegneria Informatica Automatica e Gestionale, Sapienza Università di Roma SUMMER SCHOOL VEROLI