Quantile regression with multilayer perceptrons.

Size: px

Start display at page:

Download "Quantile regression with multilayer perceptrons."

Tobias Anderson
5 years ago
Views:

1 Quatile regressio with multilayer perceptros. S.-F. Dimby ad J. Rykiewicz Uiversite Paris 1 - SAMM 90 Rue de Tolbiac, Paris - Frace Abstract. We cosider oliear quatile regressio ivolvig multilayer perceptros MLP). I this paper we ivestigate the asymptotic behavior of quatile regressio i a geeral framework. First by allowig possibly o-idetifiable regressio models like MLP s with redudat hidde uits, the by relaxig the coditios o the desity of the oise. I this paper, we preset a uiversal boud for the overfittig of such model uder weak assumptios. The mai applicatio of this boud is to give a hit about determiig the true architecture of the MLP quatile regressio model. As a illustratio, we use this theoretical result to propose ad compare effective criteria to fid the true architecture of such regressio model. 1 Itroductio Quatiles are poits take at regular itervals from the cumulative distributio fuctio CDF) of a radom variable. Some q-quatiles have special ames : The 2-quatile is called the media, the 4-quatiles are called quartiles ad the 10-quatiles are called deciles. We ca defie the quatile through a simple alterative expediet as a optimizatio problem. Just as we ca defie the sample meas as the solutio to the problem of miimizig a sum of squared residuals, we ca defie the media as the solutio to the problem of miimizig a sum of absolute residuals. More geerally, if y 1,,y are observed values, solvig mi m R i=1 ρ τ y i m) 1) where the cost fuctio ρ τ z) =τ z) 1 R +z) 1 τ) z 1 R z) is the tilted absolute fuctio. Havig succeeded i defiig the ucoditioal quatiles as a optimizatio problem, it is easy to defie coditioal quatiles i a aalogous fashio. To obtai a estimate of the coditioal quatile, we simply replace the scalar m i the equatio 1 by a fuctio fx i ), where x i are the covariate variables. 2 The model The basic model is a possibly oliear regressio model with a additive error. It is give by Y t = f θ X t )+ε t 2) 61

2 Where Y t ) 1 t are the observatios, X t ) 1 t are radom covariates ad ε t ) 1 t are uobserved error term. The regressio fuctio f θ is assumed to be a MLP fuctio with k hidde uits ca be writte : f θ x) =β + k a i φ wi T ) x + b i, i=1 with θ =β,a 1,,a k,b 1,,b k,w 11,,w 1d,,w k1,,w kd ) the parameter vector of the model ad φ a bouded trasfer fuctio, usually a sigmoïdal fuctio. θ belogs to Θ k R k d+2)+1, a compact i.e. closed ad bouded) set of possible parameters. The quatile regressio estimator is obtaied by fˆθτ solvig the optimizatio problem : mi θ Θk M τ f θ ) with M τ f θ )= i=1 ρ τ y i f θ x i )) 3) For a fuctio ρ τ.) equal to ρ τ z) =τ z) 1 R +z) 1 τ) z 1 R z) 4) I the sequel, let f θτ be a, possibly ot uique, fuctio such that f θτ =argmimf θ )withmf θ )= ρ τ y f θ x))dp x, y). 5) θ Θ k f θτ is the optimal fuctio for the theoretical quatile regressio problem. 2.1 Asymptotic distributio If the possible fuctios f θ are parametric, idetifiable ad smooth eough fuctio ad if the desity of the oise exists ad is positive the asymptotic ormality of the M-estimator ca be show see Koeker ad Basset [1] for the liear case ad Weiss [6] for the o-liear case ad 1 2-quatile). However it is possible to give more geeral results usig empirical processes theory. I this paper we prove a geeral boud valid eve if the optimal fuctios f θτ are ot uique ad without assumptios o the desity of oise, except momet coditios A geeral boud for Mf τ θ ) We will prove a iequality boudig the differece: M τ f θ ) M τ f θτ ). For a square itegrable fuctio gx, Y )thel 2 orm is: gx, Y ) 2 := g 2 x, y)dp x, y). 62

3 Let λ>0 be a costat, the geeralized derivative fuctio is defied as: e d λ θ X, Y )= λρτ Y f θ X)) e λρτ Y f θ τ X)) e λρ τ Y f θ τ X)) e λρ τ Y f θ X)) e λρτ Y f θ τ X)) e λρ τ Y f θ τ X)) 2 = e λρ τ Y f θ X)) λρτ Y f θ τ X)) 1 e λρ τ Y f θ X)) λρτ Y f θ τ X)) 1 2 ad let us defie ) d λ θ x, y) =mi{ 0,d λ θ x, y)}. For ow, let us assume that d λ θ is well defied, this poit will be discuss later. We ca state the followig iequality: Iequality: for λ>0, sup Mf τ θτ ) Mf τ θ )) 1 θ Θ k 2λ sup θ Θ k i=1 dλ θ x i,y i ) ) d λ 2 x θ i,y i ) i=1 Proof : The proof is very similar to the proof for the least square estimator obtaied by Rykiewicz [4]. We have Mf τ θτ ) Mf τ θ )) = ) 1 λ i=1 1+ log e λρ τ Y f θ X)) e λρ τ Y f θ τ X)) e λρ τ Y f θ τ X)) 2 d λ θ x i,y i ) 1 sup 0 p e λρ τ Y f θ X)) e λρτ Y f θ τ X)) e λρ τ Y f θ τ X)) λ i=1 log 1+pd λ θ x i,y i ) ) 2 1 sup p 0 λ p i=1 dλ θ x i,y i ) p2 ) ) 2 i=1 d λ 2 x θ i,y i ). Sice for ay real umber u, log1+u) u 1 2 u2. Fially, replacig p by the optimal value, we foud M τ f θτ ) M τ f θ )) 1 i=1 dλ θ xi,yi) 2λ i=1d λ θ ) 2 xi,yi) This iequality allows to prove that M τ f θτ ) M τ f θ ) is bouded i probability uder simple assumptios. This may be applied to model selectio as discussed i the ext sectio. 2.2 Applicatio : selectio of models I this sectio, the set Θ of possible parameters will be set to Θ= K k=1θ k, with Θ k1 Θ k2 for k 1 <k 2 ad K is a, possibly huge, fixed costat. Let k 0 be the miimal dimesio of the fuctioal space eeded to realize the true regressio fuctio f τ. For multilayer perceptro Θ k may be set of MLP with k hidde uits. We defie the miimum-pealized estimator of k 0, as the miimizer ˆk of 6) 7) T k) =mi θ Θ M τ f θ )+a k)) 8) 63

4 Let us assume the followig assumptios: A1) a.) is icreasig, a k 1 ) a k 2 )) teds to ifiity as teds to ifiity, for ay k 1 >k 2 ad a k) teds to 0 as teds to ifiity for ay k. A2) It exists λ>0sothat { d λ θ,θ Θ} is a Dosker class see va der Vaart [5]). We ow have: Theorem: Uder A1) ad A2), ˆk coverges i probability to the true dimesio k 0. The proof of this theorem is exactly the same as i Rykiewicz [4]. The assumptio A1) is fairly stadard for model selectio, i the Gaussia case A1) will be fulfilled by BIC-like criteria. The assumptio A2) is more difficult to check. First we ote: e λρ τ Y f θ X)) ρ τ Y f θ τ X))) 1 ) 2 = e 2λρτ Y f θx)) ρ τ Y f θ τ X))) 2e λρτ Y f θx)) ρ τ Y f θ τ X))) +1 So, d λ θ is well defied if E [ e 2λρτ Y f θx)) ρ τ Y f X)))] θ τ <, SiceaMLP fuctio is bouded, d λ θ is well defied if Y admits expoetial momets. Fially, usig the same techiques of reparameterizatio as i Rykiewicz [3], assumptio A2) ca be show to be true for liear regressios or MLP models with sigmoïdal trasfer fuctios, if the set of possible parameters Θ is compact. 3 A little experimet The theoretical pealizatio terms of the previous sectio ca be chose amog a wide rage of fuctios see coditio A1). I the sequel, a little experimet is coducted to assess the right rate of pealizatio to guess the true architecture of a model. Cosider a simulated model: Y t = F θ 0X 1t,X 2t )+ε t,t=1,,, with X 11,X 21 ),, X 1,X 2 )) i.i.d., X 1t,X 2t ) N0 R 2, 3 I 2 ), where I 2 is the idetity matrix. The oise sequece ε 1,...,ε is idepedet ad idetically distributed followig a Gaussia distributio N 0, 1) ad F θ 0x 1,x 2 ) = tah6 x 1 2 x 2 )+2 tah8 x 1 +3 x 2 ) 3 tah2 6 x 1 2 x 2 ) ) Here, the true model is a MLP with 2 iputs, 3 hidde uits ad oe output. I order to avoid too log time of computatio, the umber of hidde uits is assumedtobebetwee1ad10. 64

5 Let D be the size of the parameter vector the dimesio of the model or the umber of weights of the MLP), we cosider the quatile regressio with τ =0.5, so we miimize the sum of absolute residuals. We will compare 3 criteria, from the least pealized AIC like) to the most pealized Very Strog Pealizatio), the followig pealized criteria are assessed: 1 AIC like: t=1 ρ 0.5 z t F θ x t,y t )) 1+ 2D BIC like: 1 t=1 ρ 0.5 z t F θ x t,y t )) ) 1+ D log SP Strog Pealizatio): 1 t=1 ρ 0.5 z t F θ x t,y t )) ) 1+ D We simulate = 100, = 500 ad = 1000 data accordig to the true model 9), for each the experimet is repeated 100 times. The followig architectures are chose by the pealized criteria : =100 =500 =1000 b h. uits AIC like models sel BIC like models sel SP models sel b h. uits AIC like models sel BIC like models sel SP models sel b h. uits AIC like models sel BIC like models sel SP models sel The BIC like criterio ad the Strog Pealizatio chose ofte the true architecture eve for a small umber of data. Accordig to the theory, AIC like criterio is ot cosistet see coditio A1) ad the chose architecture is always too large. The Strog pealizatio chose a too small architecture whe theumberofdataissmall = 100), however it is a cosistet criterio, so its behavior is correct for larger umber of data = 500 ad = 1000). The BIC like criterio seems to be the best for this cost fuctio. ) 65

6 4 Coclusio The covetioal least squares estimator may be seriously deficiet i case of o-gaussia errors. It seems reasoable to pay a small premium i the form of sacrificed efficiecy, i order to get more robust regressio models. The class of statistics model called regressio quatiles are kow to have good properties uder some restrictive assumptios. I this paper we have show that some results may be obtaied uder more geeral assumptios. We have prove a iequality showig that overfittig of theses models is moderate if the oise admits expoetial momets. This boud justifies the use of pealized criterio similar to the BIC criterio i order to fit the dimesio of models. Fially, a more challegig task may be to get a more precise tuig of pealizatio term which, accordig to our result, ca be chose amog a wide rage of fuctios. Refereces [1] Koeker, R. ad Basset, G., Regressio quatiles. Ecoometrica, 46:1, pages 33-50, 1978 [2] Egel, E., Die produktios- ud Kosumptioverhaltisse des Koigreichs Sachse.Iteratioal Statistical Istitute Bulleti, 9, pages 1-125, 1857 [3] J. Rykiewicz, Cosistet estimatio of the architecture of multilayer perceptros. I M. Verleyse, editor, proceedigs of the 14 th Europea Symposium o Artificial Neural Networks ESANN 2006), d-side pub., pages , April 28-30, Bruges Belgium), [4] J. Rykiewicz, Geeral boud of overfittig for MLP regressio models. Neurocomputig to appear. [5] A.W. va der Vaart, Asymptotic statistics, Cambridge uiversity Press, Cambridge, [6] Weiss, A., Estimatig oliear dyamic models usig least absolute error estimatio.ecoometric Theory, 7, pages 46-68, 1991 Ecoometrica, 46:1, pages 33-50,

General bound of overfitting for MLP regression models.

General bound of overfitting for MLP regression models. arxiv:20.0633v [math.st] 3 Ja 202 Geeral boud of overfittig for MLP regressio models. Rykiewicz, J. Abstract Multilayer perceptros (MLP) with oe hidde layer have bee used for a log time to deal with o-liear