Testing the number of parameters with multidimensional MLP

Size: px

Start display at page:

Download "Testing the number of parameters with multidimensional MLP"

Amy McGee
5 years ago
Views:

Testig the umber of parameters with multidimesioal MLP Joseph Rykiewicz To cite this versio: Joseph Rykiewicz. Testig the umber of parameters with multidimesioal MLP. ASMDA 2005, 2005, Brest, Frace.

fr/hal-00258206 Submitted o 21 Feb 2008 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot.

1 Testig the umber of parameters with multidimesioal MLP Joseph Rykiewicz To cite this versio: Joseph Rykiewicz. Testig the umber of parameters with multidimesioal MLP. ASMDA 2005, 2005, Brest, Frace. pp , <hal > HAL Id: hal Submitted o 21 Feb 2008 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 Testig the umber of parameters of multidimesioal MLP Joseph Rykiewicz 1 SAMOS - MATISSE Uiversité de Paris I 72 rue Regault, Paris, Frace joseph.rykiewicz@uiv-paris1.fr Abstract. This work cocers testig the umber of parameters i oe hidde layer multilayer perceptro MLP. For this purpose we assume that we have idetifiable models, up to a fiite group of trasformatios o the weights, this is for example the case whe the umber of hidde uits is kow. I this framework, we show that we get a simple asymptotic distributio, if we use the logarithm of the determiat of the empirical error covariace matrix as cost fuctio. Keywords: Multilayer Perceptro, Statistical test, Asymptotic distributio. hal , versio 1-21 Feb Itroductio Cosider a sequece Y t, Z t t N of i.i.d. 1 radom vectors i.e. idetically distributed ad idepedets. So, each couple Y t, Z t has the same law that a geeric variable Y, Z R d R d. 1.1 The model Assume that the model ca be writte where Y t = F W 0Z t + ε t F W 0 is a fuctio represeted by a oe hidde layer MLP with parameters or weights W 0 ad sigmoidal fuctios i the hidde uit. The oise, ε t t N, is sequece of i.i.d. cetered variables with ukow ivertible covariace matrix ΓW 0. Write ε the geeric variable with the same law that each ε t. Notes that a fiite umber of trasformatios of the weights leave the MLP fuctios ivariat, these permutatios form a fiite group see [Sussma, 1992]. To overcome this problem, we will cosider equivalece classes of MLP : two 1 It is ot hard to exted all what we show i this paper for statioary mixig variables ad so for time series

3 2 Joseph Rykiewicz MLP are i the same class if the first oe is the image by such trasformatio of the secod oe, the cosidered set of parameter is the the quotiet space of parameters by the fiite group of trasformatios. I this space, we assume that the model is idetifiable, this ca be doe if we cosider oly MLP with the true umber of hidde uits see [Sussma, 1992]. Note that, if the umber of hidde uits is over-estimated, the such test ca have very bad behavior see [Fukumizu, 2003]. We agree that the assumptio of idetifiability is very restrictive, but we wat emphasize the fact that, eve i this framework, classical test of the umber of parameters i the case of multidimesioal output MLP is ot satisfactory ad we propose to improve it. 1.2 testig the umber of parameters Let q be a iteger lesser tha s, we wat to test H 0 : W Θ q R q agaist H 1 : W Θ s R s, where the sets Θ q ad Θ s are compact. H 0 express the fact that W belogs to a subset of Θ s with a parametric dimesio lesser tha s or, equivaletly, that s q weights of the MLP i Θ s are ull. If we cosider the classic cost fuctio : V W = Y t F W Z t 2 where x deotes the Euclidea orm of x, we get the followig statistic of test : S = mi V W mi V W W Θ q W Θ s It is show i [Yao, 2000], that S coverges i law to a poderated sum of χ 2 1 s q D S λ i χ 2 i,1 i=1 where the χ 2 i,1 are s q i.i.d. χ2 1 variables ad λ i are strictly positives values, differet of 1 if the true covariace matrix of the oise is ot the idetity. So, i the geeral case, where the true covariace matrix of the oise is ot the idetity, the asymptotic distributio is ot kow, because the λ i are ot kow ad it is difficult to compute the asymptotic level of the test. To overcome this difficulty we propose to use istead the cost fuctio 1 U W := l det Y t F W Z t Y t F W Z t T. 1 we will show that, uder suitable assumptios, the statistic of test : T = mi U W mi U W W Θ q W Θ s will coverge to a classical χ 2 s q so the asymptotic level of the test will be very easy to compute. The sequel of this paper is devoted to the proof of this property. 2

4 multidimesioal MLP 3 2 Asymptotic properties of T I order to ivestigate the asymptotic properties of the test we have to prove the cosistecy ad the asymptotic ormality of Ŵ = argmi W Θs U W. Assume, i the sequel, that ε has a momet of order at least 2 ad ote Γ W = 1 Y t F W Z t Y t F W Z t T remark that these matrix Γ W ad it iverse are symmetric. i the same way, we ote ΓW = lim Γ W, which is well defied because of the momet coditio o ε 2.1 Cosistecy of Ŵ First we have to idetify cotrast fuctio associated to U W Lemma 1 U W U W 0 a.s. KW, W 0 with KW, W 0 0 ad KW, W 0 = 0 if ad oly if W = W 0. Proof : By the strog law of large umber we have U W U W 0 a.s. l detγw l detγw 0 = l detγw l det Γ W 0 ΓW ΓW 0 + I d detγw 0 = where I d deotes the idetity matrix of R d. So, the lemme is true if ΓW ΓW 0 is a positive matrix, ull oly if W = W 0. But this property is true sice ΓW = E Y F W ZY F W Z T = E Y F W 0Z + F W 0Z F W ZY F W 0Z + F W 0Z F W Z T = E Y F W 0ZY F W 0Z T + E F W 0Z F W ZF W 0Z F W Z T = ΓW 0 + E F W 0Z F W ZF W 0Z F W Z T We deduce the the theorem of cosistecy : Theorem 1 If E ε 2 <, Ŵ P W 0

5 4 Joseph Rykiewicz Proof Remark that it exist a costat B such that sup W Θs Y F W Z 2 < Y 2 + B because Θ s is compact, so F W Z is bouded. For a matrix A R d d, let A be a orm, for example A 2 = tr AA T. We have ad sice the fuctio : limif W Θs Γ W = ΓW 0 > 0 limsup W Θs Γ W := C < Γ l det Γ, for C Γ ΓW 0 is uiformly cotiuous, by the same argumet that example 19.8 of [Va der Vaart, 1998] the set of fuctios U W, W Θ s is Gliveko- Catelli. Fially, the theorem 5.7 of [Va der Vaart, 1998], show that Ŵ coverge i probability to W Asymptotic ormality For this purpose we have to compute the first ad the secod derivative with respect to the parameters of U W. First, we itroduce a otatio : if F W X is a d-dimesioal parametric fuctio depedig of a parameter W, FW X write resp. 2 F WX for the d-dimesioal vector of partial derivative resp. secod order partial derivatives of each compoet of F W X. First derivatives : if Γ W is a matrix depedig of the parameter vector W, we get from [Magus ad Neudecker, 1988] l detγ W = tr Γ W W Γ W k Hece, if we ote usig the fact tr Γ A W k = 1 F Wz t y t F W z t T WA W k = tr A T W kγ W = tr Γ WAT W k we get l det Γ W = 2tr Γ WA W k 3

6 multidimesioal MLP 5 Secod derivatives : We write ow B W k, W l := 1 T F W z t F W z t ad We get C W k, W l := 1 y t F W z t 2 F W z t T 2 U W W l = Γ W 2tr W l 2tr Γ WA W k = W l A W k + 2tr Γ WB W k, W l + 2tr Γ W C W k, W l Now, [Magus ad Neudecker, 1988], give a aalytic form of the derivative of a iverse matrix, so we get so 2 U W 2tr Γ = 2tr Γ WB W k, W l + 2tr Γ 2 U W +2tr Γ W A W k + A T W k Γ WA W k + WC W k, W l WA W k Γ WA W k = 4tr Γ WB W k, W l + 2tr Γ WC W k, W l 4 Asymptotic distributio of Ŵ : The previous equatios allow us to give the asymptotic properties of the estimator miimizig the cost fuctio U W, amely from equatio 3 ad 4 we ca compute the asymptotic properties of the first ad the secod derivatives of U W. If the variable Z has a momet of order at least 3 the we get the followig lemma : Theorem 2 Assume that E ε 2 < ad E Z 3 <, let U W 0 be the gradiet vector of U W at W 0 ad HU W 0 be the Hessia matrix of U W at W 0. Write fially We get the BW k, W l := F WZ 1. HU W 0 a.s. 2I 0 2. U W 0 Law N0, 4I 0 3. Ŵ W 0 Law N0, I0 where, the compoet k, l of the matrix I 0 is : T F W Z W l tr Γ 0 E BW 0 k, W 0 l

7 6 Joseph Rykiewicz proof : We ca show easily that, for all x R d, we have : FWZ Cte1 + Z 2 F W Z Cte1 + Z 2 2 F W Z 2 F 0 W Z Cte W W Z 3 Write AW k = F WZ Y F W Z T ad UW := log dety F W Z. Note that the compoet k, l of the matrix 4I 0 is: UW 0 UW 0 E Wl 0 = E 2tr Γ0 A T Wk 0 2tr Γ0 AWl 0 ad, sice the trace of the product is ivariat by circular permutatio, E UW 0 UW 0 = Wl 0 4E F W 0ZT Γ0 Y F W 0ZY F W 0Z T Γ0 F W 0Z W l = 4E = 4tr W k FW 0Z T Γ0 F W 0Z W l Γ0 FW 0Z F E W 0Z T = 4tr Γ 0 E BW 0 k, W 0 l Now, the derivative FWZ is square itegrable, so U W 0 fulfills Lideberg s coditio see [Hall ad Heyde, 1980] ad U W 0 Law N0, 4I 0 For the compoet k, l of the expectatio of the Hessia matrix, remark first that ad so lim tr Γ W 0 A Wk 0 Γ W 0 A Wk 0 = 0 lim trγ C Wk 0, W l 0 = 0 lim H W 0 = lim 4tr Γ W 0 A Wk 0 2trΓ W 0 B Wk 0, W l 0 + 2trΓ C Wk 0, W l 0 = = 2tr Γ0 E BWk 0, W l 0 Now, sice 2 F WZ Cte1 + Z 2 ad 2 F W Z Γ W 0 A W 0 k + 2 F 0 W Z Cte W W Z 3, by stadard argumets foud, for example, i [Yao, 2000] we get Ŵ W 0 Law N0, I 0

8 multidimesioal MLP Asymptotic distributio of T I this sectio, we write Ŵ = argmi W Θs U W ad Ŵ 0 = argmi W Θq U W, where Θ q is view as a subset of R s. The asymptotic distributio of T is the a cosequece of the previous sectio, amely, if we have to replace U W by its Taylor expasio aroud Ŵ ad Ŵ 0, followig [Va der Vaart, 1998] chapter 16 we have : T = Ŵ Ŵ 0 T I0 Ŵ Ŵ 0 + o P 1 D χ 2 s q 3 Coclusio It has bee show that, i the case of multidimesioal output, the cost fuctio U W leads to a test for the umber of parameters i MLP simpler tha with the traditioal mea square cost fuctio. I fact the estimator Ŵ is also more efficiet tha the least square estimator see [Rykiewicz, 2003]. We ca also remark that U W matches with twice the cocetrated Gaussia log-likelihood but we have to emphasize, that its ice asymptotic properties eed oly momet coditio o ε ad Z, so it works eve if the distributio of the oise is ot Gaussia. A other solutio could be to use a approximatio of the covariace error matrix to compute geeralized least square estimator : 1 Y t F W Z t T Γ Y t F W Z t, assumig that Γ is a good approximatio of the true covariace matrix of the oise ΓW 0. However it take time to compute a good the matrix Γ ad if we try to compute the best matrix Γ with the data, it leads to the cost fuctio U W see for example [Gallat, 1987]. Fially, as we see i this paper, the computatio of the derivatives of U W is easy, so we ca use the effective differetial optimizatio techiques to estimate Ŵ ad umerical examples ca be foud i [Rykiewicz, 2003]. Refereces Fukumizu, 2003.K. Fukumizu. Likelihood ratio of uidetifiable models ad multilayer eural etworks. Aals of Statistics, 31:3: , Gallat, 1987.R.A. Gallat. No liear statistical models. J. Wiley ad Sos, New- York, Hall ad Heyde, 1980.P. Hall ad C. Heyde. Martigale limit theory ad its applicatios. Academic Press, New-York, Magus ad Neudecker, 1988.Ja R. Magus ad Heiz Neudecker. Matrix differetial calculus with applicatios i statistics ad ecoometrics. J. Wiley ad Sos, New-York, 1988.

9 8 Joseph Rykiewicz Rykiewicz, 2003.J. Rykiewicz. Estimatio of multidimesioal regressio model with multilayer perceptros. I J. Mira ad J.R. Alvarez, editors, Computatioal methods i eural modelig, volume 2686 of Lectures otes i computer sciece, pages , Sussma, 1992.H.J. Sussma. Uiqueess of the weights for miimal feedforward ets with a give iput-output map. Neural Networks, pages , Va der Vaart, 1998.A.W. Va der Vaart. Asymptotic statistics. Cambridge Uiversity Press, Cambridge, UK, Yao, 2000.J. Yao. O least square estimatio for stable oliear ar processes. The Aals of Istitut of Mathematical Statistics, 52: , 2000.

On the behavior at infinity of an integrable function

On the behavior at infinity of an integrable function O the behavior at ifiity of a itegrable fuctio Emmauel Lesige To cite this versio: Emmauel Lesige. O the behavior at ifiity of a itegrable fuctio. The America Mathematical Mothly, 200, 7 (2), pp.75-8.