Some Applications of Stochastic. Gradient Processes

Size: px

Start display at page:

Download "Some Applications of Stochastic. Gradient Processes"

Bethanie McDaniel
5 years ago
Views:

1 Applied Mathematical Sciences, Vol. 2, 2008, no. 25, Some Applications of Stochastic Gradient Processes A. Bennar 1, A. Bouamaine 2 and A. Namir 1 1 Département de Mathématiques et Informatique Faculté des sciences Ben M sik Université Hassan II Mohammedia, B.P Casablanca, Maroc 2 Ecole Nationale Supérieure d éléctricité etdemécanique Equipe de Recherche : Architecture des Systèmes Université Hassan II, Ain Chok, B.P Casablanca, Maroc Abstract We consider a stochastic gradient process, which is used for the estimation of a conditional expectation : X n+1 = X n a n x φ(v n,x n )(φ(v n,x n ) U n ). We give one theorem of almost sure convergence and one theorem of mean quadratic convergence. Several applications are given : linear estimation of a conditional expectation, sequential estimation of law mixture parameters in classification, estimation of an observable function in random points, estimation of a function h(x) = E[Z(x)], estimation of a linear regression parameters, estimation of baysian discriminant function. Mathematics Subject Classification: Primary: 62; Secondary: L20 Keywords: Stochastic approximation, conditional expectation, stochastic gradient

2 1220 A. Bennar, A. Bouamaine and A. Namir 1. Introduction We consider a random vector X n in R p defined by : X n+1 = X n a n x φ(v n,x n )(φ(v n,x n ) U n ) with (a n ) is a sequence of positive real numbers ; (U 1,V 1 ), (U 2,V 2 ),..., (U n,v n ) is a sample of independent random variable couples with the same probability law that (U, V ). φ(.,.) is a real known measurable function in R k R p. In the following,.,. and. are respectively the usual inner product and norm in R k ; A denotes the transposed matrix of A, λ min (B) the smallest eigenvalue of B ; the abbreviation a.s. means almost surely and q.m. means quadratic mean. 2. Convergence 2.1 Almost Sure Convergence Let s make the following hypotheses : (H 1 ) a n > 0, a 2 n < (H 1) a n > 0, 1 a n =, 1 a 2 n < 1 (H 2 ) there exists a and b such thats, for all θ =(θ 1,θ 2,..., θ p ) R p, [ ] φ(v,x) Var (φ(v,x) U) <ag(x)+b, for all i =1, 2,..., p. x i (H 3 ) there exists K>0such that, for all x =(x 1,x 2,..., x p ), 2 g(x) x i x j < K, for i, j =1, 2,..., p. (H 4 ) θ is a local minimum of g : α>0 : (x θ, x θ <α) (g(θ ) <g(x))

3 Applications of stochastic gradient processes 1221 (H 5 ) θ is the unique stationary point of g : Theorem x R p, (x θ ) ( x g(x) 0) Under hypotheses H 1,H 2,H 3,H 4,H 5, we have : X n a.s. θor X n a.s. + Proof See [3] 2.2 Quadratic Mean Convergence Let s make the following hypotheses : (H 8 ) φ(x, θ), x φ(x, θ) are uniformly bounded in x and θ. (H 9 ) It exists two real positives functions h and h defined in R p such that : θ, θ R p, x R q, φ(x, θ) φ(x, θ ) h(x) θ θ θ φ(x, θ) θ φ(x, θ ) h (x) θ θ E[h(X)] < ; E[h (X)] < (H 10 ) Y is a real random bounded variable. Theorem Under hypotheses H 1,H 3,H 8,H 9,H 10, we have : Proof θ g(θ n ) a.s. 0 and θ g(θ n ) q.m. 0 See [3] 3. Applications 3.1 Séquential Estimation of a Conditional Expectation by the linear model

4 1222 A. Bennar, A. Bouamaine and A. Namir Let ρ 1,ρ 2,..., ρ p p functions of q real variables, measurable, known. Let s put ρ =(ρ 1,ρ 2,..., ρ p ) ] to appraise θ that minimizes E [(E[U/V ] x ρ(v )) 2, We consider the stochastic approximation process (X n )in R p by : X n+1 = X n a n ρ(v n )(ρ (V n )X n U n ) Where (U 1,V 1 ), (U 2,V 2 ),..., (U n,v n ) is a sample of (U, V )formed of independent random variables and distributed identically. Let s make hypotheses (H 6 ) ρ 1 (V ),ρ 2 (V ),..., ρ p (V ) are linearly independent. (H 7 ) Moments of order 4 of the vector (ρ 1 (V ),ρ 2 (V ),..., ρ p (V ),U) exists. (H 8 ) X 1 is an random variable such that E[ X 1 2 ] < Corollary Under hypotheses H 1,H 6,H 7,H 8, we have : X n a.s. θ Proof Let φ the real function of R q R p defined by : p φ(v,x) =x ρ(v )= x j ρ j (V ) φ(v,x) For j =1, 2,..., p, = ρ j (V ), we have : x φ(v,x)=ρ(v ) x j Let : A = E[ρ(V )ρ (V )] Under H 7, the matrix A is symmetrical definite positive, therefore inversible. Then : θ is solution unique of the equation j=1 x g(x) =2E[ρ(V )(ρ (V )x U)] = 0 We have : θ = A 1 E[ρ(V )U]

5 Applications of stochastic gradient processes 1223 Let s prove that the hypothesis H 2 is verified. We have g(x) =E[U 2 ]+ x 2 2 <x,θ > A = E[U 2 ]+ x θ 2 A θ 2 A c x θ 2 A + d (c = λ min(a) and d = E[U 2 ] θ 2 A) 1 2 c x 2 c x 2 + d e x 2 + f (e = c 2 and f = d c x 2 ) Therefore, for i =1, 2,..., p, we have: [ ] φ(v,x) Var (φ(v,x) U) x i E[ρ 2 i (V )(x ρ(v ) U) 2 ] [ ] = Var ρ i (V )(x ρ(v ) U) a x 2 + b (a =2E[ρ 2 i (V ) ρ(v ) 2 ],b=2e[ρ 2 i (V )U 2 ]) Ag(x)+B (A = 1 e,b= b af e ) Let s prove that the hypothesis H 3 is verified. For i =1, 2,..., p, We have g(x) =2E[(x ρ(v ) U)ρ i (V )] x i Therefore : for i, j =1, 2,..., p, we have 2 g(x) =2E[ρ j (V )ρ i (V )], that doesn t depend of x. x i x j Hypotheses of the theorem 2.1 are verified, therefore : X n a.s. θ or X n a.s. + Let s prove that that we can not have X n a.s. + Indeed : as a n x g(x n ) 2 < a.s. ( see [3] Lemma 2.1) and 1 n a n =+, there exists an sub-sequence of integers (n l ) such that x g(x nl ) a.s. 0 Besides : x g(x) =2E[ρ(V )(ρ (V )x U)] = E[ρ(V )(ρ (V )] = 2A(x θ ) Therefore : x g(x n ) 2 4λ 2 min(a) X n θ 2 (λ min(a) > 0) Therefore : If X n p.s. + then x g(x n ) a.s. +. What is absurd. 3.2 Sequential Estimation of parameters of a law mixture in

6 1224 A. Bennar, A. Bouamaine and A. Namir classification Let X 1,X 2,..., X n,... a sample of X formed of random variables independent, distributed identically of law µ and to values in R q definied on a probability space (Ω, A, P). We suppose that µ is a mixture of laws : µ = r j=1 p jµ j with : r j, p j 0, p j =1et µ j is a law of probability on R q. j=1 Let F (resp. F j ) the function of distribution of µ (resp. µ j ) We have : x R q, F(x) = r j=1 p jf j (x) For j =1, 2,..., r, We suppose that p j depends a parameter β j and F j of a multidimentional parameter m j. Let θ =(β 1,β 2,..., β r,m 1,m 2,..., m r ) Let p the dimension of θ and let Φ the real function of R p R q, measurable such that : r Φ(x, θ) = π(β j )Ψ(m j,x) We suppose that functions π et Ψ are known verifying : r j, π(β j ) 0, π(β j )=1 j=1 j=1 j, Ψ(m j,.) is a function of distribution in R q. We wish to determine the parameter θ of R p such that Φ(x, θ) approach F (x) in the least square sense. ] Let f(θ) =E [(Φ(X, θ) F (X)) 2 We look for θ such that the function f is minimal for θ = θ. Let Z a random variable in R q of law µ and Y the indicatory function defined by : 1 if x D z Y = I Z (X) = 0 else

7 Applications of stochastic gradient processes 1225 with D z = {x R q : x z} and : x =(x 1,..., x q ) z =(z 1,..., z q ) x z j =1, 2,...q, x j z j We have : E[I Z (x)] = P (Z x) =F (x) E[Y/X]=E[I Z (X)/X] =F (X) ] Therefore f(θ) =E [(E[Y/X] Φ(X, θ)) 2 Let g(θ) =E[(Y Φ(X, θ)) 2 ]=E[(I Z (X) Φ(X, θ)) 2 ] The problem of estimation of θ that minimizes f becomes to look for θ such that g is minimal for θ = θ We have : g(θ) = 2E [ ( )] θ Φ(X, θ) Φ(X, θ) I Z (X) For estimate θ by séquential schem, we define( the process (Θ n )in R p ) by : Θ n+1 =Θ n a n θ Φ(X n, Θ n ) Φ(X n, Θ n ) I Zn (X n ) with (X 1,Z 1 ), (X 2,Z 2 ),..., (X n,z n ) a sample of (Z, X) formed of random independent variables, distributed identically and (a n ) is asequence of positive real numbers. Corollary a.s. Under H 1,H 2,H 3,H 4,H 5, we have Θ n θ or Θ n a.s. + Proof It is a consequence of the previous theorem. 3.3 Estimation of an observable function in random points. Let h a real function of m real variables. Let an random variable X to values in R m. We suppose that we can observe the real random variable h(x). We have E[h(X)/X] =h(x). We can approach h(x) by a linear combinaison of functions Υ i (X),i=1, 2,..., p, and estimate so the function h by using the general method of the gradient, with Y = h(x). 3.4 Estimation of an function h(x) =E[Z(x)]. Let a family of real random variable {Z(x),x R m }; Let E[Z(x)] = h(x) and let an random random variable X to values in R m. We suppose that we can observe the real random variable Z(X). We have E[Z(X)/X] =h(x). We can approach h(x) by a linear

8 1226 A. Bennar, A. Bouamaine and A. Namir combinaison of functions Υ i (X),i=1, 2,..., p, and estimate so the function h by using the general method of the gradient, with Y = Z(X). 3.4 Estimation of a linear regression parameters. The most direct application of the stochastic gradient method is the linear regression. Y is the explained random variable, X 1,X 2,..., X m are the explanatory random variables, that constitute the variable X R m.we approach E[Y/X 1,X 2,..., X m ] by a linear combinaison of Υ i (X), i =1, 2,..., p. 3.5 Estimation of baysian discriminant function. We distinguish r classes C 1,C 2,..., C r in a set of individuals. To classe a new individual, we measure m variables X 1,X 2,..., X m, that constitute the variable X R m. The utilization of the ordering baysian rule requires the knowledge of probabilities to posteriori P (C i /X). References [1] A. BENNAR, Approximation stochastique, Convergence dans le cas de plusieurs solutions et étude de modèles de corrélations. Thèse de doctorat de 3ème cycle, Université de Nancy I, (1985). [2] A. Bennar, J.M. Monnez, Almost sure convergence of a stochastic Approximation process in a convex set. International Journal of Applied Mathematics, vol. 20, n 5, pp : (2007). [3] A. Bennar, A. Bouamaine, A. Namir, Almost sure Convergence and in quadratic mean of the gradient stochastic process for the sequential estimation of a conditional expectation. Applied Mathematical Sciences, vol. 2, no. 8, pp : , (2008). [4] A. BOUAMAINE, Méthodes d approximation stochastique en Analyse des Données. Thèse de doctorat d etat, Université Mohamed V, (1996).

9 Applications of stochastic gradient processes 1227 [5] E.M. BRAVERMAN, L.T. ROZONOER, Convergence of trandom process in learning machines theory. Part I and II. Automation and Remote Control, vol. 30, pp : and pp : , (1969). [6] J. DIPPON, Globally convergent stochastic optimization with optimal asymptotic distribution. J. Appl. Proba. 35, pp : , (1998). [7] D.A. FREEDMAN, L.E. DUBINS, A sharper from of the Borel-Cantelli lemma and the strong law. A.M.S., vol. 36, pp : [8] J.M. MONNEZ, Etude d un processus general multidimensionnel d approximation stochastique sous contraintes convexes. Applications a l estimation statistique. Thèse de doctorat d Etat es Sciences Mathématiques, Université de Nancy I, (1982). [9] J.M. MONNEZ, Almost Sure Convergence of Stochastic Gradient Process with Matrix Step Sizes. Statistics and Probability Letters, 76, pp : , (2006). [10] J.P. PARISOT,Optimisation stochastique : le processus de KIEFFER-WOLFOFITZ, essai de synthese et quelques complements. Thèse de Doctorat de troisième cycle, Université de Nancy 1 (1981). [11] H. Robbins and D. Siegmund,A convergence theorem for nonnegative almost supermartingales and some applications. Optimizing methods in Statistics, J.S. Rustagi ed., Academic Press, New-York, pp : , (1971). [12] H. ROBBINS, S. MONRO, A stochastic approximation method. A.M.S., vol 22, pp : , (1951). [13] J.H. VENTER, On convergence of the Kiefer-Wolfowitz process. Approximation procedure. A.M.S., vol. 38, pp : [14] J.C. SPALL, Introduction to Stochastic Search and Optimizing. Wiley, Hoboken, New Jersey, (2003).

10 1228 A. Bennar, A. Bouamaine and A. Namir Received: November 19, 2007

A NON-PARAMETRIC TEST FOR NON-INDEPENDENT NOISES AGAINST A BILINEAR DEPENDENCE

REVSTAT Statistical Journal Volume 3, Number, November 5, 155 17 A NON-PARAMETRIC TEST FOR NON-INDEPENDENT NOISES AGAINST A BILINEAR DEPENDENCE Authors: E. Gonçalves Departamento de Matemática, Universidade