Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015
Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis of Monte-Carlo method.................. 1 2 Simulation of random vectors or stochastic processes 3 2.1 Random variable of uniform distribution on 0, 1.............. 3 2.2 Inverse method................................. 3 2.3 Transformation method............................ 4 2.4 Reject method................................. 5 2.5 Simulation of Gaussian vector......................... 5 2.6 Simulation of Brownian motion........................ 7 3 Variance reduction techniques 9 3.1 The antithetic variable............................. 9 3.2 Variate control method............................. 12 3.3 Stratification.................................. 14 3.4 Importance sampling method......................... 17 4 Stochastic gradient algorithm 21 iii
Chapter 1 Introduction 1.1 The principle The principle of Monte-Carlo method is to use simulations of the random variables to estimate a quantity such as E fx) = R d fx) µdx), 1.1) where f : R d R is a function, X is some random vector taking value in R d with distribution µ. It may also be used to solve an optimization problem of the kind where f θ : R d R) θ Θ is a family of functions. min E f θ X), 1.2) θ Θ To solve the basic problem 1.1), the method consists in simulating a sequence of i.i.d. random vectors X k ) k 1 with the same distribution of X, and then estimate EfX) by the empirical mean value Y n := 1 n n Y k := 1 n n fx k ). 1.3) The advantages of the Monte-Carlo are usually its simplicity, flexibility and efficiency for high dimensional problems. It can also be served as an alternative method or benchmark) for other numerical methods. 1.2 The error analysis of Monte-Carlo method Theorem 1.1 Law of large number) Let Y k ) k 1 be a sequence of i.i.d. random variables such that E Y <. Then with Y n defined in 1.3), one has ) P lim Y n = EY n 1 = 1.
2 Theorem 1.2 Central limit theorem) Let Y k ) k 1 be a sequence of i.i.d. random variables such that E Y 2 <. Then n Y n EY ) N 0, VarY ) ). And consequently, n ˆσ n Y n EY ) N 0, 1 ), where ˆσ 2 n : = 1 n n ) 2. Yk Y n 1.4) Notice that ˆσ 2 n defined in 1.4) is an estimator of the variance VarY from the sequence Y k ) k 1, and it admits the representation ˆσ 2 n = 1 n n Y 2 k Y n) 2. The central limit theorem induces that the asymptotic confidence interval of level pr) := 1 2π e x2 /2 dx of the estimator Y n is given by x R Y n R n ˆσ n, Y n + R n ˆσ n. 1.5) More precisely, it means that P Y n n R ˆσ n EY Y n + n R ˆσ n pr), as n. Remark 1.1 In practice, for pr) = 95%, we know R 1.96. Conclusions To utilize the Monte-Carlo method, the first issue is then how to simulate a sequence of random vector X k ) k 1 given its law µ or given its definition based on other random elements, the second issue is how to improve the estimator by reducing its error. The error of Monte-Carlo method is measured by its confidence interval 1.5), whose length is given by 2Rˆσ n / n. When the confidence level is fixed, R is fixed. One can then use a larger n, where the cost is the computation time which is proportional to n in general. Otherwise, one can reduce ˆσ n by find some other random variable Ỹ such that E Ỹ = EY and Var Ỹ < VarY. We shall address these issues in the following of the course.
Chapter 2 Simulation of random vectors or stochastic processes 2.1 Random variable of uniform distribution on 0, 1 Here we admits that we know how to simulate a random variable of uniform distribution U0, 1. In particular, most of the programming environment are equipped with the computer with a generator of uniform distribution. However, it worths noticing that a generator of random variables in a computer is a deterministic program, and hence it generates always a sequence of deterministic variables, in place of a sequence of independent random variables. In practice, we search for a generator such that the sequence of generated variables has similar performance statistically. 2.2 Inverse method Let X be random variable, its distribution function, defined by F x) := PX x), is a right-continuous non-decreasing function from R to 0, 1. We then define its rightcontinuous generalized inverse function by F 1 u) := inf { x R : F x) u }. Theorem 2.1 Let X be a random variable with distribution function F and U be a random variable of uniform distribution U0, 1 on interval 0, 1. Then X F 1 U) in law. 3
4 Proof. Notice that for any y R, we have F 1 u) y u F y). It follows that for any y R, P F 1 U) y = P U F y) = F y). Example 2.1 i) Let X be a random variable of discrete distribution, PX = x k ) = p k where x k R, p k 0 for k N and k N p k = 1. Then let U U0, 1, and Z be the random variable defined by Z := x n, if U n 1 p k, k=0 ) n p k. Then Z has the same distribution of X. The definition of Z can be interpreted as F 1 U) with the distribution function F of X. ii) Let X Eλ) be a random variable of exponential distribution of parameter λ > 0, i.e. the density function is given by fx) := λe λx 1 x 0, and the distribution is given by F x) := 1 e λx. By direct computation, F 1 u) = λ 1 log1 u) for every u 0, 1). Then for U U0, 1), k=0 F 1 U) = λ 1 log1 U) λ 1 logu) Eλ), since 1 U and U have the same distribution when U U0, 1). 2.3 Transformation method Proposition 2.1 Box-Muller) Suppose that U and V are independent random variables of uniform distribution on the interval 0, 1. Let X := 2 logu) cos2πv ) and Y := 2 logu) sin2πv ). Then X and Y are two independent random variables of Gassian distribution N0, 1). Proof. Exercise 2.1 Let U, V ) be a random vector which is uniformly distributed on the disk {u, v) : u 2 + v 2 1}. Let X := U 2 logu 2 + V 2 U 2 + V 2 and Y := V 2 logu 2 + V 2 U 2 + V 2.
5 Prove that X, Y ) N0, I 2 ). 2.4 Reject method Let f : R d R + and g : R d R + be two density functions such that, for some constant γ > 0, one has fx) γgx), for all x R d. In practice, g is the density function of some distribution with well-known simulation method such as Gaussian distribution, uniform distribution, exponential distribution, etc.), but f is the density function of some distribution without an easy simulation method. The objective is to use the simulations of random variables of distribution g, together with a rejection procedure, to simulate the random variable of distribution f. Proposition 2.2 Let Y k ) k 1 be an i.i.d. sequence of random variables of density g, and U k ) k 1 be an i.i.d. sequence of random variable of distribution U0, 1. Moreover, Y k ) k 1 and U k ) k 1 are also independent. Define a sequence X n ) n 1 of random variables { X n := Y Nn, with N 0 := 0, andn n+1 := min k > N n : U k fx } k). γgx k ) Then X n ) n 1 is a sequence of i.i.d. random variable of density f. Proof. Exercise 2.2 Let f : R R + be defined by fx) := 1 x ) +. Give a numerical algorithm based on the above reject method) to simulate an i.i.d. sequence of random variables of density f. 2.5 Simulation of Gaussian vector The case of dimension 2 of Gaussian distribution, ρ 1, 1 a constant. Define Let Z 1, Z 2 ) N0, I 2 ) be two independent random variable X 1 := Z 1 and X 2 := ρz 1 + 1 ρ 2 Z 2.
6 It is clear that X 1 X 2 N0, 1) and CovX 1, X 2 ) = CovZ 1, ρz 1 + 1 ρ 2 Z 2 ) = ρ, which means that X 1 X 2 ) N More generally, for Z 1, Z 2 ) N0, I 2 ), let 0 0 ), 1 ρ ρ 1 )) X 1 := µ 1 + σ 1 Z 1 and X 2 := µ 2 + σ 2 ρz 1 + 1 ρ 2 Z 2 ),. then X 1 X 2 ) N µ 1 µ 2 ), σ 2 1 ρσ 1 σ 2 ρσ 1 σ 2 σ 2 2 )). 2.1) Notice also that any Gaussian vector of dimension 2 can be written in the form 2.1). General case: Cholesky s method Let Z N0, I d ) be standard Gaussian random vector of dimension d, and A be a lower triangular matrix of dimension d d, i.e. Z = Z 1 Z d and A = A 11 0 0 A 21 A 22 0 A d1 A d2 A dd. Then the vector X := AZ N0, Σ) with variance-covariance matrix Σ := AA T. Cholesky s method consists in finding a lower triangular matrix A such that AA T = Σ, where Σ is a given variance-covariance matrix. Let us write the equation AA T = Σ as A 11 0 0 A 11 A 21 A d1 Σ 11 Σ 12 Σ 1d A 21 A 22 0 0 A 22 A d2 = Σ 21 Σ 22 Σ 2d. A d1 A d2 A dd. 0 0 A dd. Σ d1 Σ d2 Σ dd The solution is given by A 2 11 = Σ 11 A 21 A 11 = Σ 21 A d1 A 11 = Σ d1 A ii = Σ ii i 1 A2 ik A ij = Σ ii j 1 A ) ika jk /Ajj, j < i. Exercise 2.3 Provide a pseudo code for the algorithm of Cholesky s method.
7 2.6 Simulation of Brownian motion Definition 2.1 A standard Brownian motion W is a stochastic process starting from 0, and having i) continuous paths i.e. t W t is almost surely continuous), ii) independent increments i.e. W t W s W s W r, 0 r s t), iii) stationary and Gaussian increments i.e. W t W s N0, t s)). Forward simulation Using the the independent and stationary Gaussian increments property, one can simulate a path of a Brownian motion in a forward way. Let 0 = t 0 < t 1 < be a discrete grid of R +, Z k ) k 1 be a sequence of i.i.d. random variable of Gaussian distribution N0, 1), we define W by W 0 := 0 and W tk+1 := W tk + t k+1 t k Z k+1. Then W is a sample of paths of the Brownian motion on the discrete grid t k ) k 0. Brownian bridge The forward simulation method consists in simulating W tk+1 knowing the value of W tk. There is backward simulation method, i.e. one simulates first the variable W tn, and then simulates the variables W tn 1, W tn 2,, W t2, W t1 recursively. Proposition 2.3 Let 0 = t 0 < t 1 < be a discrete grid, then the conditional distribution of W k knowing W ti, i k ) is a Gaussian distribution Nµ, σ 2 ) with µ = t k+1 t t W tk 1 + t k t k 1 W tk+1 and σ 2 = t k+1 t k )t k t k 1 ), t k+1 t k 1 t k+1 t k 1 t k+1 t k 1 in particular, L W tk Proof. W tk 1 = x, W tk+1 = y ) = N x + t k t k 1 y, t k+1 t k 1 t k+1 t k )t k t k 1 ) ). t k+1 t k 1 Exercise 2.4 Give the backward simulation algorithm for a Brownian motion on 0, 1, using the above results.
8
Chapter 3 Variance reduction techniques Recall that the principle of the Monte-Carlo method is to estimate EY, where Y := fx) ) by Y n := 1 n n Y k := 1 n fx k ), 3.1) n with simulations Y k ) k 1 or more precisely X k ) k 1 ) of random variables Y. In view of the confidence interval 1.5), it is clear that to reduce the error, one should either augment the simulation number n in cost of computation time), or reduce the variance ˆσ n. 2 More precisely, since the variance VarY of Y is fixed, the real issue is to find some other random variable Ỹ satisfying E Ỹ = EY and Var Ỹ < VarY. 3.2) In most of cases, Ỹ admits the representation Ỹ := gx) with some function g : Rd R. Then using the simulations of Ỹ, one could expect an estimator of EY = E Ỹ ) with smaller error. 3.1 The antithetic variable For many random variables vectors) X, their distributions have some symmetric property and admits a simple transformation A : R d R d such that AX) and X have the same distribution. We call AX) the antithetic variable of X. For example, let X U0, 1, then AX) := 1 X U0, 1; let X N0, σ 2 ), then AX) := X N0, σ 2 ). It follows that 9
10 E fax)) = EfX), and hence E Ỹ = EY with Ỹ := fx) + f AX) ). 2 Then a new Monte-Carlo estimator can be given by Ỹ n := 1 n n fx k ) + fax k )) 2 = 1 2n n ) fx k ) + fax k )). 3.3) In some context, we can expect that VarỸ much smaller than VarY see the criteria 3.2)). Example 3.1 Naive Examples) i) Let fx) := x and X N0, σ 2 ) be a Gaussian r.v., then Y := fx) = X. The random variable X admits an antithetic variable X. fx)+f X) Then Ỹ := 2 0 and it is clear that i.e. 3.2) is true for this example. E Ỹ = EY and VarY > Var Ỹ = 0. ii) Let fx) := x, Y := fu) and U U0, 1 which admits an antithetic variable 1 U. fu)+f1 U) Then Ỹ := 2 1 2 and it is clear that 3.2) holds true in this context. Exercise 3.1 Let U U0, 1, then X := logu) λ Then Comparer the variance of X and that of E X X + X = E 2 X+ X 2. Eλ) and X := log1 U) λ. Eλ). Variance analysis Then one has whenever Var Ỹ = 1 4 By direct computation, we have VarfX) + 2Cov fx), fax)) ) + VarfAX)) = 1 2 VarY + 1 2 Cov fx), fax)). Var Ỹ 1 VarY 3.4) 2 Cov fx), fax)) 0. Intuitively, since AX) is the antithetic variable, we can expect that AX) has a negative correlation with X. In practice, the computation error of estimators Ỹn in 3.3))
11 and Y 2n in 3.1)) should be the same, and under the condition 3.4), one has Var Ỹ n Var Y 2n. Remark 3.1 It is very important to use the same simulation X k for estimator Ỹn in 3.3). Otherwise, image that X k ) k 1 is i.i.d with X k ) k 1, and consider Ŷ n := 1 n n fx k ) + fa X k )). 2 Then one has Var Ŷ n = Var Y 2n, which means that the estimator Ŷn is not better than the classical estimator. Case of Gaussian distribution When X is of Gaussian distribution, we can provide more precise criteria for condition 3.4). Proposition 3.1 Let X Nµ, σ 2 ), which admits an antithetic variable AX) := 2µ X. Let f : R R be a monotone function, then Cov fx), fax)) 0. Proof. Without loss of generality, we can suppose that X N0, 1). Let X 1, X 2 be two independent r.v. of distribution N0, 1), then for a monotone function, one has fx1 ) fx 2 ) ) f X 1 ) f X 2 ) ) 0. And hence E fx1 ) fx 2 ) ) f X 1 ) f X 2 ) ) 0. By direct computation, it follows that fx1 0 E ) fx 2 ) ) f X 1 ) f X 2 ) ) = 2 Cov fx 1 ), f X 1 ) = 2 Cov fx), f X). Example 3.2 In application of finance, a problem may be E e rt S T K) + with S T := S 0 e r σ2 /2)T +σw T, where W is a Brownian motion, i.e. W T N0, T ). In this case, it is clear that Y := e rt S T K) + can be expressed as an increasing function of W T, and one can then use the antithetic variable technique in the Monte-Carlo method.
12 3.2 Variate control method We recall that the random variable takes the form fx) with some random vector X and f : R d R. Suppose that there is some other function g : R d R close to f) and such that the constant m := EgX) can be computed explicitly. Then for every constant b R, one has EY = EỸ b) with Ỹ b) := fx) b gx) m ). It follows another Monte-Carlo estimator of EY, with simulations X k ) k 1, 1 n n Ỹ k b) where Ỹ k b) := fx k ) b gx k ) m ). 3.5) Example 3.3 Let X U0, 1, f : 0, 1 R be some function, and Y := fx). By approximation, one may find some polynomial function g : 0, 1 R such that f g. Besides, the constant m := EgX) is known explicitly whenever g is a polynomial. Take b = 1, it follows that and we can expect that EY = E fx) gx) + m Var fx) gx) + m = Var fx) gx) < Var fx), since g is an approximation of f. Variance analysis By direct computation, it follows that Var Ỹ b) = Var fx) b gx) m ) = Var fx) 2b Cov fx), gx) + b 2 Var gx). We then minimize the variance on the control variable b: min Var Ỹ b) = VarY b R with the optimal control variable Cov fx), gx) ) 2 Var gx) = VarY 1 ρ 2 fx), gx) ), b := Cov fx), gx) Var gx). 3.6)
13 Remark 3.2 i) The above computation shows that to use the variate control method, one should search for a function g : R d R such that m := E gx) is known explicitly, and ρ fx), gx) ) is big. ii) As in Remark 3.1, it is very important to use the same simulation X k for estimator Ỹ n in 3.5). Otherwise, image that X k ) k 1 is i.i.d with X k ) k 1, and consider Ŷ n := 1 n fxk ) bg n X k ) m) ), Then one has ρ fx), g X) ) = 0, which means that the estimator Ŷn is not better than the classical estimator. Estimation of the optimal control variable b In practice, we use Monte-Carlo method to compute EfX) since it can be computed explicitly. Then there is no reason we know how to compute Cov fx), gx), and hence we need to estimate it to obtain an estimation of b in 3.6). A natural estimator of b should then be ˆbn := n Y k Y n )gx k ) G n ) n gx k) G n ) 2, with G n := 1 n n gx k ). 3.7) Further, to avoid the correlation between the estimator ˆb n and the simulations Ỹkˆb n ) in 3.5), we can estimate first ˆb n with a small number n of simulations of X k ) 1 k n, then use a large number m of simulations X k ) n+1 k n+m to estimate EY, i.e. to obtain the estimator 1 m m Ỹ n+k ˆb n ). Multi-variate controls On can also consider several functions g k : R d R),,n. Denote Z k := g k X), Z = Z 1,, Z n ) and suppose that EZ is known explicitly, we can then have a new variate control candidate Ỹ b) := Y b, Z EZ, b = b 1,, b n ) R n. It is clear that EY = E Ỹ b), and by similar computation, one has min Var Ỹ b) ) = min σ 2 b R n b R d Y 2bΣ Y Z + b T Σ ZZ b,
14 where Σ Y, Σ Y Z and Σ ZZ are given by Var Y Z = Σ Y Σ Y Z Σ Y Z Σ ZZ ). The optimal control b is provided by b := Σ 1 ZZ Σ ZY. 3.3 Stratification Recall that X is random vector and Y := fx) for some function f : R d R. Let g : R d R m be some function and denote Z := gx) another random vector. Let A k ) 1 k K be partition of the support of Z in R n, i.e. A 1,, A K are disjoints such that K p k = 1 with p k := PZ A k ), k = 1, K. It follows by Bayes s formula that E Y = K p k E Y Z A k. Assumption 3.1 i) The values of probability p k ) 1 k K are known explicitly. ii) One knows how to simulate a random variable following the conditional distribution LY Z A k ). Under Assumption 3.1, we can propose another Monte-Carlo estimator of E Y : for each k = 1,, K, let Y k) i ) i 1 be a sequence of i.i.d random variable such that LY k) 1 ) = LY Z A k ), then for n = n 1,, n K ) N K, denote Ŷ n := K p k 1 n k n k i=1 ) Y k) i. 3.8) It is clear that E Ŷ n = EY and Ŷ n EY as n 1,, n K ). Simulation of conditional distribution i)suppose that X is a random variable with distribution function F, Z = X and A k = a k, a k+1 for some constant a 1, a 2,, a K+1. Let X k) := F 1 F a k ) + U F a k+1 ) F a k ) )) where U U0, 1,
15 and Y k) := fx k) ). Then L X k)) = LX X A k ) and L Y k) ) ) = L fx k) ) ) = LY X A k ). ii) Suppose that X is a random vector of density ρ : R d R and Z = X. Define ρ k x) := 1 ρx)1 x Ak with p k := PX A k ) = ρx)dx. p k A k Then ρ k is the density function of the conditional distribution of LX X A k ). Variance analysis Denote µ k := EY k), σ 2 k := VarY k), q k := n k n with n := K n k. Then Var Ŷ n = = 1 n K p 2 k K 1 n k Var Y k) = K p 1 k 1 nq k σ 2 k p 2 k q k σ 2 k = 1 n σ2 q), where σ 2 q) := K p 2 k q k σ 2 k. Recall also that Var Y n = 1 n Var Y = 1 n σ2. Then one compare the value σ 2 q) and σ 2. i) Proportional allocation: Let n k n =: qk = p k, then σ 2 q ) := K p kσk 2, and one has σ 2 = σ 2 q ) + K K p k µ 2 k ) 2 p k µ k σ 2 q ), 3.9) where the last inequality follows by Jensen s inequality. Remark 3.3 Let us define a random variable η by η := K k1 Z Ak. Then we have µ k := EY k) = EY η = k and σ 2 k = VarY k) = VarY η = k. Moreover, E VarY η = K p k σk 2 and Var EY η = K K p k µ 2 k ) 2. p k µ k Then the equality 3.9) can be interpreted as a variance decomposition: σ 2 = VarY = E VarY η + Var EY η.
16 ii) Optimal allocation: Let us consider the minimal variance problem min q K p 2 k q k σ 2 k subject to q : q k 0, K q k = 1. The Lagrange multiplier is given by Lλ, q 1,, q K ) := K p 2 k σk 2 q + λ K ) q k 1. k Then the first order condition gives which implies that L = p2 k σ2 k q k qk 2 + λ = 0, p k σ k q k = λ, k = 1,, K. Thus q k = λp k σ k for all k = 1,, K, and it follows that q k = p k σ k K i=1 p iσ i. Application: Let S t be defined by S t = S 0 e σ2 /t+σw t, where S 0 and σ are some positive constant. Denote X := W T / T N0, 1), motivated by its application in finance, we usually need to compute the value E Y with Y := fs T ) = gx), for some function f : R R and g : R R notice that S T can be expressed as a function of X). Let Φ : R 0, 1) be the distribution function of the Gaussian distribution N0, 1), and take A k = a k, a k+1 for some constant a k ) 1 k K+1, X k) := Φ 1 a k + a k+1 a k )U ) Y k) := gx k) ). We then obtain the following algorithm: Algorithm 3.1 i) Choose the sequence of stratification a k ) 1 k K+1. ii) For each k = 1,, K, simulate a sequence of i.i.d. random variable Ui k) i 1 of uniform distribution U0, 1, and let X k) i := Φ 1 a k + a k+1 a k )Ui k ).
17 iii) Estimate EY by K ak+1 a k ) 1 n k n k i=1 ) gxi k ). 3.4 Importance sampling method For the importance sampling, let us begin with a simple example. Suppose that X N0, 1) and h : R R is some function, then E hx) 1 = hx) e x2 /2 dx = hx)e x2 /2+x µ) 2 /2 1 e x µ)2/2 dx R 2π R 2π = hx)e µx+µ2 /2 1 e x µ)2/2 dx R 2π = E hy )e µy +µ2 /2 where Y Nµ, 1)) = E hx + µ)e µx µ2 /2. 3.10) In some context, we can expect that Var hx) > Var hx + µ)e µx µ2 /2, then we can use the latter expectation to propose a Monte-Carlo estimator. To deduce the equality 3.10), the main idea is to divide the function hx) by some density function and then multiple it. We can use this idea in a more general context. Importance sampling method Let X be a random vector of density function f : R d R + and h : R d R, the objective is to compute E hx). Suppose that there is some other density function g : R d R + such that gx) > 0 for every x R d such that fx) > 0. Then by direct computation, we have E hx) = R d hx)fx)dx = R d hx) fx) gx) gx)dx = E hz) fz) gz) where Z is a random vector of density function g. Then an importance sampling estimator for EhX) is given, with a sequence Z k ) k 1 of i.i.d. simulations of Y, by, 1 n n hz k ) fz k) gz k ). 3.11)
18 Variance analysis Let us compute the variance of the new estimator. Var hz) fz) gz) = E h 2 Z) f 2 Z) g 2 Z) E hz) fz) gz) ) 2 = h 2 z) f 2 z) R d g 2 z) gz)dz E hx) ) 2 = E h 2 X) fx) E hx) ) 2. gx) And hence the problem of minimizing the variance turns to be min g Var hz) fz) gz) min g E h 2 X) fx) gx). 3.12) Example 3.4 i) Suppose that hx) = 1 A x) for some subset A R d. Then the minimization problem min g Var hz) fz) gz) = min g Var 1 A Z) fz) gz) is solved by gz) := fz)1 Az) α, where α is the constant making g a density function. ii) Suppose that h is positive, then the minimization problem 3.12) is solved by gz) := 1 αfz)hz), where α is the constant making g a density function., The above two examples can not be implemented since to make g a density function, we need to choose α := R d fz)hz)dz = E hx), which is not known a priori. Therefore, the minimum variance problem 3.12) is not a well posed problem. In practice, we consider a family of density functions g θ ) ) θ Θ, and then solve the minimum variance problem: min Var hz) fz) θ Θ g θ Z) min E h 2 X) fx). θ Θ g θ X) Gaussian vector case function Let X = X 1, X n ) N0, σ 2 I n ), which admits density fx 1,, x n ) := Π n ρx k), with ρx) := 1 2πσ e x2 2σ 2. Let θ = θ 1,, θ n ) R n, and g θ x 1,, x n ) := Π n ρ θ k x k ), with ρ θk x k ) := 1 e x θ k ) 2 2πσ 2σ 2.
19 Then the ratio of the density function turns to be fx) g θ x) n µ2 k n = exp µ kx k + 1 ) 2 σ 2. Then E hx 1,, X n ) = E h ) X 1 + θ 1,, X n + θ n e n µ kx k 1 n ) 2 µ2 k /σ 2. Example 3.5 In application in finance, one usually considers a Brownian motion W, and denote W k := W tk W tk 1 on the discrete time grid 0 = t 0 < t 1 < and one needs to compute E h ) W 1,, W n for some function h. Let Xk = W k, σ 2 = t and µ k = θ k / t, then E h ) W 1,, W n = E h W 1 + µ 1 t,, W n + µ n t ) exp n µ k W k 1 2 µ2 k ). t
20
Chapter 4 Stochastic gradient algorithm The objective is solve an optimization problem of the form where F θ, )) θ Θ is a family of functions. min E F θ, X), 4.1) θ Θ An iterative algorithm to find the root Proposition 4.1 Let f : R d R d a bounded continuous function and θ R d such that fθ ) = 0 and θ θ ) fθ) > 0, θ R d \ {θ }. 4.2) Let γ n ) n 1 be a sequence of numbers satisfying γ n > 0, n 1, and n 1 γ n =, γn 2 <. 4.3) n 1 Further, with some θ 0 R d, define a sequence θ n ) n 1 by θ n+1 = θ n γ n+1 fθ n ), n 0. Then, θ n θ as n. Proof. i) First, by its definition, we have θ n+1 θ 2 = θ n θ 2 + 2θ n θ ) θ n+1 θ n ) + θ n+1 θ n 2 = θ n θ 2 2γ n+1 fθ n ) θ n θ ) + γn fθ 2 n ) 2 θ n θ 2 + γn fθ 2 n ) 2, 21
22 where the last inequality follows by 4.3). Define x n := θ n θ 2 n γk 2 fθ k 1) 2. Then the sequence x n ) n 1 is non-increasing. Moreover, it is bounded from below since x n f k 1 γ k. Therefore, there is some x such that x n x and hence θ n θ 2 l := x + k 1 γ 2 k fθ k 1) 2. It is clear l 0 since it is the limit of θ n θ 2. We claim that l = 0, which can conclude the proof. ii) We now prove l = 0 by contradiction. Assume that l > 0, then there is some N > 0 such that l/2 θ θ 2 2l for every n N. Besides, by the continuity of f and 4.2), we have Therefore, η := inf l/2 θ θ 2 2l θ θ ) fθ) > 0. n 1 γ n fθ n=1 ) θ n 1 θ ) n N γ n fθ n=1 ) θ n 1 θ ) η n N However, we have also θ n θ n 1 ) θ n 1 θ ) γ n =. γ n fθ n=1 ) θ n 1 θ ) = n 1 n 1 = 1 θ n θ 2 θ n θ n 1 2 θ n 1 θ 2) 2 n 1 = 1 γ 2 2 n fθ n 1 ) l + θ 0 θ 2) <. n 1 This is a contradiction, and hence the claim l = 0 is true. Stochastic gradient algorithm Theorem 4.1 In the context of Proposition 4.1, suppose that fθ) = E F θ, X) for some function F : R d R n R and some random vector X. Suppose that f satisfies 4.2) and a sequence of numbers γ) n 1 satisfies 4.3). Then, with some θ 0 R d and a sequence X n ) n 1 of i.i.d. simulations of X, we define a sequence θ n ) n 1 by θ n+1 = θ n γ n+1 F θ n, X n+1 ). 4.4) Then, θ n θ almost surely as n.
23 Proof. Application: optimal importance sampling the minimum variance problem Let us denote In the context of Section 3.4, we solve min Var hx + µ)e µx µ2 /2 min E h 2 X + µ)e 2µX µ2 µ R µ R min E h 2 X)e µx+µ2 /2 4.5) µ R Lµ, X) := h 2 X)e µx+µ2 /2 and lµ) := E Lµ, X), and F µ, X) := L µ µ, X) := µ X)h2 X)e µx+µ2 /2 and fµ) := E F µ, X). Then the minimum variance problem 4.5) is equivalent to find the µ such that fµ ) = l µ ) = 0. Notice that 1 f µ) = l µ) = E + µ X) 2 ) h 2 X)e µx+µ2 /2 > 0, and hence such a µ is separate for f. Therefore, we can use the stochastic gradient algorithm 4.4) to find the optimal µ. Algorithm 4.1 i) Simulate a sequence X n ) n 1 of i.i.d. simulations of X. ii) With µ 0 = 0, use the iteration: µ n+1 = µ n γ n+1 F µ n, X n+1 ) iii) The estimator of EhX) is given by Y n+1 := = 1 n + 1 n+1 hx k + µ k 1 )e µ k 1X k µ 2 /2) k 1 n n + 1 Y n + 1 n + 1 hx n+1 + µ n )e µnx n+1 µ 2 n/2. The advantage of the above algorithm is that one does not need to memorized the simulation X n ) n 1 in the iteration.
24