Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Similar documents
Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( )

Stochastic Simulation Variance reduction methods Bo Friis Nielsen

Random Variables and Their Distributions

Exercises and Answers to Chapter 1

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Chp 4. Expectation and Variance

Statistics (1): Estimation

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

ELEMENTS OF PROBABILITY THEORY

2 Random Variable Generation

Lecture 6 Basic Probability

Nonlife Actuarial Models. Chapter 14 Basic Monte Carlo Methods

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Joint Distributions. (a) Scalar multiplication: k = c d. (b) Product of two matrices: c d. (c) The transpose of a matrix:

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

1 Simulating normal (Gaussian) rvs with applications to simulating Brownian motion and geometric Brownian motion in one and two dimensions

5 Operations on Multiple Random Variables

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

µ X (A) = P ( X 1 (A) )

Formulas for probability theory and linear models SF2941

MTH739U/P: Topics in Scientific Computing Autumn 2016 Week 6

conditional cdf, conditional pdf, total probability theorem?

A Probability Review

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Multivariate Analysis and Likelihood Inference

Gaussian vectors and central limit theorem

F denotes cumulative density. denotes probability density function; (.)

Scientific Computing: Monte Carlo

2905 Queueing Theory and Simulation PART IV: SIMULATION

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

CSCI-6971 Lecture Notes: Monte Carlo integration

Probability Models. 4. What is the definition of the expectation of a discrete random variable?

LIST OF FORMULAS FOR STK1100 AND STK1110

1 Exercises for lecture 1

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

4. CONTINUOUS RANDOM VARIABLES

STOR Lecture 16. Properties of Expectation - I

Lecture 11. Multivariate Normal theory

18.440: Lecture 28 Lectures Review

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Approximation of BSDEs using least-squares regression and Malliavin weights

Variance reduction. Michel Bierlaire. Transport and Mobility Laboratory. Variance reduction p. 1/18

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

DUBLIN CITY UNIVERSITY

Probability Theory and Statistics. Peter Jochumzen

Structural Reliability

ACM 116: Lectures 3 4

Negative Association, Ordering and Convergence of Resampling Methods

{σ x >t}p x. (σ x >t)=e at.

Hochdimensionale Integration

Mathematical Statistics

A note on general sliding window processes

Lecture 21: Convergence of transformations and generating a random variable

IEOR E4703: Monte-Carlo Simulation

Exercises with solutions (Set D)

CALCULUS JIA-MING (FRANK) LIOU

Quick Tour of Basic Probability Theory and Linear Algebra

1 Probability and Random Variables

For a stochastic process {Y t : t = 0, ±1, ±2, ±3, }, the mean function is defined by (2.2.1) ± 2..., γ t,

Signed Measures. Chapter Basic Properties of Signed Measures. 4.2 Jordan and Hahn Decompositions

18.440: Lecture 28 Lectures Review

National Sun Yat-Sen University CSE Course: Information Theory. Maximum Entropy and Spectral Estimation

IE 581 Introduction to Stochastic Simulation

Markov Chain Monte Carlo Lecture 1

On a class of stochastic differential equations in a financial network model

Probability and Measure

Introduction to Machine Learning. Lecture 2

STAT Chapter 5 Continuous Distributions

Probability: Handout

We introduce methods that are useful in:

A NEW NONLINEAR FILTER

Accumulation constants of iterated function systems with Bloch target domains

Introduction to ARMA and GARCH processes

Actuarial Science Exam 1/P

Discretization of SDEs: Euler Methods and Beyond

Chapter 4 : Expectation and Moments

MATH2715: Statistical Methods

Supermodular ordering of Poisson arrays

Introduction to Normal Distribution

Hence, (f(x) f(x 0 )) 2 + (g(x) g(x 0 )) 2 < ɛ

Probability and Measure

Empirical Processes: General Weak Convergence Theory

In particular, if A is a square matrix and λ is one of its eigenvalues, then we can find a non-zero column vector X with

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

An idea how to solve some of the problems. diverges the same must hold for the original series. T 1 p T 1 p + 1 p 1 = 1. dt = lim

Gaussian, Markov and stationary processes

STA205 Probability: Week 8 R. Wolpert

CS 195-5: Machine Learning Problem Set 1

Multivariate Distributions

P (A G) dp G P (A G)

Stability of optimization problems with stochastic dominance constraints

2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:

Continuous Functions on Metric Spaces

Lecture 15 Random variables

E n = n + 1. λ n k (x k ˆm) 2. n 1. (x k m) 332:525 Homework Set 1

f(x) f(z) c x z > 0 1

Chapter 4: Monte-Carlo Methods

Transcription:

Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015

Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis of Monte-Carlo method.................. 1 2 Simulation of random vectors or stochastic processes 3 2.1 Random variable of uniform distribution on 0, 1.............. 3 2.2 Inverse method................................. 3 2.3 Transformation method............................ 4 2.4 Reject method................................. 5 2.5 Simulation of Gaussian vector......................... 5 2.6 Simulation of Brownian motion........................ 7 3 Variance reduction techniques 9 3.1 The antithetic variable............................. 9 3.2 Variate control method............................. 12 3.3 Stratification.................................. 14 3.4 Importance sampling method......................... 17 4 Stochastic gradient algorithm 21 iii

Chapter 1 Introduction 1.1 The principle The principle of Monte-Carlo method is to use simulations of the random variables to estimate a quantity such as E fx) = R d fx) µdx), 1.1) where f : R d R is a function, X is some random vector taking value in R d with distribution µ. It may also be used to solve an optimization problem of the kind where f θ : R d R) θ Θ is a family of functions. min E f θ X), 1.2) θ Θ To solve the basic problem 1.1), the method consists in simulating a sequence of i.i.d. random vectors X k ) k 1 with the same distribution of X, and then estimate EfX) by the empirical mean value Y n := 1 n n Y k := 1 n n fx k ). 1.3) The advantages of the Monte-Carlo are usually its simplicity, flexibility and efficiency for high dimensional problems. It can also be served as an alternative method or benchmark) for other numerical methods. 1.2 The error analysis of Monte-Carlo method Theorem 1.1 Law of large number) Let Y k ) k 1 be a sequence of i.i.d. random variables such that E Y <. Then with Y n defined in 1.3), one has ) P lim Y n = EY n 1 = 1.

2 Theorem 1.2 Central limit theorem) Let Y k ) k 1 be a sequence of i.i.d. random variables such that E Y 2 <. Then n Y n EY ) N 0, VarY ) ). And consequently, n ˆσ n Y n EY ) N 0, 1 ), where ˆσ 2 n : = 1 n n ) 2. Yk Y n 1.4) Notice that ˆσ 2 n defined in 1.4) is an estimator of the variance VarY from the sequence Y k ) k 1, and it admits the representation ˆσ 2 n = 1 n n Y 2 k Y n) 2. The central limit theorem induces that the asymptotic confidence interval of level pr) := 1 2π e x2 /2 dx of the estimator Y n is given by x R Y n R n ˆσ n, Y n + R n ˆσ n. 1.5) More precisely, it means that P Y n n R ˆσ n EY Y n + n R ˆσ n pr), as n. Remark 1.1 In practice, for pr) = 95%, we know R 1.96. Conclusions To utilize the Monte-Carlo method, the first issue is then how to simulate a sequence of random vector X k ) k 1 given its law µ or given its definition based on other random elements, the second issue is how to improve the estimator by reducing its error. The error of Monte-Carlo method is measured by its confidence interval 1.5), whose length is given by 2Rˆσ n / n. When the confidence level is fixed, R is fixed. One can then use a larger n, where the cost is the computation time which is proportional to n in general. Otherwise, one can reduce ˆσ n by find some other random variable Ỹ such that E Ỹ = EY and Var Ỹ < VarY. We shall address these issues in the following of the course.

Chapter 2 Simulation of random vectors or stochastic processes 2.1 Random variable of uniform distribution on 0, 1 Here we admits that we know how to simulate a random variable of uniform distribution U0, 1. In particular, most of the programming environment are equipped with the computer with a generator of uniform distribution. However, it worths noticing that a generator of random variables in a computer is a deterministic program, and hence it generates always a sequence of deterministic variables, in place of a sequence of independent random variables. In practice, we search for a generator such that the sequence of generated variables has similar performance statistically. 2.2 Inverse method Let X be random variable, its distribution function, defined by F x) := PX x), is a right-continuous non-decreasing function from R to 0, 1. We then define its rightcontinuous generalized inverse function by F 1 u) := inf { x R : F x) u }. Theorem 2.1 Let X be a random variable with distribution function F and U be a random variable of uniform distribution U0, 1 on interval 0, 1. Then X F 1 U) in law. 3

4 Proof. Notice that for any y R, we have F 1 u) y u F y). It follows that for any y R, P F 1 U) y = P U F y) = F y). Example 2.1 i) Let X be a random variable of discrete distribution, PX = x k ) = p k where x k R, p k 0 for k N and k N p k = 1. Then let U U0, 1, and Z be the random variable defined by Z := x n, if U n 1 p k, k=0 ) n p k. Then Z has the same distribution of X. The definition of Z can be interpreted as F 1 U) with the distribution function F of X. ii) Let X Eλ) be a random variable of exponential distribution of parameter λ > 0, i.e. the density function is given by fx) := λe λx 1 x 0, and the distribution is given by F x) := 1 e λx. By direct computation, F 1 u) = λ 1 log1 u) for every u 0, 1). Then for U U0, 1), k=0 F 1 U) = λ 1 log1 U) λ 1 logu) Eλ), since 1 U and U have the same distribution when U U0, 1). 2.3 Transformation method Proposition 2.1 Box-Muller) Suppose that U and V are independent random variables of uniform distribution on the interval 0, 1. Let X := 2 logu) cos2πv ) and Y := 2 logu) sin2πv ). Then X and Y are two independent random variables of Gassian distribution N0, 1). Proof. Exercise 2.1 Let U, V ) be a random vector which is uniformly distributed on the disk {u, v) : u 2 + v 2 1}. Let X := U 2 logu 2 + V 2 U 2 + V 2 and Y := V 2 logu 2 + V 2 U 2 + V 2.

5 Prove that X, Y ) N0, I 2 ). 2.4 Reject method Let f : R d R + and g : R d R + be two density functions such that, for some constant γ > 0, one has fx) γgx), for all x R d. In practice, g is the density function of some distribution with well-known simulation method such as Gaussian distribution, uniform distribution, exponential distribution, etc.), but f is the density function of some distribution without an easy simulation method. The objective is to use the simulations of random variables of distribution g, together with a rejection procedure, to simulate the random variable of distribution f. Proposition 2.2 Let Y k ) k 1 be an i.i.d. sequence of random variables of density g, and U k ) k 1 be an i.i.d. sequence of random variable of distribution U0, 1. Moreover, Y k ) k 1 and U k ) k 1 are also independent. Define a sequence X n ) n 1 of random variables { X n := Y Nn, with N 0 := 0, andn n+1 := min k > N n : U k fx } k). γgx k ) Then X n ) n 1 is a sequence of i.i.d. random variable of density f. Proof. Exercise 2.2 Let f : R R + be defined by fx) := 1 x ) +. Give a numerical algorithm based on the above reject method) to simulate an i.i.d. sequence of random variables of density f. 2.5 Simulation of Gaussian vector The case of dimension 2 of Gaussian distribution, ρ 1, 1 a constant. Define Let Z 1, Z 2 ) N0, I 2 ) be two independent random variable X 1 := Z 1 and X 2 := ρz 1 + 1 ρ 2 Z 2.

6 It is clear that X 1 X 2 N0, 1) and CovX 1, X 2 ) = CovZ 1, ρz 1 + 1 ρ 2 Z 2 ) = ρ, which means that X 1 X 2 ) N More generally, for Z 1, Z 2 ) N0, I 2 ), let 0 0 ), 1 ρ ρ 1 )) X 1 := µ 1 + σ 1 Z 1 and X 2 := µ 2 + σ 2 ρz 1 + 1 ρ 2 Z 2 ),. then X 1 X 2 ) N µ 1 µ 2 ), σ 2 1 ρσ 1 σ 2 ρσ 1 σ 2 σ 2 2 )). 2.1) Notice also that any Gaussian vector of dimension 2 can be written in the form 2.1). General case: Cholesky s method Let Z N0, I d ) be standard Gaussian random vector of dimension d, and A be a lower triangular matrix of dimension d d, i.e. Z = Z 1 Z d and A = A 11 0 0 A 21 A 22 0 A d1 A d2 A dd. Then the vector X := AZ N0, Σ) with variance-covariance matrix Σ := AA T. Cholesky s method consists in finding a lower triangular matrix A such that AA T = Σ, where Σ is a given variance-covariance matrix. Let us write the equation AA T = Σ as A 11 0 0 A 11 A 21 A d1 Σ 11 Σ 12 Σ 1d A 21 A 22 0 0 A 22 A d2 = Σ 21 Σ 22 Σ 2d. A d1 A d2 A dd. 0 0 A dd. Σ d1 Σ d2 Σ dd The solution is given by A 2 11 = Σ 11 A 21 A 11 = Σ 21 A d1 A 11 = Σ d1 A ii = Σ ii i 1 A2 ik A ij = Σ ii j 1 A ) ika jk /Ajj, j < i. Exercise 2.3 Provide a pseudo code for the algorithm of Cholesky s method.

7 2.6 Simulation of Brownian motion Definition 2.1 A standard Brownian motion W is a stochastic process starting from 0, and having i) continuous paths i.e. t W t is almost surely continuous), ii) independent increments i.e. W t W s W s W r, 0 r s t), iii) stationary and Gaussian increments i.e. W t W s N0, t s)). Forward simulation Using the the independent and stationary Gaussian increments property, one can simulate a path of a Brownian motion in a forward way. Let 0 = t 0 < t 1 < be a discrete grid of R +, Z k ) k 1 be a sequence of i.i.d. random variable of Gaussian distribution N0, 1), we define W by W 0 := 0 and W tk+1 := W tk + t k+1 t k Z k+1. Then W is a sample of paths of the Brownian motion on the discrete grid t k ) k 0. Brownian bridge The forward simulation method consists in simulating W tk+1 knowing the value of W tk. There is backward simulation method, i.e. one simulates first the variable W tn, and then simulates the variables W tn 1, W tn 2,, W t2, W t1 recursively. Proposition 2.3 Let 0 = t 0 < t 1 < be a discrete grid, then the conditional distribution of W k knowing W ti, i k ) is a Gaussian distribution Nµ, σ 2 ) with µ = t k+1 t t W tk 1 + t k t k 1 W tk+1 and σ 2 = t k+1 t k )t k t k 1 ), t k+1 t k 1 t k+1 t k 1 t k+1 t k 1 in particular, L W tk Proof. W tk 1 = x, W tk+1 = y ) = N x + t k t k 1 y, t k+1 t k 1 t k+1 t k )t k t k 1 ) ). t k+1 t k 1 Exercise 2.4 Give the backward simulation algorithm for a Brownian motion on 0, 1, using the above results.

8

Chapter 3 Variance reduction techniques Recall that the principle of the Monte-Carlo method is to estimate EY, where Y := fx) ) by Y n := 1 n n Y k := 1 n fx k ), 3.1) n with simulations Y k ) k 1 or more precisely X k ) k 1 ) of random variables Y. In view of the confidence interval 1.5), it is clear that to reduce the error, one should either augment the simulation number n in cost of computation time), or reduce the variance ˆσ n. 2 More precisely, since the variance VarY of Y is fixed, the real issue is to find some other random variable Ỹ satisfying E Ỹ = EY and Var Ỹ < VarY. 3.2) In most of cases, Ỹ admits the representation Ỹ := gx) with some function g : Rd R. Then using the simulations of Ỹ, one could expect an estimator of EY = E Ỹ ) with smaller error. 3.1 The antithetic variable For many random variables vectors) X, their distributions have some symmetric property and admits a simple transformation A : R d R d such that AX) and X have the same distribution. We call AX) the antithetic variable of X. For example, let X U0, 1, then AX) := 1 X U0, 1; let X N0, σ 2 ), then AX) := X N0, σ 2 ). It follows that 9

10 E fax)) = EfX), and hence E Ỹ = EY with Ỹ := fx) + f AX) ). 2 Then a new Monte-Carlo estimator can be given by Ỹ n := 1 n n fx k ) + fax k )) 2 = 1 2n n ) fx k ) + fax k )). 3.3) In some context, we can expect that VarỸ much smaller than VarY see the criteria 3.2)). Example 3.1 Naive Examples) i) Let fx) := x and X N0, σ 2 ) be a Gaussian r.v., then Y := fx) = X. The random variable X admits an antithetic variable X. fx)+f X) Then Ỹ := 2 0 and it is clear that i.e. 3.2) is true for this example. E Ỹ = EY and VarY > Var Ỹ = 0. ii) Let fx) := x, Y := fu) and U U0, 1 which admits an antithetic variable 1 U. fu)+f1 U) Then Ỹ := 2 1 2 and it is clear that 3.2) holds true in this context. Exercise 3.1 Let U U0, 1, then X := logu) λ Then Comparer the variance of X and that of E X X + X = E 2 X+ X 2. Eλ) and X := log1 U) λ. Eλ). Variance analysis Then one has whenever Var Ỹ = 1 4 By direct computation, we have VarfX) + 2Cov fx), fax)) ) + VarfAX)) = 1 2 VarY + 1 2 Cov fx), fax)). Var Ỹ 1 VarY 3.4) 2 Cov fx), fax)) 0. Intuitively, since AX) is the antithetic variable, we can expect that AX) has a negative correlation with X. In practice, the computation error of estimators Ỹn in 3.3))

11 and Y 2n in 3.1)) should be the same, and under the condition 3.4), one has Var Ỹ n Var Y 2n. Remark 3.1 It is very important to use the same simulation X k for estimator Ỹn in 3.3). Otherwise, image that X k ) k 1 is i.i.d with X k ) k 1, and consider Ŷ n := 1 n n fx k ) + fa X k )). 2 Then one has Var Ŷ n = Var Y 2n, which means that the estimator Ŷn is not better than the classical estimator. Case of Gaussian distribution When X is of Gaussian distribution, we can provide more precise criteria for condition 3.4). Proposition 3.1 Let X Nµ, σ 2 ), which admits an antithetic variable AX) := 2µ X. Let f : R R be a monotone function, then Cov fx), fax)) 0. Proof. Without loss of generality, we can suppose that X N0, 1). Let X 1, X 2 be two independent r.v. of distribution N0, 1), then for a monotone function, one has fx1 ) fx 2 ) ) f X 1 ) f X 2 ) ) 0. And hence E fx1 ) fx 2 ) ) f X 1 ) f X 2 ) ) 0. By direct computation, it follows that fx1 0 E ) fx 2 ) ) f X 1 ) f X 2 ) ) = 2 Cov fx 1 ), f X 1 ) = 2 Cov fx), f X). Example 3.2 In application of finance, a problem may be E e rt S T K) + with S T := S 0 e r σ2 /2)T +σw T, where W is a Brownian motion, i.e. W T N0, T ). In this case, it is clear that Y := e rt S T K) + can be expressed as an increasing function of W T, and one can then use the antithetic variable technique in the Monte-Carlo method.

12 3.2 Variate control method We recall that the random variable takes the form fx) with some random vector X and f : R d R. Suppose that there is some other function g : R d R close to f) and such that the constant m := EgX) can be computed explicitly. Then for every constant b R, one has EY = EỸ b) with Ỹ b) := fx) b gx) m ). It follows another Monte-Carlo estimator of EY, with simulations X k ) k 1, 1 n n Ỹ k b) where Ỹ k b) := fx k ) b gx k ) m ). 3.5) Example 3.3 Let X U0, 1, f : 0, 1 R be some function, and Y := fx). By approximation, one may find some polynomial function g : 0, 1 R such that f g. Besides, the constant m := EgX) is known explicitly whenever g is a polynomial. Take b = 1, it follows that and we can expect that EY = E fx) gx) + m Var fx) gx) + m = Var fx) gx) < Var fx), since g is an approximation of f. Variance analysis By direct computation, it follows that Var Ỹ b) = Var fx) b gx) m ) = Var fx) 2b Cov fx), gx) + b 2 Var gx). We then minimize the variance on the control variable b: min Var Ỹ b) = VarY b R with the optimal control variable Cov fx), gx) ) 2 Var gx) = VarY 1 ρ 2 fx), gx) ), b := Cov fx), gx) Var gx). 3.6)

13 Remark 3.2 i) The above computation shows that to use the variate control method, one should search for a function g : R d R such that m := E gx) is known explicitly, and ρ fx), gx) ) is big. ii) As in Remark 3.1, it is very important to use the same simulation X k for estimator Ỹ n in 3.5). Otherwise, image that X k ) k 1 is i.i.d with X k ) k 1, and consider Ŷ n := 1 n fxk ) bg n X k ) m) ), Then one has ρ fx), g X) ) = 0, which means that the estimator Ŷn is not better than the classical estimator. Estimation of the optimal control variable b In practice, we use Monte-Carlo method to compute EfX) since it can be computed explicitly. Then there is no reason we know how to compute Cov fx), gx), and hence we need to estimate it to obtain an estimation of b in 3.6). A natural estimator of b should then be ˆbn := n Y k Y n )gx k ) G n ) n gx k) G n ) 2, with G n := 1 n n gx k ). 3.7) Further, to avoid the correlation between the estimator ˆb n and the simulations Ỹkˆb n ) in 3.5), we can estimate first ˆb n with a small number n of simulations of X k ) 1 k n, then use a large number m of simulations X k ) n+1 k n+m to estimate EY, i.e. to obtain the estimator 1 m m Ỹ n+k ˆb n ). Multi-variate controls On can also consider several functions g k : R d R),,n. Denote Z k := g k X), Z = Z 1,, Z n ) and suppose that EZ is known explicitly, we can then have a new variate control candidate Ỹ b) := Y b, Z EZ, b = b 1,, b n ) R n. It is clear that EY = E Ỹ b), and by similar computation, one has min Var Ỹ b) ) = min σ 2 b R n b R d Y 2bΣ Y Z + b T Σ ZZ b,

14 where Σ Y, Σ Y Z and Σ ZZ are given by Var Y Z = Σ Y Σ Y Z Σ Y Z Σ ZZ ). The optimal control b is provided by b := Σ 1 ZZ Σ ZY. 3.3 Stratification Recall that X is random vector and Y := fx) for some function f : R d R. Let g : R d R m be some function and denote Z := gx) another random vector. Let A k ) 1 k K be partition of the support of Z in R n, i.e. A 1,, A K are disjoints such that K p k = 1 with p k := PZ A k ), k = 1, K. It follows by Bayes s formula that E Y = K p k E Y Z A k. Assumption 3.1 i) The values of probability p k ) 1 k K are known explicitly. ii) One knows how to simulate a random variable following the conditional distribution LY Z A k ). Under Assumption 3.1, we can propose another Monte-Carlo estimator of E Y : for each k = 1,, K, let Y k) i ) i 1 be a sequence of i.i.d random variable such that LY k) 1 ) = LY Z A k ), then for n = n 1,, n K ) N K, denote Ŷ n := K p k 1 n k n k i=1 ) Y k) i. 3.8) It is clear that E Ŷ n = EY and Ŷ n EY as n 1,, n K ). Simulation of conditional distribution i)suppose that X is a random variable with distribution function F, Z = X and A k = a k, a k+1 for some constant a 1, a 2,, a K+1. Let X k) := F 1 F a k ) + U F a k+1 ) F a k ) )) where U U0, 1,

15 and Y k) := fx k) ). Then L X k)) = LX X A k ) and L Y k) ) ) = L fx k) ) ) = LY X A k ). ii) Suppose that X is a random vector of density ρ : R d R and Z = X. Define ρ k x) := 1 ρx)1 x Ak with p k := PX A k ) = ρx)dx. p k A k Then ρ k is the density function of the conditional distribution of LX X A k ). Variance analysis Denote µ k := EY k), σ 2 k := VarY k), q k := n k n with n := K n k. Then Var Ŷ n = = 1 n K p 2 k K 1 n k Var Y k) = K p 1 k 1 nq k σ 2 k p 2 k q k σ 2 k = 1 n σ2 q), where σ 2 q) := K p 2 k q k σ 2 k. Recall also that Var Y n = 1 n Var Y = 1 n σ2. Then one compare the value σ 2 q) and σ 2. i) Proportional allocation: Let n k n =: qk = p k, then σ 2 q ) := K p kσk 2, and one has σ 2 = σ 2 q ) + K K p k µ 2 k ) 2 p k µ k σ 2 q ), 3.9) where the last inequality follows by Jensen s inequality. Remark 3.3 Let us define a random variable η by η := K k1 Z Ak. Then we have µ k := EY k) = EY η = k and σ 2 k = VarY k) = VarY η = k. Moreover, E VarY η = K p k σk 2 and Var EY η = K K p k µ 2 k ) 2. p k µ k Then the equality 3.9) can be interpreted as a variance decomposition: σ 2 = VarY = E VarY η + Var EY η.

16 ii) Optimal allocation: Let us consider the minimal variance problem min q K p 2 k q k σ 2 k subject to q : q k 0, K q k = 1. The Lagrange multiplier is given by Lλ, q 1,, q K ) := K p 2 k σk 2 q + λ K ) q k 1. k Then the first order condition gives which implies that L = p2 k σ2 k q k qk 2 + λ = 0, p k σ k q k = λ, k = 1,, K. Thus q k = λp k σ k for all k = 1,, K, and it follows that q k = p k σ k K i=1 p iσ i. Application: Let S t be defined by S t = S 0 e σ2 /t+σw t, where S 0 and σ are some positive constant. Denote X := W T / T N0, 1), motivated by its application in finance, we usually need to compute the value E Y with Y := fs T ) = gx), for some function f : R R and g : R R notice that S T can be expressed as a function of X). Let Φ : R 0, 1) be the distribution function of the Gaussian distribution N0, 1), and take A k = a k, a k+1 for some constant a k ) 1 k K+1, X k) := Φ 1 a k + a k+1 a k )U ) Y k) := gx k) ). We then obtain the following algorithm: Algorithm 3.1 i) Choose the sequence of stratification a k ) 1 k K+1. ii) For each k = 1,, K, simulate a sequence of i.i.d. random variable Ui k) i 1 of uniform distribution U0, 1, and let X k) i := Φ 1 a k + a k+1 a k )Ui k ).

17 iii) Estimate EY by K ak+1 a k ) 1 n k n k i=1 ) gxi k ). 3.4 Importance sampling method For the importance sampling, let us begin with a simple example. Suppose that X N0, 1) and h : R R is some function, then E hx) 1 = hx) e x2 /2 dx = hx)e x2 /2+x µ) 2 /2 1 e x µ)2/2 dx R 2π R 2π = hx)e µx+µ2 /2 1 e x µ)2/2 dx R 2π = E hy )e µy +µ2 /2 where Y Nµ, 1)) = E hx + µ)e µx µ2 /2. 3.10) In some context, we can expect that Var hx) > Var hx + µ)e µx µ2 /2, then we can use the latter expectation to propose a Monte-Carlo estimator. To deduce the equality 3.10), the main idea is to divide the function hx) by some density function and then multiple it. We can use this idea in a more general context. Importance sampling method Let X be a random vector of density function f : R d R + and h : R d R, the objective is to compute E hx). Suppose that there is some other density function g : R d R + such that gx) > 0 for every x R d such that fx) > 0. Then by direct computation, we have E hx) = R d hx)fx)dx = R d hx) fx) gx) gx)dx = E hz) fz) gz) where Z is a random vector of density function g. Then an importance sampling estimator for EhX) is given, with a sequence Z k ) k 1 of i.i.d. simulations of Y, by, 1 n n hz k ) fz k) gz k ). 3.11)

18 Variance analysis Let us compute the variance of the new estimator. Var hz) fz) gz) = E h 2 Z) f 2 Z) g 2 Z) E hz) fz) gz) ) 2 = h 2 z) f 2 z) R d g 2 z) gz)dz E hx) ) 2 = E h 2 X) fx) E hx) ) 2. gx) And hence the problem of minimizing the variance turns to be min g Var hz) fz) gz) min g E h 2 X) fx) gx). 3.12) Example 3.4 i) Suppose that hx) = 1 A x) for some subset A R d. Then the minimization problem min g Var hz) fz) gz) = min g Var 1 A Z) fz) gz) is solved by gz) := fz)1 Az) α, where α is the constant making g a density function. ii) Suppose that h is positive, then the minimization problem 3.12) is solved by gz) := 1 αfz)hz), where α is the constant making g a density function., The above two examples can not be implemented since to make g a density function, we need to choose α := R d fz)hz)dz = E hx), which is not known a priori. Therefore, the minimum variance problem 3.12) is not a well posed problem. In practice, we consider a family of density functions g θ ) ) θ Θ, and then solve the minimum variance problem: min Var hz) fz) θ Θ g θ Z) min E h 2 X) fx). θ Θ g θ X) Gaussian vector case function Let X = X 1, X n ) N0, σ 2 I n ), which admits density fx 1,, x n ) := Π n ρx k), with ρx) := 1 2πσ e x2 2σ 2. Let θ = θ 1,, θ n ) R n, and g θ x 1,, x n ) := Π n ρ θ k x k ), with ρ θk x k ) := 1 e x θ k ) 2 2πσ 2σ 2.

19 Then the ratio of the density function turns to be fx) g θ x) n µ2 k n = exp µ kx k + 1 ) 2 σ 2. Then E hx 1,, X n ) = E h ) X 1 + θ 1,, X n + θ n e n µ kx k 1 n ) 2 µ2 k /σ 2. Example 3.5 In application in finance, one usually considers a Brownian motion W, and denote W k := W tk W tk 1 on the discrete time grid 0 = t 0 < t 1 < and one needs to compute E h ) W 1,, W n for some function h. Let Xk = W k, σ 2 = t and µ k = θ k / t, then E h ) W 1,, W n = E h W 1 + µ 1 t,, W n + µ n t ) exp n µ k W k 1 2 µ2 k ). t

20

Chapter 4 Stochastic gradient algorithm The objective is solve an optimization problem of the form where F θ, )) θ Θ is a family of functions. min E F θ, X), 4.1) θ Θ An iterative algorithm to find the root Proposition 4.1 Let f : R d R d a bounded continuous function and θ R d such that fθ ) = 0 and θ θ ) fθ) > 0, θ R d \ {θ }. 4.2) Let γ n ) n 1 be a sequence of numbers satisfying γ n > 0, n 1, and n 1 γ n =, γn 2 <. 4.3) n 1 Further, with some θ 0 R d, define a sequence θ n ) n 1 by θ n+1 = θ n γ n+1 fθ n ), n 0. Then, θ n θ as n. Proof. i) First, by its definition, we have θ n+1 θ 2 = θ n θ 2 + 2θ n θ ) θ n+1 θ n ) + θ n+1 θ n 2 = θ n θ 2 2γ n+1 fθ n ) θ n θ ) + γn fθ 2 n ) 2 θ n θ 2 + γn fθ 2 n ) 2, 21

22 where the last inequality follows by 4.3). Define x n := θ n θ 2 n γk 2 fθ k 1) 2. Then the sequence x n ) n 1 is non-increasing. Moreover, it is bounded from below since x n f k 1 γ k. Therefore, there is some x such that x n x and hence θ n θ 2 l := x + k 1 γ 2 k fθ k 1) 2. It is clear l 0 since it is the limit of θ n θ 2. We claim that l = 0, which can conclude the proof. ii) We now prove l = 0 by contradiction. Assume that l > 0, then there is some N > 0 such that l/2 θ θ 2 2l for every n N. Besides, by the continuity of f and 4.2), we have Therefore, η := inf l/2 θ θ 2 2l θ θ ) fθ) > 0. n 1 γ n fθ n=1 ) θ n 1 θ ) n N γ n fθ n=1 ) θ n 1 θ ) η n N However, we have also θ n θ n 1 ) θ n 1 θ ) γ n =. γ n fθ n=1 ) θ n 1 θ ) = n 1 n 1 = 1 θ n θ 2 θ n θ n 1 2 θ n 1 θ 2) 2 n 1 = 1 γ 2 2 n fθ n 1 ) l + θ 0 θ 2) <. n 1 This is a contradiction, and hence the claim l = 0 is true. Stochastic gradient algorithm Theorem 4.1 In the context of Proposition 4.1, suppose that fθ) = E F θ, X) for some function F : R d R n R and some random vector X. Suppose that f satisfies 4.2) and a sequence of numbers γ) n 1 satisfies 4.3). Then, with some θ 0 R d and a sequence X n ) n 1 of i.i.d. simulations of X, we define a sequence θ n ) n 1 by θ n+1 = θ n γ n+1 F θ n, X n+1 ). 4.4) Then, θ n θ almost surely as n.

23 Proof. Application: optimal importance sampling the minimum variance problem Let us denote In the context of Section 3.4, we solve min Var hx + µ)e µx µ2 /2 min E h 2 X + µ)e 2µX µ2 µ R µ R min E h 2 X)e µx+µ2 /2 4.5) µ R Lµ, X) := h 2 X)e µx+µ2 /2 and lµ) := E Lµ, X), and F µ, X) := L µ µ, X) := µ X)h2 X)e µx+µ2 /2 and fµ) := E F µ, X). Then the minimum variance problem 4.5) is equivalent to find the µ such that fµ ) = l µ ) = 0. Notice that 1 f µ) = l µ) = E + µ X) 2 ) h 2 X)e µx+µ2 /2 > 0, and hence such a µ is separate for f. Therefore, we can use the stochastic gradient algorithm 4.4) to find the optimal µ. Algorithm 4.1 i) Simulate a sequence X n ) n 1 of i.i.d. simulations of X. ii) With µ 0 = 0, use the iteration: µ n+1 = µ n γ n+1 F µ n, X n+1 ) iii) The estimator of EhX) is given by Y n+1 := = 1 n + 1 n+1 hx k + µ k 1 )e µ k 1X k µ 2 /2) k 1 n n + 1 Y n + 1 n + 1 hx n+1 + µ n )e µnx n+1 µ 2 n/2. The advantage of the above algorithm is that one does not need to memorized the simulation X n ) n 1 in the iteration.

24