Case study: stochastic simulation via Rademacher bootstrap

Similar documents
FORMULATION OF THE LEARNING PROBLEM

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Probably Approximately Correct (PAC) Learning

Generalization theory

Class 2 & 3 Overfitting & Regularization

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Sequences. Limits of Sequences. Definition. A real-valued sequence s is any function s : N R.

Continuity. Chapter 4

Lecture 3: Introduction to Complexity Regularization

Continuity. Chapter 4

Generalization Bounds and Stability

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Introduction to Machine Learning (67577) Lecture 3

COMS 4771 Introduction to Machine Learning. Nakul Verma

Generalization bounds

1 Stochastic Dynamic Programming

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Generalization, Overfitting, and Model Selection

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

7 Influence Functions

Sequences. Chapter 3. n + 1 3n + 2 sin n n. 3. lim (ln(n + 1) ln n) 1. lim. 2. lim. 4. lim (1 + n)1/n. Answers: 1. 1/3; 2. 0; 3. 0; 4. 1.

Solving Classification Problems By Knowledge Sets

Chapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets

Econ Slides from Lecture 1

MATH 409 Advanced Calculus I Lecture 7: Monotone sequences. The Bolzano-Weierstrass theorem.

14.1 Finding frequent elements in stream

The strictly 1/2-stable example

Rademacher Bounds for Non-i.i.d. Processes

A PECULIAR COIN-TOSSING MODEL

PAC Model and Generalization Bounds

Proof. We indicate by α, β (finite or not) the end-points of I and call

Understanding Generalization Error: Bounds and Decompositions

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.

Lecture 7 Introduction to Statistical Decision Theory

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

10.1 The Formal Model

On the Complexity of Best Arm Identification with Fixed Confidence

Foundations of Machine Learning

Sequences of Real Numbers

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

1 Review of The Learning Setting

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n

Active Learning: Disagreement Coefficient

Topic 4 Notes Jeremy Orloff

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Lecture 3. Econ August 12

LECTURE-15 : LOGARITHMS AND COMPLEX POWERS

Calculating credit risk capital charges with the one-factor model

Lecture 1. Stochastic Optimization: Introduction. January 8, 2018

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

MATH 117 LECTURE NOTES

Distirbutional robustness, regularizing variance, and adversaries

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistical Machine Learning

Solutions Manual for Homework Sets Math 401. Dr Vignon S. Oussa

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Stochastic Dynamic Programming: The One Sector Growth Model

Stochastic bandits: Explore-First and UCB

Robustness and duality of maximum entropy and exponential family distributions

Homework 4, 5, 6 Solutions. > 0, and so a n 0 = n + 1 n = ( n+1 n)( n+1+ n) 1 if n is odd 1/n if n is even diverges.

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Computational and Statistical Learning Theory

Efficient Implementation of Approximate Linear Programming

Proving languages to be nonregular

Scalar multiplication and addition of sequences 9

Introduction to Real Analysis Alternative Chapter 1

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

2. The Concept of Convergence: Ultrafilters and Nets

Combinatorics in Banach space theory Lecture 12

Lecture 35: December The fundamental statistical distances

CHAPTER 8: EXPLORING R

Lecture 4: Graph Limits and Graphons

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Tail bound inequalities and empirical likelihood for the mean

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Part 2 Continuous functions and their properties

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Lecture 3 January 28

CONSIDER a measurable space and a probability

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Lecture notes for Analysis of Algorithms : Markov decision processes

1 Stat 605. Homework I. Due Feb. 1, 2011

Computational and Statistical Learning theory

Lecture 21. Hypothesis Testing II

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

An asymptotic ratio characterization of input-to-state stability

Quantifying Stochastic Model Errors via Robust Optimization

On A-distance and Relative A-distance

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Statistical inference

Nonparametric one-sided testing for the mean and related extremum problems

arxiv:math/ v1 [math.fa] 31 Mar 1994

We are now going to go back to the concept of sequences, and look at some properties of sequences in R

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

1 More finite deterministic automata

Chapter 11 - Sequences and Series

Transcription:

Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic simulation, which arises frequently in engineering design. The basic question is as follows. Suppose we have a system with input space Z. The system has a tunable parameter θ that lies in some set Θ. We have a performance index l : Z Θ [0,1, where we assume that the lower the value of l, the better the performance. Thus, if we use the parameter setting θ Θ and apply input z Z, the performance of the corresponding system is given by the scalar l(z,θ) [0,1. Now let s suppose that the input to the system is actually a random variable Z Z with some distribution P P (Z). Then we can define the operating characteristic L(θ) E P [l(z,θ) l(z,θ)p Z (d z), θ Θ. (1) Z The goal is to find an optimal operating point θ Θ that achieves (or comes arbitrarily close to) inf θ Θ L(θ). In practice, the problem of minimizing L(θ) is quite difficult for large-scale systems. First of all, computing the integral in (1) may be a challenge. Secondly, we may not even know the distribution P Z. Thirdly, there may be more than one distribution of the input, each corresponding to different operating regimes and/or environments. For this reason, engineers often resort to Monte Carlo simulation techniques: Assuming we can efficiently sample from P Z, we draw a large number of independent samples Z 1, Z 2,..., Z n and compute θ n = argmin θ Θ L n (θ) argmin θ Θ 1 n l(z i,θ), where L n ( ) denotes the empirical version of the operating characteristic (1). Given an accuracy parameter ε > 0 and a confidence parameter δ (0,1), we simply need to draw enough samples, so that L( θ n ) inf θ Θ L(θ) + ε with probability at least 1 δ, regardless of what the true distribution P Z happens to be. This is, of course, just another instance of the ERM algorithm we have been studying extensively. However, there are two issues. One is how many samples we need to guarantee that 1

the empirically optimal operating point will be good. The other is the complexity of actually computing an empirical minimizer. The first issue has already come up in the course under the name of sample complexity of learning. The second issue is often handled by relaxing the problem a bit: We choose a probability distribution Q over Θ (assuming it can be equipped with an appropriate -algebra) and, instead of minimizing L(θ) over θ Θ, set some level parameter α (0,1), and seek any θ Θ, for which there exists some exceptional set Λ Θ with Q(Λ) α, such that inf L(θ) ε L( θ) θ inf θ Θ\Λ L(θ) + ε (2) with probability at least 1 δ. Unless the actual optimal operating point θ happens to lie in the exceptional set Λ, we will come to within ε of the optimum with confidence at least 1 δ. Then we just need to draw a large enough number n of samples Z 1,..., Z n from P Z and a large enough number m of samples θ 1,...,θ m from Q, and then compute θ = argmin L n (θ). θ {θ 1,...,θ m } In the next several lectures, we will see how statistical learning theory can be used to develop such simulation procedures. Moreover, we will learn how to use Rademacher averages 1 to determine how many samples we need in the process of learning. The use of statistical learning theory for simulation has been pioneered in the context of control by M. Vidyasagar [Vid98, Vid01; the refinement of his techniques using Rademacher averages is due to Koltchinskii et al. [KAA + 00a, KAA + 00b. We will essentially follow their presentation, but with slightly better constants. We will follow the following plan. First, we will revisit the abstract ERM problem and its sample complexity. Then we will introduce a couple of refined tools pertaining to Rademacher averages. Next, we will look at sequential algorithms for empirical approximation, in which the sample complexity is not set a priori, but is rather determined by a data-driven stopping rule. And, finally, we will see how these sequential algorithms can be used to develop robust and efficient stochastic simulation strategies. 1 Empirical Risk Minimization: a quick review Recall the abstract Empirical Risk Minimization problem: We have a space Z, a class P of probability distributions over Z, and a class F of measurable functions f : Z [0,1. Given an i.i.d. sample Z n drawn according to some unknown P P, we compute f n argmin P n (f ) argmin 1 n f (Z i ). 1 More precisely, their stochastic counterpart, in which we do not take the expectation over the Rademacher sequence, but rather use it as a resource to aid the simulation. 2

We would like for P( f n ) to be close to inf P(f ) with high probability. To that end, we have derived the bound P( f n ) inf P(f ) 2 P n P F, where, as before, we have defined the uniform deviation P n P F sup Pn (f ) P(f ) 1 = sup n f (Z i ) E P f (Z ). Hence, if n is sufficiently large so that, for every P P, P n P F ε/2 with P-probability at least 1 δ, then P( f n ) will be ε-close to inf P(f ) with probability at least 1 δ. This motivates the following definition: Definition 1. Given the pair (F,P ), an accuracy parameter ε > 0, and a confidence parameter δ (0,1), the sample complexity of empirical approximation is { } N (ε;δ) min n N : sup P{ P n P F ε} δ P P. (3) In other words, for any ε > 0 and any δ (0,1), N (ε/2;δ) is an upper bound on the number of samples needed to guarantee that P( f n ) inf P(f ) + ε with probability (confidence) at least 1 δ. 2 Empirical Rademacher averages As before, let Z n be an i.i.d. sample of length n from some P P (Z). On multiple occasions we have seen that the performance of the ERM algorithm is controlled by the Rademacher average R n (F (Z n )) 1 n E [sup n i f (Z i ), (4) where n = ( 1,..., n ) is an n-tuple of i.i.d. Rademacher random variables independent of Z n. More precisely, we have stablished the fundamental symmetrization inequality as well as the concentration bounds These results show two things: E P n P F 2ER n (F (Z n )), (5) P{ P n P F E P n P F + ε} e 2nε2 (6) P{ P n P F E P n P F ε} e 2nε2 (7) 1. The uniform deviation P n P F tightly concentrates around its expected value. 3

2. The expected value E P n P F is bounded from above by ER n (F (Z n )). It turns out that the expected Rademacher average ER n (F (Z n )) also furnishes a lower bound on E P n P F : Lemma 1 (Desymmetrization inequality). For any class F of measurable functions f : Z [0,1, we have [ 1 2 ER n(f (Z n )) 1 2 n 1 2n E sup i [f (Z i ) P(f ) E P n P F. (8) Proof. We will first prove the second inequality in (8). To that end, for each 1 i n and each f F, let us define U i (f ) f (Z i ) P(f ). Then EU i (f ) = 0. Let Z 1,..., Z n be an independent copy of Z 1,..., Z n. Then we can define U i (f ),1 i n, similarly. Moreover, since EU i (f ) = 0, we can write [ [ E sup i [f (Z i ) P(f ) = E sup i U i (f ) [ = E sup i [U i (f ) EU i (f ) [ E sup i [U i (f ) U i (f ). Since, for each i, U i (f ) and U i (f ) are i.i.d., the difference U i (f ) U i (f ) is a symmetric random variable. Therefore, { } { } (d) i [U i (f ) U i (f ) : 1 i n = U i (f ) U i (f ) : 1 i n. Using this fact and the triangle inequality, we get [ [ E sup i [U i (f ) U i (f ) = E sup [U i (f ) U i (f ) [ 2E sup U i (f ) [ = 2E sup f (Z i ) P(f ) = 2n E P n P F. 4

To prove the first inequality in (8), we write ER n (F (Z n )) = 1 [sup n E [ i f (Zi ) P(f ) + P(f ) [ [sup 1 n E i [f (Z i ) P(f ) + 1n E sup P(f ) i = [sup 1 n E i [f (Z i ) P(f ) + 1 n E i [sup 1 n E i [f (Z i ) P(f ) + 1. n Rearranging, we get the desired inequality. In this section, we will see that we can get a lot of mileage out of the stochastic version of the Rademacher average. To that end, let us define r n (F (Z n )) 1 n sup i f (Z i ). (9) The key difference between (4) and (9) is that, in the latter, we do not take the expectation over the Rademacher sequence n. In other words, both R n (F (Z n )) and r n (F (Z n )) are random variables, but the former depends only on the training data Z n, while the latter also depends on the n Rademacher random variables 1,..., n. We see immediately that R n (F (Z n )) = E[r n (F (Z n )) Z n and ER n (F (Z n )) = Er n (F (Z n )), where the expectation on the right-hand side is over both Z n and n. The following result will be useful: Lemma 2 (Concentration inequalities for Rademacher averages). For any ε > 0, P { r n (F (Z n )) ER n (F (Z n )) + ε } e nε2 /2 (10) and P { r n (F (Z n )) ER n (F (Z n )) ε } e nε2 /2. (11) Proof. For each 1 i n, let U i (Z i, i ). Then r n (F (Z n )) can be represented as a real-valued function g (U n ). Moreover, it is easy to see that this function has bounded differences with c 1 =... = c n = 2/n. Hence, McDiarmid s inequality tells us that for any ε > 0 P { g (U n ) Eg (U n ) + ε } e nε2 /2, and the same holds for the probability that g (U n ) Eg (U n ) ε. This completes the proof. 5

3 Sequential learning algorithms In a sequential learning algorithm, the sample complexity is a random variable. It is not known in advance, but rather is computed from data in the process of learning. In other words, instead of using a training sequence of fixed length, we keep drawing independent samples until we decide that we have acquired enough of them, and then compute an empirical risk minimizer. To formalize this idea, we need the notion of a stopping time. Let {U n } n=1 be a random process. A random variable τ taking values in N is called a stopping time if and only if, for each n 1, the occurrence of the event {τ = n} is determined by U n = (U 1,...,U n ). More precisely: Definition 2. For each n, let Σ n denote the -algebra generated by U n (in other words, Σ n consists of all events that occur by time n). Then a random variable τ taking values in N is a stopping time if and only if, for each n 1, the event {τ = n} Σ n. In other words, denoting by U the entire sample path (U 1,U 2,...) of our random process, we can view τ as a function that maps U into N. For each n, the indicator function of the event {τ = n} is a function of U : 1 {τ=n} 1 {τ(u )=n}. Then τ is a stopping time if and only if, for each n and for all U,V with U n = V n we have 1 {τ(u )=n} = 1 {τ(v )=n}. Our sequential learning algorithms will work as follows. Given a desired accuracy parameter ε > 0 and a confidence parameter δ > 0, let n(ε,δ) be the initial sample size; we will assume that n(ε,δ) is a nonincreasing function of both ε and δ. Let T (ε,δ) denote the set of all stopping times τ such that sup P{ P τ P F ε} δ. P P Now if τ T (ε,δ) and we let then we immediately see that f τ argmin 1 P τ (f ) argmin τ τ f (Z i ), { } sup P( f τ ) inf P(f ) + 2ε δ. P P Of course, the whole question is how to construct an appropriate stopping time without knowing P. Definition 3. A parametric family of stopping times {ν(ε,δ) : ε > 0,δ (0,1)} is called strongly efficient (SE) (w.r.t. F and P ) if there exist constants K 1,K 2,K 3 1, such that for all ε > 0,δ (0,1) and for all τ T (ε,δ) ν(ε,δ) T (K 1 ε,δ) (12) sup P{ν(K 2 ε,δ) > τ} K 3 δ. (13) P P 6

In other words, Eq. (12) says that any SE stopping time {ν(ε,δ)} guarantees that we can approximate statistical expectations by empirical expectations with accuracy K 1 ε and confidence 1 δ; similarly, Eq. (13) says that, with probability at least 1 K 3 δ, we will require at most as many samples as would be needed by any sequential algorithm for empirical approximation with accuracy ε/k 2 and confidence 1 δ. Definition 4. A family of stopping times {ν(ε,δ) : ε > 0,δ (0,1)} is weakly efficient (WE) for (F,P ) if there exist constants K 1,K 2,K 3 1, such that for all ε > 0,δ (0,1) and ν(ε,δ) T (K 1 ε,δ) (14) sup P{ν(K 2 ε,δ) > N (ε;δ)} K 3 δ. (15) P P If ν(ε,δ) is a WE stopping time, then Eq. (14) says that we can solve the empirical approximation problem with accuracy K 1 ε and confidence 1 δ; Eq. (15) says that, with probability at most 1 δ, the sample complexity will be less than the sample complexity of empirical approximation with accuracy ε/k 2 and confidence 1 δ. If N (ε;δ) n(ε,δ), then N (ε,δ) T (ε,δ). Hence, any WE stopping time is also SE. The converse, however, is not true. 3.1 A strongly efficient sequential learning algorithm Let {Z n } n=1 be an infinite sequence of i.i.d. draws from some P P ; let { n} n=1 be an i.i.d. Rademacher sequence independent of {Z n }. Choose 2 n(ε,δ) ε 2 log 2 + 1 (16) δ(1 e ε2 /2 ) and let ν(ε,δ) min { n n(ε,δ) : r n (F (Z n )) ε }. (17) This is clearly a stopping time for each ε > 0 and each δ (0,1). Theorem 1. The family {ν(ε,δ) : ε > 0,δ (0,1)} defined in (17) with n(ε,δ) set according to (16) is SE for any class F of measurable functions f : Z [0,1 and P = P (Z) with K 1 = 5,K 2 = 6,K 3 = 1. Proof. Let n = n(ε,δ). We will first show that, for any P P (Z), P n P F 2r n (F (Z n )) + 3ε, n n (18) with probability at least 1 δ. Since for n = ν(ε,δ) n we have r n (F (Z n )) ε, we will immediately be able to conclude that P { P ν(ε,δ) P F 5ε } δ, 7

which will imply that ν(ε,δ) T (5ε,δ). Now we prove (18). First of all, applying Lemma 2 and the union bound, we can write { { P rn (F (Z n )) ER n (F (Z n )) + ε }} e nε2 /2 n n n n = e nε2 /2 e nε2 /2 n 0 = e nε2 /2 1 e ε2 /2 δ/2. From the symmetrization inequality (5), we know that E P n P F 2ER n (F (Z n )). Moreover, using (6) and the union bound, we can write { } P { P n P F E P n P F + ε} e 2nε2 n n n n Therefore, with probability at least 1 δ, e nε2 /2 n n δ/2. P n P F E P n P F + ε 2ER n (F (Z n )) + ε 2r n (F (Z n )) + 3ε, n n which is (18). This shows that (12) holds for ν(ε,δ) with K 1 = 5. Next, we will prove that, for any P P (Z), { } P min n n<ν(6ε,δ) P n P F < ε δ. (19) In other words, (19) says that, with probability at least 1 δ, P n P F ε for all n n < ν(6ε,δ). This means that, for any τ T (ε,δ), ν(6ε,δ) τ with probability at least 1 δ, which will give us (13) with K 2 = 6 and K 3 = 1. To prove (19), we have by (7) and the union bound that { } P { P n P F E P n P F ε} δ/2. n n By the desymmetrization inequality (8), we have E P n P F 1 2 ER n(f (Z n )) 1 2 n, n. Finally, by the concentration inequality (10) and the union bound, { { P rn (F (Z n )) ER n (F (Z n )) + ε }} δ/2. n n 8

Therefore, with probability at least 1 δ, P n P F 1 2 r n(f (Z n )) 1 2 n 3ε, n n. 2 If n n < ν(6ε,δ), then r n (F (Z n )) > 6ε. Therefore, using the fact that n n and n(ε,δ) 1/2 ε, we see that, with probability at least 1 δ, P n P F > 3ε 2 1 2 n 3ε 2 1 2 ε, n n < ν(6ε,δ). n This proves (19), and we are done. 3.2 A weakly efficient sequential learning algorithm Now choose 2 n(ε,δ) ε 2 log 4 + 1, (20) δ for each k = 0,1,2,... let n k 2 k n(ε,δ), and let ν(ε,δ) min { n k : r nk (F (Z n k )) ε }. (21) Theorem 2. The family {ν(ε,δ) : ε > 0,δ (0,1/2)} defined in (21) with n(ε,δ) set according to (20) is WE for any class F of measurable functions f : Z [0,1 and P = P (Z) with K 1 = 5, K 2 = 18, K 3 = 3. Proof. As before, let n = n(ε,δ). The proof of (14) is similar to what we have done in the proof of Theorem 1, except we use the bounds { { P rnk (F (Z n k )) ER nk (F (Z n k )) + ε }} e 2k nε 2 /2 k=0 k=0 = e nε2 /2 + e nε2 /2 k=1 e nε2 /2 + e nε2 /2 e nε2 /2 + e nε2 /2 2e nε2 /2 δ/2, where in the third step we have used the fact that nε 2 /2 1. Similarly, { { P Pnk P F E P nk P F + ε }} δ 2. k=0 e nε2 2 (2k 1) e (2k 1) k=1 e k k=1 9

Therefore, P nk P F 2r nk (F (Z n k )) + 3ε, k = 0,1,2,... and consequently P { P ν(ε,δ) P F 5ε } δ, which proves (14). Now we prove (15). Let N = N (ε,δ), the sample complexity of empirical approximation that we have defined in (3). Let us choose k so that n k N < n k+1, which is equivalent to 2 k n N < 2 k+1 n. Then P{ν(18ε,δ) > N } P{ν(18ε,δ) > n k }. We will show that the probability on the right-hand side is less than 3δ. First of all, since N n (by hypothesis), we have n k n/2 1/ε 2. Therefore, with probability at least 1 δ P nk P F 1 2 r n k (F (Z n k )) 1 2 n k 9ε 2 1 2 r n k (F (Z n k )) 5ε. (22) If ν(18ε,δ) > n k, then by definition r nk (F (Z n k )) > 18ε. Writing r nk = r nk (F (Z n k )) for brevity, we see get P{ν(18ε,δ) > n k } P { r nk > 18ε } = P { r nk > 18ε, P nk P F 18ε } + P { r nk > 18ε, P nk P F < 4ε } P { P nk P F 4ε } + P { r nk > 18ε, P nk P F < 4ε }. If r nk > 18ε but P nk P F < 4ε, the event in (22) cannot occur. Indeed, suppose it does. Then it must be the case that 4ε > 9ε 5ε = 4ε, which is a contradiction. Therefore, and hence For each f F and each n N define and let S n F sup S n (f ). Then P { r nk > 18ε, P nk P F < 4ε } δ, P{ν(18ε,δ) > n k } P { P nk P F 4ε } + δ. S n (f ) [f (Z i ) P(f ) P { P nk P F 4ε } = P { S nk F 4εn k } P { Snk F 2εN }. Since n k N, the F -indexed stochastic processes S nk (f ) and S N (f ) S nk (f ) are independent. Therefore, we use a technical result stated as Lemma 4 in the appendix with ξ 1 = S nk and ξ 2 = S N (f ) S nk (f ) to write P { S nk F 2εN } P{ S N F εn } inf P { S N (f ) S nk (f ) εn }. 10

By definition of N = N (ε,δ), the probability in the numerator is at most δ. To analyze the probability in the denominator, we use Hoeffding s inequality to get Therefore, inf P{ S N (f ) S nk (f ) εn } = 1 sup P { S N (f ) S nk (f ) > εn } 1 2e Nε2 /2 1 δ. P{ν(18ε,δ) > n k } δ 1 δ + δ 3δ for δ < 1/2. Therefore, {ν(ε,δ) : ε (0,1),δ (0,1/2)} is WE with K 1 = 5,K 2 = 18,K 3 = 3. 4 A sequential algorithm for stochastic simulation Armed with these results on sequential learning algorithms, we can take up the question of constructing efficient simulation strategies. We fix an accuracy parameter ε > 0, a confidence parameter δ (0,1), and a level parameter α (0,1). Given two probability distributions, P on the input space Z and Q on the parameter space Θ, we draw a large i.i.d. sample Z 1,..., Z n from P and a large i.i.d. sample θ 1,...,θ m from Q. We then compute where θ = argmin θ {θ 1,...,θ m L n (θ), L n (θ) 1 n l(z i,θ). The goal is to pick n and m large enough so that, with probability at least 1 δ, θ is an ε- minimizer of L to level α, i.e., with probability at least 1 δ there exists some set Λ Θ with Q(Λ) α, such that Eq. (2) holds with probability at least 1 δ. To that end, consider the following algorithm based on Theorem 2, proposed by Koltchinskii et al. [KAA + 00a, KAA + 00b: 11

Algorithm 1 choose positive integers m and n such that m log(2/δ) 50 and n log 8 log[1/(1 α) ε 2 δ + 1 draw m independent samples θ 1,...,θ m from Q draw n independent samples Z 1,..., Z n from P Z evaluate the stopping variable γ = max 1 j m 1 n n i l(z i,θ j ) where 1,..., n are i.i.d. Rademacher r.v. s independent of θ m and Z n if γ > ε/5, then add n more i.i.d. samples from P Z and repeat else stop and output θ = argmin θ {θ1,...,θ n } L n(θ) Then we claim that, with probability at least 1 δ, θ is an ε-minimizer of L to level α. To see this, we need the following result [Vid03, Lemma 11.1: Lemma 3. Let Q be a probability distribution on the parameter set Θ, and let h : Θ R be a (measurable) real-valued function on Θ, bounded from above, i.e., h(θ) < + for all θ Θ. Let θ 1,...,θ m be m i.i.d. samples from Q, and let Then for any α (0,1) with probability at least 1 (1 α) m. h(θ m ) max 1 j m h(θ m). Q ({ θ Θ : h(θ) > h(θ m ) }) α (23) Therefore, it is right- Proof. For each c R, let F (c) P({θ Θ : h(θ) c}). Note that F is the CDF of the random variable ξ = h(θ) with θ Q. continuous, i.e., lim c c F (c ) = F (c). Now define c α inf{c : F (c) 1 α}. Since F is right-continuous, F (c α ) 1 α. Moreover, if c < c α, then F (c) < 1 α. Now let us suppose that h(θ m ) c α. Then, since F is monotone nondecreasing, or, equivalently, if h(θ m ) c α, then P ({ θ Θ : h(θ) h(θ m ) }) = F ( h(θ m ) ) F (c α ) 1 α, P ({ θ Θ : h(θ) > h(θ m ) }) α. Therefore, if θ m is such that P ({ θ Θ : h(θ) > h(θ m ) }) > α, 12

then it must be the case that h(θ m ) < c α, which in turn implies that F ( h(θ m )) < 1 α, the complement of the event in (23). But h(θ m ) < c α means that h(θ j ) < c α for every 1 j m. Since the θ j s are independent, the events {h(θ j ) < c α } are independent, and each occurs with probability at most 1 α. Therefore, which is what we intended to prove. P ({ θ m Θ m : Q ({ θ Θ : h(θ) > h(θ m ) })}) (1 α) m, We apply this lemma to the function h(θ) = L(θ). Then, provided m is chosen as described in Algorithm 1, we will have ({ }) Q θ Θ : L(θ) < min L(θ m) δ/2. 1 j j Now consider the finite class of functions F = {f j (z) = l(z,θ j ) : 1 j m}. By Theorem 2, the final output θ {θ 1,...,θ m } will satisfy L( θ) min L(θ j ) ε 1 j m with probability at least 1 δ/2. Hence, with probability at least 1 δ there exists a set Λ Θ with Q(Λ) α, such that (2) holds. Moreover, the total number of samples used up by Algorithm 1 will be, with probability at least 1 3δ/2, no more than N F,PZ (ε/18,δ/2) min{n N : P( P n P Z F > ε/18) < δ/2}. We can estimate N F,PZ (ε/18,δ/2) as follows. First of all, the function (Z n ) P n P Z F max 1 j m P n(f j ) P Z (f j ) has bounded differences with c 1 =... = c n = 1/n. Therefore, by McDiarmid s inequality P ( (Z n ) E (Z n ) + t ) e 2nt 2, t > 0. Secondly, since the class F is finite with F = m, the symmetrization inequality (5) and the Finite Class Lemma give the bound logm E P n P Z F 4 n. Therefore, if we choose t = ε/18 4 n 1 logm and n is large enough so that t > ε/20 (say), then P( P n P F > ε/18) e nε2 /200. Hence, a fairly conservative estimate is { 200 N F,PZ (ε/18,δ/2) max ε 2 log 2 + 1, δ ( 720 ε ) 2 } logm + 1 It is instructive to compare Algorithm 1 with a simple Monte Carlo strategy: 13

Algorithm 0 choose positive integers m and n such that m log(2/δ) log[1/(1 α) and n 1 log 4m 2ε 2 δ draw m independent samples θ 1,...,θ m from Q draw n independent samples Z 1,..., Z n from P Z for j = 1 to m compute L n (θ j ) = 1 n n l(z i,θ j ) end for output θ = argmin θ {θ1,...,θ m L n (θ j ) The selection of m is guided by the same considerations as in Algorithm 1. Moreover, for each 1 j m, L n (θ j ) is an average of n independent random variables l(z i,θ j ) [0,1, and L(θ j ) = EL n (θ j ). Hence, Hoeffding s inequality says that P ({ Z n Z n : Ln (θ j ) L(θ j ) > ε }) 2e 2nε 2. If we choose n as described in Algorithm 0, then ) ( P( L n ( θ) min L(θ m j ) > ε P Ln (θ j ) L(θ j ) ) > ε 1 j m j =1 m P ( Ln (θ j ) L(θ j ) ) > ε j =1 δ/2. Hence, with probability at least 1 δ there exists a set Λ Θ with Q(Λ) α, so that (2) holds. It may seem at first glance that Algorithm 0 is more efficient than Algorithm 1. However, this is not the case in high-dimensional situations. There, one can actually show that, with probability practically equal to one, the empirical minimum of L can be much larger than the true minimum (cf. [KAA + 00b for a very vivid numerical illustration). This is an instance of the so-called Curse of Dimensionality, which adaptive schemes like Algorithm 1 can often avoid. A Technical lemma Lemma 4. Let {ξ 1 (f ) : f F } and {ξ 2 (f ) : f F } be two independent F -indexed stochastic processes with ξ j F sup ξ j (f ) <, j = 1,2. Then for all t > 0,c > 0 P{ ξ 1 F t + c} P{ ξ 1 ξ 2 F t} inf P { }. (24) ξ 2 (f ) c 14

Proof. If ξ 1 F t + c, then there exists some f F, such that ξ 1 (f ) t + c. Then for this particular f by the triangle inequality we see that ξ 2 (f ) c ξ 1 (f ) ξ 2 (f ) t Therefore, { } { } { ξ1 inf P ξ 2 ξ 2 (f ) c P ξ2 ξ 2 (f ) c P ξ2 (f ) ξ 2 (f ) } { } t P ξ2 ξ 1 ξ 2 F t. The leftmost and the rightmost terms in the above inequality do not depend on the particular f, and the inequality between them is valid on the event { ξ 1 F t + c}. Therefore, integrating the two sides w.r.t. ξ 1 on this event, we get Rearranging, we get (24). { } { } { } inf P ξ 2 ξ 2 (f ) c P ξ1 ξ 1 F t + c P ξ1,ξ 2 ξ 1 ξ 2 F t. References [KAA + 00a V. Koltchinskii, C. T. Abdallah, M. Ariola, P. Dorato, and D. Panchenko. Improved sample complexity estimates for statistical learning control of uncertain systems. IEEE Transactions on Automatic Control, 45(12):2383 2388, 2000. [KAA + 00b V. Koltchinskii, C. T. Abdallah, M. Ariola, P. Dorato, and D. Panchenko. Statistical learning control of uncertain systems: it is better than it seems. Technical Report EECE-TR-00-001, University of New Mexico, April 2000. [Vid98 [Vid01 M. Vidyasagar. Statistical learning theory and randomized algorithms for control. IEEE Control Magazine, 18(6):162 190, 1998. M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37:1515 1528, 2001. [Vid03 M. Vidyasagar. Learning and Generalization. Springer, 2 edition, 2003. 15