Case study: stochastic simulation via Rademacher bootstrap
|
|
- Marshall Reed
- 5 years ago
- Views:
Transcription
1 Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic simulation, which arises frequently in engineering design. The basic question is as follows. Suppose we have a system with input space Z. The system has a tunable parameter θ that lies in some set Θ. We have a performance index l : Z Θ [0,1, where we assume that the lower the value of l, the better the performance. Thus, if we use the parameter setting θ Θ and apply input z Z, the performance of the corresponding system is given by the scalar l(z,θ) [0,1. Now let s suppose that the input to the system is actually a random variable Z Z with some distribution P P (Z). Then we can define the operating characteristic L(θ) E P [l(z,θ) l(z,θ)p Z (d z), θ Θ. (1) Z The goal is to find an optimal operating point θ Θ that achieves (or comes arbitrarily close to) inf θ Θ L(θ). In practice, the problem of minimizing L(θ) is quite difficult for large-scale systems. First of all, computing the integral in (1) may be a challenge. Secondly, we may not even know the distribution P Z. Thirdly, there may be more than one distribution of the input, each corresponding to different operating regimes and/or environments. For this reason, engineers often resort to Monte Carlo simulation techniques: Assuming we can efficiently sample from P Z, we draw a large number of independent samples Z 1, Z 2,..., Z n and compute θ n = argmin θ Θ L n (θ) argmin θ Θ 1 n l(z i,θ), where L n ( ) denotes the empirical version of the operating characteristic (1). Given an accuracy parameter ε > 0 and a confidence parameter δ (0,1), we simply need to draw enough samples, so that L( θ n ) inf θ Θ L(θ) + ε with probability at least 1 δ, regardless of what the true distribution P Z happens to be. This is, of course, just another instance of the ERM algorithm we have been studying extensively. However, there are two issues. One is how many samples we need to guarantee that 1
2 the empirically optimal operating point will be good. The other is the complexity of actually computing an empirical minimizer. The first issue has already come up in the course under the name of sample complexity of learning. The second issue is often handled by relaxing the problem a bit: We choose a probability distribution Q over Θ (assuming it can be equipped with an appropriate -algebra) and, instead of minimizing L(θ) over θ Θ, set some level parameter α (0,1), and seek any θ Θ, for which there exists some exceptional set Λ Θ with Q(Λ) α, such that inf L(θ) ε L( θ) θ inf θ Θ\Λ L(θ) + ε (2) with probability at least 1 δ. Unless the actual optimal operating point θ happens to lie in the exceptional set Λ, we will come to within ε of the optimum with confidence at least 1 δ. Then we just need to draw a large enough number n of samples Z 1,..., Z n from P Z and a large enough number m of samples θ 1,...,θ m from Q, and then compute θ = argmin L n (θ). θ {θ 1,...,θ m } In the next several lectures, we will see how statistical learning theory can be used to develop such simulation procedures. Moreover, we will learn how to use Rademacher averages 1 to determine how many samples we need in the process of learning. The use of statistical learning theory for simulation has been pioneered in the context of control by M. Vidyasagar [Vid98, Vid01; the refinement of his techniques using Rademacher averages is due to Koltchinskii et al. [KAA + 00a, KAA + 00b. We will essentially follow their presentation, but with slightly better constants. We will follow the following plan. First, we will revisit the abstract ERM problem and its sample complexity. Then we will introduce a couple of refined tools pertaining to Rademacher averages. Next, we will look at sequential algorithms for empirical approximation, in which the sample complexity is not set a priori, but is rather determined by a data-driven stopping rule. And, finally, we will see how these sequential algorithms can be used to develop robust and efficient stochastic simulation strategies. 1 Empirical Risk Minimization: a quick review Recall the abstract Empirical Risk Minimization problem: We have a space Z, a class P of probability distributions over Z, and a class F of measurable functions f : Z [0,1. Given an i.i.d. sample Z n drawn according to some unknown P P, we compute f n argmin P n (f ) argmin 1 n f (Z i ). 1 More precisely, their stochastic counterpart, in which we do not take the expectation over the Rademacher sequence, but rather use it as a resource to aid the simulation. 2
3 We would like for P( f n ) to be close to inf P(f ) with high probability. To that end, we have derived the bound P( f n ) inf P(f ) 2 P n P F, where, as before, we have defined the uniform deviation P n P F sup Pn (f ) P(f ) 1 = sup n f (Z i ) E P f (Z ). Hence, if n is sufficiently large so that, for every P P, P n P F ε/2 with P-probability at least 1 δ, then P( f n ) will be ε-close to inf P(f ) with probability at least 1 δ. This motivates the following definition: Definition 1. Given the pair (F,P ), an accuracy parameter ε > 0, and a confidence parameter δ (0,1), the sample complexity of empirical approximation is { } N (ε;δ) min n N : sup P{ P n P F ε} δ P P. (3) In other words, for any ε > 0 and any δ (0,1), N (ε/2;δ) is an upper bound on the number of samples needed to guarantee that P( f n ) inf P(f ) + ε with probability (confidence) at least 1 δ. 2 Empirical Rademacher averages As before, let Z n be an i.i.d. sample of length n from some P P (Z). On multiple occasions we have seen that the performance of the ERM algorithm is controlled by the Rademacher average R n (F (Z n )) 1 n E [sup n i f (Z i ), (4) where n = ( 1,..., n ) is an n-tuple of i.i.d. Rademacher random variables independent of Z n. More precisely, we have stablished the fundamental symmetrization inequality as well as the concentration bounds These results show two things: E P n P F 2ER n (F (Z n )), (5) P{ P n P F E P n P F + ε} e 2nε2 (6) P{ P n P F E P n P F ε} e 2nε2 (7) 1. The uniform deviation P n P F tightly concentrates around its expected value. 3
4 2. The expected value E P n P F is bounded from above by ER n (F (Z n )). It turns out that the expected Rademacher average ER n (F (Z n )) also furnishes a lower bound on E P n P F : Lemma 1 (Desymmetrization inequality). For any class F of measurable functions f : Z [0,1, we have [ 1 2 ER n(f (Z n )) 1 2 n 1 2n E sup i [f (Z i ) P(f ) E P n P F. (8) Proof. We will first prove the second inequality in (8). To that end, for each 1 i n and each f F, let us define U i (f ) f (Z i ) P(f ). Then EU i (f ) = 0. Let Z 1,..., Z n be an independent copy of Z 1,..., Z n. Then we can define U i (f ),1 i n, similarly. Moreover, since EU i (f ) = 0, we can write [ [ E sup i [f (Z i ) P(f ) = E sup i U i (f ) [ = E sup i [U i (f ) EU i (f ) [ E sup i [U i (f ) U i (f ). Since, for each i, U i (f ) and U i (f ) are i.i.d., the difference U i (f ) U i (f ) is a symmetric random variable. Therefore, { } { } (d) i [U i (f ) U i (f ) : 1 i n = U i (f ) U i (f ) : 1 i n. Using this fact and the triangle inequality, we get [ [ E sup i [U i (f ) U i (f ) = E sup [U i (f ) U i (f ) [ 2E sup U i (f ) [ = 2E sup f (Z i ) P(f ) = 2n E P n P F. 4
5 To prove the first inequality in (8), we write ER n (F (Z n )) = 1 [sup n E [ i f (Zi ) P(f ) + P(f ) [ [sup 1 n E i [f (Z i ) P(f ) + 1n E sup P(f ) i = [sup 1 n E i [f (Z i ) P(f ) + 1 n E i [sup 1 n E i [f (Z i ) P(f ) + 1. n Rearranging, we get the desired inequality. In this section, we will see that we can get a lot of mileage out of the stochastic version of the Rademacher average. To that end, let us define r n (F (Z n )) 1 n sup i f (Z i ). (9) The key difference between (4) and (9) is that, in the latter, we do not take the expectation over the Rademacher sequence n. In other words, both R n (F (Z n )) and r n (F (Z n )) are random variables, but the former depends only on the training data Z n, while the latter also depends on the n Rademacher random variables 1,..., n. We see immediately that R n (F (Z n )) = E[r n (F (Z n )) Z n and ER n (F (Z n )) = Er n (F (Z n )), where the expectation on the right-hand side is over both Z n and n. The following result will be useful: Lemma 2 (Concentration inequalities for Rademacher averages). For any ε > 0, P { r n (F (Z n )) ER n (F (Z n )) + ε } e nε2 /2 (10) and P { r n (F (Z n )) ER n (F (Z n )) ε } e nε2 /2. (11) Proof. For each 1 i n, let U i (Z i, i ). Then r n (F (Z n )) can be represented as a real-valued function g (U n ). Moreover, it is easy to see that this function has bounded differences with c 1 =... = c n = 2/n. Hence, McDiarmid s inequality tells us that for any ε > 0 P { g (U n ) Eg (U n ) + ε } e nε2 /2, and the same holds for the probability that g (U n ) Eg (U n ) ε. This completes the proof. 5
6 3 Sequential learning algorithms In a sequential learning algorithm, the sample complexity is a random variable. It is not known in advance, but rather is computed from data in the process of learning. In other words, instead of using a training sequence of fixed length, we keep drawing independent samples until we decide that we have acquired enough of them, and then compute an empirical risk minimizer. To formalize this idea, we need the notion of a stopping time. Let {U n } n=1 be a random process. A random variable τ taking values in N is called a stopping time if and only if, for each n 1, the occurrence of the event {τ = n} is determined by U n = (U 1,...,U n ). More precisely: Definition 2. For each n, let Σ n denote the -algebra generated by U n (in other words, Σ n consists of all events that occur by time n). Then a random variable τ taking values in N is a stopping time if and only if, for each n 1, the event {τ = n} Σ n. In other words, denoting by U the entire sample path (U 1,U 2,...) of our random process, we can view τ as a function that maps U into N. For each n, the indicator function of the event {τ = n} is a function of U : 1 {τ=n} 1 {τ(u )=n}. Then τ is a stopping time if and only if, for each n and for all U,V with U n = V n we have 1 {τ(u )=n} = 1 {τ(v )=n}. Our sequential learning algorithms will work as follows. Given a desired accuracy parameter ε > 0 and a confidence parameter δ > 0, let n(ε,δ) be the initial sample size; we will assume that n(ε,δ) is a nonincreasing function of both ε and δ. Let T (ε,δ) denote the set of all stopping times τ such that sup P{ P τ P F ε} δ. P P Now if τ T (ε,δ) and we let then we immediately see that f τ argmin 1 P τ (f ) argmin τ τ f (Z i ), { } sup P( f τ ) inf P(f ) + 2ε δ. P P Of course, the whole question is how to construct an appropriate stopping time without knowing P. Definition 3. A parametric family of stopping times {ν(ε,δ) : ε > 0,δ (0,1)} is called strongly efficient (SE) (w.r.t. F and P ) if there exist constants K 1,K 2,K 3 1, such that for all ε > 0,δ (0,1) and for all τ T (ε,δ) ν(ε,δ) T (K 1 ε,δ) (12) sup P{ν(K 2 ε,δ) > τ} K 3 δ. (13) P P 6
7 In other words, Eq. (12) says that any SE stopping time {ν(ε,δ)} guarantees that we can approximate statistical expectations by empirical expectations with accuracy K 1 ε and confidence 1 δ; similarly, Eq. (13) says that, with probability at least 1 K 3 δ, we will require at most as many samples as would be needed by any sequential algorithm for empirical approximation with accuracy ε/k 2 and confidence 1 δ. Definition 4. A family of stopping times {ν(ε,δ) : ε > 0,δ (0,1)} is weakly efficient (WE) for (F,P ) if there exist constants K 1,K 2,K 3 1, such that for all ε > 0,δ (0,1) and ν(ε,δ) T (K 1 ε,δ) (14) sup P{ν(K 2 ε,δ) > N (ε;δ)} K 3 δ. (15) P P If ν(ε,δ) is a WE stopping time, then Eq. (14) says that we can solve the empirical approximation problem with accuracy K 1 ε and confidence 1 δ; Eq. (15) says that, with probability at most 1 δ, the sample complexity will be less than the sample complexity of empirical approximation with accuracy ε/k 2 and confidence 1 δ. If N (ε;δ) n(ε,δ), then N (ε,δ) T (ε,δ). Hence, any WE stopping time is also SE. The converse, however, is not true. 3.1 A strongly efficient sequential learning algorithm Let {Z n } n=1 be an infinite sequence of i.i.d. draws from some P P ; let { n} n=1 be an i.i.d. Rademacher sequence independent of {Z n }. Choose 2 n(ε,δ) ε 2 log (16) δ(1 e ε2 /2 ) and let ν(ε,δ) min { n n(ε,δ) : r n (F (Z n )) ε }. (17) This is clearly a stopping time for each ε > 0 and each δ (0,1). Theorem 1. The family {ν(ε,δ) : ε > 0,δ (0,1)} defined in (17) with n(ε,δ) set according to (16) is SE for any class F of measurable functions f : Z [0,1 and P = P (Z) with K 1 = 5,K 2 = 6,K 3 = 1. Proof. Let n = n(ε,δ). We will first show that, for any P P (Z), P n P F 2r n (F (Z n )) + 3ε, n n (18) with probability at least 1 δ. Since for n = ν(ε,δ) n we have r n (F (Z n )) ε, we will immediately be able to conclude that P { P ν(ε,δ) P F 5ε } δ, 7
8 which will imply that ν(ε,δ) T (5ε,δ). Now we prove (18). First of all, applying Lemma 2 and the union bound, we can write { { P rn (F (Z n )) ER n (F (Z n )) + ε }} e nε2 /2 n n n n = e nε2 /2 e nε2 /2 n 0 = e nε2 /2 1 e ε2 /2 δ/2. From the symmetrization inequality (5), we know that E P n P F 2ER n (F (Z n )). Moreover, using (6) and the union bound, we can write { } P { P n P F E P n P F + ε} e 2nε2 n n n n Therefore, with probability at least 1 δ, e nε2 /2 n n δ/2. P n P F E P n P F + ε 2ER n (F (Z n )) + ε 2r n (F (Z n )) + 3ε, n n which is (18). This shows that (12) holds for ν(ε,δ) with K 1 = 5. Next, we will prove that, for any P P (Z), { } P min n n<ν(6ε,δ) P n P F < ε δ. (19) In other words, (19) says that, with probability at least 1 δ, P n P F ε for all n n < ν(6ε,δ). This means that, for any τ T (ε,δ), ν(6ε,δ) τ with probability at least 1 δ, which will give us (13) with K 2 = 6 and K 3 = 1. To prove (19), we have by (7) and the union bound that { } P { P n P F E P n P F ε} δ/2. n n By the desymmetrization inequality (8), we have E P n P F 1 2 ER n(f (Z n )) 1 2 n, n. Finally, by the concentration inequality (10) and the union bound, { { P rn (F (Z n )) ER n (F (Z n )) + ε }} δ/2. n n 8
9 Therefore, with probability at least 1 δ, P n P F 1 2 r n(f (Z n )) 1 2 n 3ε, n n. 2 If n n < ν(6ε,δ), then r n (F (Z n )) > 6ε. Therefore, using the fact that n n and n(ε,δ) 1/2 ε, we see that, with probability at least 1 δ, P n P F > 3ε n 3ε ε, n n < ν(6ε,δ). n This proves (19), and we are done. 3.2 A weakly efficient sequential learning algorithm Now choose 2 n(ε,δ) ε 2 log 4 + 1, (20) δ for each k = 0,1,2,... let n k 2 k n(ε,δ), and let ν(ε,δ) min { n k : r nk (F (Z n k )) ε }. (21) Theorem 2. The family {ν(ε,δ) : ε > 0,δ (0,1/2)} defined in (21) with n(ε,δ) set according to (20) is WE for any class F of measurable functions f : Z [0,1 and P = P (Z) with K 1 = 5, K 2 = 18, K 3 = 3. Proof. As before, let n = n(ε,δ). The proof of (14) is similar to what we have done in the proof of Theorem 1, except we use the bounds { { P rnk (F (Z n k )) ER nk (F (Z n k )) + ε }} e 2k nε 2 /2 k=0 k=0 = e nε2 /2 + e nε2 /2 k=1 e nε2 /2 + e nε2 /2 e nε2 /2 + e nε2 /2 2e nε2 /2 δ/2, where in the third step we have used the fact that nε 2 /2 1. Similarly, { { P Pnk P F E P nk P F + ε }} δ 2. k=0 e nε2 2 (2k 1) e (2k 1) k=1 e k k=1 9
10 Therefore, P nk P F 2r nk (F (Z n k )) + 3ε, k = 0,1,2,... and consequently P { P ν(ε,δ) P F 5ε } δ, which proves (14). Now we prove (15). Let N = N (ε,δ), the sample complexity of empirical approximation that we have defined in (3). Let us choose k so that n k N < n k+1, which is equivalent to 2 k n N < 2 k+1 n. Then P{ν(18ε,δ) > N } P{ν(18ε,δ) > n k }. We will show that the probability on the right-hand side is less than 3δ. First of all, since N n (by hypothesis), we have n k n/2 1/ε 2. Therefore, with probability at least 1 δ P nk P F 1 2 r n k (F (Z n k )) 1 2 n k 9ε r n k (F (Z n k )) 5ε. (22) If ν(18ε,δ) > n k, then by definition r nk (F (Z n k )) > 18ε. Writing r nk = r nk (F (Z n k )) for brevity, we see get P{ν(18ε,δ) > n k } P { r nk > 18ε } = P { r nk > 18ε, P nk P F 18ε } + P { r nk > 18ε, P nk P F < 4ε } P { P nk P F 4ε } + P { r nk > 18ε, P nk P F < 4ε }. If r nk > 18ε but P nk P F < 4ε, the event in (22) cannot occur. Indeed, suppose it does. Then it must be the case that 4ε > 9ε 5ε = 4ε, which is a contradiction. Therefore, and hence For each f F and each n N define and let S n F sup S n (f ). Then P { r nk > 18ε, P nk P F < 4ε } δ, P{ν(18ε,δ) > n k } P { P nk P F 4ε } + δ. S n (f ) [f (Z i ) P(f ) P { P nk P F 4ε } = P { S nk F 4εn k } P { Snk F 2εN }. Since n k N, the F -indexed stochastic processes S nk (f ) and S N (f ) S nk (f ) are independent. Therefore, we use a technical result stated as Lemma 4 in the appendix with ξ 1 = S nk and ξ 2 = S N (f ) S nk (f ) to write P { S nk F 2εN } P{ S N F εn } inf P { S N (f ) S nk (f ) εn }. 10
11 By definition of N = N (ε,δ), the probability in the numerator is at most δ. To analyze the probability in the denominator, we use Hoeffding s inequality to get Therefore, inf P{ S N (f ) S nk (f ) εn } = 1 sup P { S N (f ) S nk (f ) > εn } 1 2e Nε2 /2 1 δ. P{ν(18ε,δ) > n k } δ 1 δ + δ 3δ for δ < 1/2. Therefore, {ν(ε,δ) : ε (0,1),δ (0,1/2)} is WE with K 1 = 5,K 2 = 18,K 3 = 3. 4 A sequential algorithm for stochastic simulation Armed with these results on sequential learning algorithms, we can take up the question of constructing efficient simulation strategies. We fix an accuracy parameter ε > 0, a confidence parameter δ (0,1), and a level parameter α (0,1). Given two probability distributions, P on the input space Z and Q on the parameter space Θ, we draw a large i.i.d. sample Z 1,..., Z n from P and a large i.i.d. sample θ 1,...,θ m from Q. We then compute where θ = argmin θ {θ 1,...,θ m L n (θ), L n (θ) 1 n l(z i,θ). The goal is to pick n and m large enough so that, with probability at least 1 δ, θ is an ε- minimizer of L to level α, i.e., with probability at least 1 δ there exists some set Λ Θ with Q(Λ) α, such that Eq. (2) holds with probability at least 1 δ. To that end, consider the following algorithm based on Theorem 2, proposed by Koltchinskii et al. [KAA + 00a, KAA + 00b: 11
12 Algorithm 1 choose positive integers m and n such that m log(2/δ) 50 and n log 8 log[1/(1 α) ε 2 δ + 1 draw m independent samples θ 1,...,θ m from Q draw n independent samples Z 1,..., Z n from P Z evaluate the stopping variable γ = max 1 j m 1 n n i l(z i,θ j ) where 1,..., n are i.i.d. Rademacher r.v. s independent of θ m and Z n if γ > ε/5, then add n more i.i.d. samples from P Z and repeat else stop and output θ = argmin θ {θ1,...,θ n } L n(θ) Then we claim that, with probability at least 1 δ, θ is an ε-minimizer of L to level α. To see this, we need the following result [Vid03, Lemma 11.1: Lemma 3. Let Q be a probability distribution on the parameter set Θ, and let h : Θ R be a (measurable) real-valued function on Θ, bounded from above, i.e., h(θ) < + for all θ Θ. Let θ 1,...,θ m be m i.i.d. samples from Q, and let Then for any α (0,1) with probability at least 1 (1 α) m. h(θ m ) max 1 j m h(θ m). Q ({ θ Θ : h(θ) > h(θ m ) }) α (23) Therefore, it is right- Proof. For each c R, let F (c) P({θ Θ : h(θ) c}). Note that F is the CDF of the random variable ξ = h(θ) with θ Q. continuous, i.e., lim c c F (c ) = F (c). Now define c α inf{c : F (c) 1 α}. Since F is right-continuous, F (c α ) 1 α. Moreover, if c < c α, then F (c) < 1 α. Now let us suppose that h(θ m ) c α. Then, since F is monotone nondecreasing, or, equivalently, if h(θ m ) c α, then P ({ θ Θ : h(θ) h(θ m ) }) = F ( h(θ m ) ) F (c α ) 1 α, P ({ θ Θ : h(θ) > h(θ m ) }) α. Therefore, if θ m is such that P ({ θ Θ : h(θ) > h(θ m ) }) > α, 12
13 then it must be the case that h(θ m ) < c α, which in turn implies that F ( h(θ m )) < 1 α, the complement of the event in (23). But h(θ m ) < c α means that h(θ j ) < c α for every 1 j m. Since the θ j s are independent, the events {h(θ j ) < c α } are independent, and each occurs with probability at most 1 α. Therefore, which is what we intended to prove. P ({ θ m Θ m : Q ({ θ Θ : h(θ) > h(θ m ) })}) (1 α) m, We apply this lemma to the function h(θ) = L(θ). Then, provided m is chosen as described in Algorithm 1, we will have ({ }) Q θ Θ : L(θ) < min L(θ m) δ/2. 1 j j Now consider the finite class of functions F = {f j (z) = l(z,θ j ) : 1 j m}. By Theorem 2, the final output θ {θ 1,...,θ m } will satisfy L( θ) min L(θ j ) ε 1 j m with probability at least 1 δ/2. Hence, with probability at least 1 δ there exists a set Λ Θ with Q(Λ) α, such that (2) holds. Moreover, the total number of samples used up by Algorithm 1 will be, with probability at least 1 3δ/2, no more than N F,PZ (ε/18,δ/2) min{n N : P( P n P Z F > ε/18) < δ/2}. We can estimate N F,PZ (ε/18,δ/2) as follows. First of all, the function (Z n ) P n P Z F max 1 j m P n(f j ) P Z (f j ) has bounded differences with c 1 =... = c n = 1/n. Therefore, by McDiarmid s inequality P ( (Z n ) E (Z n ) + t ) e 2nt 2, t > 0. Secondly, since the class F is finite with F = m, the symmetrization inequality (5) and the Finite Class Lemma give the bound logm E P n P Z F 4 n. Therefore, if we choose t = ε/18 4 n 1 logm and n is large enough so that t > ε/20 (say), then P( P n P F > ε/18) e nε2 /200. Hence, a fairly conservative estimate is { 200 N F,PZ (ε/18,δ/2) max ε 2 log 2 + 1, δ ( 720 ε ) 2 } logm + 1 It is instructive to compare Algorithm 1 with a simple Monte Carlo strategy: 13
14 Algorithm 0 choose positive integers m and n such that m log(2/δ) log[1/(1 α) and n 1 log 4m 2ε 2 δ draw m independent samples θ 1,...,θ m from Q draw n independent samples Z 1,..., Z n from P Z for j = 1 to m compute L n (θ j ) = 1 n n l(z i,θ j ) end for output θ = argmin θ {θ1,...,θ m L n (θ j ) The selection of m is guided by the same considerations as in Algorithm 1. Moreover, for each 1 j m, L n (θ j ) is an average of n independent random variables l(z i,θ j ) [0,1, and L(θ j ) = EL n (θ j ). Hence, Hoeffding s inequality says that P ({ Z n Z n : Ln (θ j ) L(θ j ) > ε }) 2e 2nε 2. If we choose n as described in Algorithm 0, then ) ( P( L n ( θ) min L(θ m j ) > ε P Ln (θ j ) L(θ j ) ) > ε 1 j m j =1 m P ( Ln (θ j ) L(θ j ) ) > ε j =1 δ/2. Hence, with probability at least 1 δ there exists a set Λ Θ with Q(Λ) α, so that (2) holds. It may seem at first glance that Algorithm 0 is more efficient than Algorithm 1. However, this is not the case in high-dimensional situations. There, one can actually show that, with probability practically equal to one, the empirical minimum of L can be much larger than the true minimum (cf. [KAA + 00b for a very vivid numerical illustration). This is an instance of the so-called Curse of Dimensionality, which adaptive schemes like Algorithm 1 can often avoid. A Technical lemma Lemma 4. Let {ξ 1 (f ) : f F } and {ξ 2 (f ) : f F } be two independent F -indexed stochastic processes with ξ j F sup ξ j (f ) <, j = 1,2. Then for all t > 0,c > 0 P{ ξ 1 F t + c} P{ ξ 1 ξ 2 F t} inf P { }. (24) ξ 2 (f ) c 14
15 Proof. If ξ 1 F t + c, then there exists some f F, such that ξ 1 (f ) t + c. Then for this particular f by the triangle inequality we see that ξ 2 (f ) c ξ 1 (f ) ξ 2 (f ) t Therefore, { } { } { ξ1 inf P ξ 2 ξ 2 (f ) c P ξ2 ξ 2 (f ) c P ξ2 (f ) ξ 2 (f ) } { } t P ξ2 ξ 1 ξ 2 F t. The leftmost and the rightmost terms in the above inequality do not depend on the particular f, and the inequality between them is valid on the event { ξ 1 F t + c}. Therefore, integrating the two sides w.r.t. ξ 1 on this event, we get Rearranging, we get (24). { } { } { } inf P ξ 2 ξ 2 (f ) c P ξ1 ξ 1 F t + c P ξ1,ξ 2 ξ 1 ξ 2 F t. References [KAA + 00a V. Koltchinskii, C. T. Abdallah, M. Ariola, P. Dorato, and D. Panchenko. Improved sample complexity estimates for statistical learning control of uncertain systems. IEEE Transactions on Automatic Control, 45(12): , [KAA + 00b V. Koltchinskii, C. T. Abdallah, M. Ariola, P. Dorato, and D. Panchenko. Statistical learning control of uncertain systems: it is better than it seems. Technical Report EECE-TR , University of New Mexico, April [Vid98 [Vid01 M. Vidyasagar. Statistical learning theory and randomized algorithms for control. IEEE Control Magazine, 18(6): , M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37: , [Vid03 M. Vidyasagar. Learning and Generalization. Springer, 2 edition,
FORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationProbably Approximately Correct (PAC) Learning
ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationSequences. Limits of Sequences. Definition. A real-valued sequence s is any function s : N R.
Sequences Limits of Sequences. Definition. A real-valued sequence s is any function s : N R. Usually, instead of using the notation s(n), we write s n for the value of this function calculated at n. We
More informationContinuity. Chapter 4
Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationContinuity. Chapter 4
Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of
More informationGeneralization Bounds and Stability
Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability
More informationLecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.
Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More information1 Stochastic Dynamic Programming
1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationAppendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)
Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More information7 Influence Functions
7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the
More informationSequences. Chapter 3. n + 1 3n + 2 sin n n. 3. lim (ln(n + 1) ln n) 1. lim. 2. lim. 4. lim (1 + n)1/n. Answers: 1. 1/3; 2. 0; 3. 0; 4. 1.
Chapter 3 Sequences Both the main elements of calculus (differentiation and integration) require the notion of a limit. Sequences will play a central role when we work with limits. Definition 3.. A Sequence
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationChapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets
Chapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets Confidence sets X: a sample from a population P P. θ = θ(p): a functional from P to Θ R k for a fixed integer k. C(X): a confidence
More informationEcon Slides from Lecture 1
Econ 205 Sobel Econ 205 - Slides from Lecture 1 Joel Sobel August 23, 2010 Warning I can t start without assuming that something is common knowledge. You can find basic definitions of Sets and Set Operations
More informationMATH 409 Advanced Calculus I Lecture 7: Monotone sequences. The Bolzano-Weierstrass theorem.
MATH 409 Advanced Calculus I Lecture 7: Monotone sequences. The Bolzano-Weierstrass theorem. Limit of a sequence Definition. Sequence {x n } of real numbers is said to converge to a real number a if for
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationThe strictly 1/2-stable example
The strictly 1/2-stable example 1 Direct approach: building a Lévy pure jump process on R Bert Fristedt provided key mathematical facts for this example. A pure jump Lévy process X is a Lévy process such
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More informationA PECULIAR COIN-TOSSING MODEL
A PECULIAR COIN-TOSSING MODEL EDWARD J. GREEN 1. Coin tossing according to de Finetti A coin is drawn at random from a finite set of coins. Each coin generates an i.i.d. sequence of outcomes (heads or
More informationPAC Model and Generalization Bounds
PAC Model and Generalization Bounds Overview Probably Approximately Correct (PAC) model Basic generalization bounds finite hypothesis class infinite hypothesis class Simple case More next week 2 Motivating
More informationProof. We indicate by α, β (finite or not) the end-points of I and call
C.6 Continuous functions Pag. 111 Proof of Corollary 4.25 Corollary 4.25 Let f be continuous on the interval I and suppose it admits non-zero its (finite or infinite) that are different in sign for x tending
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More information5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.
5 Measure theory II 1. Charges (signed measures). Let (Ω, A) be a σ -algebra. A map φ: A R is called a charge, (or signed measure or σ -additive set function) if φ = φ(a j ) (5.1) A j for any disjoint
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationMATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:
MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationSequences of Real Numbers
Chapter 8 Sequences of Real Numbers In this chapter, we assume the existence of the ordered field of real numbers, though we do not yet discuss or use the completeness of the real numbers. In the next
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationEcon Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n
Econ 204 2011 Lecture 3 Outline 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n 1 Metric Spaces and Metrics Generalize distance and length notions
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationTopic 4 Notes Jeremy Orloff
Topic 4 Notes Jeremy Orloff 4 auchy s integral formula 4. Introduction auchy s theorem is a big theorem which we will use almost daily from here on out. Right away it will reveal a number of interesting
More informationLecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.
1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if
More informationLecture 3. Econ August 12
Lecture 3 Econ 2001 2015 August 12 Lecture 3 Outline 1 Metric and Metric Spaces 2 Norm and Normed Spaces 3 Sequences and Subsequences 4 Convergence 5 Monotone and Bounded Sequences Announcements: - Friday
More informationLECTURE-15 : LOGARITHMS AND COMPLEX POWERS
LECTURE-5 : LOGARITHMS AND COMPLEX POWERS VED V. DATAR The purpose of this lecture is twofold - first, to characterize domains on which a holomorphic logarithm can be defined, and second, to show that
More informationCalculating credit risk capital charges with the one-factor model
Calculating credit risk capital charges with the one-factor model Susanne Emmer Dirk Tasche September 15, 2003 Abstract Even in the simple Vasicek one-factor credit portfolio model, the exact contributions
More informationLecture 1. Stochastic Optimization: Introduction. January 8, 2018
Lecture 1 Stochastic Optimization: Introduction January 8, 2018 Optimization Concerned with mininmization/maximization of mathematical functions Often subject to constraints Euler (1707-1783): Nothing
More informationUpper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions
Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Computer Science Department and Center for the Neural Basis of Cognition Carnegie Mellon University
More informationMATH 117 LECTURE NOTES
MATH 117 LECTURE NOTES XIN ZHOU Abstract. This is the set of lecture notes for Math 117 during Fall quarter of 2017 at UC Santa Barbara. The lectures follow closely the textbook [1]. Contents 1. The set
More informationDistirbutional robustness, regularizing variance, and adversaries
Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationSolutions Manual for Homework Sets Math 401. Dr Vignon S. Oussa
1 Solutions Manual for Homework Sets Math 401 Dr Vignon S. Oussa Solutions Homework Set 0 Math 401 Fall 2015 1. (Direct Proof) Assume that x and y are odd integers. Then there exist integers u and v such
More informationPenalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms
university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional
More informationStochastic Dynamic Programming: The One Sector Growth Model
Stochastic Dynamic Programming: The One Sector Growth Model Esteban Rossi-Hansberg Princeton University March 26, 2012 Esteban Rossi-Hansberg () Stochastic Dynamic Programming March 26, 2012 1 / 31 References
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationHomework 4, 5, 6 Solutions. > 0, and so a n 0 = n + 1 n = ( n+1 n)( n+1+ n) 1 if n is odd 1/n if n is even diverges.
2..2(a) lim a n = 0. Homework 4, 5, 6 Solutions Proof. Let ɛ > 0. Then for n n = 2+ 2ɛ we have 2n 3 4+ ɛ 3 > ɛ > 0, so 0 < 2n 3 < ɛ, and thus a n 0 = 2n 3 < ɛ. 2..2(g) lim ( n + n) = 0. Proof. Let ɛ >
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationEfficient Implementation of Approximate Linear Programming
2.997 Decision-Making in Large-Scale Systems April 12 MI, Spring 2004 Handout #21 Lecture Note 18 1 Efficient Implementation of Approximate Linear Programming While the ALP may involve only a small number
More informationProving languages to be nonregular
Proving languages to be nonregular We already know that there exist languages A Σ that are nonregular, for any choice of an alphabet Σ. This is because there are uncountably many languages in total and
More informationScalar multiplication and addition of sequences 9
8 Sequences 1.2.7. Proposition. Every subsequence of a convergent sequence (a n ) n N converges to lim n a n. Proof. If (a nk ) k N is a subsequence of (a n ) n N, then n k k for every k. Hence if ε >
More informationIntroduction to Real Analysis Alternative Chapter 1
Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More information2. The Concept of Convergence: Ultrafilters and Nets
2. The Concept of Convergence: Ultrafilters and Nets NOTE: AS OF 2008, SOME OF THIS STUFF IS A BIT OUT- DATED AND HAS A FEW TYPOS. I WILL REVISE THIS MATE- RIAL SOMETIME. In this lecture we discuss two
More informationCombinatorics in Banach space theory Lecture 12
Combinatorics in Banach space theory Lecture The next lemma considerably strengthens the assertion of Lemma.6(b). Lemma.9. For every Banach space X and any n N, either all the numbers n b n (X), c n (X)
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationCHAPTER 8: EXPLORING R
CHAPTER 8: EXPLORING R LECTURE NOTES FOR MATH 378 (CSUSM, SPRING 2009). WAYNE AITKEN In the previous chapter we discussed the need for a complete ordered field. The field Q is not complete, so we constructed
More informationLecture 4: Graph Limits and Graphons
Lecture 4: Graph Limits and Graphons 36-781, Fall 2016 3 November 2016 Abstract Contents 1 Reprise: Convergence of Dense Graph Sequences 1 2 Calculating Homomorphism Densities 3 3 Graphons 4 4 The Cut
More informationHypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3
Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest
More informationTail bound inequalities and empirical likelihood for the mean
Tail bound inequalities and empirical likelihood for the mean Sandra Vucane 1 1 University of Latvia, Riga 29 th of September, 2011 Sandra Vucane (LU) Tail bound inequalities and EL for the mean 29.09.2011
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationPart 2 Continuous functions and their properties
Part 2 Continuous functions and their properties 2.1 Definition Definition A function f is continuous at a R if, and only if, that is lim f (x) = f (a), x a ε > 0, δ > 0, x, x a < δ f (x) f (a) < ε. Notice
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More informationLecture 3 January 28
EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only
More informationCONSIDER a measurable space and a probability
1682 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 46, NO 11, NOVEMBER 2001 Learning With Prior Information M C Campi and M Vidyasagar, Fellow, IEEE Abstract In this paper, a new notion of learnability is
More informationGeneralized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses
Ann Inst Stat Math (2009) 61:773 787 DOI 10.1007/s10463-008-0172-6 Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses Taisuke Otsu Received: 1 June 2007 / Revised:
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More information1 Stat 605. Homework I. Due Feb. 1, 2011
The first part is homework which you need to turn in. The second part is exercises that will not be graded, but you need to turn it in together with the take-home final exam. 1 Stat 605. Homework I. Due
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationLecture 21. Hypothesis Testing II
Lecture 21. Hypothesis Testing II December 7, 2011 In the previous lecture, we dened a few key concepts of hypothesis testing and introduced the framework for parametric hypothesis testing. In the parametric
More information40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology
Singapore University of Design and Technology Lecture 9: Hypothesis testing, uniformly most powerful tests. The Neyman-Pearson framework Let P be the family of distributions of concern. The Neyman-Pearson
More informationAn asymptotic ratio characterization of input-to-state stability
1 An asymptotic ratio characterization of input-to-state stability Daniel Liberzon and Hyungbo Shim Abstract For continuous-time nonlinear systems with inputs, we introduce the notion of an asymptotic
More informationQuantifying Stochastic Model Errors via Robust Optimization
Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations
More informationOn A-distance and Relative A-distance
1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationStatistical inference
Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall
More informationNonparametric one-sided testing for the mean and related extremum problems
Nonparametric one-sided testing for the mean and related extremum problems Norbert Gaffke University of Magdeburg, Faculty of Mathematics D-39016 Magdeburg, PF 4120, Germany E-mail: norbert.gaffke@mathematik.uni-magdeburg.de
More informationarxiv:math/ v1 [math.fa] 31 Mar 1994
arxiv:math/9403211v1 [math.fa] 31 Mar 1994 New proofs of Rosenthal s l 1 theorem and the Josefson Nissenzweig theorem Ehrhard Behrends Abstract. We give elementary proofs of the theorems mentioned in the
More informationWe are now going to go back to the concept of sequences, and look at some properties of sequences in R
4 Lecture 4 4. Real Sequences We are now going to go back to the concept of sequences, and look at some properties of sequences in R Definition 3 A real sequence is increasing if + for all, and strictly
More informationWe are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero
Chapter Limits of Sequences Calculus Student: lim s n = 0 means the s n are getting closer and closer to zero but never gets there. Instructor: ARGHHHHH! Exercise. Think of a better response for the instructor.
More information1 More finite deterministic automata
CS 125 Section #6 Finite automata October 18, 2016 1 More finite deterministic automata Exercise. Consider the following game with two players: Repeatedly flip a coin. On heads, player 1 gets a point.
More informationChapter 11 - Sequences and Series
Calculus and Analytic Geometry II Chapter - Sequences and Series. Sequences Definition. A sequence is a list of numbers written in a definite order, We call a n the general term of the sequence. {a, a
More information