EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich

Size: px
Start display at page:

Download "EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich"

Transcription

1 EMPIRICAL PROCESS THEORY AND APPLICATIONS by Sara van de Geer Handout WS 2006 ETH Zürich

2 Contents Preface Introduction Law of large numbers for real-valued random variables 2 R d -valued random variables 3 Definition Glivenko-Cantelli classes of sets 4 Convergence of averages to their expectations 2 Exponential probability inequalities 2 Chebyshev s inequality 22 Bernstein s inequality 23 Hoeffding s inequality 24 Exercise 3 Symmetrization 3 Symmetrization with means 32 Symmetrization with probabilities 33 Some facts about conditional expectations 4 Uniform law of large numbers 4 Classes of functions 42 Classes of sets 43 Vapnik-Chervonenkis classes 44 VC graph classes of functions 45 Exercises 5 M-estimators 5 What is an M-estimator? 52 Consistency 53 Exercises 6 Uniform central limit theorems 6 Real-valued random variables 62 R d -valued random variables 63 Donsker s Theorem 64 Donsker classes 65 Chaining and the increments of empirical processes 7 Asymptotic normality of M-estimators 7 Asymptotic linearity 72 Conditions a,b and c for asymptotic normality 73 Asymptotics for the median 74 Conditions A,B and C for asymptotic normality 75 Exercises 8 Rates of convergence for least squares estimators 8 Gaussian errors 82 Rates of convergence 83 Examples 84 Exercises 9 Penalized least squares 9 Estimation and approximation error 92 Finite models 93 Nested finite models 94 General penalties 95 Application to a classical penalty 96 Exercise 2

3 Preface This preface motivates why, from a statistician s point of view, it is interesting to study empirical processes We indicate that any estimator is some function of the empirical measure In these lectures, we study convergence of the empirical measure, as sample size increases In the simplest case, a data set consists of observations on a single variable, say real-valued observations Suppose there are n such observations, denoted by X,, X n For example, X i could be the reaction time of individual i to a given stimulus, or the number of car accidents on day i, etc Suppose now that each observation follows the same probability law P This means that the observations are relevant if one wants to predict the value of a new observation X say the reaction time of a hypothetical new subject, or the number of car accidents on a future day, etc Thus, a common underlying distribution P allows one to generalize the outcomes An estimator is any given function T n X,, X n of the data Let us review some common estimators The empirical distribution The unknown P can be estimated from the data in the following way Suppose first that we are interested in the probability that an observation falls in A, where A is a certain set chosen by the researcher We denote this probability by P A Now, from the frequentist point of view, the probability of an event is nothing else than the limit of relative frequencies of occurrences of that event as the number of occasions of possible occurrences n grows without limit So it is natural to estimate P A with the frequency of A, ie, with P n A = number of times an observation X i falls in A total number of observations = number of X i A n We now define the empirical measure P n as the probability law that assigns to a set A the probability P n A We regard P n as an estimator of the unknown P The empirical distribution function The distribution function of X is defined as and the empirical distribution function is F x = P X x, ˆF n x = number of X i x n 09 F F n^ x Figure Figure plots the distribution function F x = /x 2, x smooth curve and the empirical distribution function ˆF n stair function of a sample from F with sample size n = 200 3

4 Means and averages The theoretical mean µ := EX E stands for Expectation, can be estimated by the sample average X n := X + + X n n More generally, let g be a real-valued function on R Then is an estimator EgX gx + + gx n, n Sample median The median of X is the value m that satisfies F m = /2 assuming there is a unique solution Its empirical version is any value ˆm n such that ˆF n ˆm n is equal or as close as possible to /2 In the above example F x = /x 2, so that the theoretical median is m = 2 = 442 In the ordered sample, the 00 th observation is equal to 466 and the 0 th observation is equal to 49 A common choice for the sample median is taking the average of these two values This gives ˆm n = 479 Properties of estimators Let T n = T n X,, X n be an estimator of the real-valued parameter θ Then it is desirable that T n is in some sense close to θ A minimum requirement is that the estimator approaches θ as the sample size increases This is called consistency To be more precise, suppose the sample X,, X n are the first n of an infinite sequence X, X 2, of independent copies of X Then T n is called strongly consistent if, with probability one, T n θ as n Note that consistency of frequencies as estimators of probabilities, or means as estimators of expectations, follows from the strong law of large numbers In general, an estimator T n can be a complicated function of the data In that case, it is helpful to know that the convergence of means to their expectations is uniform over a class The latter is a major topic in empirical process theory Parametric models The distribution P may be partly known beforehand The unknown parts of P are called parameters of the model For example, if the X i are yes/no answers to a certain question the binary case, we know that P allows only two possibilities, say and 0 yes=, no=0 There is only one parameter, say the probability of a yes answer θ = P X = More generally, in a parametric model, it is assumed that P is known up to a finite number of parameters θ = θ,, θ d We then often write P = P θ When there are infinitely many parameters which is for example the case when P is completely unknown, the model is called nonparametric Nonparametric models An example of a nonparametric model is where one assumes that the density f of the distribution function F exists, but all one assumes about it is some kind of smoothness eg the continuous first derivative of f exists In that case, one may propose eg to use the histogram as estimator of f This is an example of a nonparametric estimator Histograms Our aim is estimating the density fx at a given point x The density is defined as the derivative of the distribution function F at x: F x + h F x P x, x + h] fx = lim = lim h 0 h h 0 h Here, x, x + h] is the interval with left endpoint x not included and right endpoint x + h included Unfortunately, replacing P by P n here does not work, as for h small enough, P n x, x + h] will be equal to zero Therefore, instead of taking the limit as h 0, we fix h at a small positive value, called the bandwidth The estimator of fx thus becomes ˆf n x = P nx, x + h] h = number of X i x, x + h] nh 4

5 A plot of this estimator at points x {x 0, x 0 + h, x 0 + 2h, } is called a histogram Example Figure 2 shows the histogram, with bandwidth h = 05, for the sample of size n = 200 from the Pareto distribution with parameter θ = 2 The solid line is the density of this distribution f Figure 2 Conclusion An estimator T n is some function of the data X,, X n If it is a symmetric function of the data which we can in fact assume without loss of generality when the ordering in the data contains no information, we may write T n = T P n, where P n is the empirical distribution Roughly speaking, the main purpose in theoretical statistics is studying the difference between T P n and T P We therefore are interested in convergence of P n to P in a broad enough sense This is what empirical process theory is about 5

6 Introduction This chapter introduces the notation and part of the problem setting Let X,, X n, be iid copies of a random variable X with values in X and with distribution P The distribution of the sequence X, X 2, + perhaps some auxiliary variables is denoted by P Definition Let {T n, T } be a collection of real-valued random variables Then T n converges in probability to T, if for all ɛ > 0, lim n P T n T > ɛ = 0 Notation: T n P T Moreover, T n converges almost surely as to T if P lim n T n = T = Remark Convergence almost surely implies convergence in probability Law of large numbers for real-valued random variables Consider the case X = R Suppose the mean µ := EX exists Define the average X n := n Then, by the law of large numbers, as n, Now, let be the theoretical distribution function, and X i, n X n µ, as F t := P X t, t R, F n t := n #{X i t, i n}, t R, be the empirical distribution function Then by the law of large numbers, as n, F n t F t, as for all t We will prove in Chapter 4 the Glivenko-Cantelli Theorem, which says that This is a uniform law of large numbers sup F n t F t 0, as t Application: Kolmogorov s goodness-of-fit test We want to test H 0 : F = F 0 Test statistic: D n ; = sup F n t F 0 t t Reject H 0 for large values of D n 2 R d -valued random variables Questions: i What is a natural extension of half-intervals in R to higher dimensions? ii Does Glivenko-Cantelli hold for this extension? 6

7 3 Definition Glivenko-Cantelli classes of sets Let for any measurable A X, P n A := n #{X i A, i n} We call P n the empirical measure based on X,, X n Let D be a collection of subsets of X Definition 3 The collection D is called a Glivenko-Cantelli GC class if sup P n D P D 0, as D D Example Let X = R The class of half-intervals D = {l,t] : t R} is GC But when eg P = uniform distribution on [0, ] ie, F t = t, 0 t, the class is not GC and B = {all Borel subsets of [0, ]} 4 Convergence of averages to their expectations Notation For a function g : X R, we write P g := EgX, P n g := n Let G be a collection of real-valued functions on X gx i Definition 4 The class G is called a Glivenko-Cantelli GC class if sup P n g P g 0, as g G We will often use the notation P n P G := sup P n g P g g G We will skip measurability issues, and most of the time do not mention explicitly the requirement of measurability of certain sets or functions This means that everything has to be understood modulo measurability 7

8 2 Exponential probability inequalities A statistician is almost never sure about something, but often says that something holds with large probability We study probability inequalities for deviations of means from their expectations These are exponential inequalities, that is, the probability that the deviation is large is exponentially small We will in fact see that the inequalities are similar to those obtained if we assume normality Exponentially small probabilities are useful indeed when one wants to prove that with large probability a whole collection of events holds simultaneously It then suffices to show that adding up the small probabilities that one such an event does not hold, still gives something small We will use this argument in Chapter 4 2 Chebyshev s inequality Chebyshev s inequality Consider a random variable X R with distribution P, and an increasing function φ : R [0, Then for all a with φa > 0, we have P X a EφX φa Proof EφX = φxdp x = X a = φa X a φxdp x X a Let X be N 0, -distributed By Exercise 24, φxdp x + X a X<a φadp x dp = φap X a φxdp x P X a exp[ a 2 /2] a > 0 Corollary 2 Let X,, X n be independent real-valued random variables, and suppose, for all i, that X i is N 0, σi 2 -distributed Define b 2 = σi 2 Then for all a > 0, n P X i a ] exp [ a2 2b 2 22 Bernstein s inequality Bernstein s inequality Let X,, X n be independent real-valued random variables with expectation zero Suppose that for all i, Define We have for any a > 0, E X i m m! 2 Km 2 σ 2 i, m = 2, 3, b 2 = n P X i a σi 2 [ exp a 2 2aK + b 2 ] 8

9 Proof We have for 0 < λ < /K, It follows that E exp[λx i ] = + E exp + m=2 m=2 m! λm EX m i λ 2 2 λkm 2 σ 2 i = + λ2 σi 2 2 λk [ exp [ λ λ 2 σ 2 i 2 λk ] X i = [ exp ] n E exp[λx i ] λ 2 b 2 2 λk Now, apply Chebyshev s inequality to n X i, and with φx = exp[λx], x R We arrive at n [ λ 2 b 2 ] P X i a exp 2 λk λa Take to complete the proof 23 Hoeffding s inequality λ = a Ka + b 2 Hoeffding s inequality Let X,, X n be independent real-valued random variables with expectation zero Suppose that for all i, and for certain constants c i > 0, X i c i ] Then for all a > 0, n P X i a [ exp a 2 2 n c2 i ] Proof Let λ > 0 By the convexity of the exponential function exp[λx], we know that for any 0 α, exp[αλx + αλy] α exp[λx] + α exp[λy] Define now Then so But then, since Eα i = /2, we find α i = c i X i 2c i X i = α i c i + α i c i, exp[λx i ] α i exp[ λc i ] + α i exp[λc i ] E exp[λx i ] 2 exp[ λc i] + 2 exp[λc i] 9

10 Now, for all x, whereas Since we thus know that and hence Therefore, E exp exp[ x] + exp[x] = 2 exp[x 2 /2] = k=0 2k! 2 k k!, k=0 x 2k 2 k k! x 2k 2k!, exp[ x] + exp[x] 2 exp[x 2 /2], E exp[λx i ] exp[λ 2 c 2 i /2] [ λ ] [ X i exp It follows now from Chebyshev s inequality that n [ P X i a exp λ 2 λ 2 n n c 2 i /2 ] c 2 i /2 λa Take λ = a/ n c2 i to complete the proof ] 24 Exercise Exercise Let X be N 0, -distributed Show that for λ > 0, E exp[λx] = exp[λ 2 /2] Conclude that for all a > 0, Take λ = a to find the inequality P X a exp[λ 2 /2 λa] P X a exp[ a 2 /2] 0

11 3 Symmetrization Symmetrization is a technique based on the following idea Suppose you have some estimation method, and want to know how good it performs Suppose you have a sample of size n, the so-called training set and a second sample, say also of size n, the so-called test set Then we may use the training set to calculate the estimator, and the test set to check its performance For example, suppose we want to know how large the maximal deviation is between certain averages and expectations We cannot calculate this maximal deviation directly, as the expectations are unknown Instead, we can calculate the maximal deviation between the averages in the two samples Symmetrization is closely rerlated: it splits the sample of size n randomly in two subsamples Let X X be a random variable with distribution P We consider two independent sets of independent copies of X, X := X,, X n and X := X,, X n Let G be a class of real-valued functions on X Consider the empirical measures P n := n Here δ x denotes a point mass at x Define δ Xi, P n := n δ Xi P n P G := sup P n g P g, g G and likewise and P n P G := sup P ng P g, g G P n P n G := sup P n g P ng g G 3 Symmetrization with means Lemma 3 We have E P n P G E P n P n G Proof For a function f on X 2n, let E X fx, X denote the conditional expectation of fx, X given X Then obviously, E X P n g = P n g and So Hence P n P G = sup g G E X P ng = P g P n P g = E X P n P ng P n g P g = sup E X P n P ng g G Now, use that for any function fz, t depending on a random variable Z and a parameter t, we have So So we now showed that sup t EfZ, t sup t E fz, t E sup fz, t t sup E X P n P ng E X P n P n G g G P n P G E X P n P n G Finally, we use that the expectation of the conditional expectation is the unconditional expectation: EE X fx, X = EfX, X

12 So E P n P G EE X P n P n G = E P n P n G Definition 32 A Rademacher sequence {σ i } n is a sequence of independent random variables σ i, with Pσ i = = Pσ i = = 2 i Let {σ i } n be a Rademacher sequence, independent of the two samples X and X We define the symmetrized empirical measure Pn σ g := σ i gx i, g G n Let Pn σ G = sup Pn σ g g G Lemma 33 We have E P n P G 2E P σ n G Proof Consider the symmetrized version of the second sample X : P n,σ g = n σ i gx i Then P n P n G has the same distribution as Pn σ P n,σ G So E P n P n G = E Pn σ P n,σ G 32 Symmetrization with probabilities E P σ n G + E P,σ n G = 2E P σ n G Lemma 32 Let δ > 0 Suppose that for all g G, P P n g P g > δ/2 2 Then P P n P G > δ 2P P n P n G > δ 2 Proof Let P X denote the conditional probability given X If P n P G > δ, we know that for some random function g = g X depending on X, Because X is independent of X, we also know that P n g P g > δ P X P ng P g > δ/2 2 Thus, P P n g P g > δ and P ng P g δ 2 = EP X P n g P g > δ and P ng P g δ 2 2

13 = EP X P ng P g δ l{ P n g P g > δ} 2 2 El{ P ng P g > δ} = 2 P P ng P g > δ It follows that P P n P G > δ P P n g P g > δ 2P P n g P g > δ and P ng P g δ 2 2P P n g P ng > δ 2 Corollary 322 Let δ > 0 Suppose that for all g G, P P n g P g > δ/2 2 Then P P n P G > δ 4P Pn σ G > δ 4 Then 33 Some facts about conditional expectations Let X and Y be two random variables We write the conditional expectation of Y given X as E X Y = EY X EY = EE X Y Let f be some function of X and g be some function of X, Y We have The conditional probability given X is E X fxgx, Y = fxe X gx, Y P X X, Y B = E X l B X, Y Hence, P X X X, X, Y B = l A XP X X, Y B 3

14 4 Uniform laws of large numbers In this chapter, we prove uniform laws of large numbers for the empirical mean of functions g of the individual observations, when g varies over a class G of functions First, we study the case where G is finite Symmetrization is used in order to be able to apply Hoeffding s inequality Hoeffding s inequality gives exponential small probabilities for the deviation of averages from their expectations So considering only a finite number of such averages, the difference between these averages and their expectations will be small for all averages simultaneously, with large probability If G is not finite, we approximate it by a finite set A δ-approximation is called a δ-covering, and the number of elements of a δ-covering is called the δ-covering number We introduce Vapnik Chervonenkis VC classes These are classes with small covering numbers Let X X be a random variable with distribution P Consider a class G of real-valued functions on X, and consider iid copies {X, X 2, } of X In this chapter, we address the problem of proving P n P G P 0 If this is the case, we call G a Glivenko Cantelli GC class Remark It can be shown that if P n P G P 0, then also P n P G 0 almost surely This involves eg martingale arguments We will not consider this issue 4 Classes of functions Notation The sup-norm of a function g is g := sup gx x X Elementary observation Let {A k } N k= be a finite collection of events Then P N k=a k N k= PA k N max k N PA Lemma 4 Let G be a finite class of functions, with cardinality G := N > Suppose that for some finite constant K, max g G g K Then for all we have and δ 2K log N n, ] P Pn σ G > δ 2 exp [ nδ2 4K 2 ] P P n P G > 4δ 8 exp [ nδ2 4K 2 Proof By Hoeffding s inequality, for each g G, P P σ n g > δ 2 exp ] [ nδ2 2K 2 Use the elementary observation to conclude that ] P Pn σ G > δ 2N exp [ nδ2 2K 2 ] ] = 2 exp [log N nδ2 2K 2 2 exp [ nδ2 4K 2 4

15 By Chebyshev s inequality, for each g G Hence, by symmetrization with probabilities P P n g P g > δ vargx nδ 2 K 2 4K 2 log N 2 P P n P G > 4δ 4P P σ n G > δ 8 exp K2 nδ 2 ] [ nδ2 4K 2 Definition 42 The envelope G of a collection of functions G is defined by Gx = sup gx, x X g G In Exercise 2 of this chapter, the assumption sup g G g K used in Lemma 4, is weakened to P G < Definition 43 Let S be some subset of a metric space Λ, d For δ > 0, the δ-covering number Nδ, S, d of S is the minumum number of balls with radius δ, necessary to cover S, ie the smallest value of N, such that there exist s,, s N in Λ with min ds, s j δ, s S j=,,n The set s,, s N is then called a δ-covering of S The logarithm log N, S, d of the covering number is called the entropy of S Figure 3 Notation Let Theorem 44 Suppose Assume moreover that Then d,n g, g = P n g g g K, g G n log Nδ, G, d,n P 0 P n P G P 0 5

16 Proof Let δ > 0 Let g,, g N, with N = Nδ, G, d,n, be a δ-covering of G When P n g g j δ, we have P σ n g P σ n g j + δ So P σ n G max P n σ g j + δ j=,n By Hoeffding s inequality and the elementary observation, for log N δ 2K n, we have Conclude that for we have But then We thus get as n, P X max P n σ g j > δ j=,,n δ 2K log N n, ] 2 exp [ nδ2 4K 2 ] P X Pn σ G > 2δ 2 exp [ nδ2 4K 2 ] P Pn σ G > 2δ 2 exp [ nδ2 log Nδ, G, d,n 4K 2 + P 2K > δ n P P σ n G > 2δ 0 No, use the symmetrization with probabilities to conclude Since δ is arbitrary, this concludes the proof P P n P G > 8δ 0 Again, the assumption sup g G g K used in Theorem 44, can be weakened to P G < G being the envelope of G See Exercise 3 of this chapter 42 Classes of sets Let D be a collection of subsets of X, and let {ξ,, ξ n } be n points in X Definition 42 We write D ξ,, ξ n = card{d {ξ,, ξ n } : D D} = the number of subsets of {ξ,, ξ n } that D can distinguish That is, count the number of sets in D, when two sets D and D 2 are considered as equal if D D 2 {ξ,, ξ n } = Here D D 2 = D D c 2 D c D 2 is the symmetric difference between D and D 2 Remark For our purposes, we will not need to calculate D ξ,, ξ n exactly, but only a good enough upper bound Example Let X = R and Then for all {ξ,, ξ n } R D = {l,t] : t R} D ξ,, ξ n n + 6

17 Example Let D be the collection of all finite subsets of X Then, if the points ξ,, ξ n are distinct, D ξ,, ξ n = 2 n Theorem 422 Vapnik and Chervonenkis 97 We have if and only if n log D X,, X n P 0, sup P n D P D P 0 D D Proof of the if-part This follows from applying Theorem 44 to G = {l D : D D} Note that a class of indicator functions is uniformly bounded by, ie we can take K = in Theorem 44 Define now d,n g, g = max gx i gx i,,n Then d,n d,n, so also But for 0 < δ <, N, G, d,n N, G, d,n Nδ, {l D : D D}, d,n = D X,, X n So indeed, if n log D X,, X n P 0, then also n log Nδ, {l D : D D}, d,n n log D X,, X n P 0 43 Vapnik-Chervonenkis classes Definition 43 Let m D n = sup{ D ξ,, ξ n : ξ,, ξ n X } We say that D is a Vapnik-Chervonenkis VC class if for certain constants c and V, and for all n, m D n cn V, ie, if m D n does not grow faster than a polynomial in n Important conclusion: For sets, VC GC Examples a X = R, D = {l,t] : t R} Since m D n n +, D is VC b X = R d, D = {l,t] : t R d } Since m D n n + d, D is VC c X = R d, D = {{x : θ T x > t}, θ t R d+ } Since m D n 2 d n d, D is VC The VC property is closed under measure theoretic operations: Lemma 432 Let D, D and D 2 be VC Then the following classes are also VC: i D c = {D c : D D}, ii D D 2 = {D D 2 : D D, D 2 D 2 }, iii D D 2 = {D D 2 : D D, D 2 D 2 } Proof Exercise Examples - the class of intersections of two halfspaces, - all ellipsoids, - all half-ellipsoids, { - in R, the class {x : θ x + + θ r x r t} : } θ R t r+ 7

18 There are classes that are GC, but not VC Example Let X = [0, ] 2, and let D be the collection of all convex subsets of X Then D is not VC, but when P is uniform, D is GC Definition 433 The VC dimension of D is V D = inf{n : m D n < 2 n } The following Lemma is nice to know, but to avoid digressions, we will not provide a proof Lemma 434 We have that D is VC if and only if V D < In fact, we have for V = V D, m D n V k=0 n k 44 VC graph classes of functions Definition 44 The subgraph of a function g : X R is subgraphg = {x, t X R : gx t} A collection of functions G is called a VC class if the subgraphs {subgraphg : g G} form a VC class Example G = {l D : D D} is GC if D is GC Examples X = R d a G = {gx = θ 0 + θ x + + θ d x d : θ R d+ }, b G = {gx = θ 0 + θ x + + θ d x d : θ R d+ } a c d =, G = gx = { a + bx if x c d + ex if x > c, d d =, G = {gx = e θx : θ R} b c R5, d e Definition 44 Let S be some subset of a metric space Λ, d For δ > 0, the δ-packing number Dδ, S, d of S is the largest value of N, such that there exist s,, s N in S with ds k, s j > δ, k j Note For all δ > 0, Nδ, S, d Dδ, S, d Theorem 442 Let Q be any probability measure on X Define d,q g, g = Q g g For a VC class G with VC dimension V, we have for a constant A depending only on V, NδQG, G, d,q maxaδ 2V, e δ/4, δ > 0 Proof Without loss of generality, assume QG = Choose S X with distribution dq S = GdQ Given S = s, choose T uniformly in the interval [ Gs, Gs] Let g,, g N be a maximal set in G, such that Q g j g k > δ for j k Consider a pair j k Given S = s, the probability that T falls in between the two graphs of g j and g k is g j s g k s 2Gs So the unconditional probability that T falls in between the two graphs of g j and g k is gj s g k s 2Gs dq S s = Q g j g k 2 δ 2 8

19 Now, choose n independent copies {S i, T i } n of T, S between the graphs of g j and g k is then at most The probability that none of these fall in δ/2 n The probability that for some j k, none of these fall in between the graphs of g j and g k is then at most N δ/2 n 2 [ 2 exp 2 log N nδ ] 2 2 <, when we choose n the smallest integer such that n 4 log N δ So for such a value of n, with positive probability, for any j k, some of the T i fall in between the graphs of g j and g k Therefore, we must have But then, for N exp[δ/4], So V 4 log N N c + c δ N cn V 8 log N δ V 6V c N 2 δ N c 2 6V δ V = c 6V log N 2V Corollary 443 Suppose G is VC and that GdP < Then by Theorem 442 and Theorem 44, we have P n P G P 0 45 Exercises Exercise Let G be a finite class of functions, with cardinality G := N > Suppose that for some finite constant K, max g G g K Use Bernstein s inequality to show that for one has δ 2 4 log N n 2V [ δk + K 2 ] [ P P n P G > δ 2 exp nδ 2 4δK + K 2 Exercise 2 Let G be a finite class of functions, with cardinality G := N > Suppose that G has envelope G satisfying P G < Let 0 < δ <, and take K large enough, so that P Gl{G > K} δ 2 δ ] V Show that for δ 4K log N n, 9

20 Hint: use ] P P n P G > 4δ 8 exp [ nδ2 6K 2 + δ P n g P g P n gl{g K} P gl{g K} +P n Gl{G > K} + P Gl{G > K} Exercise 3 Let G be a class of functions, with envelope G, satisfying P G < and n log Nδ, G, d,n P 0 Show that P n P G P 0 Exercise 4 Are the following classes of sets functions VC? Why not? The class of all rectangles in R d 2 The class of all monotone functions on R 3 The class of functions on [0, ] given by G = {gx = ae bx + ce dx : a, b, c, d [0, ] 4 } 4 The class of all sections in R 2 a section is of the form {x, x 2 : x = a +r sin t, x 2 = a 2 +r cos t, θ t θ 2 }, for some a, a 2 R 2, some r > 0, and some 0 θ θ 2 2π 5 The class of all star-shaped sets in R 2 a set D is star-shaped if for some a D and all b D also all points on the line segment joining a and b are in D Exercise 5 Let G be the class of all functions g on [0, ] with derivative ġ satisfying ġ Check that G is not VC Show that G is GC by using partial integration and the Glivenko-Cantelli Theorem for the empirical distribution function 20

21 5 M-estimators 5 What is an M-estimator? Let X,, X n, be iid copies of a random variable X with values in X and with distribution P Let Θ be a parameter space a subset of some metric space and let for θ Θ, γ θ : X R, be some loss function We assume P γ θ < for all θ Θ We estimate the unknown parameter θ 0 := arg min θ Θ P γ θ, by the M-estimator ˆθ n := arg min θ Θ P nγ θ We assume that θ 0 exists and is unique and that ˆθ n exists Examples i Location estimators X = R, Θ = R, and ia γ θ x = x θ 2 estimating the mean, ib γ θ x = x θ estimating the median ii Maximum likelihood {p θ : θ Θ} family of densities wrt σ-finite dominating measure µ, and γ θ = log p θ If dp/dµ = p θ0, θ 0 Θ, then indeed θ 0 is a minimizer of P γ θ, θ Θ iia Poisson distribution: p θ x = e θ θx, θ > 0, x {, 2, } x! iib Logistic distribution: p θ x = 52 Consistency Define for θ Θ, e θ x + e θ x, θ R, x R 2 Rθ = P γ θ, and R n θ = P n γ θ We first present an easy proposition with a too stringent condition Proposition 52 Suppose that θ Rθ is continuous Assume moreover that sup R n θ Rθ P 0, θ Θ ie, that {γ θ : θ Θ} is a GC class Then ˆθ n P θ 0 Proof We have 0 Rˆθ n Rθ 0 = [Rˆθ n Rθ 0 ] [R n ˆθ n R n θ 0 ] + [R n ˆθ n R n θ 0 ] [Rˆθ n Rθ 0 ] [R n ˆθ n R n θ 0 ] P 0 So Rˆθ n P Rθ 0 and hence ˆθ n P θ 0 The assumption is hardly ever met, because it is close to requiring compactness of Θ 2

22 Lemma 522 Suppose that Θ, d is compact and that θ γ θ is continuous Moreover, assume that P G <, where G = sup γ θ θ Θ Then Then Proof Let By dominated convergence sup R n θ Rθ P 0 θ Θ wθ, ρ = Let δ > 0 be arbitrary Take ρ θ in such a way that sup γ θ γ θ { θ: d θ,θ<ρ} wθ, ρ 0, ρ 0 P wθ, ρ 0 P wθ, ρ θ δ Let B θ = { θ : d θ, θ < ρ θ } and let B θ,, B θn be a finite cover of Θ Then for G = {γ θ : θ Θ}, So the result follows from Theorem 44 PN2δ, G, d n > N 0 We give a lemma, which replaces compactness by a convexity assumption Lemma 523 Suppose that Θ is a convex subset of R r, and that θ γ θ, θ Θ is continuous and convex Suppose P G ɛ < for some ɛ > 0, where G ɛ = sup γ θ θ θ 0 ɛ Then ˆθ n P θ 0 Proof Because Θ is finite-dimensional, the set { θ θ 0 ɛ} is compact So by Lemma 522, sup R n θ Rθ P 0 θ θ 0 ɛ and Define ɛ α = ɛ + ˆθ n θ 0 θ n = αˆθ n + αθ 0 Then θ n θ 0 ɛ Moreover, R n θ n αr n ˆθ n + αr n θ 0 R n θ 0 It follows from the arguments used in the proof of Proposition 52, that θ n θ 0 P 0 But then also 53 Exercises ˆθ n θ 0 = ɛ θ n θ 0 ɛ θ n θ 0 P 0 22

23 Exercise Let Y {0, } be a binary response variable and Z R be a covariable Assume the logistic regression model P θ0 Y = Z = z = + exp[α 0 + β 0 z], where θ 0 = α 0, β 0 R 2 is an unknown parameter Let {Y i, Z i } n be iid copies of Y, Z Show consistency of the MLE of θ 0 Exercise 2 Suppose X,, X n are iid real-valued random variables with density f 0 = dp/dµ on [0,] Here, µ is Lebesgue measure on [0, ] Suppose it is given that f 0 F, with F the set of all decreasing densities bounded from above by 2 and from below by /2 Let ˆf n be the MLE Can you show consistency of ˆf n? For what metric? 23

24 6 Uniform central limit theorems After having studied uniform laws of large numbers, a natural question is: can we also prove uniform central limit theorems? It turns out that precisely defining what a uniform central limit theorem is, is quite involved, and actually beyond our scope In Sections 6-64 we will therefore only briefly indicate the results, and not present any proofs These sections only reveal a glimps of the topic of weak convergence on abstract spaces The thing to remember from them is the concept asymptotic continuity, because we will use that concept in our statistical applications In Section 65 we will prove that the empirical process indexed by a VC graph class is asymptotically continuous This result will be a corollary of another result of interest to theoretical statisticians: a result relating the increments of the empirical process to the entropy of G 6 Real-valued random variables Let X = R Central limit theorem in R Suppose EX = µ, and varx = σ 2 exist Then n Xn µ P z Φz, for all z, σ where Φ is the standard normal distribution function Notation Xn µ n L N 0,, σ or n Xn µ L N 0, σ 2 62 R d -valued random variables Let X, X 2, be iid R d -valued random variables copies of X, X X = R d, with expectation µ = EX, and covariance matrix Σ = EXX T µµ T Central limit theorem in R d We have n Xn µ L N 0, Σ, ie n [ a T X n µ ] L N 0, a T Σa, for all a R d 63 Donsker s Theorem Let X = R Recall the definition of the distribution function F and the empirical distribution function F n : Define F t = P X t, t R, F n t = n #{X i t, i n}, t R W n t := nf n t F t, t R By the central limit theorem in R Section 6, for all t W n t L N 0, F t F t Also, by the central limit theorem in R 2 Section 62, for all s < t, Wn s L N 0, Σs, t, W n t where F s F s F s F t Σs, t = F s F t F t F t 24

25 We are now going to consider the stochastic process W n = {W n t : t R} The process W n is called the classical empirical process Definition 63 Let K 0 be the collection of bounded functions on [0, ] The stochastic process B K 0, is called the standard Brownian bridge if - B0 = B = 0, - for all r and all t,, t r 0,, the vector - for all s t, covbs, Bt = s t - the sample paths of B are as continuous We now consider the process W F defined as Thus, W F = B F Bt Bt r W F t = BF t : t R is multivariate normal with mean zero, Donsker s theorem Consider W n and W F as elements of the space K of bounded functions on R We have W n L W F, that is, for all continuous and bounded functions f EfW n EfW F, Reflection Suppose F is continuous Then, since B is almost surely continuous, also W F = B F is almost surely continuous So W n must be approximately continuous as well in some sense Indeed, we have for any t and any sequence t n converging to t, This is called asymptotic continuity W n t n W n t P 0 64 Donsker classes Let X,, X n, be iid copies of a random variable X, with values in the space X, and with distribution P Consider a class G of functions g : X R The theoretical mean of a function g is P g := EgX, and the empirical average based on the n observations X,, X n is P n g := n gx i Here P n is the empirical distribution based on X,, X n Definition 64 The empirical process indexed by G is ν n g := n P n g P g, g G Let us recall the central limit theorem for g fixed Denote the variance of gx by σ 2 g := vargx = P g 2 P g 2 If σ 2 g <, we have ν n g L N 0, σ 2 g The central limit theorem also holds for finitely many g simultaneously Let g k and g l be two functions and denote the covariance between g k X and g l X by σg k, g l := covg k X, g l X = P g k g l P g k P g l 25

26 Then, whenever σ 2 g k < for k =,, r, ν n g ν n g r L N 0, Σ g,,g r, where Σ g,,g r is the variance-covariance matrix σ 2 g σg, g r Σ g,,g r = σg, g r σ 2 g r Definition 642 Let ν be a Gaussian process indexed by G Assume that for each r N and for each finite collection {g,, g r } G, the r-dimensional vector νg νg r has a N 0, Σ g,,g r -distribution, with Σ g,,g r defined in * We then call ν the P -Brownian bridge indexed by G Definition 643 Consider ν n and ν as bounded functions on G We call G a P -Donsker class if ν n L ν, that is, if for all continuous and bounded functions f, we have Efν n Efν Definition 644 The process ν n on G is called asymptotically continuous if for all g 0 G, and all possibly random sequences {g n } G with σg n g 0 P 0, we have ν n g n ν n g 0 P 0 We will use the notation ie, 2,P is the L 2 P -norm Remark Note that σg g 2,P g 2 2,P := P g 2, Definition 645 The class G is called totally bounded for the metric d 2,P g, g := g g 2,P induced by 2,P, if its entropy log N, G, d 2,P is finite Theorem 646 Suppose that G is totally bounded Then G is a P -Donsker class if and only if ν n as process on G is asymptotically continuous 65 Chaining and the increments of the empirical process 65 Chaining We will consider the increments of the symmetrized empirical process in Section 652 There, we will work conditionally on X = X,, X n We now describe the chaining technique in this context Let g 2 2,n := P n g 2 Let d 2,n g, g = g g 2,n, ie, d 2,n is the metric induced by 2,n Suppose that g 2,n R for all g G For notational convenience, we index the functions in G by a parameter θ Θ: G = {g θ : θ Θ} 26

27 Let for s = 0,, 2,, {g s j }Ns j= be a minimal 2 s R-covering set of G, d 2,n So N s = N2 s R, G, d 2,n, and for each θ, there exists a g s θ {gs,, g s N s } such that g θ g s θ 2,n 2 s R We use the parameter θ here to indicate which function in the covering set approximates a particular g We may choose g 0 θ 0, since g θ 2,n R Then for any S, g θ = S gθ s g s θ + g θ gθ S s= One can think of this as telescoping from g θ to gθ S, ie we follow a path taking smaller and smaller steps As S, we have max i n g θ X i gθ SX i 0 The term s= gs θ gs θ can be handled by exploiting the fact that as θ varies, each summand involves only finitely many functions 652 Increments of the symmetrized process We use the notation a b = max{a, b} a b = min{a, b} Lemma 652 On the set sup g G g 2,n R, and nδ 4 s= 2 s R log N2 s R, G, d 2,n 70R log 2, we have P X sup Pn σ g δ 4 exp[ nδ2 g G 70R 2 ] Proof Let {gj s}ns j= be a minimal 2 s R-covering set of G, s = 0,, So N s = N2 s R, G, d 2,n Now, use chaining Write g θ = s= gs θ gs Note that by the triangle inequality, θ gθ s g s θ 2,n gθ s g θ 2,n + g θ g s θ 2,n 2 s R + 2 s+ R = 32 s R Let η s be positive numbers satisfying s= η s Then P X sup Pn σ gθ s g s θ δ θ Θ s= P X sup θ Θ n P n σ gθ s g s θ δη s nδ 2 ηs 2 2 exp[2 log N s 8 2 2s R 2 ] s= What is a good choice for η s? We take Then indeed, by our condition on nδ, η s s= s= η s = 7 2 s R log N s nδ 72 s R log N s nδ + s= 2 s s 8 2 s s = Here, we used the bound + 2 s s + s= 0 2 x xdx 2 x xdx = + π/ log 2 /2 4 27

28 Observe that η s 7 2 s R log N s nδ, so that Thus, 2 log N s 2nδ 2 η 2 s s R 2 nδ 2 η 2 s 2 exp[2 log N s 8 2 2s R 2 ] 3nδ 2 ηs 2 2 exp[ s R 2 ] s= Next, invoke that η s 2 s s/8: s= s= 2nδ 2 ηs 2 2 exp[ s R 2 ] s= 2nδ 2 η 2 s 2 exp[ s R 2 ] 2 exp[ nδ2 s 49 96R 2 ] s= s= 2 exp[ nδ2 s ] = 2 exp[ nδ2 70R 2 70R 2 ] exp[ nδ2 70R 2 ] 4 exp[ nδ2 70R 2 ], where in the last inequality, we used the assumption that nδ 2 log 2 70R 2 Thus, we have shown that Remark It is easy to see that s= P X sup Pn σ g δ 4 exp[ nδ2 g G 70R 2 ] 2 s R log N2 s R, G, d 2,n 2 R 0 log Nu, G, d 2,n du 653 Asymptotic equicontinuity of the empirical process Fix some g 0 G and let Gδ = {g G : g g 0 2,P δ} Lemma 653 Suppose that G has envelope G, with P G 2 <, and that n log Nδ, G, d 2,n P 0 Then for each δ > 0 fixed ie, not depending on n, and for we have lim sup P n a 4 28A/2 2δ sup ν n g ν n g 0 > a g Gδ 0 H /2 udu 2δ, a2 6 exp[ 40δ 2 ] 28

29 Hu, G, d 2,n + lim sup P sup > A n u>0 Hu Proof The conditions P G 2 < and imply that sup g g 0 2,n g g 0 2,P P 0 g G So eventually, for each fixed δ > 0, with large probability The result now follows from Lemma 652 sup g g 0 2,n 2δ g Gδ Our next step is proving asymptotic equicontinuity of the empirical process This means that we shall take a small in Lemma 653, which is possible if the entropy integral converges Assume that and define 0 Jδ = H /2 udu <, δ 0 H /2 udu δ Roughly speaking, the increment at g 0 of the empirical process ν n g behaves like Jδ for g g 0 δ So, since Jδ 0 as δ 0, the increments can be made arbitrary small by taking δ small enough Theorem 6532 Suppose that G has envelope G with P G 2 < Suppose that Hu, G, d 2,n lim lim sup P sup > A = 0 A n u>0 Hu Also assume H /2 udu < 0 Then the empirical process ν n is asymptotically continuous at g 0, ie, for all η > 0, there exists a δ > 0 such that 59 lim sup P n and Proof Take A sufficiently large, such that Next, take δ sufficiently small, such that sup ν n g ν n g 0 > η g Gδ 6 exp[ A] < η 2, < η Hu, G, P n lim sup P sup > A η n u>0 Hu A /2 J2δ < η Then by Lemma 653, lim sup P n sup ν n g ν n g 0 > η g Gδ 6 exp[ AJ 2 2δ 2δ 2 ] + η 2 6 exp[ A] + η 2 < η, 29

30 where we used J2δ 2δ Remark Because the conditions in Theorem 6532 do not depend on g 0, its result holds for each g 0 ie, we have in fact shown that ν n is asymptotically continuous 654 Application to VC graph classes Theorem 654 Suppose that G is a VC-graph class with envelope G = sup g g G satisfying P G 2 < Then {ν n g : g G} is asymptotically continuous, and so G is P -Donsker Proof Apply Theorem 6532 and Theorem 442 Remark In particular, suppose that VC -graph class G with square integrable envelope G is parametrized by θ in some parameter space Θ R r, ie G = {g θ : θ Θ} Let z n θ = ν n g θ Question: do we have that for a random sequence θ n with θ n θ 0 in probability, also z n θ n z n θ 0 P 0? Indeed, if g θ g θ0 2,P P 0 as θ converges to θ 0, the answer is yes 30

31 7 Asymptotic normality of M-estimators Consider an M-estimator ˆθ n of a finite dimensional parameter θ 0 We will give conditions for asymptotic normality of ˆθ n It turns out that these conditions in fact imply asymptotic linearity Our first set of conditions include differentiability in θ at each x of the loss function γ θ x The proof of asymptotic normality is then the easiest In the second set of conditions, only differentiability in quadratic mean of γ θ is required The results of the previous chapter asymptotic continuity supply us with an elegant way to handle remainder terms in the proofs In this chapter, we assume that θ 0 is an interior point of Θ R r Moreoever, we assume that we already showed that ˆθ n is consistent 7 Asymptotic linearity Definition 7 The sequence of estimators ˆθ n of θ 0 is called asymptotically linear if we may write nˆθn θ 0 = np n l + o P, where l l = : X R r, l r satisfies P l = 0 and P lk 2 <, k =,, r The function l is then called the influence function For the case r =, we call σ 2 := P l 2 the asymptotic variance Definition 72 Let ˆθ n, and ˆθ n,2 be two asymptotically linear estimators of θ 0, with asymptotic variance σ 2 and σ 2 2 respectively Then e,2 := σ2 2 σ 2 is called the asymptotic relative efficiency of ˆθ n, as compared to ˆθ n,2 72 Conditions a,b and c for asymptotic normality We start with 3 conditions a,b and c, which are easier to check but more stringent We later relax them to conditions A,B and C Condition a There exists an ɛ > 0 such that θ γ θ is differentiable for all θ θ 0 < ɛ and all x, with derivative Condition b We have as θ θ 0, where V is a positive definite matrix ψ θ x := θ γ θx, x X P ψ θ ψ θ0 = V θ θ 0 + o θ θ 0, Condition c There exists an ɛ > 0 such that the class {ψ θ : θ θ 0 < ɛ} and is P -Donsker with envelope Ψ satisfying P Ψ 2 < Moreover, lim ψ θ ψ θ0 2,P = 0 θ θ 0 Lemma 72 Suppose conditions a,b and c Then ˆθ n is asymptotically linear with influence function l = V ψ θ0, so nˆθn θ 0 L N 0, V JV, 3

32 where J = P ψ θ0 ψ T θ 0 Proof Recall that θ 0 is an interior point of Θ, and minimizes P γ θ, so that P ψ θ0 = 0 Because ˆθ n is consistent, it is eventually a solution of the score equations Rewrite the score equations as P n ψˆθn = 0 0 = P n ψˆθn = P n ψˆθn ψ θ0 + P n ψ θ0 = P n P ψˆθn ψ θ0 + P ψˆθn + P n ψ θ0 Now, use condition b and the asymptotic equicontinuity of {ψ θ : θ θ 0 ɛ} see Chapter 6, to obtain This yields 0 = o P n /2 + V ˆθ n θ 0 + o ˆθ n θ 0 + P n ψ θ0 ˆθ n θ 0 = V P n ψ θ0 + o P n /2 Example: Huber estimator Let X = R, Θ = R The Huber estimator corresponds to the loss function γ θ x = γx θ, with γx = x 2 l{ x k} + 2k x k 2 l{ x > k}, x R Here, 0 < k < is some fixed constant, chosen by the statistician We will now verify a,b and c a { +2k if x θ k ψ θ x = 2x θ if x θ k 2k if x θ k b We have d dθ ψ θ dp = 2F k + θ F k + θ, where F t = P X t, t R is the distribution function So V = 2F k + θ 0 F k + θ 0 c Clearly ψ θ : θ R is a VC graph class, with envelope Ψ 2k So the Huber estimator ˆθ n has influence function k F k+θ 0 F k+θ 0 if x θ 0 k x θ lx = 0 F k+θ 0 F k+θ 0 if x θ 0 k if x θ 0 k k F k+θ 0 F k+θ 0 The asymptotic variance is σ 2 = k2 F k + θ 0 + k+θ 0 k+θ 0 x θ 0 2 df x + k 2 F k + θ 0 F k + θ 0 F k + θ Asymptotics for the median The median see Example ib in Chapter 5 can be regarded as the limiting case of a Huber estimator, with k 0 However, the loss function γ θ x = x θ is not differentiable, ie, does not satisfy condition a For even sample sizes, we do nevertheless have the score equation F n ˆθ n 2 = 0 Let us investigate this closer 32

33 Let X R have distribution F, and let F n be the empirical distribution The population median θ 0 is a solution of the equation F θ 0 = 0 We assume this solution exists and also that F has positive density f in a neighborhood of θ 0 Consider now for simplicity even sample sizes n and let the sample median ˆθ n be any solution of F n ˆθ n = 0 Then we get = 0 = F n ˆθ n F θ 0 ] + [ F n ˆθ n F ˆθ n [ F ˆθ n F θ 0 = [ ] W n ˆθ n + F ˆθ n F θ 0, n ] where W n = nf n F is the empirical process Since F is continuous at θ 0, and ˆθ n θ 0, we have by the asymptotic continuity of the empirical process Section 63, that W n ˆθ n = W n θ 0 + o P We thus arrive at 0 = W n θ 0 + [ ] n F ˆθ n F θ 0 + o P In other words, So the influence function is and the asymptotic variance is = W n θ 0 + n[fθ 0 + o][ˆθ n θ 0 ] nˆθn θ 0 = W nθ 0 fθ 0 lx = + o P { 2fθ 0 if x θ 0 +, 2fθ 0 if x > θ 0 σ 2 = 4fθ 0 2 We can now compare median and mean It is easily seen that the asymptotic relative efficiency of the mean as compared to the median is e,2 = 4σ 2 0 fθ 0 2, where σ0 2 = varx So e,2 = π/2 for the normal distribution, and e,2 = /2 for the double exponential Laplace distribution The density of the double exponential distribution is [ fx = ] 2 x θ0 exp, x R 2σ 2 0 σ 0 74 Conditions A,B and C for asymptotic normaility We are now going to relax the condition of differentiability of γ θ Condition A Differentiability in quadratic mean There exists a function ψ 0 : X R r, with P ψ0,k 2 <, k =,, r, such that γθ γ θ0 θ θ 0 T ψ 2,P 0 lim θ θ 0 Condition B We have as θ θ 0, θ θ 0 = 0 P γ θ P γ θ0 = 2 θ θ 0 T V θ θ 0 + o θ θ 0 2, 33

34 with V a positive definite matrix Condition C Define for θ θ 0, Suppose that for some ɛ > 0, the class {g θ : satisfying P G 2 < g θ = γ θ γ θ0 θ θ 0 0 < θ θ 0 < ɛ} is a P -Donsker class with envelope G Lemma 74 Suppose conditions A,B and C are met Then ˆθ n has influence function l = V ψ 0, and so ˆθn θ 0 L N 0, V JV, where J = ψ 0 ψ T 0 dp Proof Since {g θ : θ θ 0 ɛ} is a P -Donsker class, and ˆθ n is consistent, we may write 0 P n γˆθn γ θ0 = P n P γˆθn γ θ0 + P γˆθn γ θ0 = P n P g θ θ θ 0 + P γˆθn γ θ0 = P n P ˆθ n θ 0 T ψ 0 + o P n /2 + P γˆθn γ θ0 = P n P ˆθ n θ 0 T ψ 0 + o P n /2 ˆθ n θ ˆθ n θ 0 T V ˆθ n θ 0 + o ˆθ n θ 0 2 This implies ˆθ n θ 0 = O P n /2 But then V /2 ˆθ n θ 0 + V /2 P n P ψ 0 + o P n /2 2 o P n Therefore, ˆθ n θ 0 = V P n P ψ 0 + o P n /2 Because P ψ 0 = 0, the result follows, and the asymptotic covariance matrix is V JV 75Exercises Exercise Suppose X has the logistic distribution with location parameter θ see Example iib of Chapter 5 Show that the maximum likelihood estimator has asymptotic variance equal to 3, and the median has asymptotic variance equal to 4 Hence, the asymptotic relative efficiency of the maximum likelihood estimator as compared to the median is 4/3 Exercise 2 Let X i, Y i, i =,, n, be iid copies of X, Y, where X R d and Y R Suppose that the conditional distribution of Y given X = x has median mx = α 0 + α x + α d x d, with α 0 α = R d+ α d Assume moreover that given X = x, the random variable Y mx has a density f not depending on x, with f positive in a neighborhood of zero Suppose moreover that X Σ = E X XX T exists Let ˆα n = arg min a R d+ n be the least absolute deviations LAD estimator Show that nˆαn α L N 0, by verifying conditions A,B and C Y i a 0 a X i, a d X i,d, 4f 2 0 Σ, 34

35 8 Rates of convergence for least squares estimators Probability inequalities for the least squares estimator are obtained, under conditions on the entropy of the class of regression functions In the examples, we study smooth regression functions, functions of bounded variation, concave functions, analytic functions, and image restoration Results for he entropies of various classes of functions is taken from the literature on approximation theory Let Y,, Y n be real-valued observations, satisfying Y i = g 0 z i + W i, i =,, n, with z,, z n fixed covariates in a space Z, W,, W n independent errors with expectation zero, and with the unknown regression function g 0 in a given class G of regression functions The least squares estimator is ĝ n := arg min g G Y i gz i 2 Throughout, we assume that a minimizer ĝ n G of the sum of squares exists, but it need not be unique The following notation will be used The empirical measure of the covariates is Q n := n δ zi For g a function on Z, we denote its squared L 2 Q n -norm by g 2 n := g 2 2,Q n := n g 2 z i The empirical inner product between error and regression function is written as Finally, we let w, g n = n W i gz i GR := {g G : g g 0 n R} denote a ball around g 0 with radius R, intersected with G The main idea to arrive at rates of convergence for ĝ n is to invoke the basic inequality ĝ n g 0 2 n 2w, ĝ n g 0 n The modulus of continuity of the process {w, g g 0 n : g GR} can be derived from the entropy of GR, endowed with the metric d n g, g := g g n 8 Gaussian errors When the errors are Gaussian, it is not hard to extend the maximal inequality of Lemma 652 We therefore, and to simplify the exposition, will assume in this chapter that Then, as in Lemma 652, for W,, W n are ii, d, N 0, distributed R nδ 28 log Nu, GR, dn du 70R log 2, we have P 0 sup w, g g 0 n δ g GR ] 4 exp [ nδ2 70R 2 35

36 82 Rates of convergence Define Jδ, Gδ, d n = δ 0 log Nu, Gδ, dn du δ Theorem 82 Take Ψδ Jδ, Gδ, d n in such a way that Ψδ/δ 2 is a non-increasing function of δ Then for a constant c, and for nδ 2 n cψδ n we have for all δ δ n, Proof We have Now, if P ĝ n g 0 n > δ c exp[ nδ2 c 2 ] P ĝ n g 0 n > δ P sup w, g g 0 n 2 2s δ 2 := P s g G2 s+ δ s=0 nδ 2 n c Ψδ n, s=0 then also for all 2 s+ δ > δ n, n2 2s+2 δ 2 c Ψ2 s+ δ So, for an appropriate choice of c, we may apply * to each P s This gives, for some c 2, c 3, P s c 2 exp[ n24s 2 δ 4 c 2 2 2s+2 δ 2 ] c 3 exp[ nδ2 ] s=0 s=0 c 2 3 Take c = max{c, c 2, c 3 } So 83 Examples Example 83 Linear regression Let One may verify G = {gz = θ ψ z + + θ r ψ r z : θ R r } log Nu, Gδ, d n r log δ + 4u, for all 0 < u < δ, δ > 0 u δ 0 log Nu, Gδ, dn du r /2 δ = r /2 δ So Theorem 82 can be applied with 0 0 log /2 δ + 4u du δ log /2 + 4vdv := A 0 r /2 δ δ n ca 0 r n It yields that for some constant c not the same at each appearance and for all T c, r 2 P ĝ n g 0 n > T n c exp[ T r c 2 ] Note that we made extensive use here from the fact that it suffices to calculate the local entropy of G 36

37 Example 832 Smooth functions Let G = {g : [0, ] R, g m z 2 dz M 2 } Let ψ k z = z k, k =,, m, ψz = ψ z,, ψ m z T and Σ n = ψψ T dq n Denote the smallest eigenvalue of Σ n by λ n, and assume that λ n λ > 0, for all n n 0 One can show Kolmogorov and Tihomirov 959 that log Nδ, Gδ, d n Aδ m, for small δ > 0, where the constant A depends on λ Hence, we find from Theorem 82 that for T c,, P ĝ n g 0 n > T n m T 2 n 2m+ 2m+ c exp[ c 2 ] Example 833 Functions of bounded variation in R Let G = {g : R R, g z dz M} Without loss of generality, we may assume that z z n The derivative should be understood in the generalized sense: g z dz := gz i gz i Define for g G, Then it is easy to see that, i=2 α := gdq n max gz i α + M,,n One can now show Birman and Solomjak 967 that and therefore, for all T c, log Nδ, Gδ, d n Aδ, for small δ > 0, P ĝ n g 0 n > T n /3 c exp[ T 2 n /3 c 2 ] Example 834 Functions of bounded variation in R 2 Suppose that z i = u k, v l, i = kl, k =,, n, l =,, n 2, n = n n 2, with u u n, v v n2 Consider the class G = {g : R 2 R, Ig M} where Ig := I 0 g + I g + I 2 g 2, n 2 n 2 I 0 g := gu k, v l gu k, v l gu k, v l + gu k, v l, k=2 l=2 g u := n 2 gu, v l, n 2 l= 37

Empirical Processes in M-Estimation by Sara van de Geer

Empirical Processes in M-Estimation by Sara van de Geer Empirical Processes in M-Estimation by Sara van de Geer Handout at New Directions in General Equilibrium Analysis Cowles Workshop, Yale University June 15-20, 2003 Version: June 13, 2003 1 Most of the

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Lecture 2: Uniform Entropy

Lecture 2: Uniform Entropy STAT 583: Advanced Theory of Statistical Inference Spring 218 Lecture 2: Uniform Entropy Lecturer: Fang Han April 16 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

1 Glivenko-Cantelli type theorems

1 Glivenko-Cantelli type theorems STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19 Introductory Analysis I Fall 204 Homework #9 Due: Wednesday, November 9 Here is an easy one, to serve as warmup Assume M is a compact metric space and N is a metric space Assume that f n : M N for each

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

MATH 6605: SUMMARY LECTURE NOTES

MATH 6605: SUMMARY LECTURE NOTES MATH 6605: SUMMARY LECTURE NOTES These notes summarize the lectures on weak convergence of stochastic processes. If you see any typos, please let me know. 1. Construction of Stochastic rocesses A stochastic

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

P (A G) dp G P (A G)

P (A G) dp G P (A G) First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

7 Influence Functions

7 Influence Functions 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the

More information

Theoretical Statistics. Lecture 17.

Theoretical Statistics. Lecture 17. Theoretical Statistics. Lecture 17. Peter Bartlett 1. Asymptotic normality of Z-estimators: classical conditions. 2. Asymptotic equicontinuity. 1 Recall: Delta method Theorem: Supposeφ : R k R m is differentiable

More information

LECTURE 15: COMPLETENESS AND CONVEXITY

LECTURE 15: COMPLETENESS AND CONVEXITY LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other

More information

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich Submitted to the Annals of Applied Statistics arxiv: math.pr/0000000 THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

NATIONAL UNIVERSITY OF SINGAPORE Department of Mathematics MA4247 Complex Analysis II Lecture Notes Part II

NATIONAL UNIVERSITY OF SINGAPORE Department of Mathematics MA4247 Complex Analysis II Lecture Notes Part II NATIONAL UNIVERSITY OF SINGAPORE Department of Mathematics MA4247 Complex Analysis II Lecture Notes Part II Chapter 2 Further properties of analytic functions 21 Local/Global behavior of analytic functions;

More information

4 Sums of Independent Random Variables

4 Sums of Independent Random Variables 4 Sums of Independent Random Variables Standing Assumptions: Assume throughout this section that (,F,P) is a fixed probability space and that X 1, X 2, X 3,... are independent real-valued random variables

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and

More information

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1 Chapter 2 Probability measures 1. Existence Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension to the generated σ-field Proof of Theorem 2.1. Let F 0 be

More information

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A )

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A ) 6. Brownian Motion. stochastic process can be thought of in one of many equivalent ways. We can begin with an underlying probability space (Ω, Σ, P) and a real valued stochastic process can be defined

More information

Problem set 2 The central limit theorem.

Problem set 2 The central limit theorem. Problem set 2 The central limit theorem. Math 22a September 6, 204 Due Sept. 23 The purpose of this problem set is to walk through the proof of the central limit theorem of probability theory. Roughly

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

Stability of optimization problems with stochastic dominance constraints

Stability of optimization problems with stochastic dominance constraints Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM

More information

Part II Probability and Measure

Part II Probability and Measure Part II Probability and Measure Theorems Based on lectures by J. Miller Notes taken by Dexter Chua Michaelmas 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk Instructor: Victor F. Araman December 4, 2003 Theory and Applications of Stochastic Systems Lecture 0 B60.432.0 Exponential Martingale for Random Walk Let (S n : n 0) be a random walk with i.i.d. increments

More information

2. Variance and Higher Moments

2. Variance and Higher Moments 1 of 16 7/16/2009 5:45 AM Virtual Laboratories > 4. Expected Value > 1 2 3 4 5 6 2. Variance and Higher Moments Recall that by taking the expected value of various transformations of a random variable,

More information

Selected Exercises on Expectations and Some Probability Inequalities

Selected Exercises on Expectations and Some Probability Inequalities Selected Exercises on Expectations and Some Probability Inequalities # If E(X 2 ) = and E X a > 0, then P( X λa) ( λ) 2 a 2 for 0 < λ

More information

Some basic elements of Probability Theory

Some basic elements of Probability Theory Chapter I Some basic elements of Probability Theory 1 Terminology (and elementary observations Probability theory and the material covered in a basic Real Variables course have much in common. However

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

More information

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department

More information

Riemann integral and volume are generalized to unbounded functions and sets. is an admissible set, and its volume is a Riemann integral, 1l E,

Riemann integral and volume are generalized to unbounded functions and sets. is an admissible set, and its volume is a Riemann integral, 1l E, Tel Aviv University, 26 Analysis-III 9 9 Improper integral 9a Introduction....................... 9 9b Positive integrands................... 9c Special functions gamma and beta......... 4 9d Change of

More information

A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION

A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION O. SAVIN. Introduction In this paper we study the geometry of the sections for solutions to the Monge- Ampere equation det D 2 u = f, u

More information

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University February 7, 2007 2 Contents 1 Metric Spaces 1 1.1 Basic definitions...........................

More information

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples Homework #3 36-754, Spring 27 Due 14 May 27 1 Convergence of the empirical CDF, uniform samples In this problem and the next, X i are IID samples on the real line, with cumulative distribution function

More information

THE LINDEBERG-FELLER CENTRAL LIMIT THEOREM VIA ZERO BIAS TRANSFORMATION

THE LINDEBERG-FELLER CENTRAL LIMIT THEOREM VIA ZERO BIAS TRANSFORMATION THE LINDEBERG-FELLER CENTRAL LIMIT THEOREM VIA ZERO BIAS TRANSFORMATION JAINUL VAGHASIA Contents. Introduction. Notations 3. Background in Probability Theory 3.. Expectation and Variance 3.. Convergence

More information

Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions Closest Moment Estimation under General Conditions Chirok Han Victoria University of Wellington New Zealand Robert de Jong Ohio State University U.S.A October, 2003 Abstract This paper considers Closest

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions Closest Moment Estimation under General Conditions Chirok Han and Robert de Jong January 28, 2002 Abstract This paper considers Closest Moment (CM) estimation with a general distance function, and avoids

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

The Heine-Borel and Arzela-Ascoli Theorems

The Heine-Borel and Arzela-Ascoli Theorems The Heine-Borel and Arzela-Ascoli Theorems David Jekel October 29, 2016 This paper explains two important results about compactness, the Heine- Borel theorem and the Arzela-Ascoli theorem. We prove them

More information

L p Spaces and Convexity

L p Spaces and Convexity L p Spaces and Convexity These notes largely follow the treatments in Royden, Real Analysis, and Rudin, Real & Complex Analysis. 1. Convex functions Let I R be an interval. For I open, we say a function

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology Singapore University of Design and Technology Lecture 9: Hypothesis testing, uniformly most powerful tests. The Neyman-Pearson framework Let P be the family of distributions of concern. The Neyman-Pearson

More information

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

More information

1 Weak Convergence in R k

1 Weak Convergence in R k 1 Weak Convergence in R k Byeong U. Park 1 Let X and X n, n 1, be random vectors taking values in R k. These random vectors are allowed to be defined on different probability spaces. Below, for the simplicity

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

or E ( U(X) ) e zx = e ux e ivx = e ux( cos(vx) + i sin(vx) ), B X := { u R : M X (u) < } (4)

or E ( U(X) ) e zx = e ux e ivx = e ux( cos(vx) + i sin(vx) ), B X := { u R : M X (u) < } (4) :23 /4/2000 TOPIC Characteristic functions This lecture begins our study of the characteristic function φ X (t) := Ee itx = E cos(tx)+ie sin(tx) (t R) of a real random variable X Characteristic functions

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 13 Sequences and Series of Functions These notes are based on the notes A Teacher s Guide to Calculus by Dr. Louis Talman. The treatment of power series that we find in most of today s elementary

More information

Kernel Density Estimation

Kernel Density Estimation EECS 598: Statistical Learning Theory, Winter 2014 Topic 19 Kernel Density Estimation Lecturer: Clayton Scott Scribe: Yun Wei, Yanzhen Deng Disclaimer: These notes have not been subjected to the usual

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

We will briefly look at the definition of a probability space, probability measures, conditional probability and independence of probability events.

We will briefly look at the definition of a probability space, probability measures, conditional probability and independence of probability events. 1 Probability 1.1 Probability spaces We will briefly look at the definition of a probability space, probability measures, conditional probability and independence of probability events. Definition 1.1.

More information

3 Continuous Random Variables

3 Continuous Random Variables Jinguo Lian Math437 Notes January 15, 016 3 Continuous Random Variables Remember that discrete random variables can take only a countable number of possible values. On the other hand, a continuous random

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

< k 2n. 2 1 (n 2). + (1 p) s) N (n < 1

< k 2n. 2 1 (n 2). + (1 p) s) N (n < 1 List of Problems jacques@ucsd.edu Those question with a star next to them are considered slightly more challenging. Problems 9, 11, and 19 from the book The probabilistic method, by Alon and Spencer. Question

More information

1 Complete Statistics

1 Complete Statistics Complete Statistics February 4, 2016 Debdeep Pati 1 Complete Statistics Suppose X P θ, θ Θ. Let (X (1),..., X (n) ) denote the order statistics. Definition 1. A statistic T = T (X) is complete if E θ g(t

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

McGill University Math 354: Honors Analysis 3

McGill University Math 354: Honors Analysis 3 Practice problems McGill University Math 354: Honors Analysis 3 not for credit Problem 1. Determine whether the family of F = {f n } functions f n (x) = x n is uniformly equicontinuous. 1st Solution: The

More information

1 Review of Probability

1 Review of Probability 1 Review of Probability Random variables are denoted by X, Y, Z, etc. The cumulative distribution function (c.d.f.) of a random variable X is denoted by F (x) = P (X x), < x

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

Universal examples. Chapter The Bernoulli process

Universal examples. Chapter The Bernoulli process Chapter 1 Universal examples 1.1 The Bernoulli process First description: Bernoulli random variables Y i for i = 1, 2, 3,... independent with P [Y i = 1] = p and P [Y i = ] = 1 p. Second description: Binomial

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by . Sanov s Theorem Here we consider a sequence of i.i.d. random variables with values in some complete separable metric space X with a common distribution α. Then the sample distribution β n = n maps X

More information

Chapter 4: Asymptotic Properties of the MLE

Chapter 4: Asymptotic Properties of the MLE Chapter 4: Asymptotic Properties of the MLE Daniel O. Scharfstein 09/19/13 1 / 1 Maximum Likelihood Maximum likelihood is the most powerful tool for estimation. In this part of the course, we will consider

More information

the convolution of f and g) given by

the convolution of f and g) given by 09:53 /5/2000 TOPIC Characteristic functions, cont d This lecture develops an inversion formula for recovering the density of a smooth random variable X from its characteristic function, and uses that

More information

REAL AND COMPLEX ANALYSIS

REAL AND COMPLEX ANALYSIS REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any

More information

Introduction to Topology

Introduction to Topology Introduction to Topology Randall R. Holmes Auburn University Typeset by AMS-TEX Chapter 1. Metric Spaces 1. Definition and Examples. As the course progresses we will need to review some basic notions about

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures 36-752 Spring 2014 Advanced Probability Overview Lecture Notes Set 1: Course Overview, σ-fields, and Measures Instructor: Jing Lei Associated reading: Sec 1.1-1.4 of Ash and Doléans-Dade; Sec 1.1 and A.1

More information

Brownian Motion. Chapter Stochastic Process

Brownian Motion. Chapter Stochastic Process Chapter 1 Brownian Motion 1.1 Stochastic Process A stochastic process can be thought of in one of many equivalent ways. We can begin with an underlying probability space (Ω, Σ,P and a real valued stochastic

More information

9 Brownian Motion: Construction

9 Brownian Motion: Construction 9 Brownian Motion: Construction 9.1 Definition and Heuristics The central limit theorem states that the standard Gaussian distribution arises as the weak limit of the rescaled partial sums S n / p n of

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

n E(X t T n = lim X s Tn = X s

n E(X t T n = lim X s Tn = X s Stochastic Calculus Example sheet - Lent 15 Michael Tehranchi Problem 1. Let X be a local martingale. Prove that X is a uniformly integrable martingale if and only X is of class D. Solution 1. If If direction:

More information

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries Chapter 1 Measure Spaces 1.1 Algebras and σ algebras of sets 1.1.1 Notation and preliminaries We shall denote by X a nonempty set, by P(X) the set of all parts (i.e., subsets) of X, and by the empty set.

More information

Proving the central limit theorem

Proving the central limit theorem SOR3012: Stochastic Processes Proving the central limit theorem Gareth Tribello March 3, 2019 1 Purpose In the lectures and exercises we have learnt about the law of large numbers and the central limit

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

Gaussian vectors and central limit theorem

Gaussian vectors and central limit theorem Gaussian vectors and central limit theorem Samy Tindel Purdue University Probability Theory 2 - MA 539 Samy T. Gaussian vectors & CLT Probability Theory 1 / 86 Outline 1 Real Gaussian random variables

More information

4 Uniform convergence

4 Uniform convergence 4 Uniform convergence In the last few sections we have seen several functions which have been defined via series or integrals. We now want to develop tools that will allow us to show that these functions

More information

7 Convergence in R d and in Metric Spaces

7 Convergence in R d and in Metric Spaces STA 711: Probability & Measure Theory Robert L. Wolpert 7 Convergence in R d and in Metric Spaces A sequence of elements a n of R d converges to a limit a if and only if, for each ǫ > 0, the sequence a

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information