10/36-702: Minimax Theory

Size: px
Start display at page:

Download "10/36-702: Minimax Theory"

Transcription

1 0/36-70: Minimax Theory Introduction When solving a statistical learning problem, there are often many procedures to choose from. This leads to the following question: how can we tell if one statistical learning procedure is better than another? One answer is provided by minimax theory which is a set of techniques for finding the minimum, worst case behavior of a procedure. Definitions and Notation Let P be a set of distributions and let X,..., X n be a sample from some distribution P P. Let θp be some function of P. For example, θp could be the mean of P, the variance of P or the density of P. Let θ = θx,..., X n denote an estimator. Given a metric d, the minimax risk is R n R n P = inf θ sup E P [d θ, θp ] where the infimum is over all estimators. The sample complexity is } nɛ, P = min n : R n P ɛ. Example Suppose that P = Nθ, : θ R} where Nθ, denotes a Gaussian with mean θ and variance. Consider estimating θ with the metric da, b = a b. The minimax risk is R n = inf sup E P [ θ θ ]. θ 3 In this example, θ is a scalar. Example Let X, Y,..., X n, Y n be a sample from a distribution P. Let mx = E P Y X = x = y dp y X = x be the regression function. In this case, we might use the metric dm, m = m x m x dx in which case the minimax risk is In this example, θ is a function. [ R n = inf sup E P m mx mx ]. 4

2 Notation. Recall that the Kullback-Leibler distance between two distributions P 0 and P with densities p 0 and p is defined to be dp0 p0 x KLP 0, P = log dp 0 log p 0 xdx. dp p x The appendix defines several other distances between probability distributions and explains how these distances are related. We write a b = mina, b} and a b = maxa, b}. If P is a distribution with density p, the product distribution for n iid observations is P n with density p n x = n i= px i. It is easy to check that KLP0 n, P n = nklp 0, P. For positive sequences a n and b n we write a n = Ωb n to mean that there exists C > 0 such that a n Cb n for all large n. a n b n if a n /b n is strictly bounded away from zero and infinity for all large n; that is, a n = Ob n and b n = Oa n. 3 Bounding the Minimax Risk Usually, we do not find R n directly. Instead, we find an upper bound U n and a lower bound L n on R n. To find an upper bound, let θ be any estimator. Then R n = inf θ sup E P [d θ, θp ] sup E P [d θ, θp ] U n. 5 So the maximum risk of any estimator provides an upper bound U n. Finding a lower bound L n is harder. We will consider three methods: the Le Cam method, the Fano method and Tsybakov s bound. If the lower and upper bound are close, then we have succeeded. For example, if L n = cn α and U n = Cn α for some positive constants c, C and α, then we have established that the minimax rate of convergence is n α. All the lower bound methods involve a the following trick: we reduce the problem to a hypothesis testing problem. It works like this. First, we will choose a finite set of distributions M = P,..., P N } P. Then R n = inf θ sup E P [d θ, θp ] inf max E j[d θ, θ j ] 6 θ P j M where θ j = θp j and E j is the expectation under P j. Let s = min j k dθ j, θ k. By Markov s inequality, P d θ, θ > t E[d θ, θ] t and so E[d θ, θ] tp d θ, θ > t. Setting t = s/, and using 6, we have R n s inf θ max P jd θ, θ j > s/. 7 P j M

3 Given any estimator θ, define Now, if ψ j then, letting k = ψ, s dθ j, θ k dθ j, θ + dθ k, θ ψ = argmin d θ, θ j. j dθ j, θ + dθ j, θ since ψ j implies that d θ, θ k d θ, θ j = dθ j, θ. So ψ j implies that dθ j, θ s/. Thus P j d θ, θ j > s/ P j ψ j inf ψ P jψ j where the infimum is over all maps ψ form the data to,..., N}. We can think of ψ is a multiple hypothesis test. Thus we have We can summarize this as a theorem: R n s inf ψ max P jψ j. P j M Theorem 3 Let M = P,..., P N } P and let s = min j k dθ j, θ k. Then R n = inf θ sup E P [d θ, θp ] s inf max P jψ j. 8 ψ P j M Getting a good lower bound involves carefully selecting M = P,..., P N }. If M is too big, s will be small. If M is too small, then max Pj M P j ψ j will be small. 4 Distances We will need some distances between distributions. Specifically, Total Variation TVP,Q = sup A P A QA L P Q = p q Kullback-Leibler KLP,Q = p logp/q χ χ P, Q = p q dq = p q Hellinger HP,Q = p q. We also define the affinity between P and q by ap, q = p q. 3

4 We then have TVP, Q = P Q = ap, q and H P, Q TVP, Q KLP, Q χ P, Q. The appendix contains more information about these distances. 5 Lower Bound Method : Le Cam Theorem 4 Let P be a set of distributions. For any pair P 0, P P, inf sup E P [d θ, θp ] s [p n θ 4 0x p n x]dx = s [ ] TVP0 n, P n 4 9 where s = dθp 0, θp. We also have: inf θ sup E P [d θ, θp ] s 8 e nklp 0,P s P 0,P 8 e nχ 0 and inf θ sup E P [d θ, θp ] s 8 p 0 p n. Corollary 5 Suppose there exist P 0, P P such that KLP 0, P log /n. Then inf θ sup E P [d θ, θp ] s 6 where s = dθp 0, θp. Proof. Let θ 0 = θp 0, θ = θp and s = dθ 0, θ. First suppose that n = so that we have a single observation X. From Theorem 3, inf θ sup E P [d θ, θp ] s π where π = inf ψ max P jψ j. 3 j=0, 4

5 Since a maximum is larger than an average, π = inf ψ Define the Neyman-Pearson test max P jψ j inf j=0, ψ ψ x = P 0 ψ 0 + P ψ. 0 if p0 x p x if p 0 x < p x. In Lemma 7 below, we show that the sum of the errors P 0 ψ 0 + P ψ is minimized by ψ. Now P 0 ψ 0 + P ψ = p 0 xdx + p xdx p >p 0 p 0 >p = [p 0 x p x]dx + [p 0 x p x]dx = [p 0 x p x]dx. p >p 0 p 0 >p Thus, Thus we have shown that P 0 ψ 0 + P ψ inf θ = sup E P [d θ, θp ] s 4 [p 0 x p x]dx. [p 0 x p x]dx. Now suppose we have n observations. Then, replacing p 0 and p with p n 0x = n i= p 0x i and p n x = n i= p x i, we have inf sup E P [d θ, θp ] s [p n θ 4 0x p n x]dx. In Lemma 7 below, we show that p q e KLP,Q. Since KLP0 n, P n = nklp 0, P, we have inf sup E P [d θ, θp ] s θ 8 e nklp 0,P. The other results follow from the inequalities on the distances. Lemma 6 Let ψ be the Neyman-Pearson test. For any test ψ, P 0 ψ = + P ψ = 0 P 0 ψ = + P ψ = 0. 5

6 Proof. Recall that p 0 > p when ψ = 0 and that p 0 < p when ψ =. So P 0 ψ = + P ψ = 0 = p 0 xdx + p xdx ψ= ψ=0 = p 0 xdx + p 0 xdx + p xdx + p xdx ψ=,ψ = ψ=,ψ =0 ψ=0,ψ =0 ψ=0,ψ = p 0 xdx + p xdx + p xdx + p 0 xdx ψ=,ψ = ψ=,ψ =0 ψ=0,ψ =0 ψ=0,ψ = = p 0 xdx + p xdx ψ = ψ =0 = P 0 ψ = + P ψ = 0. Lemma 7 For any P and Q, p q e KLP,Q. Proof. First note that, since a b + a b = a + b, we have p q + p q =. 4 Hence p q p q p q = p q [ = p q p q from 4 ] p q p q p q Cauchy Schwartz pq = = exp log = exp log p q/p pq exp where we used Jensen s inequality in the last inequality. p log q = e KLP,Q p Example 8 Consider data X, Y,..., X n, Y n where X i Uniform0,, Y i = mx i +ɛ i and ɛ i N0,. Assume that } m M = m : my mx L x y, for all x, y [0, ]. 6

7 So P is the set of distributions of the form px, y = pxpy x = φy mx where m M. How well can we estimate mx at some point x? Without loss of generality, let s take x = 0 so the parameter of interest is θ = m0. Let dθ 0, θ = θ 0 θ. Let m 0 x = 0 for all x. Let 0 ɛ and define Lɛ x 0 x ɛ m x = 0 x ɛ. Then m 0, m M and s = m 0 m 0 0 = Lɛ. The KL distance is KLP 0, P = = = = ɛ 0 ɛ Now, KLNµ,, Nµ, = µ µ /. So = 0 KLP 0, P = L p0 x, y p 0 x, y log dydx p x, y p0 xp 0 y x p 0 xp 0 y x log p xp y x φy φy log dydx φy m x φy φy log dydx φy m x KLN0,, Nm x, dx. ɛ 0 ɛ x dx = L ɛ 3 6. dydx Let ɛ = 6 log /L n /3. Then, KLP 0, P = log /n and hence, by Corollary 5, inf θ sup E P [d θ, θp ] s 6 = Lɛ 6 = L 6 /3 6 log c /3 =. 5 L n n It is easy to show that the regressogram regression histogram θ = m0 has risk E P [d θ, θp ] /3 C. n Thus we have proved that inf θ sup E P [d θ, θp ] n 3. 6 The same calculations in d dimensions yield inf θ sup E P [d θ, θp ] n d+. 7 7

8 On the squared scale we have inf θ Similar rates hold in density estimation. sup E P [d θ, θp ] n d+. 8 There is a more general version of Le Cam s lemma that is sometimes useful. Lemma 9 Let P, Q,..., Q N be distributions such that dθp, θq j s for all j. Then inf sup E P [d θ, θp ] s p n q n θ 4 where q = N j q j. Example 0 Let Y i = θ i + n Z i, i =,..., d where Z, Z,..., Z d N0, and θ = θ,..., θ d Θ where Θ = θ R d : θ 0 }. Let P = N0, n I. Let Q j have mean 0 expect that j th coordinate has mean a log d/n where 0 < a <. Let q = N j q j. Some algebra good homework question! shows that χ q, p 0 as d. By the generalized Le Cam lemma, R n a log d/n using squared error loss. We can estimate θ by thresholding Bonferroni. This gives a matching upper bound. 6 Lower Bound Method II: Fano For metrics like df, g = f g, Le Cam s method will usually not give a tight bound. Instead, we use Fano s method. Instead of choosing two distributions P 0, P, we choose a finite set of distributions P,..., P N P. We start with Fano s lemma. Lemma Fano Let X,..., X n P where P P,..., P N }. Let ψ be any function of X,..., X n taking values in,..., N}. Let β = max j k KLP j, P k. Then N P j ψ j nβ + log. log N 8

9 Proof: See Lemma 8 in the appendix.. Now we can state and prove the Fano minimax bound. Theorem Let F = P,..., P N } P. Let θp be a parameter taking values in a metric space with metric d. Then E P d θ, θp s nβ + log 9 log N inf θ sup where and s = min j k d θp j, θp k, 0 β = max j k KLP j, P k. Corollary 3 Fano Minimax Bound Suppose there exists F = P,..., P N } P such that N 6 and β = max KLP j, P k log N j k 4n. Then Proof. From Theorem 3, inf θ max E P [ ] d θ, θp s 4. 3 R n s inf ψ max P jψ j s P j F N P j ψ j where the latter is due to the fact that a max is larger than an average. By Fano s lemma, N P j ψ j nβ + log. log N Thus, inf θ sup E P d θ, θp inf max θ P F E P d θ, θp s nβ + log. 4 log N 9

10 7 Lower Bound Method III: Tsybakov s Bound This approach is due to Tsybakov 009. appendix. The proof of the following theorem is in the Theorem 4 Tsybakov 009 Let X,..., X n P P. Let P 0, P,..., P N } P where N 3. Assume that P 0 is absolutely continuous with respect to each P j. Suppose that Then where N inf θ s = KLP j, P 0 log N 6. sup E P [d θ, θp ] s 6 max dθp j, θp k. 0 j<k N 8 Hypercubes To use Fano s method or Tsybakov s method, we need to construct a finite class of distributions F. Sometimes we use a set of the form } F = P ω : ω Ω where Ω = } ω = ω,..., ω m : ω i 0, }, i =,..., m which is called a hypercube. There are N = m distributions in F. For ω, ν Ω, define the Hamming distance Hω, ν = m Iω k ν k. One problem with a hypercube is that some pairs P, Q F might be very close together which will make s = min j k d θp j, θp k small. This will result in a poor lower bound. We can fix this problem by pruning the hypercube. That is, we can find a subset Ω Ω which has nearly the same number of elements as Ω but such that each pair P, Q F = P ω : ω Ω } is far apart. We call Ω a pruned hypercube. The technique for constructing Ω is the Varshamov-Gilbert lemma. Lemma 5 Varshamov-Gilbert Let Ω = } ω = ω,..., ω N : ω j 0, }. Suppose that N 8. There exists ω 0, ω,..., ω M Ω such that i ω 0 = 0,..., 0, ii M N/8 and iii Hω j, ω k N/8 for 0 j < k M. We call Ω = ω 0, ω,..., ω M } a pruned hypercube. 0

11 Proof. Let D = N/8. Set ω 0 = 0,..., 0. Define Ω 0 = Ω and Ω = ω Ω : Hω, ω 0 > D}. Let ω be any element in Ω. Thus we have eliminated ω Ω : Hω, ω 0 D}. Continue this way recursively and at the j th step define Ω j = ω Ω j : Hω, ω j > D} where j =,..., M. Let n j be the number of elements eliminated at step j, that is, the number of elements in A j = ω Ω j : Hω, ω j D}. It follows that n j D i=0 N i The sets A 0,..., A M partition Ω and so n 0 + n + + n M = N. Thus, D N M + N. i Thus M + D i=0 N N i i=0 =. N P i= Z i m/8 where Z,..., Z N are iid Bernoulli / random variables. By Hoeffding s inequaity, N P Z i m/8 e 9N/3 < N/4. i= Therefore, M N/8 as long as N 8. Finally, note that, by construction, Hω j, ω k D + N/8. Example 6 Consider data X, Y,..., X n, Y n where X i Uniform0,, Y i = fx i + ɛ i and ɛ i N0,. The assumption that X is uniform is not crucial. Assume that f is in the Holder class F defined by } F = f : f l y f l x L x y β l, for all x, y [0, ] where l = β. P is the set of distributions of the form px, y = pxpy x = φy mx where f F. Let Ω be a pruned hypercube and let } m F = f ω x = ω j φ j x : ω Ω where m = cn β+, φj x = Lh β Kx X j /h, and h = /m. Here, K is any sufficiently smooth function supported on /, /. Let d f, g = f g. Some calculations show that, for ω, ν Ω, df ω, f ν = Hω, νlh β+ K m 8 Lhβ+ K c h β.

12 We used the Varshamov-Gilbert result which implies that Hω, ν m/8. Furthermore, To apply Corollary 3, we need to have Now log N 4n KLP ω, P ν c h β. KLP ω, P ν log N 4n. = log m/8 4n = m 3n = 3nh. So we set h = c/n /β+. In that case, df ω, f ν c h β = c c/n β/β+. Corollary 3 implies that inf sup E P [d f, f] n β β+. f It follows that inf f sup E P f f n β β+. It can be shown that there are kernel estimators that achieve this rate of convergence. The kernel has to be chosen carefully to take advantage of the degree of smoothness β. A similar calculation in d dimensions shows that inf f sup E P f f n β β+d. 9 Further Examples 9. Parametric Maximum Likelihood For parametric models that satisfy weak regularity conditions, the maximum likelihood estimator is approximately minimax. Consider squared error loss which is squared bias plus variance. In parametric models with large samples, it can be shown that the variance term dominates the bias so the risk of the mle θ roughly equals the variance: Rθ, θ = Var θ θ + bias Var θ θ. 5 The variance of the mle is approximately Var θ where Iθ is the Fisher information. niθ Hence, nrθ, θ Iθ. 6 Typically, the squared bias is order On while the variance is of order On.

13 For any other estimator θ, it can be shown that for large n, Rθ, θ Rθ, θ. For d- dimensional vectors we have Rθ, θ Iθ /n = Od/n. Here is a more precise statement, due to Hájek and Le Cam. The family of distributions P θ : θ Θ with densities P θ : θ Θ is differentiable in quadratic mean if there exists l θ such that pθ+h p θ ht l θ pθ dµ = o h. 7 Theorem 7 Hájek and Le Cam Suppose that P θ : θ Θ is differentiable in quadratic mean where Θ R k and that the Fisher information I θ is nonsingular. Let ψ be differentiable. Then ψ θ n, where θ n is the mle, is asymptotically, locally, uniformly minimax in the sense that, for any estimator T n, and any bowl-shaped l, sup lim inf I I n sup E θ+h/ n l h I n T n ψ θ + h n ElU 8 where I is the class of all finite subsets of R k and U N0, ψ θ I θ ψ θ T. For a proof, see van der Vaart 998. Note that the right hand side of the displayed formula is the risk of the mle. In summary: in well-behaved parametric models, with large samples, the mle is approximately minimax. 9. Estimating a Smooth Density Here we use the general strategy to derive the minimax rate of convergence for estimating a smooth density. See Yu 008 for more details. Let F be all probability densities f on [0, ] such that 0 < c 0 fx c <, f x c <. We observe X,..., X n P where P has density f F. We will use the squared Hellinger distance d f, g = 0 fx gx dx as a loss function. Upper Bound. Let f n be the kernel estimator with bandwidth h = n /5. Then, using bias-variance calculations, we have that sup E f fx fx dx f F Cn 4/5 3

14 for some C. But fx gx fx gx dx = dx C fx + gx fx gx dx 9 for some C. Hence sup f E f d f, f n Cn 4/5 which gives us an upper bound. Lower Bound. For the lower bound we use Fano s inequality. Let g be a bounded, twice differentiable function on [ /, /] such that / / / gxdx = 0, g xdx = a > 0, / / / g x dx = b > 0. Fix an integer m and for j =,..., m define x j = j //m and g j x = c m gmx x j for x [0, ] where c is a small positive constant. Let M denote the Varshamov-Gilbert pruned version of the set m f τ = + τ j g j x : τ = τ,..., τ m, +} }. m For f τ M, let fτ n denote the product density for n observations and let M n = } M. Some calculations show that, for all τ, τ, f n τ : f τ KLfτ n, fτ n = nklf τ, f τ C n β. 30 m4 By Lemma 5, we can choose a subset F of M with N = e c 0m elements where c 0 is a constant and such that d f τ, f τ C m 4 α 3 for all pairs in F. Choosing m = cn /5 gives β log N/4 and d f τ, f τ C. Fano s n 4/5 lemma implies that max E j d f, f j j C n 4/5. Hence the minimax rate is n 4/5 which is achieved by the kernel estimator. Thus we have shown that R n P n 4/5. This result can be generalized to higher dimensions and to more general measures of smoothness. Since the proof is similar to the one dimensional case, we state the result without proof. 4

15 Theorem 8 Let Z be a compact subset of R d. Let Fp, C denote all probability density functions on Z such that p z p z p fz d d dz C where the sum is over al p,..., p d such that j p j = p. Then there exists a constant D > 0 such that p inf sup E f f n z fz p+ dz D. 3 f f Fp,C n The kernel estimator with an appropriate kernel with bandwidth h n = n /p+d achieves this rate of convergence. 9.3 Minimax Classification Let us now turn to classification. We focus on some results of Yang 999, Tsybakov 004, Mammen and Tsybakov 999, Audibert and Tsybakov 005 and Tsybakov and van de Geer 005. The data are Z = X, Y,..., X n, Y n where Y i 0, }. Recall that a classifier is a function of the form hx = Ix G for some set G. The classification risk is RG = PY hx = PY IX G = EY IX G. 33 The optimal classifier is h x = Ix G where G = x : mx /} and mx = EY X = x. We are interested in how close RG is to RG. Following Tsybakov 004 we define dg, G = RG RG = mx dp Xx 34 G G where A B = A B c A c B and P X is the marginal distribution of X. There are two common types of classifiers. The first type are plug-in classifiers of the form ĥx = I mx / where m is an estimate of the regression function. The second type are empirical risk minimizers where ĥ is taken to be the h that minimizes the observed error rate n n i= Y i hx i as h varies over a set of classifiers H. Sometimes one minimizes the error rate plus a penalty term. According to Yang 999, the classification problem has, under weak conditions, the same order of difficulty in terms of minimax rates as estimating the regression function mx. Therefore the rates are given in Example 39. According to Tsybakov 004 and Mammen and Tsybakov 999, classification is easier than regression. The apparent discrepancy is due to differing assumptions. 5

16 To see that classification error cannot be harder than regression, note that for any m and corresponding Ĝ dg, Ĝ = mx dp X x 35 G Ĝ mx mx dp X x mx mx dp X x 36 so the rate of convergence of dg, G is at least as fast as the regression function. Instead of putting assumptions on the regression function m, Mammen and Tsybakov 999 put an entropy assumption on the set of decision sets G. They assume log Nɛ, G, d Aɛ ρ 37 where Nɛ, G, d is the smallest number of balls of radius ɛ required to cover G. They show that, if 0 < ρ <, then there are classifiers with rate sup EdĜ, G = On / 38 P independent of dimension d. Moreover, if we add the margin or low noise assumption P X 0 < mx t Ct α for all t > 0 39 we get sup EdĜ, G = O n +α/+α+αρ 40 P which can be nearly /n for large α and small ρ. The classifiers can be taken to be plug-in estimators using local polynomial regression. Moreover, they show that this rate is minimax. We will discuss classification in the low noise setting in more detail in another chapter. 9.4 Estimating a Large Covariance Matrix Let X,..., X n be iid Gaussian vectors of dimension d. Let Σ = σ ij i,j d be the d d covariance matrix for X i. Estimating Σ when d is large is very challenging. Sometimes we can take advantage of special structure. Bickel and Levina 008 considered the class of covariance matrices Σ whose entries have polynomial decay. Specifically, Θ = Θα, ɛ, M is all covariance matrices Σ such that 0 < ɛ λ min Σ λ max Σ /ɛ and such that } max j i σ ij : i j > k Mk α 6

17 for all k. The loss function is Σ Σ where is the operator norm A = sup Ax. x: x = Bickel and Levina 008 constructed an estimator that that converges at rate log d/n α/α+. Cai, Zhang and Zhou 009 showed that the minimax rate is min n α/α+ + log d n, d } n so the Bickel-Levina estimator is not rate minimax. Cai, Zhang and Zhou then constructed an estimator that is rate minimax. 9.5 Semisupervised Prediction Suppose we have data X, Y,..., X n, Y n for a classification or regression problem. In addition, suppose we have extra unlabelled data X n+,..., X N. Methods that make use of the unlabeled are called semisupervised methods. We discuss semisupervised methods in another Chapter. When do the unlabeled data help? Two minimax analyses have been carried out to answer that question, namely, Lafferty and Wasserman 007 and Singh, Nowak and Zhu 008. Here we briefly summarize the results of the latter. Suppose we want to estimate mx = EY X = x where x R d and y R. Let p be the density of X. To use the unlabelled data we need to link m and p in some way. A common assumption is the cluster assumption: m is smooth over clusters of the marginal px. Suppose that p has clusters separated by a amount γ and that m is α smooth over each cluster. Singh, Nowak and Zhu 008 obtained the following upper and lower minimax bounds as γ varies in 6 zones which we label I to VI. These zones relate the size of γ and the number of unlabeled points: γ semisupervised supervised unlabelled data help? upper bound lower bound I n α/α+d n α/α+d NO II n α/α+d n α/α+d NO III n α/α+d n /d YES IV n /d n /d NO V n α/α+d n /d YES VI n α/α+d n /d YES The important message is that there are precise conditions when the unlabeled data help and conditions when the unlabeled data do not help. These conditions arise from computing the minimax bounds. 7

18 9.6 Graphical Models Elsewhere in the book, we discuss the problem of estimating graphical models. Here, we shall briefly mention some minimax results for this problem. Let X be a random vector from a multivariate Normal distribution P with mean vector µ and covariance matrix Σ. Note that X is a random vector of length d, that is, X = X,..., X d T. The d d matrix Ω = Σ is called the precision matrix. There is one node for each component of X. The undirected graph associated with P has no edge between X j and X j if and only if Ω jk = 0. The edge set is E = j, k : Ω jk 0}. The graph is G = V, E where V =,..., d} and E is the edge set. Given a random sample of vectors X,..., X n P we want to estimate G. Only the edge set needs to be estimated; the nodes are known. Wang, Wainwright and Ramchandran 00 found the minimax risk for estimating G under zero-one loss. Let G d,r λ denote all the multivariate Normals whose graphs have edge sets with degree at most r and such that min i,j E Ω jk Ωjj Ω kk λ. The sample complexity nd, r, λ is the smallest sample size n needed to recover the true graph with high probability. They show that for any λ [0, /], log d r log d nd, r, λ > max, r 4λ. 4 log + rλ λ Thus, assuming λ /r, we get that n Cr logd r. rλ +r λ 9.7 Deconvolution and Measurement Error A problem has seems to have received little attention in the machine learning literature is deconvolution. Suppose that X,..., X n P where P has density p. We have seen that the minimax rate for estimating p in squared error loss is n β β+ where β is the assumed amount of smoothness. Suppose we cannot observe X i directly but instead we observe X i with error. Thus, we observe Y,..., Y n where Y i = X i + ɛ i, i =,..., n. 4 The minimax rates for estimating p change drastically. A good account is given in Fan 99. As an example, if the noise ɛ i is Gaussian, then Fan shows that the minimax risk satisfies β R n C log n 8

19 which means that the problem is essentially hopeless. Similar results hold for nonparametric regression. In the usual nonparametric regression problem we observe Y i = mx i + ɛ i and we want to estimate the function m. If we observe Xi = X i + δ i instead of X i then again the minimax rates change drastically and are logarithmic of the δ i s are Normal Fan and Truong 993. This is known as measurement error or errors in variables. This is an interesting example where minimax theory reveals surprising and important insight. 9.8 Normal Means Perhaps the best understood cases in minimax theory involve normal means. First suppose that X,..., X n Nθ, σ where σ is known. A function g is bowl-shaped if the sets x : gx c} are convex and symmetric about the origin. We will say that a loss function l is bowl-shaped if lθ, θ = gθ θ for some bowl-shaped function g. Theorem 9 The unique estimator that is minimax for every bowl-shaped loss function is the sample mean X n. For a proof, see Wolfowitz 950. Now consider estimating several normal means. Let X j = θ j + ɛ j / n for j =,..., n and suppose we and to estimate θ = θ,..., θ n with loss function l θ, θ = n θ j θ j. Here, ɛ,..., ɛ n N0, σ. This is called the normal means problem. There are strong connections between the normal means problem and nonparametric learning. For example, suppose we want to estimate a regression function fx and we observe data Z i = fi/n +δ i where δ i N0, σ. Expand f in an othonormal basis: fx = j θ jψ j x. An estimate of θ j is X j = n n i= Z i ψ j i/n. It follows that X j Nθ j, σ /n. This connection can be made very rigorous; see Brown and Low 996. The minimax risk depends on the assumptions about θ. Theorem 0 Pinsker. If Θ n = R n then R n = σ and θ = X = X,..., X n is minimax.. If Θ n = θ : n j θ j C } then Up to sets of measure 0. lim inf n inf θ sup R θ, θ = σ C θ Θ n σ + C. 43 9

20 Define the James-Stein estimator Then θ JS = Hence, θ JS is asymptotically minimax. n σ n X. 44 n X j lim sup R θ JS, θ = σ C n θ Θ n σ + C Let X j = θ j + ɛ j for j =,,..., where ɛ j N0, σ /n. } Θ = θ : θj a j C 46 where a j = πj p. Let R n denote the minimax risk. Then min n p p+ σ p p p+ Rn = C p+ p p+ p + p+. 47 n π p + Hence, R n n p p+. An asymptotically minimax estimator is the Pinsker estimator defined by θ = w X, w X,..., where w j = [ a j /µ] + and µ is determined by the equation σ a j µ a j + = C. n j The set Θ in 46 is called a Sobolev ellipsoid. This set corresponds to smooth functions in the function estimation problem. The Pinsker estimator corresponds to estimating a function by smoothing. The main message to take away from all of this is that minimax estimation under smoothness assumptions requires shrinking the data appropriately. 0 Adaptation The results in this chapter provide minimax rates of convergence and estimators that achieve these rates. However, the estimators depend on the assumed parameter space. For example, estimating a β-times differential regression function requires using an estimator tailored to the assumed amount of smoothness to achieve the minimax rate n β β+. There are estimators that are adaptive, meaning that they achieve the minimax rate without the user having to know the amount of smoothness. See, for example, Chapter 9 of Wasserman 006 and the references therein. 0

21 Minimax Hypothesis Testing Let Y,..., Y n P where P P and let P 0 P. We want to test H 0 : P = P 0 versus H : P P 0. Recall that a size α test is a function φ of y,..., y n such that φy,..., y n 0, } and P n 0 φ = α. Let Φ n be the set level α tests based on n observations where 0 < α < is fixed. We want to find the minimax type II error β n ɛ = inf sup φ Φ n ɛ P n φ = 0 48 where Pɛ = } P P : dp 0, Q > ɛ and d is some metric. The minimax testing rate is ɛ n = inf ɛ : β n ɛ δ }. Lower Bound. Define Q by QA = P n AdµP 49 where µ is any distribution whose support is contained in Pɛ. In particular, if µ is uniform on a finite set P,..., P N then QA = Pj n A. 50 N j Define the likelihood ratio L n = dq dp n 0 = py n p 0 y n dµp = py j dµp. 5 p 0 y j j Lemma Let 0 < δ < α. If then β n ɛ δ. E 0 [L n] + 4 α δ 5

22 Proof. Since P n 0 φ = α for each φ, we have β n ɛ = inf φ sup P n φ = 0 inf ɛ φ Qφ = 0 inf P 0 n φ = 0 + [Qφ = 0 P0 n φ = 0] φ α + [Qφ = 0 P0 n φ = 0] α sup QA P0 n A A = α Q P n 0. Now, Q P0 n = L n y n dp 0 y n = E 0 L n y n E 0 [L n]. The result follows from 5. Upper Bound. Let φ be any size α test. Then gives an upper bound. β n ɛ sup P n φ = 0 ɛ. Multinomials Let Y,..., Y n P where Y i,..., d} S. The minimax estimation rate under L loss for this problem is O d/n. We will show that the minimax testing rate is d /4 / n. We will focus on the uniform distribution p 0. So p 0 y = /d for all y. Let } Pɛ = p : p p 0 > ɛ. We want to find ɛ n such that β n Pɛ n = δ. Theorem Let where ɛ Cd/4 n C = [log + 4 α δ ] /4. Assume that ɛ. Then β n ɛ δ.

23 Proof. Let γ = ɛ/d. Let η = η,..., η d where η j, +} and j η j = 0. Let R be all such sequences. This is like a Radamacher sequence except that they are not quite independent. Define p η by p η y = p 0 y + γ j η j ψ j y where ψ j y = if y = j and ψ j y = 0 otherwise. Then p η j 0 for all j and j p ηj =. Also p 0 p η = γη j = γd = ɛ. j Let N be the number of such probability functions. Now and, for any η, ν R, L n = N η ν i L n = N p η Y i p ν Y i p 0 Y i p 0 Y i p η Y i p 0 Y i η = p 0 Y i + γ j η jψ j Y i p 0 Y i + γ j ν jψ j Y i N p η ν i 0 Y i p 0 Y i = + dγ η N j ψ j Y i + dγ ν j ψ j Y i. η ν i j j Taking the expected value over Y,..., Y n, and using the fact that ψ j Y i ψ k Y i = 0 for j k, and that E 0 [ψ j Y ] = P 0 Y = j = /d, we have E 0 [L n] = + dγ n η N j ν j exp ndγ η, ν N η ν j η ν = E η,ν e n dγ η,ν where E η,ν denotes expectation over random draws of η and ν. The average over η and ν can thought of as averages with respect to random assignmetn of the s and s. Hence, because these are negatively associated, i E 0 [L n] E η,ν [e ndγ η,ν ] j Ee ndγ η j ν j = j coshndγ j + n d γ 4 j e n d γ 4 = e n d 3 γ 4 = e n d 3 ɛ 4 /d α δ 3

24 where we used the fact that, γ = ɛ n /d and for small x, coshx + x. From Lemma, it follows that that β n Pɛ, L δ. For the upper bound we use a test due to Paninski 008. Let T n be the number of bins with example one observation. Define t n by Let φ = IT n < t n. Thus φ Φ n. P 0 T n < t n = α. Lemma 3 There exists C such that, if p p 0 > C d /4 / n then P φ = 0 < δ.. Testing Densities Now consider Y,..., Y n P where Y i [0, ] d. Let p 0 be the uniform density. We want to test H 0 : P = P 0. Define Pɛ, s, L = f : f 0 f ɛ} Hs L 53 where f H s L if, for all x, y, and f t L for t =,..., s. f t y f t x L x y t Theorem 4 Fix δ > 0. Define ɛ n by βɛ n = δ. Then, there exist c, c > 0 such that c n s 4s+d ɛn c n s 4s+d. Proof. Here is a proof outline. Divide the spce into k = n /4s+d equal size bins. Let η be a Radamacher sequence. Let ψ be a smooth function such that ψ = 0 and ψ =. For the j th bin B j, let ψ j be the function ψ rescaled and recentered to be supported on B j with ψ j =. Define p η y = p 0 y + γ j η j ψ j y where γ = cn s+d/4s+d. It may be verified p η is a density in H s L and that p 0 p η ɛ. Let N be the number of Rademacher sequences. Then L n = N p η Y i p 0 Y i η i 4

25 and L n = N η ν = N η ν i i p η Y i p ν Y i p 0 Y i p 0 Y i = p 0 Y i + j γη jψ j Y i p 0 Y i + j γν jψ j Y i N p η ν i 0 Y i p 0 Y i + +. j γη jψ j Y i p 0 Y i j γν jψ j Y i p 0 Y i Taking the expcted value over Y,..., Y n, and using the fact that the ψ j s are orthogonal, E 0 [L n] = + n γ η N j ν j exp n γ η N j ν j. η ν j η ν j Thus E 0 [L n] E η,ν e n η,ν where η, ν = γ j η jν j. Hence, E 0 [L n] E η,ν e n η,ν = j Ee nη jν j = j coshnρ j j + n ρ 4 j j e n γ 4 = e kn γ 4 C 0. From Lemma, it follows that that β n Pɛ, L δ. The upper bound can be obtained by using a χ test on the bins. Surprisingly, the story gets much more complicated for non-uniform densities. Summary Minimax theory allows us to state precisely the best possible performance of any procedure under given conditions. The key tool for finding lower bounds on the minimax risk is Fano s inequality. Finding an upper bound usually involves finding a specific estimator and computing its risk. 3 Bibliographic remarks There is a vast literature on minimax theory however much of it is scattered in various journal articles. Some texts that contain minimax theory include Tsybakov 009, van de Geer 000, van der Vaart 998 and Wasserman 006. References: Arias-Castro Ingster xxxx, Yu 008, Tsybakov 009, van der Vaart 998, Wasserman 04. 5

26 4 Appendix 4. Metrics For Probability Distributions Minimax theory often makes use of various metrics for probability distributions. Here we summarize some of these metrics and their properties. Let P and Q be two distributions with densities p and q. We write the distance between P and Q as either dp, Q or dp, q whichever is convenient. We define the following distances and related quantities. Total variation TVP, Q = sup A P A QA L distance d P, Q = p q L distance d P, Q = p q Hellinger distance hp, Q = p q Kullback-Leibler distance KLP, Q = p logp/q χ χ P, Q = p q /p Affinity P Q = p q = minpx, qx}dx Hellinger affinity AP, Q = pq There are many relationships between these quantities. These are summarized in the next two theorems. We leave the proofs as exercises. Theorem 5 The following relationships hold:. TVP, Q = d P, Q = p q. Scheffés Theorem.. TVP, Q = P A QA where A = x : px > qx} hp, Q. 4. h P, Q = AP, Q. 5. P Q = d P, Q. h P,Q 6. P Q A P, Q = 7. h P, Q TVP, Q = d P, Q hp, Q 8. TVP, Q KLP, Q/. Pinsker s inequality. 9. log dp/dq + dp KLP, Q + KLP, Q/. 0. P Q e KLP,Q.. TVP, Q hp, Q KLP, Q χ P, Q.. Le Cam s inequalities. h P,Q 4. 6

27 Let P n denote the product measure based on n independent samples from P. Theorem 6 The following relationships hold:. h P n, Q n = n h P,Q.. P n Q n A P n, Q n = 3. P n Q n d P, Q n. 4. KLP n, Q n = nklp, Q. h P, Q n. 4. Fano s Lemma For 0 < p < define the entropy hp = p log p p log p and note that 0 hp log. Let Y, Z be a pair of random variables each taking values in,..., N} with joint distribution P Y,Z. Then the mutual information is defined to be IY ; Z = KLP Y,Z, P Y P Z = HY HY Z 54 where HY = j PY = j log PY = j is the entropy of Y and HY Z is the entropy of Y given Z. We will use the fact that IY ; hz IY ; Z for an function h. Lemma 7 Let Y be a random variable taking values in,..., N}. Let P,..., P N } be a set of distributions. Let X be drawn from P j for some j,..., N}. Thus P X A Y = j = P j A. Let Z = gx be an estimate of Y taking values in,..., N}. Then, HY X PZ Y logn + hpz = Y. 55 We follow the proof from Cover and Thomas 99. Proof. Let E = IZ Y. Then HE, Y X = HY X + HE X, Y = HY X since HE X, Y = 0. Also, HE, Y X = HE X + HY E, X. But HE X HE = hp Z = Y. Also, HY E, X = P E = 0HY X, E = 0 + P E = HY X, E = P E = hp Z = Y logn 7

28 since Y = gx when E = 0 and, when E =, HY X, E = by the log number of remaining outcomes. Combining these gives HY X PZ Y logn +hpz = Y. Lemma 8 Fano s Inequality Let P = P,..., P N } and β = max j k KLP j, P k. For any random variable Z taking values on,..., N}, N P j Z j nβ + log. log N Proof. For simplicity, assume that n =. The general case follows since KLP n, Q n = nklp, Q. Let Y have a uniform distribution on,..., N}. Given Y = j, let X have distribution P j. This defines a joint distribution P for X, Y given by P X A, Y = j = P X A Y = jp Y = j = N P ja. Hence, From 55, N P Z j Y = j = P Z Y. HY Z P Z Y logn + hp Z = Y P Z Y logn + h/ = P Z Y logn + log. Therefore, P Z Y logn HY Z log = HY IY ; Z log = log N IY ; Z log log N β log. 56 The last inequality follows since IY ; Z IY ; X = N KLP j, P N KLP j, P k β 57 j,k where P = N N P j and we used the convexity of K. Equation 56 shows that and the result follows. P Z Y logn log N β log 8

29 4.3 Tsybakov s Method Proof of Theorem 4. Let X = X,..., X n and let ψ ψx 0,,..., N}. Fix τ > 0 and define } dp 0 A j = τ. dp j Then P 0 ψ 0 = = τ τ P 0 ψ = j P 0 ψ = j A j P 0 ψ = j A j P j ψ = j P j ψ = j A j A j = τn N P j ψ = j A j P j ψ = j τ P j A c j P j ψ = j τn N = τnp 0 a P j A c j where p 0 = N P j ψ = j, a = N dp0 P j < τ. dp j 9

30 Hence, max P jψ j = max P 0 ψ 0, max P jψ j 0 j N j N max τnp 0 a, max P jψ j j N } max τnp 0 a, P j ψ j N } τnp 0 a, p 0 = max min 0 p max = [ τn + τn N τnp a, p } = } } ] dp0 P j τ. dp j τn a + τn In Lemma 9 below we show that N dp0 P j τ dp j a + a / log/τ where α = N j KP j, P 0. Choosing τ = / N we get N max P jψ j 0 j N + a + a / N log/τ N = + a + a / N log N = By Theorem 3, R n s 3 8 s 8. Lemma 9 Let a = N j KP j, P 0. Then, N dp0 P j τ dp j a + a /. log/τ 30

31 Proof. Note that dp0 dpj P j τ = P j dp j dp 0 τ = P j log dp j log dp 0. τ By Markov s inequality, P j log dp [ j log P j log dp ] j dp 0 τ dp 0 + log τ [ log dp ] j dp j. log/τ dp 0 + According to Pinsker s second inequality see Thheorem 5 in the appendix and Tsyabakov Lemma.5, [ log dp ] j dp j KP j, P 0 + KP j, K 0 /. dp 0 + So P j log dp [ ] j log KP j, P 0 + KP j, K 0 /. dp 0 τ log/τ Using Jensen s inequality, N j KP j, P 0 N KP j, P 0 = a. j So N dp0 P j τ KP j, P 0 dp j log/τ N log/τ N j a log/τ a / log/τ. j KP j, P 0 / 4.4 Assouad s Lemma Assouad s Lemma is another way to get a lower bound using hypercubes. Let } Ω = ω = ω,..., ω N : ω j 0, } be the set of binary sequences of length N. Let P = P ω : ω Ω} be a set of N distributions indexed by the elements of Ω. Let hω, ν = N Iω j ν j be the Hamming distance between ω, ν Ω. 3

32 Lemma 30 Let P ω : ω Ω} be a set of distributions indexed by ω and let θp be a parameter. For any p > 0 and any metric d, max E ω d p θ, θp ω N d p θp ω, θp ν min min P ω Ω p+ ω,ν hω, ν ω,ν ω P ν. 58 hω,ν 0 hω,ν= For a proof, see van der Vaart 998 or Tsybakov The Bayesian Connection Another way to find the minimax risk and to find a minimax estimator is to use a carefully constructed Bayes estimator. In this section we assume we have a parametric family of densities px; θ : θ Θ} and that our goal is to estimate the parameter θ. Since the distributions are indexed by θ, we can write the risk as Rθ, θ n and the maximum risk as sup θ Θ Rθ, θ n. Let Q be a prior distribution for θ. The Bayes risk with respect to Q is defined to be B Q θ n = Rθ, θ n dqθ 59 and the Bayes estimator with respect to Q is the estimator θ n that minimizes B Q θ n. For simplicity, assume that Q has a density q. The posterior density is then qθ X n = px..., X n ; θqθ mx,..., X n where mx,..., x n = px,..., x n ; θqθdθ. Lemma 3 The Bayes risk can be written as Lθ, θ n qθ x,..., x n dθ mx,..., x n dx dx n. It follows from this lemma that the Bayes estimator can be obtained by finding θ n = θ n x..., x n to minimize the inner integral Lθ, θ n qθ x,..., x n dθ. Often, this is an easy calculation. Example 3 Suppose that Lθ, θ n = θ θ n. Then the Bayes estimator is the posterior mean θ Q = θ qθ x,..., x n dθ. 3

33 Now we link Bayes estimators to minimax estimators. Theorem 33 Let θ n be an estimator. Suppose that i the risk function Rθ, θ n is constant as a function of θ and ii θ n is the Bayes estimator for some prior Q. Then θ n is minimax. Proof. We will prove this by contradiction. Suppose that θ n is not minimax. Then there is some other estimator θ such that sup θ Θ Rθ, θ < sup Rθ, θ n. 60 θ Θ Now, B Q θ = Rθ, θ dqθ definition of Bayes risk sup Rθ, θ θ Θ average is less than sup < sup Rθ, θ n from 60 = θ Θ Rθ, θ n dqθ since risk is constant = B Q θ n definition of Bayes risk. So B Q θ < B Q θ n. This is a contradiction because θ n is the Bayes estimator for Q so it must minimize B Q. Example 34 Let X Binomialn, θ. The mle is X/n. Let Lθ, θ n = θ θ n. Define X θ + n 4n n =. + Some calculations show that this is the posterior mean under a Betaα, β prior with α = β = n/4. By computing the bias and variance of θ n it can be seen that Rθ, θ n is constant. Since θ n is Bayes and has constant risk, it is minimax. n Example 35 Let us now show that the sample mean is minimax for the Normal model. Let X N p θ, I be multivariate Normal with mean vector θ = θ,..., θ p. We will prove that θ n = X is minimax when Lθ, θ n = θ n θ. Assign the prior Q = N0, c I. Then the posterior is c x N + c, c + c I. 6 33

34 The Bayes risk B Q θ n = Rθ, θ n dqθ is minimized by the posterior mean θ = c X/ + c. Direct computation shows that B Q θ = pc /+c. Hence, if θ is any estimator, then pc + c = B Q θ B Q θ = Rθ, θdqθ sup Rθ, θ. θ This shows that RΘ pc / + c for every c > 0 and hence But the risk of θ n = X is p. So, θ n = X is minimax. RΘ p Nonparametric Maximum Likelihood and the Le Cam Equation In some cases, the minimax rate can be found by finding ɛ to solve the equation Hɛ n = nɛ n where Hɛ = log Nɛ and Nɛ is the smallest number of balls of size ɛ in the Hellinger metric needed to cover P. Hɛ is called the Hellinger entropy of P. The equation Hɛ = nɛ is known as the Le Cam equation. In this section we consider one case where this is true. For more general versions of this argument, see Shen and Wong 995, Barron and Yang 999 and Birgé and Massart 993. Our goal is to estimate the density function using maximum likelihood. The loss function is Hellinger distance. Let P be a set of probability density functions. We have in mind the nonparametric situation where P does not correspond to some finite dimensional parametric family. Let Nɛ denote the Hellinger covering number of P. We will make the following assumptions: A We assume that there exist 0 < c < c < such that c px c for all x and all p P. A We assume that there exists a > 0 such that where Bp, δ = q : hp, q δ}. Haɛ, P, h sup Hɛ, Bp, 4ɛ, h p P A3 We assume nɛ n as n where Hɛ n nɛ n. 34

35 Assumption A is a very strong and is made only to make the proofs simpler. Assumption A says that the local entropy and global entropy are of the same order. This is typically true in nonparametric models. Assumption A3 says that the rate of convergence is slower than O/ n which again is typical of nonparametric problems. An example of a class P that satisfies these conditions is P = p : [0, ] [c, c ] : 0 pxdx =, 0 p x dx C }. Thanks to A we have, p q KLp, q χ p, q = p q p c = p q p + q c where C = 4c /c. 4c c p q = Ch p, q 63 Let ɛ n solve the Le Cam equation. More precisely, let } ɛ ɛ n = min ɛ : H nɛ. 64 C 6C We will show that ɛ n is the minimax rate. Upper Bound. To show the upper bound, we will find an estimator that achieves the rate. Let P n = p,..., p N } be an ɛ n / C covering set where N = Nɛ n / C. The set P n is an approximation to P that grows with sample size n. Such a set is called sieve. Let p be the mle over P n, that is, p = argmax p Pn Lp where Lp = n i= px i is the likelihood function. We call p, a sieve maximum likelihood estimator. It is crucial that the estimator is computed over P n rather than over P to prevent overfitting. Using a sieve is a type of regularization. We need the following lemma. Lemma 36 Wong and Shen Let p 0 and p be two densities and let δ = hp 0, p. Let Z,..., Z n be a sample from p 0. Then Lp P Lp 0 > / e e nδ nδ /4. 35

36 Proof. Lp P Lp 0 > / e nδ n n pz i = P p i= 0 Z i > /4 e nδ e nδ /4 pz i E p i= 0 Z i n n = e nδ /4 pz i E = e nδ /4 p0 p p 0 Z i n = e nδ /4 h p 0, p = e nδ /4 exp n log h p 0, p e nδ /4 e nh p 0,p/ = e nδ /4. In what follows, we use c, c, c,..., to denote various positive constants. Theorem 37 sup E p hp, p = Oɛ n. Proof. Let p 0 denote the true density. Let p be the element of P n that minimizes KLp 0, p j. Hence, KLp 0, p Cd p 0, p Cɛ n/c = ɛ n/. Let where A = / C. Then Now B = p P n : dp, p > Aɛ n } Ph p, p 0 > Dɛ n Ph p, p + hp 0, p > Dɛ n Ph p, p + ɛ > Dɛ n C = Ph p, p > Aɛ n = P p B P sup P sup p B P sup p B P + P. Lp Lp > e nɛ n A /+ Lp Lp > e nɛ n A / p B Lp Lp > Lp 0 + P sup p B Lp > enɛ n P Lp P Lp > e nɛ n A / Nɛ/ Ce nɛ n A /4 e nɛ n /6C p B where we used Lemma 36 and the definition of ɛ n. To bound P, define K n = n Hence, EK n = KLp 0, p ɛ /. Also, σ Var log p 0Z E log p 0Z c log p Z p Z c c c ɛ = log KLp 0, p log c 3ɛ n c c 36 E log p 0Z p Z n i= log p 0Z i p Z i.

37 where we used 63. So, by Bernstein s inequality, P = PK n > ɛ n = PK n KLp 0, p > ɛ n KLp 0, p P K n KLp 0, p > ɛ n nɛ 4 n exp 8σ + c 4 ɛ n exp c 5 nɛn. Thus, P + P exp c 6 nɛ n. Now, Eh p, p 0 = Ph p, p 0 > tdt 0 Dɛn = Ph p, p 0 > tdt + Ph p, p 0 > tdt 0 Dɛ n Dɛ n + exp c 6 nɛn c7 ɛ n. Lower Bound. Now we derive the lower bound. Theorem 38 Let ɛ n be the smallest ɛ such that Haɛ 64C nɛ. Then inf p Proof. Pick any p P. Let B = q : packing set for B. Then Hence, for any P j, P k F, sup E p hp, p = Ωɛ n. hp, q 4ɛ n }. Let F = p,..., p N } be an ɛ n N = log P ɛ n, B, h log Hɛ n, B, h log Haɛ n 64C nɛ. KLP n j, P n k = nklp j, P k Cnh P j, P k 6Cnɛ n N 4. It follows from Fano s inequality that as claimed. inf p sup E p hp, p p P 4 min hp j, p k ɛ n j k 4 In summary, we get the minimax rate by solving Hɛ n nɛ n. Now we can use the Le Cam equation to compute some rates: 37

38 Example 39 Here are some standard examples: Space Entropy Rate Sobolev α ɛ /α n α/α+ Sobolev α dimension d ɛ d/α n α/α+d Lipschitz α ɛ d/α n α/α+d Monotone /ɛ n /3 Besov Bp,q α ɛ d/α n α/α+d Neural Nets see below see below m-dimensional m log/ɛ m/n parametric model In the neural net case we have fx = c 0 + i c iσv T i x + b i where c C, v i = and σ is a step function or a Lipschitz sigmoidal function. Then and hence /+/d Hɛ ɛ /+/d log/ɛ 65 ɛ n +/d/+/d log n +/d+/d/+/d ɛ n n/ log n +/d/+/d

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Local Minimax Testing

Local Minimax Testing Local Minimax Testing Sivaraman Balakrishnan and Larry Wasserman Carnegie Mellon June 11, 2017 1 / 32 Hypothesis testing beyond classical regimes Example 1: Testing distribution of bacteria in gut microbiome

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Homework 1 Due: Wednesday, September 28, 2016

Homework 1 Due: Wednesday, September 28, 2016 0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: Nov 4 Note: hese notes are based on scribed notes from Spring5 offering of this course LaeX template courtesy of UC

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Unlabeled data: Now it helps, now it doesn t

Unlabeled data: Now it helps, now it doesn t Unlabeled data: Now it helps, now it doesn t Aarti Singh, Robert D. Nowak Xiaojin Zhu Department of Electrical and Computer Engineering Department of Computer Sciences University of Wisconsin - Madison

More information

Variance Function Estimation in Multivariate Nonparametric Regression

Variance Function Estimation in Multivariate Nonparametric Regression Variance Function Estimation in Multivariate Nonparametric Regression T. Tony Cai 1, Michael Levine Lie Wang 1 Abstract Variance function estimation in multivariate nonparametric regression is considered

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE

More information

Lecture 22: Error exponents in hypothesis testing, GLRT

Lecture 22: Error exponents in hypothesis testing, GLRT 10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Minimax lower bounds

Minimax lower bounds Minimax lower bounds Maxim Raginsky December 4, 013 Now that we have a good handle on the performance of ERM and its variants, it is time to ask whether we can do better. For example, consider binary classification:

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) :

More information

Inverse Statistical Learning

Inverse Statistical Learning Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Semi-Supervised Learning by Multi-Manifold Separation

Semi-Supervised Learning by Multi-Manifold Separation Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Estimating divergence functionals and the likelihood ratio by convex risk minimization Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Concentration inequalities and the entropy method

Concentration inequalities and the entropy method Concentration inequalities and the entropy method Gábor Lugosi ICREA and Pompeu Fabra University Barcelona what is concentration? We are interested in bounding random fluctuations of functions of many

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Adaptivity to Local Smoothness and Dimension in Kernel Regression Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract

More information

Multivariate Analysis and Likelihood Inference

Multivariate Analysis and Likelihood Inference Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Information Theory and Hypothesis Testing

Information Theory and Hypothesis Testing Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking

More information

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

10-704: Information Processing and Learning Fall Lecture 24: Dec 7 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Chapter 4: Asymptotic Properties of the MLE

Chapter 4: Asymptotic Properties of the MLE Chapter 4: Asymptotic Properties of the MLE Daniel O. Scharfstein 09/19/13 1 / 1 Maximum Likelihood Maximum likelihood is the most powerful tool for estimation. In this part of the course, we will consider

More information

The high order moments method in endpoint estimation: an overview

The high order moments method in endpoint estimation: an overview 1/ 33 The high order moments method in endpoint estimation: an overview Gilles STUPFLER (Aix Marseille Université) Joint work with Stéphane GIRARD (INRIA Rhône-Alpes) and Armelle GUILLOU (Université de

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

SDS : Theoretical Statistics

SDS : Theoretical Statistics SDS 384 11: Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin https://psarkar.github.io/teaching Manegerial Stuff

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

On Consistent Hypotheses Testing

On Consistent Hypotheses Testing Mikhail Ermakov St.Petersburg E-mail: erm2512@gmail.com History is rather distinctive. Stein (1954); Hodges (1954); Hoefding and Wolfowitz (1956), Le Cam and Schwartz (1960). special original setups: Shepp,

More information