Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Size: px
Start display at page:

Download "Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11"

Transcription

1 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of this material comes from Lehmann and Casella [1998] section 5.1, and some comes from Ferguson [1967], section Motivation and definition Let X binomial(n, ), and let X = X/n. Via admissibility of unique Bayes estimators, δ w0 (X) = w X + (1 w) 0 is admissible under squared error loss for all w (0, 1) and 0 (0, 1). R(, δ w0 ) = Var[δ w0 ] + Bias 2 [δ w0 ] = w 2 Var[ X] + (1 w) 2 ( 0 ) 2 = w 2 (1 )/n + (1 w) 2 ( 0 ) 2. Risk functions for three such estimators are in Figure 1. Requiring admissibility is not enough to identify a unique procedure. 1

2 risk Figure 1: Three risk functions for estimating a binomial proportion when n = 10, w = 0.8 and 0 {1/4, 2/4, 3/4}. This is good; we can require more of our estimator. This is bad: what should we require? One idea is to avoid a worst case scenario : evaluate procedures by their maximum risk. Definition 1 (minimax risk, minimax estimator). The minimax risk is defined as R m (Θ) = inf δ An estimator δ m is a minimax estimator of if sup R(, δ m ) = inf Θ δ i.e. sup Θ R(, δ m ) sup Θ R(, δ) for all δ D. sup R(, δ). sup R(, δ) = R m (Θ). 2

3 2 Least favorable prior Identifying a minimax estimator seems difficult: one would need to minimize the supremum risk over all estimators. However, in many cases we can identify a minimax estimator using some intuition: Suppose some values of are harder to estimate than others. Example: For the binomial model Var[X/n] = (1 )/n, so = 1/2 is hard to estimate well ). Consider a prior π() that heavily weights the difficult value of. The Bayes estimator δ π will do well for these difficult values, where the supremum risk of most estimators is likely to occur. R(π, δ π ) = R(, δ π )π(d) R(, δ)π(d) = R(π, δ) This means that R(, δ π ) R(, δ) in places of high π-probability, i.e. the difficult parts of Θ. Since δ π does well in the difficult region, maybe it is minimax. Definition 2 (least favorable prior). A prior distribution π is least favorable if R(π, δ π ) R(π, δ π ) for all priors π on Θ. Intuition: δ π is the best you can do under the worst prior. Analogy: to competitive games. You get to choose an estimator δ, your adversary gets to choose a prior π. Your adversary wants you to have high loss, you want to have low loss. For any π, your best strategy is δ π. Your adversary s best strategy is then for π to be least favorable. 3

4 Bounding the minimax risk: The least favorable prior provides a lower bound on the minimax risk R m (Θ). For any prior π over Θ, [ R(π, δ) = R(, δ) π(d) sup Minimizing over all estimators δ D, we have ] R(, δ) π(d) = sup R(, δ). inf δ R(π, δ) inf δ R(π, δ π ) R m (Θ) sup R(, δ) The Bayes risk of any prior gives a lower bound for minimax risk. Maximizing over π gives the sharpest lower bound: sup R(π, δ π ) R m (Θ). π Finding the LFP and minimax estimator: For any prior π, we have shown R(π, δ π ) inf δ sup R(, δ). On the other hand, for any estimator δ π, we have Putting these together gives inf δ sup R(, δ) sup R(, δ π ). R(π, δ π ) inf δ sup R(, δ) sup R(, δ π ). Therefore, δ π achieves the minimax risk if R(π, δ π ) = sup R(, δ π ). 4

5 Theorem 1 (LC 5.1.4). Let δ π be (unique) Bayes for π, and suppose R(π, δ π ) = sup R(, δ π ). Then 1. δ π is (unique) minimax; 2. π is least favorable; Proof. 1. For any other estimator δ, sup R(, δ) R(, δ)π(d) R(, δ π )π(d) = sup R(, δ π ). If δ π is unique Bayes under π, then the second inequality is strict and δ π is unique minimax. 2. Let π be a prior over Θ. Then R( π, δ π ) = R(, δ π ) π(d) R(, δ π ) π(d) sup R(, δ π ) = R(π, δ π ). Notes: regarding the condition R(π, δ π ) = sup R(, δ π ), the condition is sufficient but not necessary for δ π to be minimax. the condition is very restrictive - it is met only if π( : R(, δ π ) = sup R(, δ π )) = 1. Bayes estimators that satisfy this condition are sometimes called equalizer rules: 5

6 Definition 3. An estimator δ is an equalizer rule with respect to π if δ is Bayes with respect to π and R(, δ) = sup R(, δ) a.e. π. The Theorem implies that equalizer rules are minimax: Corollary 1 (LC cor 5.1.6). An equalizer rule is minimax. In the definition of an equalizer rule, it is not enough that R(, δ) is constant a.e. π. For example, if the supremum risk of an estimator could occur on a set of π-measure zero, then generally R(π, δ) < sup R(, δ) and the conditions of the theorem do not hold. To summarize: We need R(π, δ π ) = sup R(, δ π ). This condition will be met if δ π has constant risk. This condition will not be met if δ π has constant risk a.e. π, but higher risk on a set of π-measure zero. This condition will be met if δ π achieves its maximum risk on a set of π- probability 1. Example(binomial proportion): X binomial(n, ), estimate with squared error loss. Let s try to find a Bayes estimator with constant risk. Can this be done with conjugate priors? Under beta(a, b), we have δ ab (X) = a + X a + b + n R(, δ ab ) = Var[δ ab ] + Bias 2 [δ ab ] = n(1 ) (a (a + b))2 +. (a + b + n) 2 (a + b + n) 2 Can we make this constant as a function of? The numerator is n n 2 + (a + b) 2 2 2a(a + b) + c(a, b, n). 6

7 This will be constant in if n = 2a(a + b) n = (a + b) 2, solving for a and b gives a = b = n/2. Therefore, The estimator δ n/2(x) = X+ n/2 n+ n = X n constant risk and Bayes, therefore an equalizer rule, and therefore minimax. beta( n/2, n/2) is a least favorable prior. Risk comparison R(, δ n/2(x)) = At = 1/2 (a difficult value of ), we have Draw the picture. Notes: n n+1 is 1 (2(1 + n)) 2 R(, X) = (1 ) n R(1/2, X) = n > (n + 2 n + 1) = R(1/2, δ n/2(x)). The region in which R(, δ n/2(x)) < R(, X) decreases in size with n. Draw the picture. The least favorable prior is not unique. To see this, note that under any prior π, δ π (x) = E[ X] = x+1 (1 ) n x π(d) x (1 ) n x π(d), and so the estimator only depends on the first n + 1 moments of under π. 7

8 The minimax estimator is sensitive to the loss function: Under L(, δ) = ( δ) 2 /[(1 )], X/n is minimax (it is an equalizer rule under under beta(1,1)). Example (difference in proportions): X binomial(n, x ), Y binomial(n, y ), estimate y x. Can we proceed from the one sample case? Consider Y + n/2 n + n X + n/2 n + n = Y X n + n. It turns out that this is not minimax. Consider estimators of the form δ(x, Y ) = c (Y X) for some c (0, 1/n). Strategy: For each c, where does c (Y X) have maximum risk? Is c (Y X) Bayes w.r.t. a prior that assigns probability 1 to points of maximum risk? R(, δ c ) = Var[δ c ] + (E[c(Y X)] ( y x )) 2 = c 2 n[ x (1 x ) + y (1 y )] + (cn 1) 2 ( y x ) 2 Taking derivatives, the maximum risk is attained at ( x, y ) satisfying [2(cn 1) 2 2c 2 n] x 2(cn 1) 2 y = c 2 n [2(cn 1) 2 2c 2 n] y 2(cn 1) 2 x = c 2 n We have two equations and two unknowns, and so for a given c, maximum risk is generally attained at some point c = ( xc, yc ). The theorem suggests trying a prior 8

9 π c () = 1( = c ). But then δ πc = c is the Bayes estimator, not δ c. Furthermore δ πc has its minimum risk at c. However, consider Θ 0 = { : x + y = 1} Θ. On this set, y (1 y ) = x (1 x ) and y x = 1 2 x, and the risk function is R(, δ c ) = c 2 n(2 x (1 x )) + (cn 1) 2 (1 2 x ). Is there a value of c that gives this estimator constant risk (on Θ 0 )? dr( x, c(y X)) d x = d d x c 2 n(2( x 2 x)) + (cn 1) 2 (1 4 x x) Solving for c, we have = c 2 n(2(1 2 x )) (cn 1) 2 4(1 2 x ) c 2 n = 2(cn 1) 2 ± nc = 2(cn 1) ± n/2 = (n 1/c) 2n ± n 1/c = 2 c = 2 = 1 2n 2n ± n n 2n ± 1 Using the solution could give estimators outside the parameter space, so consider the + solution: c e = 1 2n n 2n + 1 2n δ e (X, Y ) = (Y X)/n, 2n + 1 i.e. δ e is a shrunken version of the UMVUE. By construction, this estimator has constant risk on { : x + y = 1}. Does it achieve its maximum risk on this set? Recall the risk function is: R(, δ c ) = c 2 n[ x (1 x ) + y (1 y )] + (cn 1) 2 ( y x ) 2 = c 2 n( x (1 x ) + y (1 y ) + ( y x ) 2 ), 9

10 where we are using the fact that 2(cn 1) 2 = c 2 n for our risk-equalizing value of c. Taking derivatives, we have that the maximum risk occurs when c 2 n[(1 2 x ) + ( x y )] = c 2 n(1 x y ) = 0 c 2 n[(1 2 y ) + ( y x )] = c 2 n(1 x y ) = 0, which are both satisfied when x + y = 1. So far, we have shown δ e has constant risk on Θ 0 = { : x + y = 1} δ e achieves its maximum risk on this set. To show that it is minimax, what remains is to show that it is Bayes with respect to a prior π on Θ 0. If it is, then the condition R(π, δ e ) = sup R(, δ e ) Θ will be met and δ e will therefore be minimax. Consider estimation of y = 1 x from X and Y. For what prior on y will δ e be Bayes? Note that on Θ 0, n + Y X binomial(2n, y ). Recall beta( n/2, n/2) is least favorable for estimating a binomial proportion with n samples. Here, our sample size is 2n, so lets see if beta( n/2, n/2) is least favorable. π( y n + Y X) = dbeta(n + Y X + n/2, n + X Y + n/2) E[ y n + Y X] = n + Y X + n/2 2n +. 2n Now recall we are trying to estimate y x, which on Θ 0 is equal to 2 y 1. 10

11 Under squared error loss, the Bayes estimator of 2 y 1 is 2E[ y n + Y X] 1, which is So we have shown: 2E[ y n + Y X] 1 = n + Y X n/2 n + n + n/2 n/2 n + n/2 = Y X n + n/2 n = n + (Y X)/n n/2 2n = (Y X)/n = δ e. 2n + 1 δ e is Bayes with respect to a prior π on Θ 0 Θ; δ e is constant risk a.e. π; sup Θ R(, δ e ) is achieved on Θ 0. Therefore, δ e is an equalizer rule, and hence minimax. 3 Least favorable prior sequence In many problems there are no least favorable priors, and the main theorem from above is of no help. Example(normal mean, known variance): X 1,... X n i.i.d. normal(0, σ 2 ), σ 2 known. R(, X) = σ 2 /n is constant risk, so seems potentially minimax. But no prior π over Θ gives δ π (X) = X. You can also see that there is no least favorable prior: Under any prior π, R(π, δ π ) < R(π, X) = σ 2 /n, 11

12 but you can find priors whose Bayes risk is arbitrarily close to σ 2 /n (i.e., the set of Bayes risks is not closed). However, X is a limit of Bayes estimators: As τ 2, More generally, consider δ τ 2 = n/σ2 n/σ 2 +1/τ 2 R(π 2 τ, δ 2 τ) = 1 n/σ 2 +1/τ 2 X X σ 2 /n = R(, X). (π k, δ k ), a sequence of prior distributions such that R(π k, δ k ) R; δ, an estimator such that sup R(, δ) = R. For each k, we have R(π k, δ k ) R m However, we always have R m sup R(, δ). Putting these together gives R(π k, δ k ) R m sup R(, δ). If R(π k, δ k ) sup R(, δ), then we must have R m = sup R(, δ), and so δ must be minimax. Definition 4. A sequence {π k } of priors is least favorable if for every prior π, R(π, δ π ) lim k R(π k, δ k ). Theorem 2 (LC thm 5.1.2). Let {π k } be sequence of prior distributions and δ an estimator such that R(π k, δ k ) sup R(, δ). Then 1. δ is minimax, and 2. {π k } is least favorable. Proof. 12

13 1. For any estimator δ, 2. For any prior π, sup R(, δ ) = lim sup k sup R(, δ ) R(π k, δ ) R(π k, δ k ) R(, δ ) lim R(π k, δ k ) = sup R(, δ) k R(π, δ π ) R(π, δ) sup R(, δ) = lim R(π k, δ k ). k Exercise: Prove the following (weaker) variant of the above theorem: Any constant risk generalized Bayes estimator is minimax. Example (normal mean, known variance): Let X 1,..., X n i.i.d. N p (, σ 2 I), σ 2 known. Let π k () be such that N p (0, τk 2I), τ k 2. δ k = X. n/σ2 n/σ 2 +1/τ 2 1 Under average square-error loss, R(π k, δ k ) = n/σ 2 +1/τk 2 Thus X is minimax. σ 2 /n = R(, X). Alert! This result should seem a bit odd to you - X is inadmissible for p 3. How can it be dominated if it is minimax? We even have the following Theorem: Theorem 3 (LC lemma ). Any unique minimax estimator is admissible. Proof. If δ is unique minimax, and δ any other estimator, them sup so δ can t be dominated. R(, δ) < sup R(, δ ) 0 : R( 0, δ) < R( 0, δ ), 13

14 So how can X be minimax and not admissible? The only possibility left by the theorem is that X is not unique minimax. In fact, for p 3, estimators of the form ( ) δ(x) = 1 c( x ) σ2 (p 2) x x 2 are minimax as long as c : R + [0, 2] is nondecreasing (LC theorem 5.5.5). For these estimators, the supremum risk occurs in the limit as. Example (normal mean, unknown variance): Let X 1,..., X n i.i.d. N p (, σ 2 I), σ 2 R + unknown. sup,σ 2 R((, σ 2 ), X) =. In fact, we can prove that every estimator has infinite maximum risk: sup R((, σ 2 ), δ) = sup sup R(, δ(x) σ 2 ),σ 2 σ 2 sup σ 2 σ 2 /n =, where the inequality holds because X is minimax in the known variance case. Therefore, every estimator is trivially minimax, with maximum risk of infinity. Here is a not-entirely satisfying solution to this problem: Assume σ 2 M, M known. Applying exactly the same argument as above, we have sup,σ 2 R((, σ 2 ), δ) = sup σ 2 :σ 2 M and so X is minimax with maximum risk M/n. sup R(, δ(x) σ 2 ) sup σ 2 /n = M/n, σ 2 :σ 2 M Note that the value of M doesn t affect the estimator - X is minimax no matter what M is. Does this mean that X is the minimax estimator for σ 2 R +? Here are some thoughts: 14

15 For p = 1, X is unique minimax for R, σ 2 (0, M] for all M. For σ 2 R +, it is still minimax, although not unique. For p > 2, X is minimax, but not unique minimax for σ 2 (0, M] or even known σ 2. A better solution to this problem is to change the loss function to L((, σ 2 ), d) = ( d) 2 /σ 2, standardized squared-error loss. Exercise: Show that X is minimax under standardized squared-error loss. 4 Nonparametric problems The normal mean, unknown variance problem above gave some indication that we can deal with nuisance parameters (such as the variance σ 2 ) when obtaining a minimax estimator for a parameter of interest (such as the mean ). What about more general nuisance parameters? Let X P P. Suppose we want to estimate = g(p ), some functional of P. Suppose δ is minimax for = g(p ) when P P 0 P. When will it also be minimax for P P? Theorem (LC ). If δ is minimax for when P 0 P and then δ is minimax for when P. Proof. For any other estimator δ, sup R(P, δ) = sup R(P, δ), P P 0 P P sup R(P, δ ) sup R(P, δ ) P P P P 0 sup R(P, δ) = sup R(P, δ). P P 0 P P 15

16 Example: (Difference in binomial proportions) P = {dbinom(x, n, x, y } P 0 = {dbinom(x, n, x = 1 y, y } For estimation of y x on P 0, we showed that c (Y X)/n, with c = 2n/( 2n + 1), is constant risk on P 0. c (Y X)/n is Bayes w.r.t. a beta( n/2, n/2) prior on y, and so c (Y X) is minimax on P 0. We then showed that c (Y X)/n achieved its maximum risk on P 0, and so by the theorem it is minimax. Example: (Population mean, bounded variance) Let P = {P : Var[X P ] M}, M known. Is X minimax for = E[X P ]? Let P 0 = {dnorm(x,, σ) : R, σ 2 M}. Then 1. X is minimax for P0 ; 2. sup P P0 R(P, X) = sup P P R(P, X) = M/n. Thus X is minimax (and unique in the univariate case, as shown in LC 5.2). Example: (Population mean, bounded range) Let X 1,..., X n i.i.d. P P, where P ([0, 1]) = 1 P P. Least favorable perspective: Our previous experience tells us to find a minimax estimator that is best under the worst conditions. What are the worst conditions? Guess: Most difficult situation is where mass of each P is as spread out as possible. Let P 0 = {binary(x, ) : [0, 1]}. As we ve already derived, the minimax estimator for E[X i ] = Pr(X i = 1 ) = based on X 1,..., X n i.i.d. binary() is δ(x) = n X + n/2 n + n = n n + 1 X + 1 n

17 To use the lemma to show this is minimax for P, we need to show Let s calculate the risk of δ for P P: and so Var[δ(X)] = E[δ(X)] = Bias 2 (δ P ) = R(P, δ) = sup R(P, δ) = sup R(P, δ). P P P P 0 n ( n + 1) Var[X P ]/n = Var[X P ]/( n + 1) 2 2 n n + 1 n = + n + 1 2( n + 1) = + 1/2 n + 1 ( 1/2)2 ( n + 1) 2, 2 1 ( n + 1) 2 [Var[X P ] + ( 1/2)2 ]. Where does this risk achieve its maximum value? Note that sup R(P, δ) = sup P P [0,1] sup R(P, δ), P P where P = {P P : E[X P ] = }. To do the inner supremum, note that for any P P, Var[X P ] = E[X 2 ] E[X] 2 E[X] E[X] 2 = E[X](1 E[X]) = (1 ) with equality only if P is the binary() measure. Therefore, for each supremum risk is attained at the binary() distribution, and so the maximum risk of δ is attained on the P 0. To recap, 17

18 sup R(P, δ) = sup P P [0,1] sup R(P, δ) = sup P P [0,1] sup R(P, δ) = sup R(P 0, δ). P P P 0 P P 0 Thus δ is minimax on P 0, achieves its maximum risk there, and so is minimax over all of P. 5 Minimax and admissibility We already showed a unique minimax estimator can t be dominated: Theorem (LC lemma ). Any unique minimax estimator is admissible. This is reminiscent of our theorem about admissibility of unique Bayes estimators, which has a similar proof: If δ were to dominate δ, then it would be as good as δ in terms of both Bayes risk and maximum risk, and so such a δ can t be either unique Bayes or unique minimax. What about the other direction? We have shown in some important cases that admissibility of an estimator implies it is Bayes, or close to Bayes. Does admissibility imply minimax? The answer is yes for constant risk estimators: Theorem (LC lemma ). If δ is constant risk and admissible, then it is minimax. Proof. Let δ be another estimator. Since δ is not dominated, there is a 0 s.t. R( 0, δ) R( 0, δ ) But since δ is constant risk, sup R(, δ) = R( 0, δ) R( 0, δ ) sup R(, δ ). Below is a diagram summarizing some of the relationships between admissible, Bayes and minimax estimators we ve covered so far: 18

19 6 Superefficiency and sparsity Let X 1,..., X n i.i.d. N(, 1), where R. A version of Hodges estimator for is given by { X if x > 1/n 1/4 δ H (x) = 0 if x < 1/n 1/4 or more compactly, δ H (x) = x 1( x > 1/n 1/4 ). Exercise: Show that the asymptotic distribution of δ H (x) is given by n(δh (X) ) N(0, v()), where v() = { 1 if 0 0 if = 0 The asymptotic variance of n(δ H (X) ) makes δ H seem useful in many situations: Often we are trying to estimate a parameter that could potentially be zero, or very close to zero. For such parameters, δ H seems to be as good as X asymptotically for all, and much better than X at the special value = 0. In particular, if = 0, then Pr(δ H (X) = 0) 1. This seems much better than simple consistency at = 0, which says Pr( δ H (X) < ɛ) 1. for every ɛ > 0. An estimator consistent at = 0 never actually has to be zero, whereas the Hodges estimator is more and more likely to actually be zero as n. However, there is a price to pay. Let s compare the maximum of the risk function for δ H to that of X, under squared error loss. For Hodges estimator, we have for any 19

20 R, sup E X [(δ H (X) ) 2 ] E X [(δ H (X) ) 2 ] Letting = 0 / n, we have = E X [( X1( X < 1/n 1/4 ) ) 2 1( X < 1/n 1/4 )] = E X [ 2 1( X < 1/n 1/4 )] = 2 Pr( X < 1/n 1/4 ) = 2 Pr( n 1/4 < X < n 1/4 ) = 2 Pr( n( n 1/4 ) < n( X ) < n( + n 1/4 ) ) = 2 [Φ( n( + n 1/4 )) Φ( n( n 1/4 ))]. sup R(, δ H ) 2 0 n [Φ( 0 + n 1/4 ) Φ( 0 n 1/4 )]. Note that this holds for any 0 R. On the other hand, the risk of X is constant, 1/n for all. Therefore, sup R(, δ H ) sup R(, X) 2 0 [Φ( 0 + n 1/4 ) Φ( 0 n 1/4 )]. You can play around with various values of 0 to see how big you can make this bound for a given n. The thing to keep in mind is that the inequality holds for all n and 0. As n, the normal CDF term goes to one, so As this holds for all 0, we have sup lim R(, δ H ) n sup R(, X) 2 0. lim n sup R(, δ H ) =. sup R(, X) Yikes! So the Hodges estimator becomes infinitely worse than X as n. Is this asymptotic result relevant for finite-sample comparisons of estimators? semi-closed form expression for the risk of δ H is given by Lehmann and Casella [1998, page 442]. A plot of the finite-sample risk of δ H is shown in Figure 2. A 20

21 n MSE Figure 2: Risk functions of δ H for n {25, 100, 250}, with the risk bound in blue. Related to this calculation is something of more modern interest, the problem of variable selection in regression. Consider the following model: X 1,..., X n i.i.d.p X ɛ 1,..., ɛ n i.i.d.n(0, 1) y i = β T X i + ɛ i You may have attended one or more seminars where someone presented an estimator ˆβ of β with the following property: Pr( ˆβ j = 0 β) 1 as n β : β j = 0. Again, such an estimator ˆβ j is not just consistent at β j = 0, it actually equals 0 with increasingly high probability as n. This property has been coined sparsistency by people studying asymptotics of model selection procedures. This special consistency at β j = 0 seems similar to the properties of Hodges estimator when = 0. What does the behavior at 0 imply about the risk function elsewhere? It is difficult to say anything about the entire risk function for all such estimators, but 21

22 we can gain some general insight by looking at the supremum risk. Let β n = β 0 / n, where β 0 is arbitrary. sup E β [ ˆβ β 2 ] E βn [ ˆβ β n 2 ] β E βn [ ˆβ β n 2 1(ˆβ = 0)] Now consider the risk of the OLS estimator: = β n 2 Pr(ˆβ = 0 β n ) = 1 n β 0 2 Pr(ˆβ = 0 β n ) So we have which holds for all n and all β 0. E β [ ˆβ ols β 2 ] = E PX [tr((x T X) 1 )] v n sup β R(β, ˆβ) sup β R(β, ˆβ ols ) β 0 2 Pr(ˆβ = 0 β n ) nv n, Now you should believe that nv n converges to some number v > 0. What about Pr(ˆβ = 0 β n )? Now β n 0 as n, and so it seems that since ˆβ has the sparsistency property, eventually ˆβ will equal 0 with high probability. Next quarter you will learn about contiguity of sequences of distributions and will be able to show that lim Pr(ˆβ = 0 β = β n ) = lim Pr(ˆβ = 0 β = 0) = 1. n n The first equality is due to contiguity, the second to the sparsistency property of ˆβ. This result gives lim n sup β R(β, ˆβ) sup β R(β, ˆβ ols ) β 0 2 /v. Since this result holds for all β 0, the limit is in fact infinite. The result is that, in terms of supremum risk (under squared error loss) a sparsistent estimator becomes infinitely worse than the OLS estimator. This result is due to Leeb and Pötscher [2008] (see also Pötscher and Leeb [2009]). Discuss: Is supremum risk an appropriate comparison? 22

23 Is squared error an appropriate loss? When would you use a sparsistent estimator? References Thomas S. Ferguson. Mathematical statistics: A decision theoretic approach. Probability and Mathematical Statistics, Vol. 1. Academic Press, New York, Hannes Leeb and Benedikt M Pötscher. Sparse estimators and the oracle property, or the return of hodges estimator. Journal of Econometrics, 142(1): , E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in Statistics. Springer-Verlag, New York, second edition, ISBN Benedikt M. Pötscher and Hannes Leeb. On the distribution of penalized maximum likelihood estimators: the LASSO, SCAD, and thresholding. J. Multivariate Anal., 100(9): , ISSN X. doi: /j.jmva URL 23

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4 Contents 1 Statistical inference 1 2 The estimation problem 2 3 The testing problem 4 4 Loss, decision rules and risk 6 4.1 Statistical decision problems....................... 7 4.2 Decision rules and

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 3 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a)

More information

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture notes on statistical decision theory Econ 2110, fall 2013 Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic

More information

Solutions to the Exercises of Section 2.11.

Solutions to the Exercises of Section 2.11. Solutions to the Exercises of Section 2.11. 2.11.1. Proof. Let ɛ be an arbitrary positive number. Since r(τ n,δ n ) C, we can find an integer n such that r(τ n,δ n ) C ɛ. Then, as in the proof of Theorem

More information

Econ 2148, spring 2019 Statistical decision theory

Econ 2148, spring 2019 Statistical decision theory Econ 2148, spring 2019 Statistical decision theory Maximilian Kasy Department of Economics, Harvard University 1 / 53 Takeaways for this part of class 1. A general framework to think about what makes a

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Lecture Notes 3 Convergence (Chapter 5)

Lecture Notes 3 Convergence (Chapter 5) Lecture Notes 3 Convergence (Chapter 5) 1 Convergence of Random Variables Let X 1, X 2,... be a sequence of random variables and let X be another random variable. Let F n denote the cdf of X n and let

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Econ 2140, spring 2018, Part IIa Statistical Decision Theory Econ 2140, spring 2018, Part IIa Maximilian Kasy Department of Economics, Harvard University 1 / 35 Examples of decision problems Decide whether or not the hypothesis of no racial discrimination in job

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

MAS113 Introduction to Probability and Statistics

MAS113 Introduction to Probability and Statistics MAS113 Introduction to Probability and Statistics School of Mathematics and Statistics, University of Sheffield 2018 19 Identically distributed Suppose we have n random variables X 1, X 2,..., X n. Identically

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Lecture 4: September Reminder: convergence of sequences

Lecture 4: September Reminder: convergence of sequences 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 4: September 6 In this lecture we discuss the convergence of random variables. At a high-level, our first few lectures focused

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

STAT215: Solutions for Homework 2

STAT215: Solutions for Homework 2 STAT25: Solutions for Homework 2 Due: Wednesday, Feb 4. (0 pt) Suppose we take one observation, X, from the discrete distribution, x 2 0 2 Pr(X x θ) ( θ)/4 θ/2 /2 (3 θ)/2 θ/4, 0 θ Find an unbiased estimator

More information

Notes on Decision Theory and Prediction

Notes on Decision Theory and Prediction Notes on Decision Theory and Prediction Ronald Christensen Professor of Statistics Department of Mathematics and Statistics University of New Mexico October 7, 2014 1. Decision Theory Decision theory is

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0. () () a. X is a binomial distribution with n = 000, p = /6 b. The expected value, variance, and standard deviation of X is: E(X) = np = 000 = 000 6 var(x) = np( p) = 000 5 6 666 stdev(x) = np( p) = 000

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Theoretical results for lasso, MCP, and SCAD

Theoretical results for lasso, MCP, and SCAD Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical

More information

(1) Introduction to Bayesian statistics

(1) Introduction to Bayesian statistics Spring, 2018 A motivating example Student 1 will write down a number and then flip a coin If the flip is heads, they will honestly tell student 2 if the number is even or odd If the flip is tails, they

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Wednesday, October 5, 2011 Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in

More information

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16 EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Economics 583: Econometric Theory I A Primer on Asymptotics

Economics 583: Econometric Theory I A Primer on Asymptotics Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:

More information

Evaluating the Performance of Estimators (Section 7.3)

Evaluating the Performance of Estimators (Section 7.3) Evaluating the Performance of Estimators (Section 7.3) Example: Suppose we observe X 1,..., X n iid N(θ, σ 2 0 ), with σ2 0 known, and wish to estimate θ. Two possible estimators are: ˆθ = X sample mean

More information

LECTURE NOTES 57. Lecture 9

LECTURE NOTES 57. Lecture 9 LECTURE NOTES 57 Lecture 9 17. Hypothesis testing A special type of decision problem is hypothesis testing. We partition the parameter space into H [ A with H \ A = ;. Wewrite H 2 H A 2 A. A decision problem

More information

Worst case analysis for a general class of on-line lot-sizing heuristics

Worst case analysis for a general class of on-line lot-sizing heuristics Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

MIT Spring 2016

MIT Spring 2016 Decision Theoretic Framework MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Decision Theoretic Framework 1 Decision Theoretic Framework 2 Decision Problems of Statistical Inference Estimation: estimating

More information

How Much Evidence Should One Collect?

How Much Evidence Should One Collect? How Much Evidence Should One Collect? Remco Heesen October 10, 2013 Abstract This paper focuses on the question how much evidence one should collect before deciding on the truth-value of a proposition.

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

18.600: Lecture 24 Covariance and some conditional expectation exercises

18.600: Lecture 24 Covariance and some conditional expectation exercises 18.600: Lecture 24 Covariance and some conditional expectation exercises Scott Sheffield MIT Outline Covariance and correlation Paradoxes: getting ready to think about conditional expectation Outline Covariance

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory. Charles J. Geyer School of Statistics University of Minnesota

Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory. Charles J. Geyer School of Statistics University of Minnesota Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory Charles J. Geyer School of Statistics University of Minnesota 1 Asymptotic Approximation The last big subject in probability

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

More information

Sequential Decisions

Sequential Decisions Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal

More information

18.440: Lecture 25 Covariance and some conditional expectation exercises

18.440: Lecture 25 Covariance and some conditional expectation exercises 18.440: Lecture 25 Covariance and some conditional expectation exercises Scott Sheffield MIT Outline Covariance and correlation Outline Covariance and correlation A property of independence If X and Y

More information

Probability Models of Information Exchange on Networks Lecture 1

Probability Models of Information Exchange on Networks Lecture 1 Probability Models of Information Exchange on Networks Lecture 1 Elchanan Mossel UC Berkeley All Rights Reserved Motivating Questions How are collective decisions made by: people / computational agents

More information

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES 1. Order statistics Let X 1,...,X n be n real-valued observations. One can always arrangetheminordertogettheorder statisticsx (1) X (2) X (n). SinceX (k)

More information

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Lecture 7 October 13

Lecture 7 October 13 STATS 300A: Theory of Statistics Fall 2015 Lecture 7 October 13 Lecturer: Lester Mackey Scribe: Jing Miao and Xiuyuan Lu 7.1 Recap So far, we have investigated various criteria for optimal inference. We

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics Prof. Carolina Caetano For a while we talked about the regression method. Then we talked about the linear model. There were many details, but

More information

Chapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs

Chapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs 2.6 Limits Involving Infinity; Asymptotes of Graphs Chapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs Definition. Formal Definition of Limits at Infinity.. We say that

More information

Machine learning, shrinkage estimation, and economic theory

Machine learning, shrinkage estimation, and economic theory Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

10-704: Information Processing and Learning Fall Lecture 24: Dec 7 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

Bayesian analysis in finite-population models

Bayesian analysis in finite-population models Bayesian analysis in finite-population models Ryan Martin www.math.uic.edu/~rgmartin February 4, 2014 1 Introduction Bayesian analysis is an increasingly important tool in all areas and applications of

More information

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation LECTURE 10: REVIEW OF POWER SERIES By definition, a power series centered at x 0 is a series of the form where a 0, a 1,... and x 0 are constants. For convenience, we shall mostly be concerned with the

More information

Probability and Measure

Probability and Measure Chapter 4 Probability and Measure 4.1 Introduction In this chapter we will examine probability theory from the measure theoretic perspective. The realisation that measure theory is the foundation of probability

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

The Risk of James Stein and Lasso Shrinkage

The Risk of James Stein and Lasso Shrinkage Econometric Reviews ISSN: 0747-4938 (Print) 1532-4168 (Online) Journal homepage: http://tandfonline.com/loi/lecr20 The Risk of James Stein and Lasso Shrinkage Bruce E. Hansen To cite this article: Bruce

More information

Lecture 7: Schwartz-Zippel Lemma, Perfect Matching. 1.1 Polynomial Identity Testing and Schwartz-Zippel Lemma

Lecture 7: Schwartz-Zippel Lemma, Perfect Matching. 1.1 Polynomial Identity Testing and Schwartz-Zippel Lemma CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 7: Schwartz-Zippel Lemma, Perfect Matching Lecturer: Shayan Oveis Gharan 01/30/2017 Scribe: Philip Cho Disclaimer: These notes have not

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015

More information

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and

More information

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I 1 / 16 Nonparametric tests Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I Nonparametric one and two-sample tests 2 / 16 If data do not come from a normal

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Bayesian Learning in Social Networks

Bayesian Learning in Social Networks Bayesian Learning in Social Networks Asu Ozdaglar Joint work with Daron Acemoglu, Munther Dahleh, Ilan Lobel Department of Electrical Engineering and Computer Science, Department of Economics, Operations

More information

F (x) = P [X x[. DF1 F is nondecreasing. DF2 F is right-continuous

F (x) = P [X x[. DF1 F is nondecreasing. DF2 F is right-continuous 7: /4/ TOPIC Distribution functions their inverses This section develops properties of probability distribution functions their inverses Two main topics are the so-called probability integral transformation

More information

arxiv: v1 [math.st] 11 Apr 2007

arxiv: v1 [math.st] 11 Apr 2007 Sparse Estimators and the Oracle Property, or the Return of Hodges Estimator arxiv:0704.1466v1 [math.st] 11 Apr 2007 Hannes Leeb Department of Statistics, Yale University and Benedikt M. Pötscher Department

More information

9 Bayesian inference. 9.1 Subjective probability

9 Bayesian inference. 9.1 Subjective probability 9 Bayesian inference 1702-1761 9.1 Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake pm to win

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/?? to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

5 + 9(10) + 3(100) + 0(1000) + 2(10000) = Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,

More information

P (A G) dp G P (A G)

P (A G) dp G P (A G) First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

CS Communication Complexity: Applications and New Directions

CS Communication Complexity: Applications and New Directions CS 2429 - Communication Complexity: Applications and New Directions Lecturer: Toniann Pitassi 1 Introduction In this course we will define the basic two-party model of communication, as introduced in the

More information

MA131 - Analysis 1. Workbook 4 Sequences III

MA131 - Analysis 1. Workbook 4 Sequences III MA3 - Analysis Workbook 4 Sequences III Autumn 2004 Contents 2.3 Roots................................. 2.4 Powers................................. 3 2.5 * Application - Factorials *.....................

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Section Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence.

Section Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence. Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence. Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence. f n (0)x n Recall from

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Lecture 2 Sep 5, 2017

Lecture 2 Sep 5, 2017 CS 388R: Randomized Algorithms Fall 2017 Lecture 2 Sep 5, 2017 Prof. Eric Price Scribe: V. Orestis Papadigenopoulos and Patrick Rall NOTE: THESE NOTES HAVE NOT BEEN EDITED OR CHECKED FOR CORRECTNESS 1

More information