Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
|
|
- Lawrence Warner
- 5 years ago
- Views:
Transcription
1 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of this material comes from Lehmann and Casella [1998] section 5.1, and some comes from Ferguson [1967], section Motivation and definition Let X binomial(n, ), and let X = X/n. Via admissibility of unique Bayes estimators, δ w0 (X) = w X + (1 w) 0 is admissible under squared error loss for all w (0, 1) and 0 (0, 1). R(, δ w0 ) = Var[δ w0 ] + Bias 2 [δ w0 ] = w 2 Var[ X] + (1 w) 2 ( 0 ) 2 = w 2 (1 )/n + (1 w) 2 ( 0 ) 2. Risk functions for three such estimators are in Figure 1. Requiring admissibility is not enough to identify a unique procedure. 1
2 risk Figure 1: Three risk functions for estimating a binomial proportion when n = 10, w = 0.8 and 0 {1/4, 2/4, 3/4}. This is good; we can require more of our estimator. This is bad: what should we require? One idea is to avoid a worst case scenario : evaluate procedures by their maximum risk. Definition 1 (minimax risk, minimax estimator). The minimax risk is defined as R m (Θ) = inf δ An estimator δ m is a minimax estimator of if sup R(, δ m ) = inf Θ δ i.e. sup Θ R(, δ m ) sup Θ R(, δ) for all δ D. sup R(, δ). sup R(, δ) = R m (Θ). 2
3 2 Least favorable prior Identifying a minimax estimator seems difficult: one would need to minimize the supremum risk over all estimators. However, in many cases we can identify a minimax estimator using some intuition: Suppose some values of are harder to estimate than others. Example: For the binomial model Var[X/n] = (1 )/n, so = 1/2 is hard to estimate well ). Consider a prior π() that heavily weights the difficult value of. The Bayes estimator δ π will do well for these difficult values, where the supremum risk of most estimators is likely to occur. R(π, δ π ) = R(, δ π )π(d) R(, δ)π(d) = R(π, δ) This means that R(, δ π ) R(, δ) in places of high π-probability, i.e. the difficult parts of Θ. Since δ π does well in the difficult region, maybe it is minimax. Definition 2 (least favorable prior). A prior distribution π is least favorable if R(π, δ π ) R(π, δ π ) for all priors π on Θ. Intuition: δ π is the best you can do under the worst prior. Analogy: to competitive games. You get to choose an estimator δ, your adversary gets to choose a prior π. Your adversary wants you to have high loss, you want to have low loss. For any π, your best strategy is δ π. Your adversary s best strategy is then for π to be least favorable. 3
4 Bounding the minimax risk: The least favorable prior provides a lower bound on the minimax risk R m (Θ). For any prior π over Θ, [ R(π, δ) = R(, δ) π(d) sup Minimizing over all estimators δ D, we have ] R(, δ) π(d) = sup R(, δ). inf δ R(π, δ) inf δ R(π, δ π ) R m (Θ) sup R(, δ) The Bayes risk of any prior gives a lower bound for minimax risk. Maximizing over π gives the sharpest lower bound: sup R(π, δ π ) R m (Θ). π Finding the LFP and minimax estimator: For any prior π, we have shown R(π, δ π ) inf δ sup R(, δ). On the other hand, for any estimator δ π, we have Putting these together gives inf δ sup R(, δ) sup R(, δ π ). R(π, δ π ) inf δ sup R(, δ) sup R(, δ π ). Therefore, δ π achieves the minimax risk if R(π, δ π ) = sup R(, δ π ). 4
5 Theorem 1 (LC 5.1.4). Let δ π be (unique) Bayes for π, and suppose R(π, δ π ) = sup R(, δ π ). Then 1. δ π is (unique) minimax; 2. π is least favorable; Proof. 1. For any other estimator δ, sup R(, δ) R(, δ)π(d) R(, δ π )π(d) = sup R(, δ π ). If δ π is unique Bayes under π, then the second inequality is strict and δ π is unique minimax. 2. Let π be a prior over Θ. Then R( π, δ π ) = R(, δ π ) π(d) R(, δ π ) π(d) sup R(, δ π ) = R(π, δ π ). Notes: regarding the condition R(π, δ π ) = sup R(, δ π ), the condition is sufficient but not necessary for δ π to be minimax. the condition is very restrictive - it is met only if π( : R(, δ π ) = sup R(, δ π )) = 1. Bayes estimators that satisfy this condition are sometimes called equalizer rules: 5
6 Definition 3. An estimator δ is an equalizer rule with respect to π if δ is Bayes with respect to π and R(, δ) = sup R(, δ) a.e. π. The Theorem implies that equalizer rules are minimax: Corollary 1 (LC cor 5.1.6). An equalizer rule is minimax. In the definition of an equalizer rule, it is not enough that R(, δ) is constant a.e. π. For example, if the supremum risk of an estimator could occur on a set of π-measure zero, then generally R(π, δ) < sup R(, δ) and the conditions of the theorem do not hold. To summarize: We need R(π, δ π ) = sup R(, δ π ). This condition will be met if δ π has constant risk. This condition will not be met if δ π has constant risk a.e. π, but higher risk on a set of π-measure zero. This condition will be met if δ π achieves its maximum risk on a set of π- probability 1. Example(binomial proportion): X binomial(n, ), estimate with squared error loss. Let s try to find a Bayes estimator with constant risk. Can this be done with conjugate priors? Under beta(a, b), we have δ ab (X) = a + X a + b + n R(, δ ab ) = Var[δ ab ] + Bias 2 [δ ab ] = n(1 ) (a (a + b))2 +. (a + b + n) 2 (a + b + n) 2 Can we make this constant as a function of? The numerator is n n 2 + (a + b) 2 2 2a(a + b) + c(a, b, n). 6
7 This will be constant in if n = 2a(a + b) n = (a + b) 2, solving for a and b gives a = b = n/2. Therefore, The estimator δ n/2(x) = X+ n/2 n+ n = X n constant risk and Bayes, therefore an equalizer rule, and therefore minimax. beta( n/2, n/2) is a least favorable prior. Risk comparison R(, δ n/2(x)) = At = 1/2 (a difficult value of ), we have Draw the picture. Notes: n n+1 is 1 (2(1 + n)) 2 R(, X) = (1 ) n R(1/2, X) = n > (n + 2 n + 1) = R(1/2, δ n/2(x)). The region in which R(, δ n/2(x)) < R(, X) decreases in size with n. Draw the picture. The least favorable prior is not unique. To see this, note that under any prior π, δ π (x) = E[ X] = x+1 (1 ) n x π(d) x (1 ) n x π(d), and so the estimator only depends on the first n + 1 moments of under π. 7
8 The minimax estimator is sensitive to the loss function: Under L(, δ) = ( δ) 2 /[(1 )], X/n is minimax (it is an equalizer rule under under beta(1,1)). Example (difference in proportions): X binomial(n, x ), Y binomial(n, y ), estimate y x. Can we proceed from the one sample case? Consider Y + n/2 n + n X + n/2 n + n = Y X n + n. It turns out that this is not minimax. Consider estimators of the form δ(x, Y ) = c (Y X) for some c (0, 1/n). Strategy: For each c, where does c (Y X) have maximum risk? Is c (Y X) Bayes w.r.t. a prior that assigns probability 1 to points of maximum risk? R(, δ c ) = Var[δ c ] + (E[c(Y X)] ( y x )) 2 = c 2 n[ x (1 x ) + y (1 y )] + (cn 1) 2 ( y x ) 2 Taking derivatives, the maximum risk is attained at ( x, y ) satisfying [2(cn 1) 2 2c 2 n] x 2(cn 1) 2 y = c 2 n [2(cn 1) 2 2c 2 n] y 2(cn 1) 2 x = c 2 n We have two equations and two unknowns, and so for a given c, maximum risk is generally attained at some point c = ( xc, yc ). The theorem suggests trying a prior 8
9 π c () = 1( = c ). But then δ πc = c is the Bayes estimator, not δ c. Furthermore δ πc has its minimum risk at c. However, consider Θ 0 = { : x + y = 1} Θ. On this set, y (1 y ) = x (1 x ) and y x = 1 2 x, and the risk function is R(, δ c ) = c 2 n(2 x (1 x )) + (cn 1) 2 (1 2 x ). Is there a value of c that gives this estimator constant risk (on Θ 0 )? dr( x, c(y X)) d x = d d x c 2 n(2( x 2 x)) + (cn 1) 2 (1 4 x x) Solving for c, we have = c 2 n(2(1 2 x )) (cn 1) 2 4(1 2 x ) c 2 n = 2(cn 1) 2 ± nc = 2(cn 1) ± n/2 = (n 1/c) 2n ± n 1/c = 2 c = 2 = 1 2n 2n ± n n 2n ± 1 Using the solution could give estimators outside the parameter space, so consider the + solution: c e = 1 2n n 2n + 1 2n δ e (X, Y ) = (Y X)/n, 2n + 1 i.e. δ e is a shrunken version of the UMVUE. By construction, this estimator has constant risk on { : x + y = 1}. Does it achieve its maximum risk on this set? Recall the risk function is: R(, δ c ) = c 2 n[ x (1 x ) + y (1 y )] + (cn 1) 2 ( y x ) 2 = c 2 n( x (1 x ) + y (1 y ) + ( y x ) 2 ), 9
10 where we are using the fact that 2(cn 1) 2 = c 2 n for our risk-equalizing value of c. Taking derivatives, we have that the maximum risk occurs when c 2 n[(1 2 x ) + ( x y )] = c 2 n(1 x y ) = 0 c 2 n[(1 2 y ) + ( y x )] = c 2 n(1 x y ) = 0, which are both satisfied when x + y = 1. So far, we have shown δ e has constant risk on Θ 0 = { : x + y = 1} δ e achieves its maximum risk on this set. To show that it is minimax, what remains is to show that it is Bayes with respect to a prior π on Θ 0. If it is, then the condition R(π, δ e ) = sup R(, δ e ) Θ will be met and δ e will therefore be minimax. Consider estimation of y = 1 x from X and Y. For what prior on y will δ e be Bayes? Note that on Θ 0, n + Y X binomial(2n, y ). Recall beta( n/2, n/2) is least favorable for estimating a binomial proportion with n samples. Here, our sample size is 2n, so lets see if beta( n/2, n/2) is least favorable. π( y n + Y X) = dbeta(n + Y X + n/2, n + X Y + n/2) E[ y n + Y X] = n + Y X + n/2 2n +. 2n Now recall we are trying to estimate y x, which on Θ 0 is equal to 2 y 1. 10
11 Under squared error loss, the Bayes estimator of 2 y 1 is 2E[ y n + Y X] 1, which is So we have shown: 2E[ y n + Y X] 1 = n + Y X n/2 n + n + n/2 n/2 n + n/2 = Y X n + n/2 n = n + (Y X)/n n/2 2n = (Y X)/n = δ e. 2n + 1 δ e is Bayes with respect to a prior π on Θ 0 Θ; δ e is constant risk a.e. π; sup Θ R(, δ e ) is achieved on Θ 0. Therefore, δ e is an equalizer rule, and hence minimax. 3 Least favorable prior sequence In many problems there are no least favorable priors, and the main theorem from above is of no help. Example(normal mean, known variance): X 1,... X n i.i.d. normal(0, σ 2 ), σ 2 known. R(, X) = σ 2 /n is constant risk, so seems potentially minimax. But no prior π over Θ gives δ π (X) = X. You can also see that there is no least favorable prior: Under any prior π, R(π, δ π ) < R(π, X) = σ 2 /n, 11
12 but you can find priors whose Bayes risk is arbitrarily close to σ 2 /n (i.e., the set of Bayes risks is not closed). However, X is a limit of Bayes estimators: As τ 2, More generally, consider δ τ 2 = n/σ2 n/σ 2 +1/τ 2 R(π 2 τ, δ 2 τ) = 1 n/σ 2 +1/τ 2 X X σ 2 /n = R(, X). (π k, δ k ), a sequence of prior distributions such that R(π k, δ k ) R; δ, an estimator such that sup R(, δ) = R. For each k, we have R(π k, δ k ) R m However, we always have R m sup R(, δ). Putting these together gives R(π k, δ k ) R m sup R(, δ). If R(π k, δ k ) sup R(, δ), then we must have R m = sup R(, δ), and so δ must be minimax. Definition 4. A sequence {π k } of priors is least favorable if for every prior π, R(π, δ π ) lim k R(π k, δ k ). Theorem 2 (LC thm 5.1.2). Let {π k } be sequence of prior distributions and δ an estimator such that R(π k, δ k ) sup R(, δ). Then 1. δ is minimax, and 2. {π k } is least favorable. Proof. 12
13 1. For any estimator δ, 2. For any prior π, sup R(, δ ) = lim sup k sup R(, δ ) R(π k, δ ) R(π k, δ k ) R(, δ ) lim R(π k, δ k ) = sup R(, δ) k R(π, δ π ) R(π, δ) sup R(, δ) = lim R(π k, δ k ). k Exercise: Prove the following (weaker) variant of the above theorem: Any constant risk generalized Bayes estimator is minimax. Example (normal mean, known variance): Let X 1,..., X n i.i.d. N p (, σ 2 I), σ 2 known. Let π k () be such that N p (0, τk 2I), τ k 2. δ k = X. n/σ2 n/σ 2 +1/τ 2 1 Under average square-error loss, R(π k, δ k ) = n/σ 2 +1/τk 2 Thus X is minimax. σ 2 /n = R(, X). Alert! This result should seem a bit odd to you - X is inadmissible for p 3. How can it be dominated if it is minimax? We even have the following Theorem: Theorem 3 (LC lemma ). Any unique minimax estimator is admissible. Proof. If δ is unique minimax, and δ any other estimator, them sup so δ can t be dominated. R(, δ) < sup R(, δ ) 0 : R( 0, δ) < R( 0, δ ), 13
14 So how can X be minimax and not admissible? The only possibility left by the theorem is that X is not unique minimax. In fact, for p 3, estimators of the form ( ) δ(x) = 1 c( x ) σ2 (p 2) x x 2 are minimax as long as c : R + [0, 2] is nondecreasing (LC theorem 5.5.5). For these estimators, the supremum risk occurs in the limit as. Example (normal mean, unknown variance): Let X 1,..., X n i.i.d. N p (, σ 2 I), σ 2 R + unknown. sup,σ 2 R((, σ 2 ), X) =. In fact, we can prove that every estimator has infinite maximum risk: sup R((, σ 2 ), δ) = sup sup R(, δ(x) σ 2 ),σ 2 σ 2 sup σ 2 σ 2 /n =, where the inequality holds because X is minimax in the known variance case. Therefore, every estimator is trivially minimax, with maximum risk of infinity. Here is a not-entirely satisfying solution to this problem: Assume σ 2 M, M known. Applying exactly the same argument as above, we have sup,σ 2 R((, σ 2 ), δ) = sup σ 2 :σ 2 M and so X is minimax with maximum risk M/n. sup R(, δ(x) σ 2 ) sup σ 2 /n = M/n, σ 2 :σ 2 M Note that the value of M doesn t affect the estimator - X is minimax no matter what M is. Does this mean that X is the minimax estimator for σ 2 R +? Here are some thoughts: 14
15 For p = 1, X is unique minimax for R, σ 2 (0, M] for all M. For σ 2 R +, it is still minimax, although not unique. For p > 2, X is minimax, but not unique minimax for σ 2 (0, M] or even known σ 2. A better solution to this problem is to change the loss function to L((, σ 2 ), d) = ( d) 2 /σ 2, standardized squared-error loss. Exercise: Show that X is minimax under standardized squared-error loss. 4 Nonparametric problems The normal mean, unknown variance problem above gave some indication that we can deal with nuisance parameters (such as the variance σ 2 ) when obtaining a minimax estimator for a parameter of interest (such as the mean ). What about more general nuisance parameters? Let X P P. Suppose we want to estimate = g(p ), some functional of P. Suppose δ is minimax for = g(p ) when P P 0 P. When will it also be minimax for P P? Theorem (LC ). If δ is minimax for when P 0 P and then δ is minimax for when P. Proof. For any other estimator δ, sup R(P, δ) = sup R(P, δ), P P 0 P P sup R(P, δ ) sup R(P, δ ) P P P P 0 sup R(P, δ) = sup R(P, δ). P P 0 P P 15
16 Example: (Difference in binomial proportions) P = {dbinom(x, n, x, y } P 0 = {dbinom(x, n, x = 1 y, y } For estimation of y x on P 0, we showed that c (Y X)/n, with c = 2n/( 2n + 1), is constant risk on P 0. c (Y X)/n is Bayes w.r.t. a beta( n/2, n/2) prior on y, and so c (Y X) is minimax on P 0. We then showed that c (Y X)/n achieved its maximum risk on P 0, and so by the theorem it is minimax. Example: (Population mean, bounded variance) Let P = {P : Var[X P ] M}, M known. Is X minimax for = E[X P ]? Let P 0 = {dnorm(x,, σ) : R, σ 2 M}. Then 1. X is minimax for P0 ; 2. sup P P0 R(P, X) = sup P P R(P, X) = M/n. Thus X is minimax (and unique in the univariate case, as shown in LC 5.2). Example: (Population mean, bounded range) Let X 1,..., X n i.i.d. P P, where P ([0, 1]) = 1 P P. Least favorable perspective: Our previous experience tells us to find a minimax estimator that is best under the worst conditions. What are the worst conditions? Guess: Most difficult situation is where mass of each P is as spread out as possible. Let P 0 = {binary(x, ) : [0, 1]}. As we ve already derived, the minimax estimator for E[X i ] = Pr(X i = 1 ) = based on X 1,..., X n i.i.d. binary() is δ(x) = n X + n/2 n + n = n n + 1 X + 1 n
17 To use the lemma to show this is minimax for P, we need to show Let s calculate the risk of δ for P P: and so Var[δ(X)] = E[δ(X)] = Bias 2 (δ P ) = R(P, δ) = sup R(P, δ) = sup R(P, δ). P P P P 0 n ( n + 1) Var[X P ]/n = Var[X P ]/( n + 1) 2 2 n n + 1 n = + n + 1 2( n + 1) = + 1/2 n + 1 ( 1/2)2 ( n + 1) 2, 2 1 ( n + 1) 2 [Var[X P ] + ( 1/2)2 ]. Where does this risk achieve its maximum value? Note that sup R(P, δ) = sup P P [0,1] sup R(P, δ), P P where P = {P P : E[X P ] = }. To do the inner supremum, note that for any P P, Var[X P ] = E[X 2 ] E[X] 2 E[X] E[X] 2 = E[X](1 E[X]) = (1 ) with equality only if P is the binary() measure. Therefore, for each supremum risk is attained at the binary() distribution, and so the maximum risk of δ is attained on the P 0. To recap, 17
18 sup R(P, δ) = sup P P [0,1] sup R(P, δ) = sup P P [0,1] sup R(P, δ) = sup R(P 0, δ). P P P 0 P P 0 Thus δ is minimax on P 0, achieves its maximum risk there, and so is minimax over all of P. 5 Minimax and admissibility We already showed a unique minimax estimator can t be dominated: Theorem (LC lemma ). Any unique minimax estimator is admissible. This is reminiscent of our theorem about admissibility of unique Bayes estimators, which has a similar proof: If δ were to dominate δ, then it would be as good as δ in terms of both Bayes risk and maximum risk, and so such a δ can t be either unique Bayes or unique minimax. What about the other direction? We have shown in some important cases that admissibility of an estimator implies it is Bayes, or close to Bayes. Does admissibility imply minimax? The answer is yes for constant risk estimators: Theorem (LC lemma ). If δ is constant risk and admissible, then it is minimax. Proof. Let δ be another estimator. Since δ is not dominated, there is a 0 s.t. R( 0, δ) R( 0, δ ) But since δ is constant risk, sup R(, δ) = R( 0, δ) R( 0, δ ) sup R(, δ ). Below is a diagram summarizing some of the relationships between admissible, Bayes and minimax estimators we ve covered so far: 18
19 6 Superefficiency and sparsity Let X 1,..., X n i.i.d. N(, 1), where R. A version of Hodges estimator for is given by { X if x > 1/n 1/4 δ H (x) = 0 if x < 1/n 1/4 or more compactly, δ H (x) = x 1( x > 1/n 1/4 ). Exercise: Show that the asymptotic distribution of δ H (x) is given by n(δh (X) ) N(0, v()), where v() = { 1 if 0 0 if = 0 The asymptotic variance of n(δ H (X) ) makes δ H seem useful in many situations: Often we are trying to estimate a parameter that could potentially be zero, or very close to zero. For such parameters, δ H seems to be as good as X asymptotically for all, and much better than X at the special value = 0. In particular, if = 0, then Pr(δ H (X) = 0) 1. This seems much better than simple consistency at = 0, which says Pr( δ H (X) < ɛ) 1. for every ɛ > 0. An estimator consistent at = 0 never actually has to be zero, whereas the Hodges estimator is more and more likely to actually be zero as n. However, there is a price to pay. Let s compare the maximum of the risk function for δ H to that of X, under squared error loss. For Hodges estimator, we have for any 19
20 R, sup E X [(δ H (X) ) 2 ] E X [(δ H (X) ) 2 ] Letting = 0 / n, we have = E X [( X1( X < 1/n 1/4 ) ) 2 1( X < 1/n 1/4 )] = E X [ 2 1( X < 1/n 1/4 )] = 2 Pr( X < 1/n 1/4 ) = 2 Pr( n 1/4 < X < n 1/4 ) = 2 Pr( n( n 1/4 ) < n( X ) < n( + n 1/4 ) ) = 2 [Φ( n( + n 1/4 )) Φ( n( n 1/4 ))]. sup R(, δ H ) 2 0 n [Φ( 0 + n 1/4 ) Φ( 0 n 1/4 )]. Note that this holds for any 0 R. On the other hand, the risk of X is constant, 1/n for all. Therefore, sup R(, δ H ) sup R(, X) 2 0 [Φ( 0 + n 1/4 ) Φ( 0 n 1/4 )]. You can play around with various values of 0 to see how big you can make this bound for a given n. The thing to keep in mind is that the inequality holds for all n and 0. As n, the normal CDF term goes to one, so As this holds for all 0, we have sup lim R(, δ H ) n sup R(, X) 2 0. lim n sup R(, δ H ) =. sup R(, X) Yikes! So the Hodges estimator becomes infinitely worse than X as n. Is this asymptotic result relevant for finite-sample comparisons of estimators? semi-closed form expression for the risk of δ H is given by Lehmann and Casella [1998, page 442]. A plot of the finite-sample risk of δ H is shown in Figure 2. A 20
21 n MSE Figure 2: Risk functions of δ H for n {25, 100, 250}, with the risk bound in blue. Related to this calculation is something of more modern interest, the problem of variable selection in regression. Consider the following model: X 1,..., X n i.i.d.p X ɛ 1,..., ɛ n i.i.d.n(0, 1) y i = β T X i + ɛ i You may have attended one or more seminars where someone presented an estimator ˆβ of β with the following property: Pr( ˆβ j = 0 β) 1 as n β : β j = 0. Again, such an estimator ˆβ j is not just consistent at β j = 0, it actually equals 0 with increasingly high probability as n. This property has been coined sparsistency by people studying asymptotics of model selection procedures. This special consistency at β j = 0 seems similar to the properties of Hodges estimator when = 0. What does the behavior at 0 imply about the risk function elsewhere? It is difficult to say anything about the entire risk function for all such estimators, but 21
22 we can gain some general insight by looking at the supremum risk. Let β n = β 0 / n, where β 0 is arbitrary. sup E β [ ˆβ β 2 ] E βn [ ˆβ β n 2 ] β E βn [ ˆβ β n 2 1(ˆβ = 0)] Now consider the risk of the OLS estimator: = β n 2 Pr(ˆβ = 0 β n ) = 1 n β 0 2 Pr(ˆβ = 0 β n ) So we have which holds for all n and all β 0. E β [ ˆβ ols β 2 ] = E PX [tr((x T X) 1 )] v n sup β R(β, ˆβ) sup β R(β, ˆβ ols ) β 0 2 Pr(ˆβ = 0 β n ) nv n, Now you should believe that nv n converges to some number v > 0. What about Pr(ˆβ = 0 β n )? Now β n 0 as n, and so it seems that since ˆβ has the sparsistency property, eventually ˆβ will equal 0 with high probability. Next quarter you will learn about contiguity of sequences of distributions and will be able to show that lim Pr(ˆβ = 0 β = β n ) = lim Pr(ˆβ = 0 β = 0) = 1. n n The first equality is due to contiguity, the second to the sparsistency property of ˆβ. This result gives lim n sup β R(β, ˆβ) sup β R(β, ˆβ ols ) β 0 2 /v. Since this result holds for all β 0, the limit is in fact infinite. The result is that, in terms of supremum risk (under squared error loss) a sparsistent estimator becomes infinitely worse than the OLS estimator. This result is due to Leeb and Pötscher [2008] (see also Pötscher and Leeb [2009]). Discuss: Is supremum risk an appropriate comparison? 22
23 Is squared error an appropriate loss? When would you use a sparsistent estimator? References Thomas S. Ferguson. Mathematical statistics: A decision theoretic approach. Probability and Mathematical Statistics, Vol. 1. Academic Press, New York, Hannes Leeb and Benedikt M Pötscher. Sparse estimators and the oracle property, or the return of hodges estimator. Journal of Econometrics, 142(1): , E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in Statistics. Springer-Verlag, New York, second edition, ISBN Benedikt M. Pötscher and Hannes Leeb. On the distribution of penalized maximum likelihood estimators: the LASSO, SCAD, and thresholding. J. Multivariate Anal., 100(9): , ISSN X. doi: /j.jmva URL 23
Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationPeter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4
Contents 1 Statistical inference 1 2 The estimation problem 2 3 The testing problem 4 4 Loss, decision rules and risk 6 4.1 Statistical decision problems....................... 7 4.2 Decision rules and
More informationMIT Spring 2016
MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 3 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a)
More informationLecture notes on statistical decision theory Econ 2110, fall 2013
Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic
More informationSolutions to the Exercises of Section 2.11.
Solutions to the Exercises of Section 2.11. 2.11.1. Proof. Let ɛ be an arbitrary positive number. Since r(τ n,δ n ) C, we can find an integer n such that r(τ n,δ n ) C ɛ. Then, as in the proof of Theorem
More informationEcon 2148, spring 2019 Statistical decision theory
Econ 2148, spring 2019 Statistical decision theory Maximilian Kasy Department of Economics, Harvard University 1 / 53 Takeaways for this part of class 1. A general framework to think about what makes a
More informationSTA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources
STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationLecture Notes 3 Convergence (Chapter 5)
Lecture Notes 3 Convergence (Chapter 5) 1 Convergence of Random Variables Let X 1, X 2,... be a sequence of random variables and let X be another random variable. Let F n denote the cdf of X n and let
More information3.0.1 Multivariate version and tensor product of experiments
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]
More informationEcon 2140, spring 2018, Part IIa Statistical Decision Theory
Econ 2140, spring 2018, Part IIa Maximilian Kasy Department of Economics, Harvard University 1 / 35 Examples of decision problems Decide whether or not the hypothesis of no racial discrimination in job
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationMAS113 Introduction to Probability and Statistics
MAS113 Introduction to Probability and Statistics School of Mathematics and Statistics, University of Sheffield 2018 19 Identically distributed Suppose we have n random variables X 1, X 2,..., X n. Identically
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationLecture 4: September Reminder: convergence of sequences
36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 4: September 6 In this lecture we discuss the convergence of random variables. At a high-level, our first few lectures focused
More informationLecture 2: Basic Concepts of Statistical Decision Theory
EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture
More informationSTAT215: Solutions for Homework 2
STAT25: Solutions for Homework 2 Due: Wednesday, Feb 4. (0 pt) Suppose we take one observation, X, from the discrete distribution, x 2 0 2 Pr(X x θ) ( θ)/4 θ/2 /2 (3 θ)/2 θ/4, 0 θ Find an unbiased estimator
More informationNotes on Decision Theory and Prediction
Notes on Decision Theory and Prediction Ronald Christensen Professor of Statistics Department of Mathematics and Statistics University of New Mexico October 7, 2014 1. Decision Theory Decision theory is
More informationLecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.
Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationCSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.
() () a. X is a binomial distribution with n = 000, p = /6 b. The expected value, variance, and standard deviation of X is: E(X) = np = 000 = 000 6 var(x) = np( p) = 000 5 6 666 stdev(x) = np( p) = 000
More informationEC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)
1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationTheoretical results for lasso, MCP, and SCAD
Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical
More information(1) Introduction to Bayesian statistics
Spring, 2018 A motivating example Student 1 will write down a number and then flip a coin If the flip is heads, they will honestly tell student 2 if the number is even or odd If the flip is tails, they
More informationST5215: Advanced Statistical Theory
Department of Statistics & Applied Probability Wednesday, October 5, 2011 Lecture 13: Basic elements and notions in decision theory Basic elements X : a sample from a population P P Decision: an action
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More informationMultinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is
Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in
More informationEE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16
EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationEconomics 583: Econometric Theory I A Primer on Asymptotics
Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:
More informationEvaluating the Performance of Estimators (Section 7.3)
Evaluating the Performance of Estimators (Section 7.3) Example: Suppose we observe X 1,..., X n iid N(θ, σ 2 0 ), with σ2 0 known, and wish to estimate θ. Two possible estimators are: ˆθ = X sample mean
More informationLECTURE NOTES 57. Lecture 9
LECTURE NOTES 57 Lecture 9 17. Hypothesis testing A special type of decision problem is hypothesis testing. We partition the parameter space into H [ A with H \ A = ;. Wewrite H 2 H A 2 A. A decision problem
More informationWorst case analysis for a general class of on-line lot-sizing heuristics
Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationMIT Spring 2016
Decision Theoretic Framework MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Decision Theoretic Framework 1 Decision Theoretic Framework 2 Decision Problems of Statistical Inference Estimation: estimating
More informationHow Much Evidence Should One Collect?
How Much Evidence Should One Collect? Remco Heesen October 10, 2013 Abstract This paper focuses on the question how much evidence one should collect before deciding on the truth-value of a proposition.
More informationLecture 2: Statistical Decision Theory (Part I)
Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical
More information18.600: Lecture 24 Covariance and some conditional expectation exercises
18.600: Lecture 24 Covariance and some conditional expectation exercises Scott Sheffield MIT Outline Covariance and correlation Paradoxes: getting ready to think about conditional expectation Outline Covariance
More informationChapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1
Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationStat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory. Charles J. Geyer School of Statistics University of Minnesota
Stat 5101 Lecture Slides: Deck 7 Asymptotics, also called Large Sample Theory Charles J. Geyer School of Statistics University of Minnesota 1 Asymptotic Approximation The last big subject in probability
More informationHypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3
Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest
More informationLinear Classifiers and the Perceptron
Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible
More informationSequential Decisions
Sequential Decisions A Basic Theorem of (Bayesian) Expected Utility Theory: If you can postpone a terminal decision in order to observe, cost free, an experiment whose outcome might change your terminal
More information18.440: Lecture 25 Covariance and some conditional expectation exercises
18.440: Lecture 25 Covariance and some conditional expectation exercises Scott Sheffield MIT Outline Covariance and correlation Outline Covariance and correlation A property of independence If X and Y
More informationProbability Models of Information Exchange on Networks Lecture 1
Probability Models of Information Exchange on Networks Lecture 1 Elchanan Mossel UC Berkeley All Rights Reserved Motivating Questions How are collective decisions made by: people / computational agents
More informationORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES
ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES 1. Order statistics Let X 1,...,X n be n real-valued observations. One can always arrangetheminordertogettheorder statisticsx (1) X (2) X (n). SinceX (k)
More informationSTAT 135 Lab 5 Bootstrapping and Hypothesis Testing
STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationLecture 1: Bayesian Framework Basics
Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationLecture 7 October 13
STATS 300A: Theory of Statistics Fall 2015 Lecture 7 October 13 Lecturer: Lester Mackey Scribe: Jing Miao and Xiuyuan Lu 7.1 Recap So far, we have investigated various criteria for optimal inference. We
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationECE531 Lecture 10b: Maximum Likelihood Estimation
ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationReview and continuation from last week Properties of MLEs
Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that
More informationStatistical Measures of Uncertainty in Inverse Problems
Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department
More informationUnbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.
Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it
More informationNotes 11: OLS Theorems ECO 231W - Undergraduate Econometrics
Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics Prof. Carolina Caetano For a while we talked about the regression method. Then we talked about the linear model. There were many details, but
More informationChapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs
2.6 Limits Involving Infinity; Asymptotes of Graphs Chapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs Definition. Formal Definition of Limits at Infinity.. We say that
More informationMachine learning, shrinkage estimation, and economic theory
Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such
More informationEco517 Fall 2004 C. Sims MIDTERM EXAM
Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More information10-704: Information Processing and Learning Fall Lecture 24: Dec 7
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 24: Dec 7 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationCompressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery
Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad
More informationHypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006
Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)
More informationBayesian analysis in finite-population models
Bayesian analysis in finite-population models Ryan Martin www.math.uic.edu/~rgmartin February 4, 2014 1 Introduction Bayesian analysis is an increasingly important tool in all areas and applications of
More informationLECTURE 10: REVIEW OF POWER SERIES. 1. Motivation
LECTURE 10: REVIEW OF POWER SERIES By definition, a power series centered at x 0 is a series of the form where a 0, a 1,... and x 0 are constants. For convenience, we shall mostly be concerned with the
More informationProbability and Measure
Chapter 4 Probability and Measure 4.1 Introduction In this chapter we will examine probability theory from the measure theoretic perspective. The realisation that measure theory is the foundation of probability
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationThe Risk of James Stein and Lasso Shrinkage
Econometric Reviews ISSN: 0747-4938 (Print) 1532-4168 (Online) Journal homepage: http://tandfonline.com/loi/lecr20 The Risk of James Stein and Lasso Shrinkage Bruce E. Hansen To cite this article: Bruce
More informationLecture 7: Schwartz-Zippel Lemma, Perfect Matching. 1.1 Polynomial Identity Testing and Schwartz-Zippel Lemma
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 7: Schwartz-Zippel Lemma, Perfect Matching Lecturer: Shayan Oveis Gharan 01/30/2017 Scribe: Philip Cho Disclaimer: These notes have not
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015
More informationLecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN
Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and
More informationNonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I
1 / 16 Nonparametric tests Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I Nonparametric one and two-sample tests 2 / 16 If data do not come from a normal
More informationTesting Statistical Hypotheses
E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions
More informationBayesian Learning in Social Networks
Bayesian Learning in Social Networks Asu Ozdaglar Joint work with Daron Acemoglu, Munther Dahleh, Ilan Lobel Department of Electrical Engineering and Computer Science, Department of Economics, Operations
More informationF (x) = P [X x[. DF1 F is nondecreasing. DF2 F is right-continuous
7: /4/ TOPIC Distribution functions their inverses This section develops properties of probability distribution functions their inverses Two main topics are the so-called probability integral transformation
More informationarxiv: v1 [math.st] 11 Apr 2007
Sparse Estimators and the Oracle Property, or the Return of Hodges Estimator arxiv:0704.1466v1 [math.st] 11 Apr 2007 Hannes Leeb Department of Statistics, Yale University and Benedikt M. Pötscher Department
More information9 Bayesian inference. 9.1 Subjective probability
9 Bayesian inference 1702-1761 9.1 Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake pm to win
More informationMaximum Likelihood Estimation
Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More informationIntroduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??
to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationSTA205 Probability: Week 8 R. Wolpert
INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and
More information5 + 9(10) + 3(100) + 0(1000) + 2(10000) =
Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,
More informationP (A G) dp G P (A G)
First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationCS Communication Complexity: Applications and New Directions
CS 2429 - Communication Complexity: Applications and New Directions Lecturer: Toniann Pitassi 1 Introduction In this course we will define the basic two-party model of communication, as introduced in the
More informationMA131 - Analysis 1. Workbook 4 Sequences III
MA3 - Analysis Workbook 4 Sequences III Autumn 2004 Contents 2.3 Roots................................. 2.4 Powers................................. 3 2.5 * Application - Factorials *.....................
More informationLectures on Simple Linear Regression Stat 431, Summer 2012
Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationSection Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence.
Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence. Example Determine the Maclaurin series of f (x) = e x and its the interval of convergence. f n (0)x n Recall from
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationLecture 2 Sep 5, 2017
CS 388R: Randomized Algorithms Fall 2017 Lecture 2 Sep 5, 2017 Prof. Eric Price Scribe: V. Orestis Papadigenopoulos and Patrick Rall NOTE: THESE NOTES HAVE NOT BEEN EDITED OR CHECKED FOR CORRECTNESS 1
More information