Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Size: px

Start display at page:

Download "Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it"

Rodney Hampton
5 years ago
Views:

1 Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students or people is an honor code violation. Cite any results you use from the literature that we did not explicitly prove in class or homework.

2 Question (Asymptotics and an integrated delta method): Let T n be an estimator satisfying r n (T n θ) d T R d for some random variable T and a non-random sequence r n. Assume a.s. T n θ. Let µ be a finite measure on a set X and for z : X R, let F (z) = f(x, z(x))dµ(x), where f : X R R. Let φ : R d X R be twice differentiable in its first argument, where for t, θ R d, φ(t, x) = φ(θ, x) + φ(θ, x), t θ + (t θ)t φ( θ, x)(t θ) for some θ [t, θ] = {λt + ( λ)θ λ [0, ]}. Assume that f satisfies the Lipschitz condition f(x, a) f(x, b) L(x) a b for a, b R where L is square integrable, that is, L(x) dµ(x) <. Define the process Z n (x) := r n (φ(t n, x) φ(θ, x)), so that Z n : X R. Assume that φ is smooth enough around θ that sup φ(t, x) op dµ(x) < t θ ɛ for all small enough ɛ > 0, and that φ(θ, x) dµ(x) <. Show that F (Z n ) d F fo (T ) where F fo : R d R is defined by F fo (t) := f(x, φ(θ, x), t )dµ(x). Hint: You do not need any of the empirical process results we have proved to answer this question.

3 Question (Logistic regression): Suppose that we observe data in pairs (x, y) R d {±}, where the data come from a logistic model with X P 0 and p θ (y x) = + e yxt θ, with log loss l θ (y x) = log( + exp( yx T θ)). Let θ n minimize the empirical logistic loss, L n (θ) := n n l θ (Y i X i ) = n i= n log( + e Y ixi T θ ) i= for pairs X, Y drawn from the logistic model with parameter θ 0. Let θ n = argmin θ L n (θ). Assume in addition that the data X i R d are i.i.d. and satisfy E[X i X T i ] = Σ 0 and E[ X i 4 ] <. That is, the second moment matrix of the X i is positive definite. (a) Let L(θ) = E 0 [l θ (Y X)] denote the population logistic loss. Show that L(θ 0 ) 0. that is, the Hessian of L at θ 0 is positive definite. You may assume that the order of differentiation and integration may be exchanged. (b) Under these assumptions, argue that θ n is consistent for θ 0, that is, that θ n a.s. θ 0. For the remainder of the question, assume that data are -dimensional and satisfy x {, }, as it makes things simpler. (c) Give the asymptotic distribution of n( θ n θ 0 ). You may assume that θ n is consistent. In one sentence or so, describe the effect of the parameter value θ 0 on the efficiency of θ n. (d) In many scenarios especially large-scale prediction problems we do not particularly care about the value of the parameter, but we wish to make accurate predictions with confidence of a label y given data x. For example, we might compare our predicted logistic model p θn (y x) = + exp( y θ n x) to the true logistic prediction p θ0 (y x). With this in mind, define the risk R(θ) := E 0 [ p θ (Y X) p θ0 (Y X) ] the expected absolute error in our predictions of labels (when the true distribution is P θ0 ). What is the asymptotic distribution of nr( θ n )? In one sentence or so, describe the effect of the parameter value θ 0 on the efficiency of θ n. 3

4 Question 3 ( bit estimators of location): Let f be a symmetric continuous density with f(x) > 0 for all x R. Suppose we receive data X i R drawn i.i.d. from the location family of distributions with densities f( θ), θ R, where f : R R + is Lipschitz continuous, and the goal is to estimate θ. However, there is an additional wrinkle: for reasons of communication and reducing memory and power usage, we may only observe a single bit associated with each X i, that is, we observe Z i {0, }, where each Z i is a function of X i. To make this a bit more concrete, we assume that each Z i is in the form of a threshold, that is, where t i R is real-valued. Z i := {X i t i } () (a) Assume that for each i, t i t is a constant and that we observe Z,..., Z n of the form (). Give a n-consistent estimator T n = T n (Z,..., Z n ; t) of θ based on Z,..., Z n and t. (b) What is the asymptotic distribution of your estimator T n? (c) Now, suppose that we have a sample {X i } N i= of size N, and we divide it into two parts: S () = {X,..., X n } and S () = {X n+,..., X N }. On the first sample S (), let Z i = {X i t} as in part (a). Then define t n := T n (Z,..., Z n ; t) where T n ( ) is your estimator from part (a). Now, let Z i = { X i t n } for i {n +,..., N}. Assume that n/n 0 as N. Construct an estimator θ N, a function of Z n+,..., Z N and t n, such that ( ) N( θn θ 0 ) d N 0, 4f(0) when the data X i are i.i.d. with density f( θ 0 ). Prove that you achieve this efficiency. Hint: The Berry-Esseen theorem may be useful. To remind (or define this) for you, the Berry- Esseen theorem is as follows: if Y i are i.i.d. and E[ Y i E[Y i ] 3 ] <, then ( n(y ) sup t R P n E[Y ]) t Φ(t) Var(Y ) E[ Y E[Y ] 3 ] Var(Y ) 3/ n where Φ is the standard normal CDF and Y n = n n i= Y i is the sample mean. 4

5 Question 4 (Stochastic convex optimization): Let θ R d and define f(θ) := E[F (θ; X)] = F (θ; x)dp (x) be a function, where F ( ; x) is convex in its first argument (in θ) for all x X, and P is a probability distribution. We assume F (θ; ) is integrable for all θ. Recall that a function h is convex h(tθ + ( t)θ ) th(θ) + ( t)h(θ ) for all θ, θ R d, t [0, ]. Let θ 0 argmin θ f(θ), and assume that f satisfies the following ν-strong convexity guarantee: f(θ) f(θ 0 ) + ν θ θ 0 for θ s.t. θ θ 0 β, where β > 0 is some constant. We also assume that the instantaneous functions F ( ; x) are L(x)- Lipschitz continuous over the set {θ R d θ θ 0 β}, meaning that F (θ; x) F (θ ; x) L(x) θ θ when θ θ0 β, θ θ 0 β. X We assume the (local) Lipschitz constant is sub-gaussian, that is, ( ) σ Lip λ E[L(X)] < and E [exp (λ(l(x) E[L(X)]))] exp for all λ R. We also make the following assumption on the relative errors in F (θ; X) and F (θ 0 ; X). (θ, x) = [F (θ; x) f(θ)] [F (θ 0 ; x) f(θ 0 )]. Then Define log (E [exp (λ (θ, X))]) λ σ θ θ 0 for all λ R if θ θ 0 β. Let X,..., X n be an i.i.d. sample according to P, and define f n (θ) := n n i= F (θ; X i) and let θ n argmin f n (θ). θ Θ Show that there exists a numerical constant C such that for all suitably large n, P ( θ n θ 0 C [log σ δ ]) ( ) ν n + d Õ() δ + exp n E[L(X)] CσLip for all δ (0, ), where Õ() hides constants logarithmic in n, E[L(X)], and ν. (You do not need to show this, but n such that C σ ν n [log δ + d Õ()] β is sufficiently large.) Hint : You may use the following facts, which are proved in Question 7.0. i. For any convex function h, if there is some r > 0 and a point θ 0 such that h(θ) > h(θ 0 ) for all θ such that θ θ 0 = r, then h(θ ) > h(θ 0 ) for all θ with θ θ 0 > r. ii. The functions f and f n are convex. iii. The point θ 0 is unique. Hint : The bounded differences inequality will not work here. You should show that f n is locally Lipschitz with high probability. 5

Maximum Likelihood Estimation

University of Pavia Maximum Likelihood Estimation Eduardo Rossi Likelihood function Choosing parameter values that make what one has observed more likely to occur than any other parameter values do. Assumption(Distribution)