Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Similar documents
Maximum Likelihood Estimation

Lecture 7 Introduction to Statistical Decision Theory

STAT 200C: High-dimensional Statistics

Derivatives in 2D. Outline. James K. Peterson. November 9, Derivatives in 2D! Chain Rule

Qualifying Exam in Machine Learning

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Lecture 4: September Reminder: convergence of sequences

STAT 200C: High-dimensional Statistics

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 3 January 16

Computational and Statistical Learning theory

Lecture 9: March 26, 2014

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

CS229 Supplemental Lecture notes

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Lecture 1: Supervised Learning

minimize x subject to (x 2)(x 4) u,

Comprehensive Examination Quantitative Methods Spring, 2018

Distirbutional robustness, regularizing variance, and adversaries

Empirical Likelihood

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Optimized first-order minimization methods

Lecture 21: Minimax Theory

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics

STAT 200C: High-dimensional Statistics

SDS : Theoretical Statistics

INTRODUCTION TO BAYESIAN METHODS II

IEOR E4703: Monte-Carlo Simulation

Theoretical Statistics. Lecture 1.

Estimators based on non-convex programs: Statistical and computational guarantees

Brief Review on Estimation Theory

2016 Final for Advanced Probability for Communications

1 Stat 605. Homework I. Due Feb. 1, 2011

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Spring 2012 Math 541B Exam 1

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Bounds on the Constant in the Mean Central Limit Theorem

Lecture 5: Importance sampling and Hamilton-Jacobi equations

Lecture 3 - Linear and Logistic Regression

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

Stochastic Optimization

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Midterm exam CS 189/289, Fall 2015

CS229 Supplemental Lecture notes

(Part 1) High-dimensional statistics May / 41

Least squares under convex constraint

Concentration, self-bounding functions

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Algorithmic Stability and Generalization Christoph Lampert

10-701/ Machine Learning - Midterm Exam, Fall 2010

The Bootstrap 1. STA442/2101 Fall Sampling distributions Bootstrap Distribution-free regression example

Lecture 1 Measure concentration

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Stochastic Proximal Gradient Algorithm

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Problem Selected Scores

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

STAT215: Solutions for Homework 2

NONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION

IFT Lecture 7 Elements of statistical learning theory

14.30 Introduction to Statistical Methods in Economics Spring 2009

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Math 362, Problem set 1

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

Class 2 & 3 Overfitting & Regularization

7 Influence Functions

STATISTICS 385: STOCHASTIC CALCULUS HOMEWORK ASSIGNMENT 4 DUE NOVEMBER 23, = (2n 1)(2n 3) 3 1.

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

High-dimensional probability and statistics for the data sciences

Fast Rates for Exp-concave Empirial Risk Minimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Correct ordering in the Zipf Poisson ensemble

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Statistics 3858 : Maximum Likelihood Estimators

SOLUTION FOR HOMEWORK 7, STAT p(x σ) = (1/[2πσ 2 ] 1/2 )e (x µ)2 /2σ 2.

Probabilistic Graphical Models & Applications

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

Lecture 2: Convex Sets and Functions

Interval Estimation. Chapter 9

Bayesian Interpretations of Regularization

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Complexity of two and multi-stage stochastic programming problems

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

1 Hoeffding s Inequality

Generalized Kernel Classification and Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

Chapter 3. Point Estimation. 3.1 Introduction

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

Advanced Statistics II: Non Parametric Tests

4130 HOMEWORK 4. , a 2

δ -method and M-estimation

Private Empirical Risk Minimization, Revisited

1 Dimension Reduction in Euclidean Space

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

Transcription:

Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students or people is an honor code violation. Cite any results you use from the literature that we did not explicitly prove in class or homework.

Question (Asymptotics and an integrated delta method): Let T n be an estimator satisfying r n (T n θ) d T R d for some random variable T and a non-random sequence r n. Assume a.s. T n θ. Let µ be a finite measure on a set X and for z : X R, let F (z) = f(x, z(x))dµ(x), where f : X R R. Let φ : R d X R be twice differentiable in its first argument, where for t, θ R d, φ(t, x) = φ(θ, x) + φ(θ, x), t θ + (t θ)t φ( θ, x)(t θ) for some θ [t, θ] = {λt + ( λ)θ λ [0, ]}. Assume that f satisfies the Lipschitz condition f(x, a) f(x, b) L(x) a b for a, b R where L is square integrable, that is, L(x) dµ(x) <. Define the process Z n (x) := r n (φ(t n, x) φ(θ, x)), so that Z n : X R. Assume that φ is smooth enough around θ that sup φ(t, x) op dµ(x) < t θ ɛ for all small enough ɛ > 0, and that φ(θ, x) dµ(x) <. Show that F (Z n ) d F fo (T ) where F fo : R d R is defined by F fo (t) := f(x, φ(θ, x), t )dµ(x). Hint: You do not need any of the empirical process results we have proved to answer this question.

Question (Logistic regression): Suppose that we observe data in pairs (x, y) R d {±}, where the data come from a logistic model with X P 0 and p θ (y x) = + e yxt θ, with log loss l θ (y x) = log( + exp( yx T θ)). Let θ n minimize the empirical logistic loss, L n (θ) := n n l θ (Y i X i ) = n i= n log( + e Y ixi T θ ) i= for pairs X, Y drawn from the logistic model with parameter θ 0. Let θ n = argmin θ L n (θ). Assume in addition that the data X i R d are i.i.d. and satisfy E[X i X T i ] = Σ 0 and E[ X i 4 ] <. That is, the second moment matrix of the X i is positive definite. (a) Let L(θ) = E 0 [l θ (Y X)] denote the population logistic loss. Show that L(θ 0 ) 0. that is, the Hessian of L at θ 0 is positive definite. You may assume that the order of differentiation and integration may be exchanged. (b) Under these assumptions, argue that θ n is consistent for θ 0, that is, that θ n a.s. θ 0. For the remainder of the question, assume that data are -dimensional and satisfy x {, }, as it makes things simpler. (c) Give the asymptotic distribution of n( θ n θ 0 ). You may assume that θ n is consistent. In one sentence or so, describe the effect of the parameter value θ 0 on the efficiency of θ n. (d) In many scenarios especially large-scale prediction problems we do not particularly care about the value of the parameter, but we wish to make accurate predictions with confidence of a label y given data x. For example, we might compare our predicted logistic model p θn (y x) = + exp( y θ n x) to the true logistic prediction p θ0 (y x). With this in mind, define the risk R(θ) := E 0 [ p θ (Y X) p θ0 (Y X) ] the expected absolute error in our predictions of labels (when the true distribution is P θ0 ). What is the asymptotic distribution of nr( θ n )? In one sentence or so, describe the effect of the parameter value θ 0 on the efficiency of θ n. 3

Question 3 ( bit estimators of location): Let f be a symmetric continuous density with f(x) > 0 for all x R. Suppose we receive data X i R drawn i.i.d. from the location family of distributions with densities f( θ), θ R, where f : R R + is Lipschitz continuous, and the goal is to estimate θ. However, there is an additional wrinkle: for reasons of communication and reducing memory and power usage, we may only observe a single bit associated with each X i, that is, we observe Z i {0, }, where each Z i is a function of X i. To make this a bit more concrete, we assume that each Z i is in the form of a threshold, that is, where t i R is real-valued. Z i := {X i t i } () (a) Assume that for each i, t i t is a constant and that we observe Z,..., Z n of the form (). Give a n-consistent estimator T n = T n (Z,..., Z n ; t) of θ based on Z,..., Z n and t. (b) What is the asymptotic distribution of your estimator T n? (c) Now, suppose that we have a sample {X i } N i= of size N, and we divide it into two parts: S () = {X,..., X n } and S () = {X n+,..., X N }. On the first sample S (), let Z i = {X i t} as in part (a). Then define t n := T n (Z,..., Z n ; t) where T n ( ) is your estimator from part (a). Now, let Z i = { X i t n } for i {n +,..., N}. Assume that n/n 0 as N. Construct an estimator θ N, a function of Z n+,..., Z N and t n, such that ( ) N( θn θ 0 ) d N 0, 4f(0) when the data X i are i.i.d. with density f( θ 0 ). Prove that you achieve this efficiency. Hint: The Berry-Esseen theorem may be useful. To remind (or define this) for you, the Berry- Esseen theorem is as follows: if Y i are i.i.d. and E[ Y i E[Y i ] 3 ] <, then ( n(y ) sup t R P n E[Y ]) t Φ(t) Var(Y ) E[ Y E[Y ] 3 ] Var(Y ) 3/ n where Φ is the standard normal CDF and Y n = n n i= Y i is the sample mean. 4

Question 4 (Stochastic convex optimization): Let θ R d and define f(θ) := E[F (θ; X)] = F (θ; x)dp (x) be a function, where F ( ; x) is convex in its first argument (in θ) for all x X, and P is a probability distribution. We assume F (θ; ) is integrable for all θ. Recall that a function h is convex h(tθ + ( t)θ ) th(θ) + ( t)h(θ ) for all θ, θ R d, t [0, ]. Let θ 0 argmin θ f(θ), and assume that f satisfies the following ν-strong convexity guarantee: f(θ) f(θ 0 ) + ν θ θ 0 for θ s.t. θ θ 0 β, where β > 0 is some constant. We also assume that the instantaneous functions F ( ; x) are L(x)- Lipschitz continuous over the set {θ R d θ θ 0 β}, meaning that F (θ; x) F (θ ; x) L(x) θ θ when θ θ0 β, θ θ 0 β. X We assume the (local) Lipschitz constant is sub-gaussian, that is, ( ) σ Lip λ E[L(X)] < and E [exp (λ(l(x) E[L(X)]))] exp for all λ R. We also make the following assumption on the relative errors in F (θ; X) and F (θ 0 ; X). (θ, x) = [F (θ; x) f(θ)] [F (θ 0 ; x) f(θ 0 )]. Then Define log (E [exp (λ (θ, X))]) λ σ θ θ 0 for all λ R if θ θ 0 β. Let X,..., X n be an i.i.d. sample according to P, and define f n (θ) := n n i= F (θ; X i) and let θ n argmin f n (θ). θ Θ Show that there exists a numerical constant C such that for all suitably large n, P ( θ n θ 0 C [log σ δ ]) ( ) ν n + d Õ() δ + exp n E[L(X)] CσLip for all δ (0, ), where Õ() hides constants logarithmic in n, E[L(X)], and ν. (You do not need to show this, but n such that C σ ν n [log δ + d Õ()] β is sufficiently large.) Hint : You may use the following facts, which are proved in Question 7.0. i. For any convex function h, if there is some r > 0 and a point θ 0 such that h(θ) > h(θ 0 ) for all θ such that θ θ 0 = r, then h(θ ) > h(θ 0 ) for all θ with θ θ 0 > r. ii. The functions f and f n are convex. iii. The point θ 0 is unique. Hint : The bounded differences inequality will not work here. You should show that f n is locally Lipschitz with high probability. 5