5.1 Approximation of convex functions

Size: px

Start display at page:

Download "5.1 Approximation of convex functions"

Lindsey Angelica Andrews
5 years ago
Views:

1 Chapter 5 Convexity Very rough draft 5.1 Approximation of convex functions Convexity::approximation bound Expand Convexity simplifies arguments from Chapter 3. Reasons: local minima are global minima and pointwise convergence of a sequnce of convex functions implies uniform convergence on compacta. The first Lemma in this Section contains the standard result (Rockafellar 1970, Theorem 10.8) that pointwise convergence of a sequence of (deterministic) convex functions {H n } to a limit function B on an open set G implies uniform convergence on each compact subset of G. For a subset H of R k and δ > 0, write H δ for the set {y : d(y, H) δ}. For a bounded subset T of R k, call a finite subset T 0 T a δ-net for T if d(t, T 0 ) δ for ever t in T. Equivalently, the union of closed balls t T0 B(t, δ) covers T. Also equivalently, B(t, δ) T 0 for each t in T. Define T δ := {t R k : d(t, T ) δ}. <1> Lemma. (based on Pollard 1991) Let K be a compact,convex subset of R d. Let H( ) and A( ) be real-valued functions defined on K 4δ, with H convex and A satisfying the uniform continuity requirement A(s) A(t) η whenever s t δ and s, t K 4δ. Let T 0 be a δ-net for K 3δ. Then sup H(s) A(s) 4η + 3 max H(t) A(t) s K t T 0 Proof Write for the maximum of H(t) A(t) over the finite set T 0. The asserted inequality will follow from a pair of bounds for A: <> <3> H(s) A(s) + η + H(s) A(s) 4η 3 for all s in K δ for all s in K version: 9sept010 printed: 6 October 010 Asymptopia c David Pollard

2 Convexity Proof of upper bound (). Let s be a point of K δ. First I prove that s lies in the convex hull of the finite set T (s) := {t T : s t δ} = T 0 B(s, δ). Argue by contradiction. If s is not in the convex hull then, by the separating hyperplane theorem, there exists some unit vector u for which u s > u t for all t in T (s) The closed ball B(s, δ) around the point s = s+δu lies inside B(s, δ) K 4δ and it contains no points from T 0, which contradicts the δ-net property of T 0. It follows that s must be a convex combination of points in T s. That is, s = t T (s) α tt with α t 0 and t α t = 1. Then H(s) t α t H(t) by convexity of H t t α t (A(t) + ) α t (A(s) + η + ) definition of because s t δ for t T (s) A(s) + η +. Proof of lower bound <3>. Let s be a point of K. Find a point of T 0 with s t δ. Define s = t (s t). Then s T (s) K δ and t = (s + s )/. By convexity of H, H(t) (H(s) + H(s )) / From <>, and, by definition of, H(s ) A(s ) + η +, H(t) A(t) Combine the last three inequalities to get A(t) H(s) + A(s ) + η +. Both A(s ) and A(t) lie within η of A(s). Inequality <3> follows.

3 5.3 A limit theorem for M-estimators 3 5. Minimizers of convex functions Convexity::argmin growth<4> Suppose M is a strictly convex function on R k satisfing the growth condition M(t)/ t as t. Convexity::limit For each z in R k, the function M(t, z) := M(t) z t achieves its minimum value M (z) = inf t R k M(t) z t at a unique point, ψ M (z). I claim that the map z ψ M (z) is continuous. I would also like to show that if M(t, z) M (z)+ɛ then t must be close to ψ M (z) in some suitably uniform sense. What else would be needed to establish a more general form of Lemma <5>? 5.3 A limit theorem for M-estimators stoch.approximation <5> Lemma. Suppose {H n (t) : t R k } is a stochastic process with convex sample paths. Assume (i) H n (ˆt n ) inf t R k H n (t) + o p (1) for some random ˆt n (ii) there exists a sequence of random vectors {Z n } of order O p (1) for which H n (t) + Z nt 1 t in probability, for each fixed t R k. Then ˆt n = Z n + o p (1). Proof Define A n (t) = 1 t Z nt = 1 Z n t 1 Z n, so that H n (t) A n (t) 0 in probability, for each fixed t. Note that A n is minimized at Z n, which will imply that H n is minimized near Z n, with high probability. Suppose ɛ > 0 and η > 0 are given, small quantities. By (ii), there exists an R > 0 for which P{ Z n > R} < ɛ eventually. Choose a δ < 1/4 such that A n (s) A n (t) η when s t δ and Z n R and s, t B(0, R + 1)

4 4 Convexity Let T 0 be a δ-net for B(0, R+1). Then there exists a set Ω n with probability eventually greater than 1 ɛ, on which Z n R n := max t T0 H n (t) A n (t) η H n (ˆt n ) H n (Z n ) + η On Ω n, Lemma <1> implies that H n (t) A n (t) 7η for all t R. In particular, for all unit vectors v, H n (Z n ) 7η A n (Z n ) = A n (Z n + 6η) 18η H n (Z n + 6ηv) 11η Convexity of r H n (Z n + rv) then implies that H n (Z n ) + 4η H n (Z n + rv) for all r 6η and all unt vectors v, which forces ˆt n to lie within the ball of radius 6η around Z n. I would like to strengthen the Lemma by replacing (ii) by: there exists a sequence of random vectors {Z n } of order O p (1) for which H n (t) + Z nt M(t) in probability, for each fixed t R k, where M is a strictly convex function satisfying the growth condition <4>. I would like to conclude something like ˆt n = Ψ M (Z n ) + o p (1) If we had Z n Z then we would get ˆt n ψ M (Z), a very clean limit theorem. random.sum <6> Theorem. Let G n (θ) = i n g n,i(θ) be a random process with convex sample paths defined on R k. Suppose there exist points θ n in R k and positive definite matrices J n for which: (i) θ n is an estimator for which G n ( θ n ) inf θ Θ G n (θ) + o p (1)

5 5.4 Examples 5 (ii) the function θ PG n (θ) is minimized at θ n (iii) each {g n,i } has a linear approximation near θ n, g n,i (θ) = g n,i (θ n ) + (θ θ n ) D n,i + r n,i (θ), where the D n,i are random vectors with zero expected value (iv) PG n (θ n + Jn 1/ t) = 1 t + o(1) for each t ( ) (v) var i n r n,i(θ n + Jn 1/ t) = o(1) for each t (vi) Z n := Jn 1/ i n D n,i = O p (1) Then J 1/ n ( θ n θ n ) = Z n + o p (1). Try to replace (iv) by PG n (θ n + Jn 1/ t) = M(t) + o(1) for each t, with M as in Section. Maybe the Theorem will then be strong enough to cover the lasso (Example <17>). Proof The standardized random vector ˆt n = Jn 1/ ( θ n θ n ) comes within o p (1) of minimizing the process H n (t) := G n (θ n + J 1/ n t) G n (θ n ) = t Z n + i n r n,i(θ n + Jn 1/ t). For a fixed t, assumption (v) ensures that the sum contributed by the {r n,i } lies within o p (1) of its expectation, P i n r n,i(θ n + J 1/ n t) = PG n (θ n + J 1/ n t) PG n (θ n ) because PD n,i = 0 = 1 t + o(1) by (iv) Thus H n (t)+t Z n = 1 t +o p (1) for each fixed t. An appeal to Lemma <??> completes the proof. Remarks. The theorem is almost true if θ ranges over only a subset Θ of R d, instead of over the whole of R d.

6 6 Convexity Convexity::examples 5.4 Examples <7> Example. Suppose X 1, X,... are independent observations from a prob- ability distribution P on the real line with sample median θ 0. Suppose also that there is a neighborhood of θ 0 in which P has a continuous, strictly positive density (with respect to Lebesgue measure). Let θ n be the sample median. Show that f(θ 0 )( θ n θ 0 ) N(0, 1). median spatial.median LAD.regression <8> Example. Generalize Example <7> to cover argmin i t X i in higher dimensions. <9> Example. Theorem 1 of Pollard (1991) for L1 regression. exp.fam <10> Example. Modify the following result from Dou, Pollard, and Zhou (009), but with improved argument from talk at Wellner conference: psi3 Let {Q λ : λ R} be an exponential family of probability measures with densities dq λ /dq 0 = f λ (y) = exp (λy ψ(λ)). Remember that e ψ(λ) = Q 0 e λy and that the distribution Q λ has mean ψ (1) (λ) and variance ψ () (λ). We assume: (ψ3) There exists an increasing real function G on R + such that psi (ψ) ψ (3) (λ + h) ψ () (λ)g( h ) for all λ and h Without loss of generality we assume G(0) 1. For each ɛ > 0 there exists a finite constant C ɛ for which ψ () (λ) C ɛ exp(ɛλ ) for all λ R. Equivalently, ψ () (λ) exp ( o(λ ) ) as λ. As shown in Section??, these assumptions on the ψ function imply that hell<11> h (Q λ, Q λ+δ ) δ ψ () (λ) (1 + δ ) G( δ ) for all λ, δ R. Remark. We may assume that ψ () (λ) > 0 for every real λ. Otherwise we would have 0 = ψ () (λ 0 ) = var λ0 (y) = νf λ0 (y)(y ψ (1) (λ 0 )) for some λ 0, which would make y = ψ (1) (λ 0 ) for ν almost all y and Q λ Q λ0 for every λ.

7 5.4 Examples 7 The theory in this section combine ideas from Portnoy (1988) and from Hjort and Pollard (1993). We write our results in a notation that makes the applications in Section?? and?? more straightforward. The notational cost is that the parameters are indexed by {0, 1,..., N}. To avoid an excess of parentheses we write N + for N + 1. In the applications N changes with the sample size n and Q is replaced by Q n,a,b,n or Q n,a,b,n. Suppose ξ 1,..., ξ n are (nonrandom) vectors in R N +. Suppose Q = i n Q λi with λ i = ξ i γ for a fixed γ = (γ 0, γ 1,..., γ N ) in R N +. Under Q, the coordinate maps y 1,..., y n are independent random variables with y i Q λi. The log-likelihood for fitting the model is L n (g) = i n (ξ ig)y i ψ(ξ ig) for g R N +, which is maximized (over R N + ) at the MLE ĝ (= ĝ n ). Remark. As a small amount of extra bookkeeping in the following argument would show, we do not need ĝ to exactly maximize L n. It would suffice to have L n (ĝ) suitably close to sup g L n (g). In particular, we need not be concerned with questions regarding existence or uniqueness of the argmax. Define (i) J n = i n ξ iξ i ψ() (λ i ), an N + N + matrix (ii) w i := J 1/ n ξ i, an element of R N + (iii) W n = i n w i ( yi ψ (1) (λ i ) ), an element of R N + Notice that QW n = 0 and var Q (W n ) = i n w iw i ψ() (λ i ) = I N+ and Q W n = trace (var Q (W n )) = N +. mle.approx <1> Lemma. Suppose 0 < ɛ 1 1/ and 0 < ɛ < 1 and max i n w i ɛ 1ɛ G(1)N + with G as in Assumption (ψ3). Then ĝ = γ + J 1/ n (W n + r n ) with r n ɛ 1 on the set { W n N + /ɛ }, which has Q-probability greater than 1 ɛ.

8 8 Convexity Proof The equality Q W n = N + and Tchebychev give Q{ W n > N + /ɛ } ɛ. Reparametrize by defining t = Jn 1/ (g γ). The concave function L n (t) := L n (γ + Jn 1/ t) L n (γ) = y iw i n it + ψ(λ i ) ψ(λ i + w it) is maximized at t n = Jn 1/ (ĝ γ). It has derivative L n (t) = ( ) w i y i ψ (1) (λ i + w i n it). For a fixed unit vector u R N + and a fixed t R N +, consider the real-valued function of the real variable s, H(s) := u Ln (st) = ) i n u w i (y i ψ (1) (λ i + sw it), which has derivatives Ḣ(s) = i n (u w i )(w it)ψ () (λ i + sw it) Ḧ(s) = i n (u w i )(w it) ψ (3) (λ i + sw it). Notice that H(0) = u W n and Ḣ(0) = u i n w iw i ψ() (λ i )t = u t. Write M n for max i n w i. By virtue of Assumption (ψ3), Ḧ(s) i n u w i (w it) ψ () (λ i )G ( sw it ) M n G (M n st ) t i n w iw iψ () (λ i )t = M n G (M n st ) t. By Taylor expansion, for some 0 < s < 1, H(1) H(0) Ḣ(0) 1 Ḧ(s ) 1 M ng (M n t ) t. u ll<13> That is, ( ) u Ln (t) W n + t 1 M ng (M n t ) t. Approximation <13> will control the behavior of L(s) := L n (W n +su), a concave function of the real argument s, for each unit vector u. By concavity, the derivative L(s) = u Ln (W n + su) = s + R(s)

9 5.4 Examples 9 is a decreasing function of s with Thus R(s) 1 M ng (M n W n + su ) W n + su On the set { W n N + /ɛ } we have implying W n ± ɛ 1 u N + /ɛ + ɛ 1. M n W n ± ɛ 1 u ɛ 1ɛ G(1)N + ( N+ /ɛ + ɛ 1 ) < 1, R(±ɛ 1 ) 1 M ng(1) W n ± ɛ 1 u ɛ 1ɛ ( N+ /ɛ + ɛ ) 1 G(1)N + ɛ 1 ( 1 + ɛ 1 ɛ /N + ) < 5 8 ɛ 1. Deduce that L(ɛ 1 ) = ɛ 1 + R(ɛ 1 ) 3 8 ɛ 1 L( ɛ 1 ) = ɛ 1 + R( ɛ 1 ) 3 8 ɛ 1 The concave function s L n (W n +su) must achieve its maximum for some s in the interval [ ɛ 1, ɛ 1 ], for each unit vector u. It follows that t n W n ɛ 1. AnBn <14> Corollary. Suppose ξ i = Dη i for some nonsingular matrix D, so that J n = nda n D wherea n := 1 n η iη i n iψ () (λ i ). assumptiona<15> If B n is another nonsingular matrix for which A n B n ( B n 1 ) 1 and if max.eta<16> max i n η i ɛ n/n + G(1) 3 Bn 1 for some 0 < ɛ < 1 then for each set of vectors κ 0,..., κ N in R N + there is a set Y κ,ɛ with QY c κ,ɛ < ɛ on which 0 j N κ j(ĝ γ) 6 B n 1 nɛ 0 j N D 1 κ j.

10 10 Convexity Remark. For our applications of the Corollary in Sections?? and??, we need D = diag(d 0, D 1,..., D N ) and κ j = e j, the unit vector with a 1 in its jth position, for j m and κ j = 0 for j > m. In our companion paper we will need the more general κ j s. Proof First we establish a bound on the spectral distance between A 1 n and Bn 1. Define H = Bn 1 A n I. Then H B n 1 A n B n 1/, which justifies the expansion A 1 n Bn 1 = ( (I + H) 1 I ) Bn 1 j 1 H k B n 1 B n 1. As a consequence, A 1 n B 1 n. Choose ɛ 1 = 1/ and ɛ = ɛ in Lemma 1. The bound on max i n η i gives the bound on max i n w i needed by the Lemma: n w i = η id(j n /n) 1 Dη i = η ia 1 n η i A 1 n η i. Define K j := Jn 1/ κ j, so that κ j (ĝ γ) (K j W n) + (K j r n). By Cauchy-Schwarz, j (K jr n ) K j r n = U κ r n j where U κ := j κ jj 1 n κ j = j n 1 (D 1 κ j ) A 1 n D 1 κ j n 1 B 1 j D 1 κ j. n For the contribution V κ := j K j W n the Cauchy-Schwarz bound is too crude. Instead, notice that QV κ = U κ, which ensures that the complement of the set Y κ,ɛ := { W n N + /ɛ} {V κ U κ /ɛ} has Q probability less that ɛ. On the set Y κ,ɛ, 0 j N κ j(ĝ γ) V κ + U κ r n 3U κ /ɛ. The asserted bound follows. lasso <17> Example. Try to apply Theorem <6> to cover Theorem of Knight and Fu (000). Comment that more recent results with p = p n are much closer to the way the lasso is used.

11 Notes for Chapter 5 11 Notes Convexity: Huber (1964) Brown (1985) and Appendix A of Maritz (1981). Andersen and Gill (198) Haberman (1989) Niemiro (199) Bickel, Klaassen, Ritov, and Wellner (1993, pp 38, 473, 519) Ritov (1987) Jurečková (1977) References Andersen, P. K. and R. D. Gill (198). Cox s regression model for counting processes: a large sample study. Annals of Statistics 10, Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1993). Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press. Brown, B. M. (1985). Multiparameter linearization theorems. Journal of the Royal Statistical Society, Series B 47, Dou, W., D. Pollard, and H. H. Zhou (009). Functional regression for general exponential families. Technical report, Yale University Statistics Department. Haberman, S. J. (1989). Concavity and estimation. Annals of Statistics 17, Hjort, N. L. and D. Pollard (1993). Asymptotics for minimisers of convex processes. Technical report, Yale University. Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics 35, Jurečková, J. (1977). Asymptotic relations of m-estimates and r-estimates in linear regression model. Annals of Statistics 5, Knight, K. and W. Fu (000). Asymptotics for lasso-type estimators. Annals of Statistics 8 (5), Maritz, J. S. (1981). Distribution-Free Statistical Methods. Chapman and Hall. Niemiro, W. (199). Asymptotics for m-estimators defined by convex minimization. Annals of Statistics 0,

12 1 Convexity Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory 7, Portnoy, S. (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Annals of Statistics 16 (1), Ritov, Y. (1987). Tightness of monotone random fields. Journal of the Royal Statistical Society, Series B 49, Rockafellar, R. T. (1970). Convex Analysis. Princeton, New Jersey: Princeton Univ. Press.

FUNCTIONAL REGRESSION FOR GENERAL EXPONENTIAL FAMILIES

Submitted to the Annals of Statistics FUNCTIONAL REGRESSION FOR GENERAL EXPONENTIAL FAMILIES BY WEI DOU, DAVID POLLARD AND HARRISON H. ZHOU Yale University The paper derives a minimax lower bound for rates