Nonparametric estimation of. s concave and log-concave densities: alternatives to maximum likelihood

Size: px

Start display at page:

Download "Nonparametric estimation of. s concave and log-concave densities: alternatives to maximum likelihood"

Peregrine Horton
5 years ago
Views:

1 Nonparametric estimation of s concave and log-concave densities: alternatives to maximum likelihood Jon A. Wellner University of Washington, Seattle Cowles Foundation Seminar, Yale University November 9, 2016

2 Cowles Foundation Seminar, Yale University Based on joint work with: Qiyang (Roy) Han Charles Doss Fadoua Balabdaoui Kaspar Rufibach Arseni Seregin

3 Outline A: B: Log-concave and s concave densities on R and R d s concave densities on R and R d C: Maximum Likelihood for log-concave and s concave densities D. 1: Basics 2: On the model 3: Off the model An alternative to ML: Rényi divergence estimators 1. Basics 2. On the model 3. Off the model E. Summary: problems and open questions Cowles Foundation Seminar, Yale, November 9,

4 A. Log-concave densities on R and R d If a density f on R d is of the form f(x) f ϕ (x) = exp(ϕ(x)) = exp ( ( ϕ(x))) where ϕ is concave (so ϕ is convex), then f is log-concave. The class of all densities f on R d of this form is called the class of log-concave densities, P log concave P 0. Properties of log-concave densities: Every log-concave density f is unimodal (quasi concave). P 0 is closed under convolution. P 0 is closed under marginalization. P 0 is closed under weak limits. A density f on R is log-concave if and only if its convolution with any unimodal density is again unimodal (Ibragimov, 1956). Cowles Foundation Seminar, Yale, November 9,

5 Many parametric families are log-concave, for example: Normal (µ, σ 2 ) Uniform(a, b) Gamma(r, λ) for r 1 Beta(a, b) for a, b 1 t r densities with r > 0 are not log-concave. Tails of log-concave densities are necessarily sub-exponential. P log concave = the class of Polyá frequency functions of order 2, P F 2, in the terminology of Schoenberg (1951) and Karlin (1968). See Marshall and Olkin (1979), chapter 18, and Dharmadhikari and Joag-Dev (1988), page 150. for nice introductions. Cowles Foundation Seminar, Yale, November 9,

6 B. s concave densities on R and R d If a density f on R d is of the form f(x) f ϕ (x) = then f is s-concave. (ϕ(x)) 1/s, ϕ convex, if s < 0 exp( ϕ(x)), ϕ convex, if s = 0 (ϕ(x)) 1/s, ϕ concave, if s > 0, The classes of all densities f on R d of these forms are called the classes of s concave densities, P s. The following inclusions hold: if < s < 0 < r <, then P r P 0 P s P Cowles Foundation Seminar, Yale, November 9,

7 Properties of s-concave densities: Every s concave density f is quasi-concave. The Student t ν density, t ν P s for s 1/(1 + ν). Thus the Cauchy density (= t 1 ) is in P 1/2 P s for s 1/2. The classes P s have interesting closure properties under convolution and marginalization which follow from the Borell-Brascamp-Lieb inequality: let 0 < λ < 1, 1/d s, and let f, g, h : R d [0, ) be integrable functions such that where h(λx + (1 λ)y) M s (f(x), g(y); λ) for all x, y R d M s (a, b; λ) = (λa s + (1 λ)b s ) 1/s, M 0 (a, b; λ) = a λ b 1 λ. Then R d h(x)dx M s/(sd+1) ( Rd f(x)dx, R d g(x)dx; λ Cowles Foundation Seminar, Yale, November 9, ).

8 Special Case: s = 0: Prékopa-Leindler inequality. Let 0 < λ < 1 and let f, g, h : R d [0, ). if h(λx + λy) f(x) λ g(y) 1 λ for all x, y R d, then R d h(x)dx ( R d f(x)dx ) λ ( R d g(x)dx ) 1 λ. Proposition. (a) Marginalization preserves log-concavity. (b) Convolution preserves log-concavity. Proof. (a) Suppose that h(x, y) is a log-concave density on R m R n. Thus h(λ(x, y) + (1 λ)(x, y )) h(x, y) λ h(x, y ) 1 λ. Cowles Foundation Seminar, Yale, November 9,

9 We want to show that f(y) R m h(x, y)dx is log-concave. With h(x) h(x, λy + λy ), f(x) h(x, y), g(x) h(x, y ), h(λx + λx ) = h(λx + λx, λy + λy ) Thus Prékopa-Leindler yields f(λy + (1 λ)y ) h(x, y) λ h(x, y ) λ = f(x) λ g(x ) λ. ( h(x, y)dx R m = f(y) λ f(y ) 1 λ. ) λ ( R m h(x, y )dx ) 1 λ (b) If X f and Y g are independent, f, g log-concave, then (X, Y ) f g h is log-concave. Since log-concavity is preserved under affine transforms (X Y, X +Y ) has a log-concave density. By (a), X + Y is log-concave. Cowles Foundation Seminar, Yale, November 9,

10 If f P s and s > 1/(d + 1), then E f X <. If f P s and s > 1/d, then f <. If f P 0, then for some a > 0 and b R f(x) exp( a x + b) for all x R. If f P s and s > 1/d, then with r 1/s > d, then for some a, b > 0 f(x) (b + a x ) r for all x R. If s < 1/d then there exists f P s with f =. If 1/d < s 1/(d + 1), then there exists f P s with E f X =. Cowles Foundation Seminar, Yale, November 9,

11 f(x) = ( 1 r rbeta(r/2, 1/2) r + x 2 ) (1+r)/2, with r =.05, so f P 1/(1+.05) Cowles Foundation Seminar, Yale, November 9,

12 t r densities, r {0.1, 0.2,..., 1.0} Cowles Foundation Seminar, Yale, November 9,

13 f(x) = 1 2Γ(1/2) x 1/2 exp( x ), with f P 2. Cowles Foundation Seminar, Yale, November 9,

14 C. Maximum Likelihood: 0-concave and s-concave densities MLE of f and ϕ: Let C denote the class of all concave function ϕ : R [, ). The estimator ϕ n based on X 1,..., X n i.i.d. as f 0 is the maximizer of the adjusted criterion function l n (ϕ) = = over ϕ C. logf ϕ (x)df n (x) f ϕ (x)dx ϕ(x)dfn (x) e ϕ(x) dx, s = 0, (1/s)log( ϕ(x))+ df n (x) ( ϕ(x)) 1/s + dx, s < 0, Cowles Foundation Seminar, Yale, November 9,

15 1. Basics The MLE s for P 0 exist and are unique when n d + 1. The MLE s for P s exist for s ( 1/d, 0) when n d ( r r d where r = 1/s. Thus n as 1/s = r d. Uniqueness of MLE s for P s? MLE ϕ n is piecewise affine for 1/d < s 0. The MLE for P s does not exist if s < 1/d. (Well known for s = and d = 1.) ) Cowles Foundation Seminar, Yale, November 9,

16 2. On the model The MLE s are Hellinger and L 1 consistent. The log-concave MLE s ˆf n,0 satisfy e a x ˆf n,0 (x) f 0 (x) dx a.s. 0. for a < a 0 where f 0 (x) exp( a 0 x + b 0 ). The s concave MLE s are computationally awkward; log is too aggressive a transform for an s concave density. [Note that ML has difficulties even for location t families: multiple roots of the likelihood equations.] Pointwise distribution theory for ˆf n,0 when d = 1; no pointwise distribution theory for ˆf n,s when d = 1; no pointwise distribution theory for ˆf n,0 or ˆf n,s when d > 1. Global rates? H( f n,s, f 0 ) = O p (n 2/5 ) for 1 < s 0, d = 1. Cowles Foundation Seminar, Yale, November 9,

17 3. Off the model Now suppose that Q is an arbitrary probability measure on R d with density q and X 1,..., X n are i.i.d. q. The MLE f n for P 0 satisfies: f n (x) f (x) dx a.s. 0 R d where, for the Kullback-Leibler divergence K(q, f) = qlog(q/f)dλ, f = argmin f P0 (R d ) K(q, f) is the pseudo-true density in P 0 (R d ) corresponding to q. In fact: R d ea x f n (x) f (x) dx a.s. 0 for any a < a 0 where f (x) exp( a 0 x + b 0 ). Cowles Foundation Seminar, Yale, November 9,

18 The MLE f n for P s does not behave well off the model. Retracing the basic arguments of Cule and Samworth (2010) leads to negative conclusions. (How negative remains to be pinned down!) Conclusion: Investigate alternative methods for estimation in the larger classes P s with s < 0! This leads to the proposals by Koenker and Mizera (2010). Cowles Foundation Seminar, Yale, November 9,

19 D. An alternative to ML: Rényi divergence estimators 0. Notation and Definitions β = 1 + 1/s < 0, α 1 + β 1 = 1. C(X) = all continuous functions on conv(x). C (X) = all signed Radon measures on C(X) = dual space of C(X). G(X) = all closed convex (lower s.c.) functions on conv(x). G(X) = {G C (X) : gdg 0 (or dual) cone of G(X). for all g G(X}, the polar Cowles Foundation Seminar, Yale, November 9,

20 Primal problems: P 0 and P s : P 0 : P s : min g G(X) L 0 (g, P n ) where L 0 (g, P n ) = P n g + min g G(X) L s (g, P n ) where L s (g, P n ) = P n g + 1 β R d exp( g(x))dx. R d g(x)β dx. Cowles Foundation Seminar, Yale, November 9,

21 Dual problems: P 0 and P s : D 0 : D s : max f { f(y)logf(y)dy} subject to f(y) = d(p n G) dy max f(y) α f α dy subject to f(y) = d(p n G) dy for some G G(X). for some G G(X). Cowles Foundation Seminar, Yale, November 9,

22 Why do these make sense? Population version of P 0 : min g G L 0 (g, f 0 ) where L 0 (g, f 0 ) = {g(x)f 0 (x) + e g(x) }dx. Minimizing the integrand pointwise in g = g(x) for fixed f 0 (x) yields f 0 (x) e g(x) = 0 if e g(x) = f 0 (x). Population version of P s : min g G L s (g, f 0 ) where L s (g, f 0 ) = {g(x)f 0 (x) + 1 β gβ (x)}dx. Minimizing the integrand pointwise in g = g(x) for fixed f 0 (x) yields f 0 (x)+(β/ β )g β 1 (x) = f 0 (x) g β 1 (x) = 0, and hence g 1/s = g 1/s (x) = f 0 (x). Cowles Foundation Seminar, Yale, November 9,

23 1. Basics for the Rényi divergence estimators: (Koenker and Mizera, 2010) If conv(x) has non-empty interior, then strong duality between P s and D s holds. The dual optimal optimal solution exists, is unique, and ˆf n = ĝ 1/s n. (Koenker and Mizera, 2010) The solution f = g 1/s in the population version of the problem when Q = P 0 has density p 0 P s is Fisher-consistent; i.e. f = p 0. Cowles Foundation Seminar, Yale, November 9,

24 2. Off the model: Han & W (2015) Let Q 1 {Q on (R d, B d ) : x dq(x) < }, Q 0 {Q on (R d, B d ) : int(csupp(q)) }. Theorem (Han & W, 2015): If 1/(d + 1) < s < 0 and Q Q 0 Q 1, then the primal problem P s (Q) has a unique solution g G which satisfies f = g 1/s where g is bounded away from 0 and f is a bounded density. Theorem (Han & W, 2015): Let d = 1. If f n,s denotes the solution to the primal problem P s and f n,0 denotes the solution to the primal problem P 0, then for any κ > 0, p 1, (1 + x ) κ f n,s (x) f n,0 (x) p dx 0 as s 0. Cowles Foundation Seminar, Yale, November 9,

25 Theorem (Han & W, 2015): Suppose that: (i) d 1, (ii) 1/(d + 1) < s < 0, and (iii) Q Q 0 Q 1. If f Q,s denotes the (pseudo-true) solution to the primal problem P s (Q), then for any κ < r d = ( 1/s) d, (1 + x ) κ f n,s (x) f Q,s (x) dx a.s. 0 as n. Cowles Foundation Seminar, Yale, November 9,

26 3. On the model: Q has density f P s ; f = g 1/s for some g convex. Consistency: Suppose that: (i) d 1 and 1/d < s < 0 and s > s if s 1/(d + 1), s = s if s > 1/(d + 1). Then for any κ < r d = ( 1/s) d, (1 + x ) κ f n,s (x) f(x) dx a.s. 0 as n. Thus H( f n,s, f) a.s. 0 as well. Pointwise limit theory: (paralleling the results of Balabdaoui, Rufibach, and W (2009) for s = 0) Cowles Foundation Seminar, Yale, November 9,

27 Assumptions: (A1) g 0 G and f 0 P s (R) with 1/2 < s < 0. (A2) f 0 (x 0 ) > 0. (A3) g 0 is locally C 2 in a neighborhood of x 0 with g 0 (x 0 ) > 0. Theorem 1. (Pointwise limit theorem; Han & W (2016)) Under assumptions (A1)-(A3), we have n2 5 n5 1 and... (ĝn (x 0 ) g 0 (x 0 ) ) (ĝ n (x 0 ) g 0 (x 0) ) d ( g0 4(x 0)g (2) ) 1/5 0 (x 0) r 4 f 0 (x 0 ) 2 H (2) (4)! ( ] g 2 0 (x 0 )[g (2) 3 0 (x ) 1/5 0) ] 3 H r 2 f 0 (x 0 ) [(4)! (3) 3 2 (0) 2 (0) Cowles Foundation Seminar, Yale, November 9, ,

28 ... furthermore n2 5 n 1 5 ( ˆf n (x 0 ) f 0 (x 0 ) ) ( ˆf n(x 0 ) f 0 (x 0) ) d ( rf 0 (x 0 ) 3 g (2) ) 1/5 0 (x 0) H (2) g 0 (x 0 )(4)! ( ) r 3 f 0 (x 0 ) (g 4 (2) 3 0 (x ) 1/5 0) ] 3 H g 0 (x 0 ) [(4)! (3) 3 2 (0) 2 (0), where H 2 is the unique lower envelope of the process Y 2 satisfying 1. H 2 (t) Y 2 (t) for all t R; 2. H (2) k is concave; 3. H 2 (t) = Y 2 (t) if the slope of H (2) 2 decreases strictly at t. 4. Y 2 (t) = t 0 W (s)ds t 4, t R where W is two-sided Brownian motion started at 0. Cowles Foundation Seminar, Yale, November 9,

29 Estimation of the mode for d = 1. Theorem 2. (Estimation of the mode) Assume (A1)-(A4) hold. Then ( n 1/5( ) g ˆm n m 0 (m 0 ) 2 (4)! 2 ) 1/5 0 d r 2 f 0 (m 0 )g (2) M(H (2) 0 (m 0) 2 2 ), (1) where ˆm n = M( ˆf n ), m 0 = M(f 0 ). What is the price of assuming s < 0 when the truth f P 0? Assume 1/2 < s < 0 and k = 2. Let f 0 = exp(ϕ 0 ) be a logconcave density where ϕ 0 : R R is the underlying concave function. Then f 0 is also s-concave. Cowles Foundation Seminar, Yale, November 9,

30 Let g s := f 1/r 0 = exp( ϕ 0 /r) be the underlying convex function when f 0 is viewed as an s-concave density. Calculation yields g (2) s (x 0 ) = 1 r 2g s(x 0 ) ( ϕ 0 (x 0) 2 rϕ 0 (x 0) ). Hence the constant before H (2) 2 (0) appearing in the limit distribution for f n becomes ( f0 (x 0 ) 3 ϕ 0 (x 0) 2 4!r + f 0(x 0 ) 3 ϕ 0 (x ) 1/5 0). 4! The second term is the constant involved in the limiting distribution when f 0 (x 0 ) is estimated via the log-concave MLE: (2.2), page 1305 in Balabdaoui, Rufibach, & W (2009). The ratio of the two constants (or asymptotic relative efficiency) is shown for f 0 standard normal (blue) and logistic (magenta) in the figure: Cowles Foundation Seminar, Yale, November 9,

31 Efficiency ratio: f 0 = N(0, 1), blue; f 0 =logistic, magenta Cowles Foundation Seminar, Yale, November 9,

32 The first term is non-negative and is the price we pay by estimating a true log-concave density via the Rényi divergence estimator over a larger class of s-concave densities. Note that the first term vanishes as r (or s 0). Note that the ratio is 1 at the mode of f 0. For estimation of the mode, the ratio of constants is always 1: nothing is lost by enlarging the class from s = 0 to s < 0! Cowles Foundation Seminar, Yale, November 9,

33 E. Summary: problems and open questions Global rates of convergence? Limiting distribution(s) for d > 1? (n r with r = 2/(4 + d)?) MLE (rate-) inefficient for d 4 (or perhaps d 3)? How to penalize to get efficient rates? Can we go below s = 1/(d + 1) with other methods? Multivariate classes with nice preservation/closure properties and smoother than log-concave? Algorithms for computing f n P s? (Non-smooth and convex; or non-smooth and non-convex?) Related results for convex regression on R d : Seijo and Sen, Ann. Statist. (2011). Cowles Foundation Seminar, Yale, November 9,

34 Guvenen et al (2014) have estimated models of income dynamics using very large (10 percent) samples of U.S. Social Security records linked to W2 data The density is not log-concave, but an s concave density with s = 1/2 fits well: log f(x) x ~ log income annual increments 1 f(x) x ~ log income annual increments f(x) x ~ log income annual increments Courtesy Roger Koenker Cowles Foundation Seminar, Yale, November 9,

35 F. Selected references Dümbgen and Rufibach (2009). Cule, Samworth, and Stewart (2010) Cule and Samworth (2010). Dümbgen, Samworth, and Schuhmacher (2011). Balabdaoui, Rufibach, and W (2009) Seregin & W (2010), Ann. Statist. Koenker and Mizera (2010), Ann. Statist. Han & W (2016): Ann. Statist., & arxiv: v3. Doss & W (2016): Ann. Statist., & arxiv: v2. Guntuboyina & Sen (2015): Prob. Theory & Rel. Fields. Chatterjee, Guntuboyina, & Sen (2015): Ann. Statist. Cowles Foundation Seminar, Yale, November 9,

36 Many thanks! Cowles Foundation Seminar, Yale, November 9,

37 Cowles Foundation Seminar, Yale, November 9,

38 Cowles Foundation Seminar, Yale, November 9,

Nonparametric estimation: s concave and log-concave densities: alternatives to maximum likelihood

Nonparametric estimation: s concave and log-concave densities: alternatives to maximum likelihood Jon A. Wellner University of Washington, Seattle Statistics Seminar, York October 15, 2015 Statistics Seminar,