A tailor made nonparametric density estimate

Size: px

Start display at page:

Download "A tailor made nonparametric density estimate"

Adrian Mosley
5 years ago
Views:

1 A tailor made nonparametric density estimate Daniel Carando 1, Ricardo Fraiman 2 and Pablo Groisman 1 1 Universidad de Buenos Aires 2 Universidad de San Andrés School and Workshop on Probability Theory and Applications ICTP/CNPq January 2007, Campos do Jordão, Brasil

2 The density estimation problem X a random variable on R d with density f. The density f is unknown. We have an i.i.d. sample X 1,..., X n drawn from f.

3 The density estimation problem X a random variable on R d with density f. The density f is unknown. We have an i.i.d. sample X 1,..., X n drawn from f. We look for an estimate of f based in the sample. That is, a mapping f n : R d (R d ) n R

4 The density estimation problem X a random variable on R d with density f. The density f is unknown. We have an i.i.d. sample X 1,..., X n drawn from f. We look for an estimate of f based in the sample. That is, a mapping f n : R d (R d ) n R The density f is assumed to belong to a certain class F.

5 Maximum likelihood estimates For every density function g E(log g(x )) E(log f (X ))

6 Maximum likelihood estimates For every density function g E(log g(x )) E(log f (X )) The Maximum Likelihood Estimate (MLE) is defined as the maximizer of the empirical mean (log-likelihood function) L n (g) := 1 n over the class F ( n n log g(x i ) = log g(x i ) i=1 i=1 ) 1/n

7 Maximum likelihood estimates For every density function g E(log g(x )) E(log f (X )) The Maximum Likelihood Estimate (MLE) is defined as the maximizer of the empirical mean (log-likelihood function) L n (g) := 1 n over the class F ( n n log g(x i ) = log g(x i ) i=1 i=1 ) 1/n This problem is not always well posed (depending on the class F)

8 The parametric case The class F can be parameterized (by a finite number of parameters) Example: X N(µ, σ 2 ) F = {f (µ,σ 2 ), µ R, σ 2 > 0} f µ,σ 2(x) = 1 e (x µ)2 2σ 2 2πσ The maximizer f n is the density of a gaussian random variable with parameters µ n = 1 n n X i, i=1 σ 2 n = 1 n n (X i µ n ) 2. i=1

9 The parametric case Under some (fairly weak) conditions, in the parametric case, the MLE are known to be 1. Strongly consistent, i.e. f n f a.s. 2. Asymptotically minimum variance unbiased estimators 3. Asymptotically gaussian.

10 The MLE make use of the knowledge we have on f since depends strongly on the class F where we look for the maximizer

11 The nonparametric case The class F has infinite dimension Example: F = {g L 1 (R d ), g 0, g L 1 = 1}, the class of all densities.

12 The nonparametric case The class F has infinite dimension Example: F = {g L 1 (R d ), g 0, g L 1 = 1}, the class of all densities. The MLE method fails since L n (g) is unbounded Approximations of the identity belong to F

13 Alternatives Kernel density estimates (Parzen and Rosenblatt, 1956) f n (x) = 1 nh d n ( ) x Xi K h i=1 K 0, K = 1, h > 0

14 Alternatives Kernel density estimates (Parzen and Rosenblatt, 1956) f n (x) = 1 nh d n ( ) x Xi K h i=1 K 0, K = 1, h > 0 Very popular. Very flexible.

15 Alternatives Kernel density estimates (Parzen and Rosenblatt, 1956) f n (x) = 1 nh d n ( ) x Xi K h i=1 K 0, K = 1, h > 0 Very popular. Very flexible. The kernel K and the bandwidth h must be chosen

16 Alternatives The kernel density estimate The kernel density estimate for different choices of K and h

17 Alternatives The kernel density estimate The kernel density estimate is a Universal estimate. It works for all densities f. Does not make use of further knowledge on f.

18 Alternatives Maximum penalized likelihood estimates, Good and Gaskins (1971) Idea: Penalize the lack of smoothness Instead of looking for a maximizer of L n (g), we look for a maximizer of ( n ) 1 log g(x i ) h g 2. n i=1

19 Alternatives Tailor-designed Maximum Likelihood Estimates If we have some knowledge on f then F is not the class of all densities and, may be, we can apply MLE techniques

20 Tailor-designed Maximum Likelihood Estimates Grenander s estimate Grenander (1956) considered F to be the class of decreasing densities in R +

21 Tailor-designed Maximum Likelihood Estimates Grenander s estimate Grenander (1956) considered F to be the class of decreasing densities in R + In this case it turns out that the MLE is well defined and is the derivative of the least concave majorant of the empirical distribution function

22 Tailor-designed Maximum Likelihood Estimates Grenander s estimate Grenander (1956) considered F to be the class of decreasing densities in R + In this case it turns out that the MLE is well defined and is the derivative of the least concave majorant of the empirical distribution function It is consistent and minimax optimal in this class.

23 Tailor-designed Maximum Likelihood Estimates Grenander s estimate Grenander (1956) considered F to be the class of decreasing densities in R + In this case it turns out that the MLE is well defined and is the derivative of the least concave majorant of the empirical distribution function It is consistent and minimax optimal in this class. Robertson (1967), Wegman (1969, 1970), Sager (1982) and Polonik (1998) generalized Grenander s estimate to other kinds of shape restrictions

24 Tailor-designed Maximum Likelihood Estimates Grenander s estimate Grenander (1956) considered F to be the class of decreasing densities in R + In this case it turns out that the MLE is well defined and is the derivative of the least concave majorant of the empirical distribution function It is consistent and minimax optimal in this class. Robertson (1967), Wegman (1969, 1970), Sager (1982) and Polonik (1998) generalized Grenander s estimate to other kinds of shape restrictions

25 MLE for Lipschitz densities We consider F to be the class of densities g with compact support S(g) that verify g(x) g(y) κ x y, x, y S(g). That is, F is the class of Lipschitz densities with prescribed Lipschitz constant κ. We allow g to be discontinuous at the boundary of its support.

26 MLE for Lipschitz densities We consider F to be the class of densities g with compact support S(g) that verify g(x) g(y) κ x y, x, y S(g). That is, F is the class of Lipschitz densities with prescribed Lipschitz constant κ. We allow g to be discontinuous at the boundary of its support. The support of the density f can be unknown (In this case we ask S(f ) to be convex)

27 Theorem (i) There exists a unique maximizer f n of L n (g) in F. Moreover, f n is supported in C n, the convex hull of {X 1,..., X n }, and its value there is given by the maximum of n cone functions, i.e. f n (x) = max (f n(x i ) κ x X i ) +. (1) 1 i n (ii) f n is consistent in the following sense: for every compact set K S(f ), (iii) Hence lim f n f L n (K) 0 lim f n f n L 1 (R d ) 0 a.s. a.s.

28 The MLE in dimension d = 1 x i

29 The MLE in dimension d = 2

30 Proof. (i) Existence Picture. Uniqueness We are looking for a maximum of a concave function in a convex set.

31 Proof. (i) Existence Picture. Uniqueness We are looking for a maximum of a concave function in a convex set. (ii) Is a consequence of Huber s Theorem (1967). Idea: Use Huber s Theorem we need a sequence ˆf n of (almost) maximizers of L n belonging to a (fixed) compact class. We construct them us follows ˆf n := A n max 1 i n (f n(x i ) κ x X i ) +, for all x S(f ). The constant A n is chosen to guarantee f n = 1.

32 Proof. (i) Existence Picture. Uniqueness We are looking for a maximum of a concave function in a convex set. (ii) Is a consequence of Huber s Theorem (1967). Idea: Use Huber s Theorem we need a sequence ˆf n of (almost) maximizers of L n belonging to a (fixed) compact class. We construct them us follows ˆf n := A n max 1 i n (f n(x i ) κ x X i ) +, for all x S(f ). The constant A n is chosen to guarantee f n = 1. And ˆf n Lip(κ, S(f )), which is compact f n ˆf n L (K) A n 1 f n L (K) 0, (2) since A n 1 and ( f n L (K)) n is bounded a.s.

33 (iii) Since Cn S(f ) S(f ) < fn (x) κdiam(s(f )) + 1 C n we can find K S(f ) such that f n (x) f (x) dx R d f n (x) f (x) dx + f n (x) f (x) dx ε K S(f )\K

34 Computing the estimator We have proved that the estimator lives in a certain finite-dimensional space and that is determined by its value at the sample points. For y R n we define Our problem read us Find ( +, g y (x) = max y i x X i ) x Cn. 1 i n argmax y P n y i. i=1 P = {y R n, y i > 0, y i y j κ X i X j (i j), g y = 1}. P is convex and y i is concave To have an efficient method to solve this problem we need to decide (efficiently) if a point y P Easy in d = 1. Not so easy if d > 1

35 Computing the estimator Dimension d = 1 Let (X (1),..., X (n) ) the order statistics. The Lipschitz conditions reads us κ(x (i+1) X (i) ) y i+1 y i κ(x (i+1) X (i) ), And g y (x) dx = i = 1,..., n 1. = 1 n 1 (y i+1 y i ) 2 +2(y i+1 +y i )(X (i+1) X (i) ) (X (i+1) X (i) ) 2. 4 i=1

36 Computing the estimator Dimension d=1 - Sample size: n=

37 Computing the estimator Dimension d=1 - Sample size: n=

38 Computing the estimator Dimension d > 1 We can not order the sample points We have not an explicit formula for the integral g y (x) dx

39 Some problems... Too many peaks

40 Some problems... Too many peaks An optimization nonlinear problem has to be solved.

41 An alternative ML type estimator Dimension one - PLMLE V = V(X 1,..., X n ) = {g Lip(κ, [X (1), X (n) ]) : g [X (i),x (i+1)] is linear } g = 1,

42 An alternative ML type estimator Dimension one - PLMLE V = V(X 1,..., X n ) = {g Lip(κ, [X (1), X (n) ]) : g [X (i),x (i+1)] is linear } g = 1, Definition The PLMLE is the maximizer f n of L n over V(X 1,..., X n ).

43 An alternative ML type estimator Dimension one - PLMLE V = V(X 1,..., X n ) = {g Lip(κ, [X (1), X (n) ]) : g [X (i),x (i+1)] is linear } g = 1, Definition The PLMLE is the maximizer f n of L n over V(X 1,..., X n ). Existence and uniqueness of this estimator is guaranteed since V is a finite dimensional compact and convex subset of F.

44 An alternative ML type estimator Dimension one - PLMLE V = V(X 1,..., X n ) = {g Lip(κ, [X (1), X (n) ]) : g [X (i),x (i+1)] is linear } g = 1, Definition The PLMLE is the maximizer f n of L n over V(X 1,..., X n ). Existence and uniqueness of this estimator is guaranteed since V is a finite dimensional compact and convex subset of F. It has lower likelihood than f n but is asymptotically the same

45 Computation of PLMLE maximize n y i ; subject to i=1 a Ay a, By = 1. A = , a = κ x 2 x 1. x i+1 x i. x n x n 1, B = 1 2 (x 2 x 1, x 3 x 1,..., x i+1 x i 1,..., x n x n 2, x n x n 1 ) The equation a Ay a guarantees the Lipschitz condition and By = 1 represents the restriction f n = 1.

46 PLMLE demonstration PLMLE vs. Kernels. Sample size: n=

47 PLMLE demonstration PLMLE vs. Kernels. Sample size: n=

48 Delaunay triangulations

49 Delaunay triangulations

50 Delaunay triangulations

51 Delaunay triangulations

52 Voronoi tessellations and Delaunay triangulations

53 Voronoi tessellations and Delaunay triangulations

54 Some facts on Delaunay Triangulations For sample points of absolutely continuous probabilities is well defined with probability one.

55 Some facts on Delaunay Triangulations For sample points of absolutely continuous probabilities is well defined with probability one. Maximizes the mimimum angle of the triangles among all possible triangulations of the points.

56 Some facts on Delaunay Triangulations For sample points of absolutely continuous probabilities is well defined with probability one. Maximizes the mimimum angle of the triangles among all possible triangulations of the points. Useful for the numerical treatment of Partial Differential Equations by the Finite Element Method.

57 Some facts on Delaunay Triangulations For sample points of absolutely continuous probabilities is well defined with probability one. Maximizes the mimimum angle of the triangles among all possible triangulations of the points. Useful for the numerical treatment of Partial Differential Equations by the Finite Element Method. Useful to compute the Euclidean Minimum Spanning Tree of a set of points (Is a subgraph of the Delaunay triangulation).

58 Some facts on Delaunay Triangulations For sample points of absolutely continuous probabilities is well defined with probability one. Maximizes the mimimum angle of the triangles among all possible triangulations of the points. Useful for the numerical treatment of Partial Differential Equations by the Finite Element Method. Useful to compute the Euclidean Minimum Spanning Tree of a set of points (Is a subgraph of the Delaunay triangulation). Very used in Computational Geometry.

59 T = {τ 1,..., τ N } The Delaunay Tesselation.

60 T = {τ 1,..., τ N } The Delaunay Tesselation. C n = N i=1 τ i

61 T = {τ 1,..., τ N } The Delaunay Tesselation. C n = For any i j, τ i τ j is either a point, a (d 1) dimensional face, or the empty set. We consider now the class of piecewise linear functions on T N i=1 τ i

62 T = {τ 1,..., τ N } The Delaunay Tesselation. C n = For any i j, τ i τ j is either a point, a (d 1) dimensional face, or the empty set. We consider now the class of piecewise linear functions on T N i=1 τ i V = V(X 1,..., X n ) = { g Lip(κ, C n ) : g τi is linear, } g = 1, Definition The PLMLE f n is the argument that maximizes L n over V(X 1,..., X n ).

63 Theorem For every compact set K S(f ) we have f n f L (K) 0 a.s.

64 Computing the estimator V is a compact subset of the (finite dimensional) vector space Ṽ = {g : C n R : g τi We need a (good) basis for Ṽ. is linear}

65 Computing the estimator V is a compact subset of the (finite dimensional) vector space Ṽ = {g : C n R : g τi is linear} We need a (good) basis for Ṽ. We borrow from FEM. g Ṽ g(x) = i g(x i )ϕ i (x) ϕ i (X j ) = δ ij.

66 Computing the estimator g(x) dx = R d R d ( n ) g(x i )ϕ i (x) dx = By, i=1

67 Computing the estimator ( n ) g(x) dx = g(x i )ϕ i (x) dx = By, R d R d i=1 ( ) B = ϕ 1,..., ϕ n y = (g(x 1 ),..., g(x n ))

68 Computing the estimator ( n ) g(x) dx = g(x i )ϕ i (x) dx = By, R d R d i=1 ( ) B = ϕ 1,..., ϕ n y = (g(x 1 ),..., g(x n )) We compute B just once!

69 Computing the estimator ( n ) g(x) dx = g(x i )ϕ i (x) dx = By, R d R d i=1 ( ) B = ϕ 1,..., ϕ n y = (g(x 1 ),..., g(x n )) We compute B just once! We also have g τk = i y i ϕ i τk = A k y, A k = ( ( ϕ 1 τk ) t ( ϕn τk ) t),

70 Computing the estimator The optimization problem reads us n maximize y i ; subject to i=1 A k y κ, 1 k N, By = 1.

71 Computing the estimator The optimization problem reads us n maximize y i ; subject to i=1 A k y κ, 1 k N, By = 1. If =, all the restrictions are linear.

72 Computing the estimator The optimization problem reads us n maximize y i ; subject to i=1 A k y κ, 1 k N, By = 1. If =, all the restrictions are linear. The size of A k grows linearly with d

73 The PLMLE for a cone density Sample size: n=250

74 The PLMLE for a Uniform density Sample size: n=200

75 The PLMLE for a bivariate sum of uniform variables Sample size: n=

Statistics: Learning models from data

DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial