Logistic regression geometry Karim Anaya-Izquierdo arxiv:1304.1720v1 [stat.me] 5 Apr 2013 London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK e-mail: karim.anaya@lshtm.ac.uk and Frank Critchley Department of Mathematics and Statistics, The Open University, Milton Keynes, MK7 6AA, UK e-mail: f.critchley@open.ac.uk and Paul Marriott Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1 e-mail: pmarriot@uwaterloo.ca 1. Introduction The fact that the maximum likelihood estimate in a logistic regression model may not exist is a well-known phenomenon and a number of recent papers have explored its underlying geometrical basis. [9], [12] and [7] point out that existence, and non-existence, of the estimate can be fully characterised by considering the closure of the model as an exponential family. In this formulation it becomes clear that the maximum is always well-defined, but can lie on the boundary rather than in the relative interior. Furthermore, the boundary can be considered as a polytope characterised by a finite number of extremal points. This paper builds on this work and shows that the boundary affects more than the existence of the maximum likelihood estimate. In particular, even when the estimate exists, the geometry and boundary can strongly affect inference procedures. First and higher order asymptotic results can not be uniformly applied. Indeed, near the boundary, effects such as high skewness, discreteness and collinearity dominate, any of which could render inference based on asymptotic normality suspect. The paper presents a simple diagnostic tool which allows the analyst to check if the boundary is going to have an appreciable effect on standard inferential techniques. The tool, and the effect that the boundary can have, are illustrated in a well-known example and through simulated datasets. Example 1. The Fisher iris data set, [8], is often used to illustrate classification and binary regression. Even in this familiar case we show that the boundary is close enough to have significant effects for inference. Let us focus on the problem 1
Anaya-Izquierdo et al. /Logistic regression geometry 2 of distinguishing the species Iris setosa coded with 1 in figures from Iris versicolor (coded with 0) based on the length of the flower s sepal. The left hand panel in Fig. 1 shows the logistic regression fit, while the right hand panel shows a contour plot of the log-likelihood for intercept parameter α and slope parameter β. The near singularity of the observed Fisher information is evident. While this does have a geometric interpretation, this particular collinearity effect can easily be removed by centring the explanatory variable around its sample mean. For clarity of exposition we work exclusively with the centred model from now on. Log likelihood contours Binary response 0.0 0.2 0.4 0.6 0.8 1.0 β 8 6 4 2 4.5 5.0 5.5 6.0 6.5 7.0 predictor 10 20 30 40 50 α Fig 1. Fisher Iris data example: fit and log-likelhood contours Figure 2, to be discussed in Section 3, shows various aspects of sampling distributions under the maximum likelihood fit. These show failure of first, and higher, order asymptotics due to skewness and discreteness in a number of different ways. These effects can be explained by the closeness of the boundary shown in panel (a) as points connected with solid line segments in the sequel. In this example, the boundary is just close enough to start to play a significant role. More extreme examples are shown in Section 3. (a) Sampling distribution Sufficient statistic 2 26 24 22 20 18 15 10 5 β (b) Sampling distribution (c) Sampling distribution Frequency 0 500 1000 1500 2000 30 40 50 60 70 Sufficient statistic 1 2.0 1.5 1.0 0.5 0.0 0.5 1.0 α 15 10 5 β Fig 2. Fisher Iris data example: effects of the boundary on the sampling distribution 2. Overview of geometry This section looks at the geometry underlying logistic regression. Section 2.1 describes the more general case of the geometry of extended multinomial models,
Anaya-Izquierdo et al. /Logistic regression geometry 3 while Section 2.2 focuses on logistic regression. Section 2.3 defines the diagnostic which allows the analyst to check if the boundary is close enough to substantially effect first order asymptotic results. Finally, in Section 3, we return to Example 1 and related variants. 2.1. Extended multinomial models The information geometry of the extended multinomial model is considered in [1]. The cell probabilities of the extended multinomial define the simplex k := { π = (π 0,π 1,...,π k ) : π i 0, } k π i = 1. (2.1) The term extended refers to the fact that this is the closure of the multinomial model: zero cell probabilities are allowed. The closure of exponential families has been studied by [2], [4], [11] and [5]. Of central interest in this paper is the consequence of working in a closed extended family, rather than in the more familiar open parameter space. One important feature with obvious implications for first order asymptotics is that the Fisher information can become arbitrarily close to singular as you approach the boundary. This is clearly shown by considering its spectral decomposition. With π (0) denoting the vector of all probabilities except π 0, the Fisher information matrix for the natural (log-odds) parameters, written as a function of π, is the sample size times I(π) := diag(π (0) ) π (0) π T (0). Its explicit spectral decomposition is an example of interlacing eigenvalue results, (see for example [10], Chapter 4). In particular, suppose {π i } k i=1 comprises g > 1 distinct values λ 1 > > λ g > 0, λ i occuring m i times, so that g i=1 m i = k.then,the spectrumofi(π) comprisesg simple eigenvalues{ λ i } g i=1 satisfying i=0 λ 1 > λ 1 > > λ g > λ g 0, (2.2) together, if g < k, with {λ i : m i > 1}, each such λ i having multiplicity m i 1. While the closure of the full multinomial model is easy to understand in the representation (2.1), the closure of lower dimensional sub-models of k expressed in the natural parameters, such as logistic regression models, are more problematic to compute although they can be critical inferentially. In order to visualise the geometry of the problem of computing limits in exponential families, consider a low-dimensional example. Example 2. Define a two dimensional full exponential subfamily π(α,β) of 3 where π(α,β) (exp{αv 11 +βv 21 },exp{αv 12 +βv 22 },exp{αv 13 +βv 23 },exp{αv 14 +βv 24 })
Anaya-Izquierdo et al. /Logistic regression geometry 4 for vectors v 1 = (1,2,3,4),v 2 = (1,4,9, 1). Consider directions from the origin (α,β) = (0,0) found by writing α = θβ giving, for each θ, a one dimensional full exponential family parameterized by β in the direction (θ+1, 2θ+4, 3θ+9, 4θ 1). The aspect of this vector which determines the connection to the boundary is the rank order of its elements, in particular which elements are the maximum and minimum. For example, suppose the first component was the maximum and the last the minimum, then as β ± this one dimensional family will be connected to the first and fourth vertex of the embedding four simplex, respectively. In order to see all possible rank orderings of the components, see the right panel of Fig. 3 which shows the graphs of the functions {θ+1,2θ+4,3θ+9,4θ 1}. The maximum and minimum ranks are determined by the upper and lower envelopes, shown as solid lines. From this analysis of the envelopes of a set of linear functions, it can be seen that the function 2θ +4 is redundant. It can be shown that only three of the four vertices of the ambient 4-simplex have been connected by the model. This is show explicitly in the left panel of Fig. 3. Envelope of linear functions Linear function 60 40 20 0 20 40 60 15 10 5 0 5 10 15 θ Fig 3. Attaching a two dimensional example to the boundary of the simplex. In general, the problem of finding the limit points in full exponential families inside simplex models is a problem of finding redundant linear constraints. As shown in [6], this can be converted, via duality, into the problem of finding extremal points in a finite dimensional affine space. For an alternative approach, see [9]. 2.2. Logistic regression A logistic regression model is a full exponential family that lies in a very high dimensional simplex when considered as a model for the joint distribution of N binaryresponse variates.consideran N D design matrix X whosei th row,x T i, contains the covariate values for the i th case and a binary response t {0,1} N. ( Let s(a) = log being given by a 1 a ) so that s 1 (a) = exp(a) 1+exp(a), the logistic regression model P(T i = 1) = s 1 (x T i β). This is a full exponential family that lies in the (2 N 1)-dimensional simplex when considered as a model for the joint distribution of the N binary response
Anaya-Izquierdo et al. /Logistic regression geometry 5 variates. A design matrix X defines a D-dimensional subset an affine subspace in the natural parameters and changing the explanatory variates changes the orientation of this low dimensional space inside the space of all joint distributions. Example 3. Consider a logistic regression with 20 cases in which X comprises two columns, 1 20, the vector of all ones, and (1,2,...,20) T. It is important to consider the way that this two-dimensional exponential family is attached to the boundary. The generalisation of Fig. 3 is shown in Fig. 4, where here only lines which are part of the envelope are plotted. The corresponding vertices which the full exponential family reaches are given by vectors of the form z with the structure either z i = 0 for i = 1,...,h and 1 for i = h+1,...,20 or z i = 1 for i = 1,...,h and 0 for i = h+1,...,20. This generalises at once to any single covariate taking distinct values. 10 5 0 5 0.30 0.25 0.20 0.15 0.10 theta Fig 4. Envelope of lines 2.3. A diagnostic tool First order asymptotics essentially assumes that the parameter space can be treated as a Euclidean space with a fixed metric, typically the Fisher information evaluated either at a hypothesised value or the maximum likelihood estimate. This observation allows a simple diagnostic tool to be developed which gives a sufficient condition for first order methods to be appropriate. If they were appropriate, then the closest point on the boundary should be a large distance, as measured in this fixed metric, from the maximum likelihood estimate this length being calibrated using the quantiles of the relevant χ 2 distribution. When the dimension of the extended multinomial model is high, relative to sample size, first order asymptotic approximations hold at best on low dimensional subspaces. In the case of logistic regression, consider the mean parameter space which, with a small but common abuse of notation, we can also consider as the space of sufficient statistics. In this space, the vertices which define the boundary polytope are also the extremal points in the convex set of attainable values of the sufficient statistics, see Figs 2 (a) and 5. In Fig. 5, whose details are discussed in Section 3, the contours are defined by the squared distance
Anaya-Izquierdo et al. /Logistic regression geometry 6 from the maximum likelihood estimate relative to the Fisher information there. The largest contour is calibrated by the 99%-quantile of the χ 2 2 distribution. The diagnostic is simply based on checking if this contour crosses the boundary or not. If it does cross the boundary, for example marginally in the left panel and strongly in the right, then we regard first order asymptotic normality as suspect. We note in passing that, in a general multinomial model setting, distances to the boundary can be easily computed, via quadratic programming, and in fact have a closed form. Let Q π0 (π) be the squared distance from π 0 to π measured with respect to the Fisher information at π 0. Using this distance function, the squared distance to the face defined by the index set I from the point π 0 is where π I = i I π 0i. Q π0 (π) = π I 1 π I, 3. Examples and discussion Let us return to Example 1. Consider first panel (a) of Fig. 2. This is a plot of randomly sampled sufficient statistics plotted by dots generated from the distribution identified by the maximum likelihood estimate. The vertices in the extended multinomial model to which the logistic model is connected correspond toaset ofpointsin the samplespace plotted with circles andthe edgeswhich connect these points define a one dimensional boundary plotted with straight lines. The image of the boundary is a polytope and it is clear that for the iris data this sample is getting very close to the boundary. In the example the boundary is just close enough for the first order approximations to start to break down. This is happening in a number of ways. There is noticeable discreteness in the sample, each vertical streak corresponding to one of the extremal points on the boundary. This effect becomes more pronounced the closer the boundary becomes, as shown below. The second way that the first order approximation breaks down is that the effect of higher order terms in the asymptotic expansions are starting to make themselves felt. This is illustrated by the solid contour lines, which are defined by the two dimensional Edgeworth expansion of the sampling distribution, see [3]. It can be seen that these are not centred ellipses, as would have been expected if the normal approximation was adequate. Rather a distortion caused by the boundary is becoming evident. Panel(b) in Fig. 2 shows the sampling distribution of the maximum likelihood estimate. Near the boundary the very high degree of non-linearity between the maximum likelihood estimates and the natural sufficient statistics has a very strong effect, as can be seen. The strong directional features correspond to directions of recession as described in [9]. The effect of this non-linearity can be further seen in panel (c) which plots the marginaldistribution of β, the estimate of the slope parameter, whose large skewness clearly indicates non-normality.
Anaya-Izquierdo et al. /Logistic regression geometry 7 In order to show how all these aspects become stronger when the boundary gets closer, consider Fig. 5. The left hand panel shows the same information as in Fig. 2, but now with the contours of our proposed diagnostic plotted. This is the case where the maximum likelihood estimate from Example 1 is used and, as can be seen the diagnostic line, calibrated by the 99% quantile of the χ 2 2 distribution, just touches the boundary. This implies that boundary effects are starting to affect inference, as described above. Sufficient statistic 26 24 22 20 18 Sufficient statistic 26.0 25.0 24.0 23.0 30 40 50 60 70 Sufficient statistic 40 45 50 55 60 Sufficient statistic Fig 5. Distribution of sufficient statistics near the boundary In the right hand panel of Fig. 5, the sampling has been done from a distribution much closer to the boundary. As can be clearly seen, the diagnostic curves strongly cut the boundary. The effects on inference in this case are much stronger. The discretisation effects are much stronger; in particular, note that most of the probability mass on the boundary lies on a relatively small number of vertices plotted as circles in the figure. The effect on the sampling distribution of the maximum likelihood estimate is even more extreme. There is now an appreciably large probability that a sampled vector of sufficient statistics lies on a boundary point, which implies an infinite slope estimate. This gives very strong departures from normality for both the joint and marginal distributions. References [1] K. Anaya-Izquierdo, F. Critchley, P. Marriott, and P. W. Vos. Computational information geometry: foundations. Submitted to Geometric Science of Information 2013, 2013. [2] O. E. Barndorff-Nielsen. Information and exponential families in statistical theory. John Wiley & Sons, 1978. [3] O. E. Barndorff-Nielsen and D. R. Cox. Asymptotic techniques for use in statistics. Chapman & Hall, 1989. [4] L. D. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory. Institute of Mathematical Statistics, 1986. [5] I. Csiszar and F. Matus. Closures of exponential families. The Annals of Probability, 33(2):582 600, 2005.
Anaya-Izquierdo et al. /Logistic regression geometry 8 [6] H. Edelsbrunner. Algorithms in combinatorial geometry. Springer-Verlag: NewYork, 1987. [7] S Fienberg and A. Rinaldo. Maximum likelihood estimation in log-linear models: Theory and algorithms. Annals of Statist., 40:996 1023, 2012. [8] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179 188, 1936. [9] C. J. Geyer. Likelihood inference in exponential families and directions of recession. Electron. J. Statist., 3:259 289, 2009. [10] R. A. Horn and C.R. Johnson. Matrix Analysis. CUP, 1985. [11] S. L. Lauritzen. Graphical models. Oxford University Press, 1996. [12] A. Rinaldo, S. Fienberg, and Y. Zhou. On the geometry of discrete exponential families with applications to exponential random graph models. Electron. J. Statist., 3:446 484, 2009.