arxiv: v1 [stat.me] 5 Apr 2013

Similar documents
Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling

arxiv: v1 [math.st] 10 Sep 2012

Towards the Geometry of Model Sensitivity: An Illustration

STA 4273H: Statistical Machine Learning

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Hyperplane Margin Classifiers on the Multinomial Manifold

Applications of Geometry in Optimization and Statistical Estimation

Largest Area Ellipse Inscribing an Arbitrary Convex Quadrangle

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Local Mixtures and Exponential Dispersion Models

Subject CS1 Actuarial Statistics 1 Core Principles

Institute of Actuaries of India

Empirical Likelihood Methods for Sample Survey Data: An Overview

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Sample size determination for logistic regression: A simulation study

ECE521 week 3: 23/26 January 2017

Academic Editors: Frédéric Barbaresco and Frank Nielsen Received: 31 August 2016; Accepted: 19 November 2016; Published: 24 November 2016

Maximum Likelihood Estimation in Loglinear Models

Applied Multivariate and Longitudinal Data Analysis

Local Mixtures of the Negative Exponential Distribution

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Statistical Data Mining and Machine Learning Hilary Term 2016

Algebra Vocabulary. abscissa

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

An Introduction to Differential Geometry in Econometrics

Tactical Decompositions of Steiner Systems and Orbits of Projective Groups

arxiv: v2 [stat.me] 16 Jun 2011

New Interpretation of Principal Components Analysis

In most cases, a plot of d (j) = {d (j) 2 } against {χ p 2 (1-q j )} is preferable since there is less piling up

Smoothness of conditional independence models for discrete data

Flat and multimodal likelihoods and model lack of fit in curved exponential families

Modern Likelihood-Frequentist Inference. Donald A Pierce, OHSU and Ruggero Bellio, Univ of Udine

Local Mixture Models of Exponential Families

Monitoring Random Start Forward Searches for Multivariate Data

Lecture 5: LDA and Logistic Regression

Markovian Combination of Decomposable Model Structures: MCMoSt

LOGISTIC REGRESSION Joseph M. Hilbe

Stat 5101 Lecture Notes

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

Logistic Regression: Regression with a Binary Dependent Variable

A Redundant Klee-Minty Construction with All the Redundant Constraints Touching the Feasible Region

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Robustness of Principal Components

Dark Matter Detection: Finding a Halo in a Haystack

Small sample size in high dimensional space - minimum distance based classification.

30. (+) Use the unit circle to explain symmetry (odd and even) and periodicity of trigonometric functions. [F-TF]

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

How do we analyze, evaluate, solve, and graph quadratic functions?

Interaction effects for continuous predictors in regression modeling

Convex Optimization M2

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Asymptotics and Extremal Properties of the Edge-Triangle Exponential Random Graph Model

A nonparametric two-sample wald test of equality of variances

arxiv: v1 [math.st] 22 Jun 2018

4 Statistics of Normally Distributed Data

Neighbourhoods of Randomness and Independence

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Conics and their duals

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

General Regression Model

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

HANDBOOK OF APPLICABLE MATHEMATICS

n =10,220 observations. Smaller samples analyzed here to illustrate sample size effect.

Reports of the Institute of Biostatistics

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Secondary 1 - Secondary 3 CCSS Vocabulary Word List Revised Vocabulary Word Sec 1 Sec 2 Sec 3 absolute value equation

Chapter 1 Statistical Inference

On Linear Copositive Lyapunov Functions and the Stability of Switched Positive Linear Systems

Course in Data Science

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Operators with numerical range in a closed halfplane

Directional Statistics

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

An Algorithm for Solving the Convex Feasibility Problem With Linear Matrix Inequality Constraints and an Implementation for Second-Order Cones

The Perceptron algorithm

2. TRIGONOMETRY 3. COORDINATEGEOMETRY: TWO DIMENSIONS

Introduction. Chapter 1

Confidence and prediction intervals for. generalised linear accident models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Algebra Topic Alignment

ASSESSING A VECTOR PARAMETER

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

Likelihood Inference in Exponential Families and Generic Directions of Recession

Testing Restrictions and Comparing Models

Classification techniques focus on Discriminant Analysis

Fundamentals of Applied Probability and Random Processes

Consider Table 1 (Note connection to start-stop process).

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

i=1 h n (ˆθ n ) = 0. (2)

CHAPTER 5. Outlier Detection in Multivariate Data

Marginal log-linear parameterization of conditional independence models

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

arxiv: v1 [cs.cg] 16 May 2011

October 1, Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study

Adjusted Empirical Likelihood for Long-memory Time Series Models

Transcription:

Logistic regression geometry Karim Anaya-Izquierdo arxiv:1304.1720v1 [stat.me] 5 Apr 2013 London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK e-mail: karim.anaya@lshtm.ac.uk and Frank Critchley Department of Mathematics and Statistics, The Open University, Milton Keynes, MK7 6AA, UK e-mail: f.critchley@open.ac.uk and Paul Marriott Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1 e-mail: pmarriot@uwaterloo.ca 1. Introduction The fact that the maximum likelihood estimate in a logistic regression model may not exist is a well-known phenomenon and a number of recent papers have explored its underlying geometrical basis. [9], [12] and [7] point out that existence, and non-existence, of the estimate can be fully characterised by considering the closure of the model as an exponential family. In this formulation it becomes clear that the maximum is always well-defined, but can lie on the boundary rather than in the relative interior. Furthermore, the boundary can be considered as a polytope characterised by a finite number of extremal points. This paper builds on this work and shows that the boundary affects more than the existence of the maximum likelihood estimate. In particular, even when the estimate exists, the geometry and boundary can strongly affect inference procedures. First and higher order asymptotic results can not be uniformly applied. Indeed, near the boundary, effects such as high skewness, discreteness and collinearity dominate, any of which could render inference based on asymptotic normality suspect. The paper presents a simple diagnostic tool which allows the analyst to check if the boundary is going to have an appreciable effect on standard inferential techniques. The tool, and the effect that the boundary can have, are illustrated in a well-known example and through simulated datasets. Example 1. The Fisher iris data set, [8], is often used to illustrate classification and binary regression. Even in this familiar case we show that the boundary is close enough to have significant effects for inference. Let us focus on the problem 1

Anaya-Izquierdo et al. /Logistic regression geometry 2 of distinguishing the species Iris setosa coded with 1 in figures from Iris versicolor (coded with 0) based on the length of the flower s sepal. The left hand panel in Fig. 1 shows the logistic regression fit, while the right hand panel shows a contour plot of the log-likelihood for intercept parameter α and slope parameter β. The near singularity of the observed Fisher information is evident. While this does have a geometric interpretation, this particular collinearity effect can easily be removed by centring the explanatory variable around its sample mean. For clarity of exposition we work exclusively with the centred model from now on. Log likelihood contours Binary response 0.0 0.2 0.4 0.6 0.8 1.0 β 8 6 4 2 4.5 5.0 5.5 6.0 6.5 7.0 predictor 10 20 30 40 50 α Fig 1. Fisher Iris data example: fit and log-likelhood contours Figure 2, to be discussed in Section 3, shows various aspects of sampling distributions under the maximum likelihood fit. These show failure of first, and higher, order asymptotics due to skewness and discreteness in a number of different ways. These effects can be explained by the closeness of the boundary shown in panel (a) as points connected with solid line segments in the sequel. In this example, the boundary is just close enough to start to play a significant role. More extreme examples are shown in Section 3. (a) Sampling distribution Sufficient statistic 2 26 24 22 20 18 15 10 5 β (b) Sampling distribution (c) Sampling distribution Frequency 0 500 1000 1500 2000 30 40 50 60 70 Sufficient statistic 1 2.0 1.5 1.0 0.5 0.0 0.5 1.0 α 15 10 5 β Fig 2. Fisher Iris data example: effects of the boundary on the sampling distribution 2. Overview of geometry This section looks at the geometry underlying logistic regression. Section 2.1 describes the more general case of the geometry of extended multinomial models,

Anaya-Izquierdo et al. /Logistic regression geometry 3 while Section 2.2 focuses on logistic regression. Section 2.3 defines the diagnostic which allows the analyst to check if the boundary is close enough to substantially effect first order asymptotic results. Finally, in Section 3, we return to Example 1 and related variants. 2.1. Extended multinomial models The information geometry of the extended multinomial model is considered in [1]. The cell probabilities of the extended multinomial define the simplex k := { π = (π 0,π 1,...,π k ) : π i 0, } k π i = 1. (2.1) The term extended refers to the fact that this is the closure of the multinomial model: zero cell probabilities are allowed. The closure of exponential families has been studied by [2], [4], [11] and [5]. Of central interest in this paper is the consequence of working in a closed extended family, rather than in the more familiar open parameter space. One important feature with obvious implications for first order asymptotics is that the Fisher information can become arbitrarily close to singular as you approach the boundary. This is clearly shown by considering its spectral decomposition. With π (0) denoting the vector of all probabilities except π 0, the Fisher information matrix for the natural (log-odds) parameters, written as a function of π, is the sample size times I(π) := diag(π (0) ) π (0) π T (0). Its explicit spectral decomposition is an example of interlacing eigenvalue results, (see for example [10], Chapter 4). In particular, suppose {π i } k i=1 comprises g > 1 distinct values λ 1 > > λ g > 0, λ i occuring m i times, so that g i=1 m i = k.then,the spectrumofi(π) comprisesg simple eigenvalues{ λ i } g i=1 satisfying i=0 λ 1 > λ 1 > > λ g > λ g 0, (2.2) together, if g < k, with {λ i : m i > 1}, each such λ i having multiplicity m i 1. While the closure of the full multinomial model is easy to understand in the representation (2.1), the closure of lower dimensional sub-models of k expressed in the natural parameters, such as logistic regression models, are more problematic to compute although they can be critical inferentially. In order to visualise the geometry of the problem of computing limits in exponential families, consider a low-dimensional example. Example 2. Define a two dimensional full exponential subfamily π(α,β) of 3 where π(α,β) (exp{αv 11 +βv 21 },exp{αv 12 +βv 22 },exp{αv 13 +βv 23 },exp{αv 14 +βv 24 })

Anaya-Izquierdo et al. /Logistic regression geometry 4 for vectors v 1 = (1,2,3,4),v 2 = (1,4,9, 1). Consider directions from the origin (α,β) = (0,0) found by writing α = θβ giving, for each θ, a one dimensional full exponential family parameterized by β in the direction (θ+1, 2θ+4, 3θ+9, 4θ 1). The aspect of this vector which determines the connection to the boundary is the rank order of its elements, in particular which elements are the maximum and minimum. For example, suppose the first component was the maximum and the last the minimum, then as β ± this one dimensional family will be connected to the first and fourth vertex of the embedding four simplex, respectively. In order to see all possible rank orderings of the components, see the right panel of Fig. 3 which shows the graphs of the functions {θ+1,2θ+4,3θ+9,4θ 1}. The maximum and minimum ranks are determined by the upper and lower envelopes, shown as solid lines. From this analysis of the envelopes of a set of linear functions, it can be seen that the function 2θ +4 is redundant. It can be shown that only three of the four vertices of the ambient 4-simplex have been connected by the model. This is show explicitly in the left panel of Fig. 3. Envelope of linear functions Linear function 60 40 20 0 20 40 60 15 10 5 0 5 10 15 θ Fig 3. Attaching a two dimensional example to the boundary of the simplex. In general, the problem of finding the limit points in full exponential families inside simplex models is a problem of finding redundant linear constraints. As shown in [6], this can be converted, via duality, into the problem of finding extremal points in a finite dimensional affine space. For an alternative approach, see [9]. 2.2. Logistic regression A logistic regression model is a full exponential family that lies in a very high dimensional simplex when considered as a model for the joint distribution of N binaryresponse variates.consideran N D design matrix X whosei th row,x T i, contains the covariate values for the i th case and a binary response t {0,1} N. ( Let s(a) = log being given by a 1 a ) so that s 1 (a) = exp(a) 1+exp(a), the logistic regression model P(T i = 1) = s 1 (x T i β). This is a full exponential family that lies in the (2 N 1)-dimensional simplex when considered as a model for the joint distribution of the N binary response

Anaya-Izquierdo et al. /Logistic regression geometry 5 variates. A design matrix X defines a D-dimensional subset an affine subspace in the natural parameters and changing the explanatory variates changes the orientation of this low dimensional space inside the space of all joint distributions. Example 3. Consider a logistic regression with 20 cases in which X comprises two columns, 1 20, the vector of all ones, and (1,2,...,20) T. It is important to consider the way that this two-dimensional exponential family is attached to the boundary. The generalisation of Fig. 3 is shown in Fig. 4, where here only lines which are part of the envelope are plotted. The corresponding vertices which the full exponential family reaches are given by vectors of the form z with the structure either z i = 0 for i = 1,...,h and 1 for i = h+1,...,20 or z i = 1 for i = 1,...,h and 0 for i = h+1,...,20. This generalises at once to any single covariate taking distinct values. 10 5 0 5 0.30 0.25 0.20 0.15 0.10 theta Fig 4. Envelope of lines 2.3. A diagnostic tool First order asymptotics essentially assumes that the parameter space can be treated as a Euclidean space with a fixed metric, typically the Fisher information evaluated either at a hypothesised value or the maximum likelihood estimate. This observation allows a simple diagnostic tool to be developed which gives a sufficient condition for first order methods to be appropriate. If they were appropriate, then the closest point on the boundary should be a large distance, as measured in this fixed metric, from the maximum likelihood estimate this length being calibrated using the quantiles of the relevant χ 2 distribution. When the dimension of the extended multinomial model is high, relative to sample size, first order asymptotic approximations hold at best on low dimensional subspaces. In the case of logistic regression, consider the mean parameter space which, with a small but common abuse of notation, we can also consider as the space of sufficient statistics. In this space, the vertices which define the boundary polytope are also the extremal points in the convex set of attainable values of the sufficient statistics, see Figs 2 (a) and 5. In Fig. 5, whose details are discussed in Section 3, the contours are defined by the squared distance

Anaya-Izquierdo et al. /Logistic regression geometry 6 from the maximum likelihood estimate relative to the Fisher information there. The largest contour is calibrated by the 99%-quantile of the χ 2 2 distribution. The diagnostic is simply based on checking if this contour crosses the boundary or not. If it does cross the boundary, for example marginally in the left panel and strongly in the right, then we regard first order asymptotic normality as suspect. We note in passing that, in a general multinomial model setting, distances to the boundary can be easily computed, via quadratic programming, and in fact have a closed form. Let Q π0 (π) be the squared distance from π 0 to π measured with respect to the Fisher information at π 0. Using this distance function, the squared distance to the face defined by the index set I from the point π 0 is where π I = i I π 0i. Q π0 (π) = π I 1 π I, 3. Examples and discussion Let us return to Example 1. Consider first panel (a) of Fig. 2. This is a plot of randomly sampled sufficient statistics plotted by dots generated from the distribution identified by the maximum likelihood estimate. The vertices in the extended multinomial model to which the logistic model is connected correspond toaset ofpointsin the samplespace plotted with circles andthe edgeswhich connect these points define a one dimensional boundary plotted with straight lines. The image of the boundary is a polytope and it is clear that for the iris data this sample is getting very close to the boundary. In the example the boundary is just close enough for the first order approximations to start to break down. This is happening in a number of ways. There is noticeable discreteness in the sample, each vertical streak corresponding to one of the extremal points on the boundary. This effect becomes more pronounced the closer the boundary becomes, as shown below. The second way that the first order approximation breaks down is that the effect of higher order terms in the asymptotic expansions are starting to make themselves felt. This is illustrated by the solid contour lines, which are defined by the two dimensional Edgeworth expansion of the sampling distribution, see [3]. It can be seen that these are not centred ellipses, as would have been expected if the normal approximation was adequate. Rather a distortion caused by the boundary is becoming evident. Panel(b) in Fig. 2 shows the sampling distribution of the maximum likelihood estimate. Near the boundary the very high degree of non-linearity between the maximum likelihood estimates and the natural sufficient statistics has a very strong effect, as can be seen. The strong directional features correspond to directions of recession as described in [9]. The effect of this non-linearity can be further seen in panel (c) which plots the marginaldistribution of β, the estimate of the slope parameter, whose large skewness clearly indicates non-normality.

Anaya-Izquierdo et al. /Logistic regression geometry 7 In order to show how all these aspects become stronger when the boundary gets closer, consider Fig. 5. The left hand panel shows the same information as in Fig. 2, but now with the contours of our proposed diagnostic plotted. This is the case where the maximum likelihood estimate from Example 1 is used and, as can be seen the diagnostic line, calibrated by the 99% quantile of the χ 2 2 distribution, just touches the boundary. This implies that boundary effects are starting to affect inference, as described above. Sufficient statistic 26 24 22 20 18 Sufficient statistic 26.0 25.0 24.0 23.0 30 40 50 60 70 Sufficient statistic 40 45 50 55 60 Sufficient statistic Fig 5. Distribution of sufficient statistics near the boundary In the right hand panel of Fig. 5, the sampling has been done from a distribution much closer to the boundary. As can be clearly seen, the diagnostic curves strongly cut the boundary. The effects on inference in this case are much stronger. The discretisation effects are much stronger; in particular, note that most of the probability mass on the boundary lies on a relatively small number of vertices plotted as circles in the figure. The effect on the sampling distribution of the maximum likelihood estimate is even more extreme. There is now an appreciably large probability that a sampled vector of sufficient statistics lies on a boundary point, which implies an infinite slope estimate. This gives very strong departures from normality for both the joint and marginal distributions. References [1] K. Anaya-Izquierdo, F. Critchley, P. Marriott, and P. W. Vos. Computational information geometry: foundations. Submitted to Geometric Science of Information 2013, 2013. [2] O. E. Barndorff-Nielsen. Information and exponential families in statistical theory. John Wiley & Sons, 1978. [3] O. E. Barndorff-Nielsen and D. R. Cox. Asymptotic techniques for use in statistics. Chapman & Hall, 1989. [4] L. D. Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory. Institute of Mathematical Statistics, 1986. [5] I. Csiszar and F. Matus. Closures of exponential families. The Annals of Probability, 33(2):582 600, 2005.

Anaya-Izquierdo et al. /Logistic regression geometry 8 [6] H. Edelsbrunner. Algorithms in combinatorial geometry. Springer-Verlag: NewYork, 1987. [7] S Fienberg and A. Rinaldo. Maximum likelihood estimation in log-linear models: Theory and algorithms. Annals of Statist., 40:996 1023, 2012. [8] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179 188, 1936. [9] C. J. Geyer. Likelihood inference in exponential families and directions of recession. Electron. J. Statist., 3:259 289, 2009. [10] R. A. Horn and C.R. Johnson. Matrix Analysis. CUP, 1985. [11] S. L. Lauritzen. Graphical models. Oxford University Press, 1996. [12] A. Rinaldo, S. Fienberg, and Y. Zhou. On the geometry of discrete exponential families with applications to exponential random graph models. Electron. J. Statist., 3:446 484, 2009.