Infinitely Imbalanced Logistic Regression

Similar documents
Issues using Logistic Regression for Highly Imbalanced data

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Linear Models in Machine Learning

Maximum Likelihood Estimation

Association studies and regression

Linear Regression Models P8111

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Modeling Binary Outcomes: Logit and Probit Models

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Gaussian Graphical Models and Graphical Lasso

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Kernel Logistic Regression and the Import Vector Machine

Classification Logistic Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

Generalized Linear Models. Kurt Hornik

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION

Gradient-Based Learning. Sargur N. Srihari

Expectation-Maximization

Logistic Regression. Seungjin Choi

Classification: Logistic Regression from Data

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Probabilistic Machine Learning. Industrial AI Lab.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

ECE 5984: Introduction to Machine Learning

Linear Programming Algorithms for Detecting Separated Data in Binary Logistic Regression Models

Decoupled Collaborative Ranking

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Introduction to Machine Learning

Density Modeling and Clustering Using Dirichlet Diffusion Trees

MIT Spring 2016

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Statistical learning theory, Support vector machines, and Bioinformatics

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Relative Loss Bounds for Multidimensional Regression Problems

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

The Logit Model: Estimation, Testing and Interpretation

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Nonlinear Programming Models

Machine Learning. Support Vector Machines. Manfred Huber

Linear and logistic regression

Week 5: Logistic Regression & Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine learning - HT Maximum Likelihood

Introduction to Machine Learning

1-Bit Matrix Completion

CS489/698: Intro to ML

Linear Regression and Discrimination

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Master s Written Examination

Logistic Regression. William Cohen

Mixtures of Gaussians. Sargur Srihari

Chapter 7. Hypothesis Testing

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Bayesian Learning (II)

Machine Learning, Fall 2012 Homework 2

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Linear Methods for Prediction

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Neural Network Training

Introduction to Signal Detection and Classification. Phani Chavali

Generalized Linear Models and Exponential Families

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

ECS171: Machine Learning

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

An Introduction to Spectral Learning

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Boosting with Early Stopping: Convergence and Consistency

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

R. Koenker Spring 2017 Economics 574 Problem Set 1

ECON 3150/4150, Spring term Lecture 6

The material for categorical data follows Agresti closely.

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Gaussian discriminant analysis Naive Bayes

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Generalized Linear Models Introduction

Statistical Models for Defective Count Data

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Composite Hypotheses and Generalized Likelihood Ratio Tests

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Does Unlabeled Data Help?

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Advanced Statistics I : Gaussian Linear Model (and beyond)

CSCI-567: Machine Learning (Spring 2019)

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Generative v. Discriminative classifiers Intuition

Single-level Models for Binary Responses

Transcription:

p. 1/1 Infinitely Imbalanced Logistic Regression Art B. Owen Journal of Machine Learning Research, April 2007 Presenter: Ivo D. Shterev

p. 2/1 Outline Motivation Introduction Numerical Examples Notation Silvapulle s Results Overlap Conditions Technical Lemmas Main Results Example

p. 3/1 Motivation Binary classification problems. Two imbalanced classes - one class is very rare compared to the other. Applications include fraud detection, drug discovery, modeling of rare events in political sciences, etc.

p. 4/1 Introduction We have the observation Y {0, 1} where 0 is in the common case, and 1 is in the rare case. The limiting logistic regression coefficient β(n) satisfies x = exp(x β)xdf 0 (x) exp(x β)df 0 (x 0 ) where F 0 is the distribution of X given Y = 0, x is the average of the sample values x i for which y = 1. when F 0 = N(µ 0, Σ 0 ), then lim β(n) = Σ 1 0 ( x µ 0 ) N

p. 5/1 Numerical examples Suppose there are N observations (X Y = 0) N(0, 1), and a single observation (x y = 1) = 1. The logistic regression for this case is (x i,y i ) = ( Φ 1 ( i 1 2 N ), 0), for i = 1,...,N (x N+1,y N+1 ) = (1, 1) where Φ( ) is the cumulative distribution function of X. as N increases the problem becomes more imbalanced.

Numerical Examples it seems that lim N α = log N lim N β = 1 p. 6/1

p. 7/1 Numerical Examples it seems that α(n) = const log N lim N β = 0

p. 8/1 Numerical Examples it seems that β does not converge to a useful limit.

Notation Data (x,y), where x R d and y {0, 1}. Observations n with y = 1, and N with y = 0, where n N. Denote (x y = 1) = (x 11,...,x 1n ), (x y = 0) = (x 01,...,x 0N ). the logistic regression model is Pr(Y = 1 X = x) = exp(α + x β) 1 + exp(α + x β) where α R and β R d. The log-likelihood in logistic regression is l(α, β) = n i=1 (α + x 1i β log ( 1i 1 + exp(α + x β))) N log ( 1 + exp(α + x 0iβ) ) p. 9/1 i=1

Notation centering the logistic regression around x = 1 n n i=1 x 1i, gives l(α,β) = nα + i=1 n (x 1i x) β i=1 n log (1 + exp ( α + (x 1i x) β )) N log ( 1 + exp ( α + (x x) β )) df 0 (x) the study is focussed on the maximum likelihood estimate (MLE) (ˆα, ˆβ). the centered ˆα 0 = ˆα + x ˆβ, while ˆβ stays the same. lim ˆα =, but ˆβ does not necessarily diverge. N p. 10/1

p. 11/1 Silvapulle s Results Theorem 1: For y = 1 let z 1i = (1,x 1i) for i = 1,...,n 1. For y = 0 let z 0i = (1,x 0i) for i = 1,...,n 0. Let θ = (α,β ). then the logistic regression model has Pr(Y = y X = x) = exp(z θ) 1 + exp(z θ) employ two convex cones C j = n j i=1 k ji z ji k ji > 0, j {0, 1}. assume that the n 0 + n 1 by d + 1 matrix with rows taken from z ji, has rank d + 1. Iff C 0 C 1, then a unique finite MLE ˆθ = (ˆα, ˆβ ) exists.

Silvapulle s Results Lemma 2: Define the convex hull of the x s for y = 0 and y = 1 H j = n j i=1 λ ji x ji λ ji > 0, n j i=1 λ ji = 1 C 0 C 1 is equivalent to H 0 H 1. Proof: Suppose that x 0 H 0 H 1, then z 0 = (1,x 0) C 0 C 1. Conversely, suppose that z 0 C 0 C 1. Then we can write z 0 = n 0 i=1 ( ) 1 k 0i x 0i = n 1 i=1 ( ) 1 k 1i x 1i find K, the common positive value for n 0 i=1 k 0i and n 1 i=1 k 1i. Choose λ ji = k ji /K, then x 0 = n 0 i=1 λ 0ix 0i = n 1 i=1 λ 1ix 1i H 0 H 1. p. 12/1

p. 13/1 Overlap Conditions Assume some overlap between x 1 and the distribution F 0, to get interesting results. let Ω = {ω R d ω ω = 1} be the unit sphere in R d. Definition 3: F on R d has the point x surrounded if df(x) > δ (x x ) ω>ǫ holds for some ǫ > 0, some δ > 0 and all ω Ω. If F has the point x surrounded, then there exist η and γ satisfying inf df(x) η > 0 ω Ω (x x ) ω>0

p. 14/1 Overlap Conditions and where Z + = max(z, 0). [(x inf x ) ω ] df(x) γ > 0 ω Ω + For Theorem 1 (Lemma 2) to hold, is it sufficient that there is some point x that is surrounded by both F 0 and F 1. In the infinitely imbalanced case it is expected that F 0 will surround almost all x, but it is not required. It is not sufficient that F 0 surrounds only one x, but it is sufficient that F 0 surrounds x. It is not necessary that F 1 surrounds x.

p. 15/1 Lemma 4: For α,z R Technical Lemmas exp(α + z) log ( 1 + exp(α + z) ) [ log ( 1 + exp(α) ) + z exp(α) ] 1 + exp(α) [ z exp(α) ] 1 + exp(α) + = z + exp(α) 1 + exp(α) Lemma 5: Let n 1 and x 1,...,x n R d be given. Assume that F 0 surrounds x = 1 n n i=1 x 1i and that 0 < N <. Then l(α,β) has a unique maximizer (ˆα, ˆβ). +

p. 16/1 Main Results Lemma 6: Under the conditions of Lemma 5, let ˆα and ˆβ maximize l. Let η satisfy (x x df(x) η > 0. Then, for ) ω>ǫ N 2n 2n we have exp(ˆα). η Nη Lemma 7: Under the conditions of Lemma 5, let ˆα and ˆβ maximize l. Then lim N sup ˆβ <. Theorem 8: Let n 1 and x 1,...,x n R d and suppose that F 0 satisfies the tail condition exp(x β)(1 + x )df 0 (x) < for β R d (F 0 has too heavy tails), and surrounds x = 1 n n i=1 x i. Then the maximizer (ˆα, ˆβ) of l satisfies lim N exp(x ˆβ)xdF0 (x) exp(x ˆβ)dF0 (x) = x

p. 17/1 Main Results In the limit N the logistic regression (ˆβ) depends on the x 1,...,x n only through x.

p. 18/1 Example Suppose F 0 is a Gaussian mixture F 0 = K k=1 λ k N(µ k,σ k ), λ k > 0 and K k=1 λ k = 1 if at least one of Σ k has full rank, then F 0 will surround the point x and the solution to β is defined through x = K k=1 λ k(µ k + Σ k β) exp(β µ k + 1 2 β Σ k β) K k=1 λ k exp(β µ k + 1 2 β Σ k β) or 0 = K k=1 λ k (µ k + Σ k β x) exp(β µ k + 1 2 β Σ k β)

p. 19/1 Example Solving the equation can be done by convex optimization, using Newton s method with O(d 3 ) computations per iteration.