Robust estimation of scale and covariance with P n and its application to precision matrix estimation

Similar documents
Efficient and Robust Scale Estimation

Robust scale estimation with extensions

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Robust methods and model selection. Garth Tarr September 2015

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Lecture 12 Robust Estimation

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

Supplementary Material for Wang and Serfling paper

Joint Gaussian Graphical Model Review Series I

Symmetrised M-estimators of multivariate scatter

Cellwise robust regularized discriminant analysis

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Robust and sparse estimation of the inverse covariance matrix using rank correlation measures

Cellwise robust regularized discriminant analysis

Robust Variable Selection Through MAVE

Statistical Data Analysis

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

Development of robust scatter estimators under independent contamination model

A Robust Approach to Regularized Discriminant Analysis

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Robust Inverse Covariance Estimation under Noisy Measurements

Sparse inverse covariance estimation with the lasso

Lecture 14 October 13

9. Robust regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Spatial autocorrelation: robustness of measures and tests

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Permutation-invariant regularization of large covariance matrices. Liza Levina

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

Chapter 3. Linear Models for Regression

LARGE SAMPLE BEHAVIOR OF SOME WELL-KNOWN ROBUST ESTIMATORS UNDER LONG-RANGE DEPENDENCE

Regression diagnostics

ORIE 4741: Learning with Big Messy Data. Regularization

Biostatistics Advanced Methods in Biostatistics IV

High Dimensional Covariance and Precision Matrix Estimation

Robust estimators based on generalization of trimmed mean

Inference based on robust estimators Part 2

High-dimensional covariance estimation based on Gaussian graphical models

Estimation of large dimensional sparse covariance matrices

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

odhady a jejich (ekonometrické)

A SHORT COURSE ON ROBUST STATISTICS. David E. Tyler Rutgers The State University of New Jersey. Web-Site dtyler/shortcourse.

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Generalized Elastic Net Regression

The Nonparanormal skeptic

Small Sample Corrections for LTS and MCD

The Minimum Regularized Covariance Determinant estimator

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

An Overview of Multiple Outliers in Multidimensional Data

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

ON THE MAXIMUM BIAS FUNCTIONS OF MM-ESTIMATES AND CONSTRAINED M-ESTIMATES OF REGRESSION

Package scio. February 20, 2015

Computational Connections Between Robust Multivariate Analysis and Clustering

Robust variable selection through MAVE

Robust Estimation of Cronbach s Alpha

Sparse Inverse Covariance Estimation for a Million Variables

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

The S-estimator of multivariate location and scatter in Stata

A Short Course in Basic Statistics

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Identification of Multivariate Outliers: A Performance Study

STA 2201/442 Assignment 2

Evaluation of robust PCA for supervised audio outlier detection

Lecture 14: Variable Selection - Beyond LASSO

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Gaussian Graphical Models and Graphical Lasso

Robust Wilks' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA)

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

CPSC 540: Machine Learning

Measuring robustness

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Cheng Soon Ong & Christian Walder. Canberra February June 2018

An Introduction to Graphical Lasso

Lecture 25: November 27

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

MULTIVARIATE TECHNIQUES, ROBUSTNESS

A Brief Overview of Robust Statistics

arxiv: v1 [math.st] 2 Aug 2007

A Comparison of Robust Estimators Based on Two Types of Trimming

Descriptive Statistics

Exam 2. Jeremy Morris. March 23, 2006

Robust Exponential Smoothing of Multivariate Time Series

CPSC 540: Machine Learning

Solving Corrupted Quadratic Equations, Provably

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Lecture 17: October 27

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Robust estimation of principal components from depth-based multivariate rank covariance matrix

APPLICATIONS OF A ROBUST DISPERSION ESTIMATOR. Jianfeng Zhang. Doctor of Philosophy in Mathematics, Southern Illinois University, 2011

Transcription:

Robust estimation of scale and covariance with P n and its application to precision matrix estimation Garth Tarr, Samuel Müller and Neville Weber USYD 2013 School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY

Outline Robust Scale Estimator, P n Covariance Covariance Matrix Autocovariance Precision Matrix Long Range Dependence Short Range Dependence

Outline Robust scale estimation Robust covariance estimation Robust covariance matrices Summary and key references

History of relative efficiencies at the Gaussian (n = 20) 1.00 SD 0.85 P n Relative Efficiency 0.67 0.52 0.38 IQR MAD Q n S n Year 1894 1931 1974 1993 2011

U-quantile statistics Given data X = (X 1,..., X n ) and a symmetric kernel h : R 2 R a U-statistic of order 2 is defined as: ( ) n 1 U n (X) := h(x i, X j ). (1) 2 i<j Let H(t) = P (h(x i, X j ) t) be the cdf of the kernels with corresponding empirical distribution function, H n (t) := ( ) n 1 I{h(X i, X j ) t}, for t R. (2) 2 i<j For 0 < p < 1, the corresponding sample U-quantile is: H 1 n (p) := inf{t : H n (t) p}. (3)

Generalised L-statistics A generalised linear (GL) statistic can be defined as where T n (H n ) = I J(p)H 1 n (p)dp + d j=1 J is function for smooth weighting of Hn 1 (p) I [0, 1] is some interval a j are discrete coefficients for H 1 n (p j ) a j H 1 n (p j ). (Serfling, 1984)

Examples of GL-statistics Interquartile range: h(x) = x, IQR = Hn 1 (0.75) Hn 1 (0.25) Variance: h(x, y) = 1 2 (x y)2, 1 0 Hn 1 (p)dp Winsorized variance: h(x, y) = 1 2 (x y)2, 0.75 0 Hn 1 (p)dp + 0.25Hn 1 (0.75) Rousseeuw and Croux s Q n : h(x, y) = x y, H 1 n (0.25)

Pairwise mean scale estimator: P n Consider the set of ( n 2) pairwise means: {h(x i, X j ), 1 i < j n} where h(x 1, X 2 ) = (X 1 + X 2 )/2. Let H n be the corresponding empirical distribution function: H n (t) := ( ) n 1 I{h(X i, X j ) t}, for t R. 2 i<j Definition P n = c [ Hn 1 (0.75) Hn 1 (0.25) ], where c 1.048 is a correction factor to make P n consistent for the standard deviation when the underlying observations are Gaussian.

Why pairwise means? Consider 10 observations from N (0, 1). Density 0.4 0.2 0.0 0.4 0.2 Original Data Pairwise Means 0.0 4 2 0 2 4 Observations

Influence curve The influence curve for a functional T at distribution F is T ((1 ɛ)f + ɛδ x ) T (F ) IC(x; T, F ) = lim ɛ 0 ɛ where δ x has all its mass at x. Serfling (1984) outlines the IC for GL-statistics. Influence curve for P n Assuming that F has derivative f > 0 on [F 1 (ɛ), F 1 (1 ɛ)] for all ɛ > 0, [ 0.75 F (2H 1 F (0.75) x) IC(x; P n, F ) = c f(2h 1 F (0.75) x)f(x)dx ] 0.25 F (2H 1 F (0.25) x) f(2h 1. F (0.25) x)f(x)dx

Influence curves when F = Φ SD P n IC(x; T, F ) 1 0 1 2 Q n MAD 4 2 0 2 4 x

Asymptotic normality The empirical U-process is ( n (Hn (t) H(t)) ) t R. Silverman (1976) proved that in this context, n(hn ( ) H( )) converges weakly to an almost sure continuous zero-mean Gaussian process W with covariance function: EW (s)w (t) = 4 P(h(X 1, X 2 ) s, h(x 1, X 3 ) t) 4H(s)H(t). For 0 < p < q < 1, if H, the derivative of H, is strictly positive on the interval [H 1 (p) ε, H 1 (q) + ε] for some ε > 0, then we can use the inverse map to show n ( H 1 n ( ) H 1 ( ) ) D W (H 1 ( )) H (H 1 ( )).

Asymptotic normality Recall, Hence, P n = c [ Hn 1 (3/4) Hn 1 (1/4) ]. n(pn θ) D N (0, V ) where θ = c ( H 1 (3/4) H 1 (1/4) ) and V both depend on the underlying distribution. When the underlying data are Gaussian, V = IC(x, P n, Φ) 2 d Φ(x) = 0.579. This equates to an asymptotic efficiency of 0.86 as compared with 0.82 for Q n and 0.37 for the MAD at the normal.

Breakdown value Definition The breakdown value is the smallest fraction of contamination that can cause arbitrarily corruption to an estimator. P n will breakdown if 25% of the pairwise means are contaminated. Arbitrarily changing m of the original observations leaves n m uncontaminated and ( ) n m 2 pairwise means remain uncontaminated. P n will remain bounded so long as more than 75% of the pairwise means are uncontaminated, i.e. ( ) ( ) n m n > 0.75. 2 2 The asymptotic breakdown value of P n is 13.4%.

Why the interquartile range? Asymptotic breakdown point 0.0 0.1 0.2 0.3 Consider, P n (τ) = H 1 n ( 1 + τ 2 ) H 1 n ( 1 τ The asymptotic breakdown point is: ε 1 0.0 0.2 0.4 0.6 0.8 1.0 τ Asymptotic efficiency 0.8 0.9 1.0 ). 2 1+τ 2. 0.0 0.2 0.4 0.6 0.8 1.0 τ

Asymptotic relative efficiency when f = t ν for ν [1, 10] Asymptotic relative efficiency 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P n Q n MAD 1 2 3 4 5 6 7 8 9 10 Degrees of freedom

Adaptive trimming: Pn For preliminary high breakdown location and scale estimates, m(x) and s(x) respectively, an observation, X i, is trimmed if X i m(x) s(x) > d, (4) where d is the tuning parameter. Achieves the best possible breakdown value for a sensible choice of tuning parameter. Definition Denote P n as the adaptively trimmed P n with d = 5.

Efficiency of P n in finite samples Following Randal (2008) efficiencies are estimated over m independent samples as êff(t ) = For each i = 1, 2,..., m, Var (ln ˆσ 1,..., ln ˆσ m ) Var (ln T (X 1 ),..., ln T (X m )). (5) X i are independent samples of size n, ˆσ i is the ML scale estimate, and T (X i ) is the proposed scale estimate. Distributions considered t distributions with degrees of freedom between 1 and 10. Configural polysampling using Tukey s 3 corners: Gaussian, One-wild and Slash.

Aside: the slash distribution If Z N (0, 1) and U U(0, 1) then X = µ + σz/u follows the slash distribution. When µ = 0 and σ = 1 we have the standard slash distribution. N (0, 1) Cauchy Slash 4 3 2 1 0 1 2 3 4

Relative efficiencies: f = t ν for ν [1, 10] and n = 20 Estimated relative efficiency 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 3 4 5 6 7 8 9 10 Degrees of freedom P n P n Q n

Relative efficiencies at the Slash corner (n = 20) P n 48% P n 75% 10% trim P n Q n 95% S n MAD 87% IQR 0.2 0.4 0.6 0.8 1.0 Estimated relative efficiency

Relative efficiencies at the One-wild corner (n = 20) P n 84% P n 72% 10% trim P n Q n 68% S n MAD 40% IQR 0.2 0.4 0.6 0.8 1.0 Estimated relative efficiency

Relative efficiencies at the Gaussian corner (n = 20) P n P n 81% 85% 10% trim P n Q n 67% S n MAD 38% IQR 0.2 0.4 0.6 0.8 1.0 Estimated relative efficiency

Outline Robust scale estimation Robust covariance estimation Robust covariance matrices Summary and key references

From scale to covariance: the GK device Gnanadesikan and Kettenring (1972) relate scale and covariance using the following identity, cov(x, Y ) = 1 [var(αx + βy ) var(αx βy )], 4αβ where X and Y are random variables. In general, X and Y can have different units, so we set α = 1/ var(x) and β = 1/ var(y ). Replacing variance with P 2 n we can similarly construct, γ P (X, Y ) = 1 [ P 2 4αβ n (αx + βy ) Pn(αX 2 βy ) ], where α = 1/P n (X) and β = 1/P n (Y ).

Robustness properties IC((x,y),P n,φ) γ P inherits the 13.4% breakdown value from P n. Following Genton and Ma (1999), we have shown that influence curve and therefore the gross error sensitivity of γ P can be derived from the IC of P n. IC(x,y) 3 2 1 0 1 2 3 4 2 0 2 4 x 4 2 0 y 2 4! IC is bounded and hence the gross error sensitivity is finite.

Asymptotic variance Genton and Ma (1999) show that the asymptotic variance is directly proportional to that of the underlying scale estimator: V (γ S, Φ ρ ) = 2(1 + ρ 2 )V (S, Φ) V (γ S, Φ ρ ) 5 γ MAD 1 1 γ P γ SD! Asymptotic efficiency of γ P relative to the standard deviation at the Gaussian remains 86%. ρ

Bivariate t ν for ν [3, 10], n = 20 and ρ = 0.5. Estimated efficiency relative to MLE 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ρ P ρ Q ρ MAD 3 4 5 6 7 8 9 10 Degrees of freedom

Bivariate t ν for ν [3, 10], n = 20 and ρ = 0.9. Estimated relative efficiency 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ρ P ρ Q ρ MAD 3 4 5 6 7 8 9 10 Degrees of freedom

Outline Robust scale estimation Robust covariance estimation Robust covariance matrices Summary and key references

Existing robust covariance matrix estimation methods M-estimates have BDP 1/(1 + p) Minimum volume ellipsoid (MVE) converges at n 1/3 S-estimates behave like MLE for large p Minimum covariance determinant (MCD) requires subsampling, so slow for large p! Try constructing a covariance matrix using pairwise robust covariance estimates.

Covariance matrices Good covariance matrices should be: 1. Positive-semidefinite. 2. Affine equivariant. That is, if Ĉ is a covariance matrix estimator, A and a are constants and X is random then, Ĉ(AX + a) = AĈ(X)A.! How can we ensure that a matrix full of pairwise covariances is a good covariance matrix? Maronna and Zamar (2002) outline the OGK routine which ensures that the resulting covariance matrix is positive-definite and approximately affine-equivariant matrix.

OGK procedure (Maronna and Zamar, 2002) Let s( ) be a scale function and X = [x 1,..., x n ] R n p be a data matrix. 1. Let D = diag(s(x 1 ),..., s(x p )) and define Y = XD 1. 2. Compute the correlation matrix applying s( ) to the columns of Y: { 1 U = [u jk ] = 4 (s(y j + y k ) 2 s(y j y k ) 2 ) j k 1 j = k 3. Compute the eigendecomposition U = EΛE. 4. Project the data onto the basis eigenvectors: Z = YE. 5. Estimate the variances in the coordinate directions: Γ = diag(s(z 1 ) 2,..., s(z p ) 2 ). 6. The estimated covariance matrix is then, ˆΣ = D 2 EΓE.

Precision matrices In many practical applications, the covariance matrix is not what is really required. PCA, Mahalanobis distance, LDA, etc. use the inverse covariance matrix: the precision matrix, Θ = Σ 1. Precision matrices are also of interest in modelling Gaussian Markov random fields, where zeros in the correspond to conditional independence between variables. Sparsity! In many applications, it is often useful to impose a level of sparsity on the estimated precision matrix.

Regularisation techniques Graphical lasso (glasso) (Friedman, Hastie, Tibshirani, 2007) minimises the penalised negative Gaussian log-likelihood: f(θ) = tr( ˆΣΘ) log Θ + λ Θ 1, where Θ 1 is the L 1 norm and λ is a tuning parameter for the amount of shrinkage. Constrained l 1 -minimisation for inverse matrix estimation (CLIME) (Cai, Liu and Luo, 2011) solves the following objective function: min Θ 1 subject to: ˆΣΘ I λ. Quadratic Inverse Covariance (QUIC) (Hsieh, et. al. 2011) solves the same minimisation problem as the glasso but uses a second order approach.

QUadratic Inverse Covariance (QUIC) Given a regularisation parameter λ 0, the l 1 regularised Gaussian MLE for Θ = Σ 1 can be estimated by solving a regularised log determinant program: argmin Θ 0 f(θ) = argmin Θ 0 {tr( ˆΣΘ) log Θ + λ Θ }{{} 1 } }{{} g(θ) h(θ) where ˆΣ is the sample covariance matrix and Θ 1 = i,j θ ij. Key features The l 1 regularisation promotes sparsity in Θ g(θ) is twice differentiable and strictly convex quadratic approximation h(θ) is convex but not differentiable

Algorithm 1 QUadratic Inverse Covariance Input: symmetric matrix ˆΣ, scalar λ, stopping tolerance ɛ 1: for i = 0, 1,... do 2: Compute Σ t = Θ 1 t. 3: Find second order approximation f( ; Θ t ) to f(θ t + ). 4: Partition variables into free and fixed sets. 5: Find Newton direction D t using coordinate descent over the free variable set (Lasso problem). 6: Determine step-size α such that Θ t+1 = Θ t +αd t is positive definite and the objective function sufficiently decreases. 7: end for Output: sequence of Θ t R package QUIC

Randomly scattered contamination 100 100% Observation vectors 50 Rows contaminated 10 5 0 75% 50% 5 25% p=30 1 0% 1 15 30 Variables 0% 5% 10% Cells contaminated

Simulation design Two scenarios: sparse and dense precision matrices, Θ. sample size n = 120 variables p = 30 data N (0, Σ) where Σ = Θ 1 sparsity parameter λ = 0.1 contamination 0% to 3% randomly scattered replications N = 1000 Performance metric The entropy loss L(Θ, ˆΘ) = tr(θ 1 ˆΘ) log Θ 1 ˆΘ p.

Simulation design Sparse Θ Dense Θ 1 1 10 10 0.2 0.1 0.0 20 20-0.1-0.2 30 1 10 20 30 30 1 10 20 30

Sparse Θ, outliers at 5 6 Entropy loss 4 2 Classic P n OGK P n Q n OGK Q n τ scale OGK τ OGKw τ MCD 0% 1% 2% 3% Random contamination

Dense Θ, outliers at 5 6 Entropy loss 4 2 Classic P n OGK P n Q n OGK Q n τ scale OGK τ OGKw τ MCD 0% 1% 2% 3% Random contamination

Outline Robust scale estimation Robust covariance estimation Robust covariance matrices Summary and key references

Summary 1. Aim Efficient, robust and widely applicable scale and covariance estimators. 2. Method P n scale estimator with the GK identity and an orthogonalisation procedure for pairwise covariance matrices or QUIC procedure for (sparse) precision matrix estimation. 3. Results 86% asymptotic efficiency at the Gaussian distribution. Robustness properties flow through to covariance estimates and beyond.

References Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383 393. Randal, J. (2008). A reinvestigation of robust scale estimation in finite samples. Computational Statistics & Data Analysis, 52(11):5014 5021. Rousseeuw, P. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88(424):1273 1283. Serfling, R. J. (1984). Generalized L-, M-, and R-statistics. The Annals of Statistics, 12(1):76 86. Genton, M. and Ma, Y. (1999). Robustness properties of dispersion estimators. Statistics & Probability Letters, 44(4):343 350.

References Hsieh, C-J., Sustik, M.A., Dhillon I.S. and Ravikumar, P.K. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neural Information Processing Systems 24, 2330 2338. Gnanadesikan, R. and Kettenring J. R. (1972). Robust estimates, residuals and outlier detection with multiresponse data. Biometrics, 28(1):81 124. Maronna, R. and Zamar, R., (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4):307 317. Tarr, G., Müller, S. and Weber, N.C., (2012). A robust scale estimator based on pairwise means. Journal of Nonparametric Statistics, 24(1):187 199.