Geodesic Convexity and Regularized Scatter Estimation

Similar documents
Parameter estimation in linear Gaussian covariance models

Inference For High Dimensional M-estimates. Fixed Design Results

Log Covariance Matrix Estimation

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Robust subspace recovery by geodesically convex optimization

Sparse Covariance Selection using Semidefinite Programming

Multivariate Gaussian Analysis

(Part 1) High-dimensional statistics May / 41

Global Maxwellians over All Space and Their Relation to Conserved Quantites of Classical Kinetic Equations

Invariant coordinate selection for multivariate data analysis - the package ICS

Finite Singular Multivariate Gaussian Mixture

High-dimensional regression with unknown variance

Covariance function estimation in Gaussian process regression

Recovery of anisotropic metrics from travel times

Notes on Random Vectors and Multivariate Normal

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Decomposable and Directed Graphical Gaussian Models

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Multivariate Statistical Analysis

Maximum likelihood estimation of a log-concave density based on censored data

Common-Knowledge / Cheat Sheet

Permutation-invariant regularization of large covariance matrices. Liza Levina

Journal of Multivariate Analysis. Sphericity test in a GMANOVA MANOVA model with normal error

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Random Matrices and Multivariate Statistical Analysis

16.20 Techniques of Structural Analysis and Design Spring Instructor: Raúl Radovitzky Aeronautics & Astronautics M.I.T

The problem is to infer on the underlying probability distribution that gives rise to the data S.

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Naive Bayes and Gaussian Bayes Classifier

Active Set Methods for Log-Concave Densities and Nonparametric Tail Inflation

Naive Bayes and Gaussian Bayes Classifier

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Multivariate Analysis and Likelihood Inference

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Hyperbolic Systems of Conservation Laws. in One Space Dimension. I - Basic concepts. Alberto Bressan. Department of Mathematics, Penn State University

INVARIANT COORDINATE SELECTION

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Kernel Methods. Machine Learning A W VO

Independent component analysis for functional data

A sensitivity result for quadratic semidefinite programs with an application to a sequential quadratic semidefinite programming algorithm

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Introduction to Normal Distribution

ON THE HÖLDER CONTINUITY OF MATRIX FUNCTIONS FOR NORMAL MATRICES

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Sparse PCA with applications in finance

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

H 2 -optimal model reduction of MIMO systems

Invariant co-ordinate selection

Information Geometry: Background and Applications in Machine Learning

Multivariable Calculus

Linear Methods for Prediction

Efficient Estimation in Convex Single Index Models 1

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

4 Film Extension of the Dynamics: Slowness as Stability

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

Econometrics I, Estimation

Second Order Freeness and Random Orthogonal Matrices

A direct formulation for sparse PCA using semidefinite programming

Naive Bayes and Gaussian Bayes Classifier

Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

SIMULTANEOUS ESTIMATION OF SCALE MATRICES IN TWO-SAMPLE PROBLEM UNDER ELLIPTICALLY CONTOURED DISTRIBUTIONS

Review: control, feedback, etc. Today s topic: state-space models of systems; linearization

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Learning gradients: prescriptive models

Combinatorial Types of Tropical Eigenvector

A Least Squares Formulation for Canonical Correlation Analysis

On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

Lecture 6: Discrete Choice: Qualitative Response

Linear Algebra Practice Final

Neural Network Training

Tests for separability in nonparametric covariance operators of random surfaces

Nonlinear Programming Models

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

S-estimators in mapping applications

Paul Schrimpf. October 18, UBC Economics 526. Unconstrained optimization. Paul Schrimpf. Notation and definitions. First order conditions

Discussion of Hypothesis testing by convex optimization

Lecture 11. Multivariate Normal theory

1 Data Arrays and Decompositions

Second-Order Inference for Gaussian Random Curves

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Lecture: Examples of LP, SOCP and SDP

Course Summary Math 211

Analysis and Linear Algebra. Lectures 1-3 on the mathematical tools that will be used in C103

The following definition is fundamental.

Journal of Computational and Applied Mathematics

A note on profile likelihood for exponential tilt mixture models

arxiv: v2 [stat.me] 31 Aug 2017

Operator norm convergence for sequence of matrices and application to QIT

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Statistical Inference On the High-dimensional Gaussian Covarianc

1 Appendix A: Matrix Algebra

On Independent Component Analysis

Tutorial lecture 2: System identification

EFFICIENT MULTIVARIATE ENTROPY ESTIMATION WITH

Applications of Information Geometry to Hypothesis Testing and Signal Detection

CS 195-5: Machine Learning Problem Set 1

Ordinary Differential Equations II

Transcription:

Geodesic Convexity and Regularized Scatter Estimation Lutz Duembgen (Bern) David Tyler (Rutgers) Klaus Nordhausen (Turku/Vienna), Heike Schuhmacher (Bern) Markus Pauly (Ulm), Thomas Schweizer (Bern) Düsseldorf, July 22, 2017

I. Geometry of Scatter Matrices II. Geodesic Convexity and Coercivity III. M-Functionals of Scatter IV. Regularization

I. Geometry of Scatter Matrices R q q sym := { A R q q : A = A } R q q sym,+ := { A R q q sym : A positive definite } (open convex cone in R q q sym ) A, B := tr(ab) = i,j A ij B ij A F := A, A

z x y

A = [ z + x y ] y z x = [ ] x y + z I y x 2 A 2 F = 2 (x 2 + y 2 + z 2 ) A positive definite z > x 2 + y 2

µ R q Σ R q q sym,+ ˆΣ = sample covariance matrix of X 1, X 2,..., X n i.i.d N q (µ, Σ).

m = 50 samples of size n = 100:

m = 50 samples of size n = 500:

Suitable Geometry W ˆΣ = Σ 1/2 W Σ 1/2 { has universal symmetric distribution(q, n) p I q as n Local distance measure at Σ: d Σ (Σ, ˆΣ) := W I q F d Σ (Σ 0, Σ 1 ) := Σ 1/2 (Σ 0 Σ 1 )Σ 1/2 F

Global distance measure (geodesic distance) D g (Σ 0, Σ 1 ) := min over all smooth paths connecting Σ 0 and Σ 1. = min 1 d Σt ( Σt, Σ t+dt ) 0 1 Σ 1/2 t 0 [0, 1] t Σ t Σ t Σ 1/2 F dt t

Explicit solution A = log(σ 1/2 0 Σ 1 Σ 1/2 0 ) Σ t = Σ 1/2 0 exp(ta) Σ 1/2 0 D g (Σ 0, Σ 1 ) = A F Note: exp(a) = k=0 A k k! exp ( U diag(λ)u ) = U diag(e λ )U log ( U diag(λ)u ) = U diag(log λ)u

*

Local global parametrizations of R q q sym,+ Σ = BB with nonsingular B R q q R q q sym,+ = { B exp(a)b : A R q q } sym { Γ R q q sym,+ : det(γ) = det(σ)} = { B exp(a)b : A R q q sym, tr(a) = 0 }.

Note that for q 2, is not isometric. (R q q sym,+, D g ) Σ log(σ) (R q q sym, F ) -3-2 -1 0 1 2 3 y -3-2 -1 0 1 2 3 x

II. Geodesic Convexity and Coercivity Geodesic Convexity A function is (strictly) geodesically convex if f : R q q sym,+ R nonsingular B R q q, nonzero A R q q sym, f ( B exp(ta)b ) is (strictly) convex in t R. Equivalently: nonsingular B R q q, f ( B diag(e x )B ) is (strictly) convex in x R q.

Example The function is geodesically linear: f (Σ) := log det(σ) log det(b exp(a)b ) = log det(bb ) + trace(a).

Verifying g-convexity for smooth functions (V2) For any nonsingular B R q q and x R q, f ( B diag(e x )B ) = f (BB ) + gb x + 1 2 x H B x + o( x 2 ) as x 0. f is g-convex iff for all B, H B 0. f is strictly g-convex iff for all B, H B > 0.

Example For nonzero v R q, f (Σ) := log v Σv is g-convex. For nonsingular B R q q and w := B v, f ( B diag(e x )B ) = log ( w diag(e x )w ) f (BB ) + gb x + 1 2 x H B x with g B := ( w 2 i w 2 ) q i=1 H B := diag(g B ) g B g B.

Remarks Σ f (Σ) g-convex Σ f (Σ 1 ) g-convex. Sums and pointwise suprema of g-convex functions are g-convex. Both log λ max (Σ) and log λ max (Σ 1 ) = log λ min (Σ) are g-convex. f (Σ) g-convex, h : R R convex and increasing = h(f (Σ)) is g-convex. A local minimizer of a g-convex function is also a global minimizer. The only g-affine functions are f (Σ) = c 1 + c 2 log det(σ) with c 1, c 2 R.

Geodesic Coercivity Let f : R q q sym,+ R be g-convex / strictly g-convex. Then iff f is g-coercive, arg min f (Σ) is compact / a singleton Σ f (Σ) as log(σ) F. Criterion: If f is differentiable, it is g-coercive iff lim t for any nonzero A R q q sym. d dt f (exp(ta)) > 0

III. M-Functionals of Scatter True/empirical distribution Working model/caricature for P: P on R q with center 0 R q. ( f Σ (x) = C det(σ) 1/2 exp ρ(x Σ 1 x) ) 2 ρ(s) in s > 0 sρ (s) in s > 0 In other words, ρ(e x ) and convex in x R.

Target function (log-likelihood times 2/n) L(Σ, P) := 2 log[f Σ /f I ] dp = [ρ(x Σ 1 x) ρ(x x) ] P(dx) + log det(σ) M-Functional of scatter Σ(P) := arg min L(Σ, P) Σ R q q sym,+ M-estimator of scatter ˆP = emp. distribution of X 1, X 2,..., X n i.i.d. P Σ( ˆP) estimates Σ(P)

L(Σ, P) = Σ(P) = [ρ(x Σ 1 x) ρ(x x) ] P(dx) + log det(σ) arg min L(Σ, P) Σ R q q sym,+ ρ(s) = s: Σ(P) = Var(P) sρ (s) bounded in s 0: Σ( ) is moderately robust P elliptically symmetric with center 0 and scatter Σ: Σ(P) = c Σ

Good news In general, L(, P) is geodesically convex. Under mild regularity conditions on P and ρ, L(, P) is geodesically strictly convex and coercive.

Taylor expansion with L(B diag(e x )B, P) L(BB, P) + g B x + 1 2 x H B x g B := 1 q ψ B ψ B := ρ ( x 2 )(xi 2 ) q i=1 P B(dx) H B := diag(ψ B ) + ρ ( x 2 )xx P B (dx) P B := L(B 1 X ), X P. Existence, continuity and weak differentiability of Σ( )... Fast algorithms for computation of Σ( ˆP) via partial Newton method...

Symmetrization Replace Σ(P) with Σ s (P) := Σ(P P) P P := L(X X ), X, X i.i.d P. Estimator uses or with 1 k n. P P := P P := 1 nk ( ) n 1 δ 2 Xj X i 1 i<j n n i+k i=1 j=i+1 δ Xj X i

No need to estimate center of P P elliptically symmetric around µ with scatter Σ: Σ s (P) = c Σ Block independence property: ( P = L B [ X1 X 2 ]) with independent X 1 R q(1), X 2 R q(2) implies [ ] Σ1 (P) 0 Σ s (P) = B B. 0 Σ 2 (P)

IV. Regularization In high-dimensional settings replace Σ(P) with ( ) arg min L(Σ, P) + α Pen(Σ), α > 0, Σ R q q sym,+ where Pen : R q q sym,+ R satisfies Pen(cΣ) = Pen(Σ) (scale invariance) Pen(Σ) as λ max (Σ)/λ min (Σ).

Examples of penalties: Pen 0 (Σ) = log tr(σ) + log tr(σ 1 ) ( q ) ( q ) = log λ i + log λ 1 i ) i=1 i=1 Pen 1 (Σ) = q 1 log det(σ) + log tr(σ 1 ) q ( q ) = q 1 log λ i + log i=1 i=1 λ 1 i Pen 2 (Σ) = log det(σ) + q log λ max (Σ) q = log(λ i /λ min ) i=1

These penalties Pen j (Σ) are scale invariant g-convex g-coercive on {Σ : det(σ) = c} strictly g-convex on {Σ : det(σ) = c} (Pen 0, Pen 1 ) with arg min Σ Pen j (Σ) = {ci q : c > 0}

Example: Regularized version of Tyler s (1987) M-functional with f (Σ) = L(Σ, P) + α Pen(Σ) ρ(s) = q log s { Pen 1 (Σ) (Case 1) Pen(Σ) = h(pen 1 (Σ)) (Case 2) On {Σ : det(σ) = 1}, f is strictly g-convex g-coercive in Case 1 if ( P(V) < 1 + α ) dim(v) q q g-coercive in Case 2 if lim s h(s) s whenever 1 dim(v) < q. =

Numerical experiment For q = 50 and n = 30 consider X 1, X 2,..., X n i.i.d. Elliptic q (0, Σ) with Σ = diag(10, 5, 3, 2, 1,..., 1) 2.

Compute and ˆΣ α ( := arg min L(Σ, ˆP) + αh(pen 1 (Σ)) ) Σ ˆΣ := ˆΣˆα with ˆα := arg min CV(α) CV(α) := α 2 Z n i=1 { ρ(x i ˆΣ 1 α, i x i) + log det(ˆσ α, i ) }

log λ(σ ) and log λ(ˆσ ) (ˆα = 2 7 ) 0 1 2 3 4

Cross validation: CV(2 k ) versus k 7000 8000 9000 10000 11000 12000 13000 14000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

First eigenvectors: û 1 u 1 versus k 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Eigenvalues: log λ(ˆσ ) log λ(σ ) versus k 5 10 15 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Shape matrices: D g (ˆΣ, Σ ) versus k 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Symmetrization and orthogonally invariant penalties f (Σ) = L(Σ, P P) + α Pen(Σ) Pen(U ΣU) = Pen(Σ) for orthogonal U R q q Restricted block independence property ( [ ]) X1 P = L U with U R q q orth and independent X 1 R q(1), X 2 R q(2) implies [ ] Σ1 (P) 0 Σ s (P) = U U. 0 Σ 2 (P) X 2

Open questions and ongoing work Symmetrized M-estimators: Balanced incomplete versus complete U-statistics Asymptotics for regularized scatter estimators Algorithms for non-smooth g-convex penalties Using regularized scatter estimators in other contexts (classification, ICS ICA, multivar. regression,... )...

References Auderset, Mazza & Ruh: Angular Gaussian and Cauchy estimation. (JMVA 2005) Bhatia: Positive definite matrices. (Princeton University Press 2007) Wiesel: Geodesic convexity and covariance estimation. (IEEE Trans. Signal Process. 2012) D., Pauly & Schweizer: M-functionals of multivariate scatter. (Statistics Surveys 2015) D., Nordhausen & Schuhmacher: New algorithms for M-estimation of multivar. scatter and loc. (JMVA 2016) R package fastm. (CRAN 2014/2015) D. & Tyler: Geodesic convexity and regularized scatter estimators. (arxiv 1607.05455)