Sufficient dimension reduction via distance covariance

Similar documents
Likelihood-based Sufficient Dimension Reduction

Regression Graphics. R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108

Sufficient Dimension Reduction for Longitudinally Measured Predictors

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Moment Based Dimension Reduction for Multivariate. Response Regression

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Successive direction extraction for estimating the central subspace in a multiple-index regression

arxiv: v1 [math.st] 6 Dec 2007

Combining eigenvalues and variation of eigenvectors for order determination

A Selective Review of Sufficient Dimension Reduction

Weighted Principal Support Vector Machines for Sufficient Dimension Reduction in Binary Classification

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Simulation study on using moment functions for sufficient dimension reduction

Tobit Model Estimation and Sliced Inverse Regression

Bivariate Paired Numerical Data

[y i α βx i ] 2 (2) Q = i=1

arxiv: v3 [stat.me] 8 Jul 2014

ON DIMENSION REDUCTION IN REGRESSIONS WITH MULTIVARIATE RESPONSES

Lecture 9: Vector Algebra

Marginal tests with sliced average variance estimation

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Partial martingale difference correlation

Shrinkage Inverse Regression Estimation for Model Free Variable Selection

A Significance Test for the Lasso

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Problem Selected Scores

A note on sufficient dimension reduction

Fused estimators of the central subspace in sufficient dimension reduction

Sliced Inverse Regression

Multivariate Statistical Analysis

Sufficient reductions in regressions with elliptically contoured inverse predictors

Chapter 2. General Vector Spaces. 2.1 Real Vector Spaces

Regression: Ordinary Least Squares

Linear Methods for Prediction

DS-GA 1002 Lecture notes 12 Fall Linear regression

Resistant Dimension Reduction

Vector Spaces. distributive law u,v. Associative Law. 1 v v. Let 1 be the unit element in F, then

On a Nonparametric Notion of Residual and its Applications

Linear Methods for Prediction

Chapter 1. Linear Regression with One Predictor Variable

Linear Regression (continued)

Lecture 6: Discrete Choice: Qualitative Response

LINEAR MMSE ESTIMATION

Made available courtesy of Elsevier:

Accelerated Life Test of Mechanical Components Under Corrosive Condition

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Sliced Regression for Dimension Reduction

Envelopes: Methods for Efficient Estimation in Multivariate Statistics

sparse and low-rank tensor recovery Cubic-Sketching

Regression diagnostics

Dimension Reduction in Abundant High Dimensional Regressions

Second-Order Inference for Gaussian Random Curves

Overfitting, Bias / Variance Analysis

Stat 5101 Lecture Notes

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Sliced Inverse Moment Regression Using Weighted Chi-Squared Tests for Dimension Reduction

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Linear Models Review

Statistica Sinica Preprint No: SS R2

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

The structure of unitary actions of finitely generated nilpotent groups

Econ 2120: Section 2

Classification 2: Linear discriminant analysis (continued); logistic regression

Math 550 Notes. Chapter 2. Jesse Crawford. Department of Mathematics Tarleton State University. Fall 2010

Prediction. Prediction MIT Dr. Kempthorne. Spring MIT Prediction

Supplementary Materials for Tensor Envelope Partial Least Squares Regression

Supplementary Materials: Martingale difference correlation and its use in high dimensional variable screening by Xiaofeng Shao and Jingsi Zhang

The deterministic Lasso

k times l times n times

arxiv: v1 [stat.me] 28 Apr 2016

Scatter Plot Quadrants. Setting. Data pairs of two attributes X & Y, measured at N sampling units:

A review on Sliced Inverse Regression

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Function-on-Scalar Regression with the refund Package

GARCH Models Estimation and Inference. Eduardo Rossi University of Pavia

Sufficient dimension reduction based on the Hellinger integral: a general, unifying approach

Linear Models in Machine Learning

Semi-parametric estimation of non-stationary Pickands functions

Commutative Banach algebras 79

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

A GENERAL THEORY FOR NONLINEAR SUFFICIENT DIMENSION REDUCTION: FORMULATION AND ESTIMATION

Linear models and their mathematical foundations: Simple linear regression

(1)(a) V = 2n-dimensional vector space over a field F, (1)(b) B = non-degenerate symplectic form on V.

be the set of complex valued 2π-periodic functions f on R such that

Unsupervised Learning: Dimensionality Reduction

Multivariate Regression

Hypothesis testing in multilevel models with block circular covariance structures

On construction of constrained optimum designs

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Robust model selection criteria for robust S and LT S estimators

A Note on Hilbertian Elliptically Contoured Distributions

New Introduction to Multiple Time Series Analysis

Math 4153 Exam 1 Review

10. Linear Models and Maximum Likelihood Estimation

Holzmann, Min, Czado: Validating linear restrictions in linear regression models with general error structure

PLS. theoretical results for the chemometrics use of PLS. Liliana Forzani. joint work with R. Dennis Cook

Particle Filters. Outline

Multivariate Statistical Analysis

Transcription:

Sufficient dimension reduction via distance covariance Wenhui Sheng Xiangrong Yin University of Georgia July 17, 2013

Outline 1 Sufficient dimension reduction 2 The model 3 Distance covariance 4 Methodology 5 Simulation studies 6 Determining d 7 Real data analysis 8 Summary

Sufficient dimension reduction (SDR). Dimension Reduction Subspace Let B be a p q matrix with q p, if Y X B T X, then the space S(B), spanned by the columns of B, is a dimension reduction subspace.

Sufficient dimension reduction (SDR). Central Subspace (CS) If the intersection of all dimension reduction subspace is itself a dimension reduction subspace, it is called a central subspace, denoted by S Y X. Cook (1998) and Yin, Li and Cook (2008) showed that under mild conditions, CS exists and is unique. In SDR, since the dimension reduction subspace is not unique, our primary interest is to estimate the CS.

The Model We consider the following regression model: Y X η T X, (2.1) where Y is a scalar response, X is a p 1 predictor vector and η is a p d matrix with d p. Here d = dim(s Y X ), which is the structural dimension. Our goal is to estimate the S Y X by finding a η R p d which satisfies (2.1).

Distance covariance Székely, Rizzo and Bakirov (2007) proposed distance covariance (DCOV) as a new distance measure of dependence between two random vectors. Let U R p and V R q, then the DCOV between U and V with finite first moments is the nonnegative number, V(U, V ), defined by V 2 (U, V ) = f U,V (t, s) f U (t)f V (s) 2 w(t, s)dtds, R p+q

Distance covariance (Con t). where f U and f V stand for the characteristic functions of U and V respectively, and their joint characteristic function is denoted by f U,V. f 2 = f f for a complex-valued function f, with f being the conjugate of f. w(t, s) is a specially chosen positive weight function; more details of w(t, s) can be found in Székely, Rizzo and Bakirov (2007) and Székely and Rizzo (2009).

Properties of Distance Covariance U and V are independent if and only if V(U, V ) = 0. DCOV is efficient to measure nonlinear relationship. The sample version of V(U, V ) is very simple. This property benefits the computation. Others

The method We consider the squared distance covariance between Y and β T X, where β is an p d 0 arbitrary matrix with d 0 p: V 2 (β T X, Y ) = f R d 0 β T (t, s) f +1 X,Y β T (t)f X Y (s) 2 w(t, s)dtds, We show that under a mild condition, solving (4.1) will yield a basis of the central subspace, max V 2 (β T X, Y ), (4.1) β T Σ X β=i d0 1 d 0 p In (4.1), we use a scale constraint β T Σ X β = I d0, which is necessary to make the maximization procedure work.

The method (Con t) Proposition 1 Let η be a basis of the CS, β be a p d 0 matrix with d 0 d, η T Σ X η = I d and β T Σ X β = I d0. Assume S(β) S(η), then V 2 (β T X, Y ) V 2 (η T X, Y ). The equality holds if and only if S(β) = S(η).

The method (Con t) Proposition 2 Let η be a basis of the CS, β be a p d 1 matrix with η T Σ X η = I d and β T Σ X β = I d1. Here d 1 could be bigger, less or equal to d. Suppose Pη(Σ T X ) X X and S(β) S(η), then V 2 (β T X, Y ) < V 2 (η T X, Y ). QT η(σ X )

The method (Con t) Independence condition Independence condition: P T η(σ X ) X QT η(σ X ) X Independence condition will be satisfied when X is normal. Low dimensional projection of the predictor are approximately multivariate normal (Diaconis and Freedman 1984; Hall and Li 1993).

Estimating the central subspace when d is specified Suppose d is known (A permutation test will be proposed to estimate d) The estimate of η: η n = arg max V β T n 2 (β T X, Y) ˆΣX β=i d where V 2 n (β T X, Y) is the sample version of V 2 (β T X, Y ).

k, l = 1,, n. Similarly, define b kl = Y k Y l and B kl = b kl b k. b.l + b.., for k, l = 1,, n. Estimating the central subspace when d is specified (Con t) The sample version of V 2 (β T X, Y ): Vn 2 (β T X, Y) = 1 n n 2 A kl (β)b kl, where, k,l=1 A kl (β) = a kl (β) ā k. (β) ā.l (β) + ā.. (β) a kl (β) = β T X k β T X l, ā k. (β) = 1 n a kl (β), n ā.l (β) = 1 n a kl (β), ā.. (β) = 1 n n 2 k=1 l=1 n a kl (β), k,l=1

Asymptotic properties Proposition 3 Assume η is a basis of the central subspace S Y X and η T Σ X η = I d. Suppose the support of X is compact, E Y < and Pη(Σ T X ) X X. Let QT η(σ X ) η n = arg max βt ˆΣ X β=i d V 2 n (β T X, Y), then η n is a consistent estimator of a basis of S Y X, that is, there exists a rotation matrix Q: Q T P Q = I d, such that η n ηq.

Asymptotic properties (Con t) Proposition 4 Assume η is a basis of the central subspace S Y X and η T Σ X η = I d. Suppose the support of X is compact, E Y < and Pη(Σ T X ) X X. Let QT η(σ X ) η n = arg max βt ˆΣX β=i d V 2 n (β T X, Y), then under the regularity conditions given in the supplementary file, there exists a rotation matrix Q: Q T Q = I d such that n(ηn ηq) D N(0, V 11 ), where V 11 is the covariance matrix defined in the supplementary file.

Simulation studies Consider the following two models: Let β 1 = (1, 0, 0, 0, 0, 0, ) T, β 2 = (0, 1, 0, 0, 0, 0, ) T, β 3 = (1, 1, 1, 0, 0, 0) T and β 4 = (1, 0, 0, 0, 1, 3, ) T and n = 100. The models are (A) Y = (β T 1 X)2 + (β T 2 X) + 0.1ɛ (B) Y = (β T 3 X)2 + 3 sin(β T 4 X/4) + 0.2ɛ

Simulation studies(con t) For each model, three different kinds of predictors: Part (1): X N(0, I 6 ); Part (2): X is continuous but nonnormal; Part (3): X is discrete. Comparison: SIR (Li 1991), SAVE (Cook and Weisberg 1991), PHD (Li 1992) and LAD (Cook and Forzani 2009)

Model A Table: Comparison with model A Part(1) Part(2) Part(3) (n, p) Method m SE (n,p) Method m SE (n,p) Method m SE (100,6) SIR 0.9025 0.1184 (100,6) SIR 0.6283 0.1834 (100,6) SIR 0.5422 0.1877 PHD 0.8288 0.1604 PHD 0.8568 0.1481 PHD 0.7382 0.2554 SAVE 0.4227 0.1822 SAVE 0.4688 0.1859 SAVE 0.4945 0.2506 LAD 0.2952 0.1047 LAD 0.2869 0.0936 LAD * * Dcov 0.2014 0.0570 Dcov 0.1816 0.0632 Dcov 0.0083 0.0727 LAD does not work sometimes

Model B Table: Comparison with model B Part(1) Part(2) Part(3) (n, p) Method m SE (n,p) Method m SE (n,p) Method m SE (100,6) SIR 0.8606 0.1719 (100,6) SIR 0.8585 0.1705 (100,6) SIR 0.9607 0.0648 PHD 0.8916 0.1299 PHD 0.9594 0.0644 PHD 0.7673 0.1978 SAVE 0.5870 0.2487 SAVE 0.6626 0.2811 SAVE 0.7987 0.1780 LAD 0.2846 0.1129 LAD 0.2866 0.1338 LAD * * Dcov 0.2271 0.0903 Dcov 0.2205 0.0839 Dcov 0.4200 0.2593 LAD does not work sometimes

Determining d We want to test the conditional independence, that is, given β = (β 1, β 2,, β k ) S Y X and β = (β k+1,, β p ), (β, β ) form an orthogonal basis of R p and we want to test whether the relationship, Y X β T X, is right. The permutation test we suggest here is to test the independent between (y, β T X) versus β T X. Without further assumption β T X an upper bound of d. β T X, we can only get

Determining d (Con t) We use permutation test to determine the dimensionality of central subspace, d = dim(s Y X ). The sample size is n = 200 with p = 6. For each model and each part, we use 100 replications. Table: Permutation test Model Normal Nonnormal Discrete A 93% 100% 100% B 100% 87% 62% The percentage of d = 2 and d = 3.

Bird, plane or car This data set concerns the identification of the sounds made by birds, planes and cars. A two hour recording was made in the city of Ermont, France, and then 5 second snippets of interesting sounds were manually selected. This resulted in 58 recordings identified as birds, 44 as cars and 67 as planes. Each recording was further processed, and was ultimately represented by 13 SDMFCCs (Scale Dependent Mel- Frequency Cepstrum Coefficients).

Bird, plane or car Figure: Plot of the first two Dcov directions for the birds-planes-cars example. Birds, black s; planes, red s; cars, green + s.

Summary 1 The article extends the methodology in the single-index paper to multiple-index model and it uses a permutation test to estimate the dimensionality of the central subspace. 2 The method has very weaker assumptions on the distribution of the predictors, and it works very efficiently on discrete predictors. 3 The article finds new theoretical properties of V 2 (β T X, Y ).

Thank You! Q & A