Sliced Inverse Regression

Similar documents
High-dimensional regression with unknown variance

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

A Note on Hilbertian Elliptically Contoured Distributions

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

Principal Component Analysis for a Spiked Covariance Model with Largest Eigenvalues of the Same Asymptotic Order of Magnitude

Inference For High Dimensional M-estimates. Fixed Design Results

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

LECTURE NOTE #11 PROF. ALAN YUILLE

Machine Learning for Data Science (CS4786) Lecture 11

Introduction to Regression

Modelling Non-linear and Non-stationary Time Series

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

Optimal Linear Estimation under Unknown Nonlinear Transform

On corrections of classical multivariate tests for high-dimensional data. Jian-feng. Yao Université de Rennes 1, IRMAR

Estimation of large dimensional sparse covariance matrices

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Principal Component Analysis

Spatial Process Estimates as Smoothers: A Review

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Dissertation Defense

high-dimensional inference robust to the lack of model sparsity

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Inference For High Dimensional M-estimates: Fixed Design Results

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Vector Auto-Regressive Models

Minimax Estimation of Kernel Mean Embeddings

On corrections of classical multivariate tests for high-dimensional data

VAR Models and Applications

PCA, Kernel PCA, ICA

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Nonparametric Drift Estimation for Stochastic Differential Equations

Table of Contents. Multivariate methods. Introduction II. Introduction I

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Non-linear Dimensionality Reduction

Regression #3: Properties of OLS Estimator

Sliced Inverse Regression for big data analysis

Introduction to Regression

The Free Matrix Lunch

Variable Selection in High Dimensional Convex Regression

Nonparametric regression with martingale increment errors

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Long-Run Covariability

Lecture Notes 3 Convergence (Chapter 5)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Intermediate Econometrics

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Statistical Pattern Recognition

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Principal Component Analysis!! Lecture 11!

Lecture 20: Linear model, the LSE, and UMVUE

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

Introduction to Regression

On-line Variance Minimization

Susceptible-Infective-Removed Epidemics and Erdős-Rényi random

Lecture 10: Dimension Reduction Techniques

Online Kernel PCA with Entropic Matrix Updates

Statistical Machine Learning

STAT 518 Intro Student Presentation

MLCC 2015 Dimensionality Reduction and PCA

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

14 Singular Value Decomposition

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Multivariate Statistics

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Extreme inference in stationary time series

1 Motivation for Instrumental Variable (IV) Regression

Statistics 910, #5 1. Regression Methods

Factor Analysis (10/2/13)

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Lecture Notes 1: Vector spaces

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Gaussian random variables inr n

Semi-Nonparametric Inferences for Massive Data

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

A Selective Review of Sufficient Dimension Reduction

Introduction to Regression

Single Equation Linear GMM with Serially Correlated Moment Conditions

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Multivariate Statistical Analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

(a)

Robust Testing and Variable Selection for High-Dimensional Time Series

IV. Matrix Approximation using Least-Squares

1 Principal Components Analysis

L. Brown. Statistics Department, Wharton School University of Pennsylvania

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Regression, Ridge Regression, Lasso

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Chapter 4 - Fundamentals of spatial processes Lecture notes

Can we do statistical inference in a non-asymptotic way? 1

The LIML Estimator Has Finite Moments! T. W. Anderson. Department of Economics and Department of Statistics. Stanford University, Stanford, CA 94305

Nonparametric Inference In Functional Data

Transcription:

Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University

Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed Regression Curve SIR Algorithm of SIR Discussion on SIR Consistency and Sparsity of SIR

Background Regression analysis is a popular way of studying the relationship between a response variable y R and its explanatory variable x R p. In some cases finding correct parametric model is not easy, leading to a nonparametric way. When dimension increases higher more and more data are required in the sample. We want an ideal model catching most or all of the interesting feature with least dimension. y = f(β 1 x, β 2 x,..., β K x, ɛ), K p.

Dimension Reduction y = f(β 1 x, β 2 x,..., β K x, ɛ), K p. f is not identifiable, arbitrary function in R K+1. β can be changed since β T x is a projection on a K-dimension space. When K is much smaller than p, we may claim we reduce the dimension once most of the information about y is remained and we estimate β efficiently. Estimating the projection β indicates that we have the new space. We call β effective dimension reduction (e.d.r.) direction.

Dimension Reduction (Continue) Ideal statement: y = f(β 1 x, β 2 x,..., β K x, ɛ), K p. Alternative statement: The conditional distribution of y given x depends on x only through the K-dimensional variable (β 1 x,..., β K x). y x β T x

Intuition of SIR Difficulty arises. Dimension p is larger than n, regressing y against x does not make sense. Hard to view the data (cooradinates) by using traditional method due to high dimension. Flip y and x!

Definition of SIR Consider the model x = g p (y) as an one-dimension problem. E(x y) will be a curve in p-dimension space. If possible, this curve will hover around a K-dimensional affine subspace. Later shows the relationship between K-dimensional affine subspace and effective dimension reduction space (spanned from e.d.r. directions.)

Definition of SIR (Continue) Affine invariant criterion, the squared trace correlation: R 2 (b) = max β B If x is standardized as follows: (bσ xx β ) 2 (bσ xx b )(βσ xx β ). z = Σ 1 xx {x E(x)}, the inverse regression curve falls into a subspace which coincides with e.d.r. space.

Algorithm of SIR We have a data set (y i, x i ), i = 1, 2,..., n. 1. Standardized x: x i = Σ 1 xx {x i x}; 2. Divide range of y into H slices, I 1,..., I H. Each slice has proportion p h of total observations; 3. Compute sample mean of each slice, denoted by m h ; 4. Conduct a weighted Principle Component Analysis on m h from the weighted covariance matrix V = H h=1 p h m h m h ; 5. Output β k = η k Σ 1/2 xx, k = 1,..., K where η k s are K largest eigenvectors.

Remarks Sample mean is just for simplicity. We can use other methods to estimate inverse curve, such as kernel based nonparametric regression, nearest neighbor, smoothing splines. Here we are only interested in the orientation. Weighted version of PCA takes care of unequal sample slice. In general we have lim n p/n = 0 to guarantee the consistency. If violated, we need more conditions. First K components locate most important subspace. We will discuss how to decide k later. Last step transform back to β k.

Further discussion No need to standardize x i, but still need to transform sliced means as follows: H Σ 1 = p h ( x h x)( x h x) h=1 where x h is the sliced sample mean. Then we do PCA on Σ 1 instead of Σ xx. One can use other methods, such as robust version, to standardize x. The purpose is to downweight or cut out the influential design points.

Further discussion (Continue) Slice can be equal distance. We prefer to have it varied from slice to slice such that they have similar sample size with each other. We hope the range of each slice will converge to 0 so that only local points will contribute to the estimation. Even with large number of slice, we still have good consistency. Usually using following slice: I h = (F 1 y {(h 1)/H}, F 1 {h/h}) Choice of H may affect the asymptotic variance of β, but not as important as bandwidth in nonparametric model. y

Further discussion (Continue) Expectation of the squared trace correlation between β k x and β k x is given by E{R 2 ( B)} = 1 p K n ( 1 + 1 K K λ k k=1 ) 1 + o ( ) 1. n To be really successful in picking up all K dimensions for reduction, the inverse regression curve cannot be too straight. In other words, the first K eigenvalues for V must be significantly different from zero compared with the sampling error.

Further discussion (Continue) Theorem If x is normally distributed, then n(p K) λ p K follows a χ 2 distribution with (p K)(H K 1) degree of freedom asymptotically, where λ p K denotes the average of the smallest p K eigenvalues for V. We can use this result to assess the number of component we have.

Theoretical results: Conditions 1. The conditional distribution of y given x depends on x only through the K-dimensional variable (β 1 x,..., β K x). y x β T x; 2. For any b R p, the conditional expectation E(bx β 1 x,..., β K x) is a linear combination of β 1 x,..., β K x.

Theoretical results: Inverse regression curve Theorem The centered inverse regression curve E(x y) E(x) is contained in the linear subspace spanned by β k Σ xx, k = 1,..., K, where Σ xx denotest the covariance matrix of x. Here we have E(E(x y)) = E(x) by law of total expectation. The following corollary is straightforward. Corollary Assume x has been standardized to z, then the standardized inverse regression curve E(z y) is contained in the linear space generated by the standardized e.d.r. directions η 1,..., η K.

Theoretical results: Remarks According to law of total variance, E{cov(z y)} = covz cov{e(z y)} = I cov{e(z y)}. Hence the largest eigen value of E{cov(z y)} equals the smallest eigen value of cov{e(z y)}. The consistency is at rate of n. Let p h = P r{y I h } and m h = E(Z y I h ). We have m h m h at rate n 1/2, V H h=1 p hm h m h at rate n.

Conditions for more detialed discussion 1. Linearity condition: For and b R p, E(bx β 1 x,..., β K x) is a linear combination of β 1 x,..., β K x. 2. Coverage condition: The dimension of the space spanned by the central curve is the same as the dimension of the central space. 3. Boundness Condition: There exist positive constant C 1 and C 2 such that C 1 λ min (Σ xx ) λ max (Σ xx ) C 2 where λ min and λ max are the minimum and maximum eigenvalue of Σ xx respectively.

Conditions (Continue) 4. The central curve E(x y) has fourth finite moment and is κ-sliced stable with respect to y and E(x y). Sliced stable is an intrinsic property of E(x y). If we expect the slice estimate 1/H h m h m h of var(m(y)) is consistent, we must require that the average loss of variance in each slice (1/c m h,i m h,i m h m h ) to be decreasing as H is increasing.

Consistency Theorem Assume the conditions are all hold, for sufficiently large H and n, we have: ( ) Λ 1 p Λ p 2 O p H κ 1 + H2 p H n + 2 p n where Λ p = var(e(x y)) and Λ p is its estimate. A direct corollary is that if p/n 0, we may choose H = log(n/p) such that the right hand side converges to 0. Hence Λ p is a consistent estimate of Λ p = var(e(x y)).

Consistency (Continue) Theorem Assume the conditions are all hold, x is sub-gaussian and lim n p/n = 0, then 1 Σ Λ xx p Σ 1 xx Λ p 2 0 with probability converging to 1 as n, where Σ xx = 1 n n x i x i. i=1

Consistency (Continue) Theorem Assume the conditions are all hold and x n(0, I p ), for the single index model y = f(β T x, ɛ) we have When lim n p/n (0, ), Λ p Λ p 2 is (as a function of p/n) dominated by p/n p/n if H, n ; Let β be the PCA eigenvector of Λ p. If lim n p/n 0, the there exists a positive constant c(p/n) > 0 such that lim inf n E (β, β) > c(p/n) with probability converges to 1.

Conditions for ultrahigh dimension SIR We are now discussing the case when p n. 5. s = S p where S = { i β j (i) 0 for some j, 1 j K } and S is the number of elements in S. 6. Σ xx U(ɛ 0, α, C) and max 1 i p r i is bounded where r i is the number of non-zero elements in the i-th row of Σ xx. Where U(ɛ 0, α, C) = { Σ xx : max { σ i,j : i j > l} Cl α for all l > 0, j i and 0 < ɛ 0 λ min (Σ xx) λ max(σ xx) 1 } ɛ 0

Conditions for ultrahigh dimension SIR (Continue) We are now discussing the case when p n. 7. There exist positive constants C and ω such that var[e{x(k) y}] > C/s ω when E{x(k) y} is not constant. 8. There exists a constant K such that every coordinate x(k) is sub-gaussian and upper-exponentially bounded by K. Now we have the following theorems:

Ultrahigh dimension consistency Theorem Assuming the conditions are hold, let t = a/s ω where a is a sufficiently small positive constant such that t < var{m(y, k)}/2 for any k T, one has 1 T c ɛ p holds with probability at least { n } 1 C 1 exp C 2 H 2 s ω + C 3log(H) + log(p s) ; 2 T c I p holds with probability at least { n } 1 C 4 exp C 5 H 2 s ω + C 6log(H) + log(s) ; for some positive constants C 1,..., C 6.

Ultrahigh dimension consistency (Continue) Theorem Under the same assumptions of previous Theorem and the same setting of t, let T = I(t) and H = log(n/s ω log(p)), then ( ) e Λ T, T 2 p Λ p 0, n with probability converges to 1. As a direct corollary we have ( ) 1 Σ X e Λ T, T p Σ 1 with probability converges to 1. X Λ p 0, n 2

Ultrahigh dimension algorithm 1. Calculate var H,c {x(k)} for k = 1,..., p according to var H,c {x(k)} = 1 H 1 H { x h (k) x(k)} 2 ; i=1 2. Let T = {k var H,c {x(k)} > t} for an appropriate t; 3. Let Λ T, T p be the SIR estimator of the conditional covariance matrix for the data (y, x, T ) according to: Λ = 1 H 1 H { x h x} { x h x} T ; h=1

Ultrahigh dimension algorithm (Continue) 4. Calculate η i = e( η T i ) where η T i, 1 i K are the top eigenvectors of Λ T, T ; 5. Calculate β i = Σ xx ; Σ 1 xx η i where Σ xx is a consistent estimate of 6. The central space is estimated by the span of β i s.

Summary Introduce the Sliced Inverse Regression (SIR) Provide the algorithm of SIR Further discuss the consistency of SIR Extend original SIR to ultrahigh dimension SIR

Thank you!