Cross-Sectional Regression after Factor Analysis: Two Applications

Size: px

Start display at page:

Download "Cross-Sectional Regression after Factor Analysis: Two Applications"

Florence Armstrong
6 years ago
Views:

1 al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016

2 Overview / 27

3 Outline / 27

4 Data matrix Y R n p Panel data. Transposable data. Modern datasets: usually high dimensional (both n, p 1). Two examples: 1 Gene expressions (row: cell; column: gene). 2 Mutual fund monthly returns (row: month; column: fund). 3 / 27

5 Gene discovery Which genes are associated with a treatment/condition? Let X R n 1 be the treatment vector. Simple linear regression: col j (Y) = α j X + ɛ j. Equivalently, Y n p = X n 1 α T p 1 + ɛ n p. Statistical significance of α j, multiple testing... 4 / 27

6 Mutual fund selection How skillful is a mutual fund manager? Let Z R n d be the well-known systemic risk factors. The Fama-French-Carhart four factor model: 1 Market Minus Risk Free; 2 Small [market capitalization] Minus Big; 3 High [book-to-market ratio] Minus Low; 4 Momentum. Simple linear regression for mutual fund j: col j (Y) = α j + β T j Z + ɛ j. Equivalently, Y n p = 1 n 1 α T p 1 + Z n d β T p d + ɛ n p. α j is usually regarded as the skill of manager j. 5 / 27

7 A common model In both examples, we can model the data matrix by Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. α is the parameter of interest. β is nuisance (not always included). ɛ is noise, assumed Gaussian and column-independent. In genomics testing, X is treatment and Z is other factors affecting Y. In mutual fund selection, X is intercept and Z contains the systemic risk factors. Standard statistical method: linear regression for each column. 6 / 27

8 Unmeasured variables Not all the adjustment covariates Z are always measured. In the biology example, Z can be gender, age, microarray platform, batch,... In the finance example, Z can be other systemic risk factors (hundreds are documented). 7 / 27

9 Is this a problem? Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. NO if Z X (α is unconfounded). YES if Z and X are dependent (α is confounded). 8 / 27

10 Unconfounded case Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. The least squares estimator ˆα is still unbiased, but dependent. Troublesome for multiple testing (FDR control) if the latent variables are ignored. Solution: estimate Z and β by factor analysis which give the dependency structure. 9 / 27

11 Confounded case Y n p = X n 1 α T p 1 + Z n d βp d T + ɛ. The least squares estimator ˆα is biased (by how much?). Assume Z n d = X n 1 γd 1 T + W and W X, then Y n p = X n 1 τ T p 1 + W n d β T p d + ɛ, where τ = α + βγ. The OLS estimator ˆα is unbiased for τ (the marginal effects), but not α. 10 / 27

12 Factor analysis Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ = X n 1 τ T p 1 + W n d β T p d + ɛ β can be estimated from factor analysis: 1 Regress X out of Y; 2 Run factor analysis (e.g. PCA) on the residual matrix. 11 / 27

13 Cross-sectional regression Back to the decomposition of marginal effects: τ p 1 = α p 1 + β p d γ d 1. Now we have good estimate of τ and β, can we estimate α from this formula? There are p + d parameters but p equations, so NO...? Need additional assumptions for identifiability, like sparsity. Proposition: if α 0 (p d)/2 and β is good, then α is identifiable. Regress ˆτ on ˆβ with robust loss function (sparsity penalty on α). 12 / 27

14 Does sparsity make sense? Not always. Reasonable in our examples: 1 Most genes are most likely unrelated to the treatment. 2 Most mutual funds have no skill by economic game theory [Berk and Green, 2004]. 13 / 27

15 Entire procedure Three steps: Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. Row regression/regular regression/time-series regression/longitudinal regression... Factor analysis on residuals. Column regression/cross-sectional regression. 14 / 27

16 Outline / 27

17 A biology example: COPD study COPD = chronic obstructive pulmonary disease. Singh et al. [2011] tried to find genes associated with the severity of COPD (moderate or severe) N(0.024,2.6^2) density t statistics Distribution of t-statistics: overdispersed and skewed. 16 / 27

18 COPD data: severity as primary variable N(0, 1) 0.3 N(0, 1) density 0.2 density t statistics (a) Naive linear regression t statistics (b) After adjustment. ˆd = 1 [Onatski, 2010]. ˆγ 0.98, confounded variance of X is approximately 22%. Test of confounding: p-value / 27

19 COPD data: gender as primary variable Genes associated with gender should come from X /Y chromosomes (positive controls) N(0, 1) 0.3 N(0, 1) density 0.2 density t statistics (a) Naive linear regression t statistics (b) After adjustment. ˆγ 0.27, variance explained is approximately 3%. Test of confounding: p-value / 27

20 COPD data: gender as primary variable Can we control FDR? FDP LEAPP(RR) Naive Limma SVA Nominal FDR 19 / 27

21 Outline / 27

22 Mutual fund selection Two definitions of mutual fund skill: 1 The α in Capital Asset Pricing Model (CAPM) which uses just one market factor; 2 The α adjusted for known and unknown factors. I will call it α. Surprisingly, finance researchers find that most investors are chasing the CAPM-α, a Nobel prize winner but was introduced 50 years ago. 21 / 27

23 A simulation experiment At the beginning of every year from 1996 to 2014, find all the mutual funds that exist in the last 5 years. Estimate their CAPM-α and α using the 5 year data. Form decile groups based on the estimated α and α, compare their monthly returns in the next year. Note: I m actually using the Treynor index α sd(γ T j Z). 22 / 27

24 Top 10% Funds ER SR CAPM-α FFC-α AUM Monthly Flow α \ α (-1.96) (-2.06) α α (-0.69) (-1.00) α \ α (2.41) (1.97) Table : Performance of the top funds. ER is excessive return, SR is Sharpe ratio (µ/σ), AUM is asset under management. 23 / 27

25 1.5 cumulative log return strategy all funds α only α ~ only α and α ~ time Figure : Cumulative log-return. 24 / 27

26 15% 9 largest α largest treynor(α ~ ) average monthly return * 12 10% 5% 0% percentile 0 (0, 10] 1 (10, 20] 2 (20, 30] 3 (30, 40] 4 (40, 50] 5 (50, 60] 6 (60, 70] 7 (70, 80] 8 (80, 90] 9 (90, 100] year Figure : Return in the next 5 years of 10 deciles. 25 / 27

27 Outline / 27

28 Confounding is a common problem across domains. Sometimes it s helpful to think rows and columns in a similar way. Be wise when investing. 27 / 27

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen