al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016
Overview 1 2 3 4 1 / 27
Outline 1 2 3 4 2 / 27
Data matrix Y R n p Panel data. Transposable data. Modern datasets: usually high dimensional (both n, p 1). Two examples: 1 Gene expressions (row: cell; column: gene). 2 Mutual fund monthly returns (row: month; column: fund). 3 / 27
Gene discovery Which genes are associated with a treatment/condition? Let X R n 1 be the treatment vector. Simple linear regression: col j (Y) = α j X + ɛ j. Equivalently, Y n p = X n 1 α T p 1 + ɛ n p. Statistical significance of α j, multiple testing... 4 / 27
Mutual fund selection How skillful is a mutual fund manager? Let Z R n d be the well-known systemic risk factors. The Fama-French-Carhart four factor model: 1 Market Minus Risk Free; 2 Small [market capitalization] Minus Big; 3 High [book-to-market ratio] Minus Low; 4 Momentum. Simple linear regression for mutual fund j: col j (Y) = α j + β T j Z + ɛ j. Equivalently, Y n p = 1 n 1 α T p 1 + Z n d β T p d + ɛ n p. α j is usually regarded as the skill of manager j. 5 / 27
A common model In both examples, we can model the data matrix by Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. α is the parameter of interest. β is nuisance (not always included). ɛ is noise, assumed Gaussian and column-independent. In genomics testing, X is treatment and Z is other factors affecting Y. In mutual fund selection, X is intercept and Z contains the systemic risk factors. Standard statistical method: linear regression for each column. 6 / 27
Unmeasured variables Not all the adjustment covariates Z are always measured. In the biology example, Z can be gender, age, microarray platform, batch,... In the finance example, Z can be other systemic risk factors (hundreds are documented). 7 / 27
Is this a problem? Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. NO if Z X (α is unconfounded). YES if Z and X are dependent (α is confounded). 8 / 27
Unconfounded case Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. The least squares estimator ˆα is still unbiased, but dependent. Troublesome for multiple testing (FDR control) if the latent variables are ignored. Solution: estimate Z and β by factor analysis which give the dependency structure. 9 / 27
Confounded case Y n p = X n 1 α T p 1 + Z n d βp d T + ɛ. The least squares estimator ˆα is biased (by how much?). Assume Z n d = X n 1 γd 1 T + W and W X, then Y n p = X n 1 τ T p 1 + W n d β T p d + ɛ, where τ = α + βγ. The OLS estimator ˆα is unbiased for τ (the marginal effects), but not α. 10 / 27
Factor analysis Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ = X n 1 τ T p 1 + W n d β T p d + ɛ β can be estimated from factor analysis: 1 Regress X out of Y; 2 Run factor analysis (e.g. PCA) on the residual matrix. 11 / 27
Cross-sectional regression Back to the decomposition of marginal effects: τ p 1 = α p 1 + β p d γ d 1. Now we have good estimate of τ and β, can we estimate α from this formula? There are p + d parameters but p equations, so NO...? Need additional assumptions for identifiability, like sparsity. Proposition: if α 0 (p d)/2 and β is good, then α is identifiable. Regress ˆτ on ˆβ with robust loss function (sparsity penalty on α). 12 / 27
Does sparsity make sense? Not always. Reasonable in our examples: 1 Most genes are most likely unrelated to the treatment. 2 Most mutual funds have no skill by economic game theory [Berk and Green, 2004]. 13 / 27
Entire procedure Three steps: Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. Row regression/regular regression/time-series regression/longitudinal regression... Factor analysis on residuals. Column regression/cross-sectional regression. 14 / 27
Outline 1 2 3 4 15 / 27
A biology example: COPD study COPD = chronic obstructive pulmonary disease. Singh et al. [2011] tried to find genes associated with the severity of COPD (moderate or severe). 0.15 N(0.024,2.6^2) density 0.10 0.05 0.00 5 0 5 t statistics Distribution of t-statistics: overdispersed and skewed. 16 / 27
COPD data: severity as primary variable 0.4 0.4 0.3 N(0, 1) 0.3 N(0, 1) density 0.2 density 0.2 0.1 0.1 0.0 5 0 5 t statistics (a) Naive linear regression. 0.0 5 0 5 t statistics (b) After adjustment. ˆd = 1 [Onatski, 2010]. ˆγ 0.98, confounded variance of X is approximately 22%. Test of confounding: p-value 0. 17 / 27
COPD data: gender as primary variable Genes associated with gender should come from X /Y chromosomes (positive controls). 0.4 0.4 0.3 N(0, 1) 0.3 N(0, 1) density 0.2 density 0.2 0.1 0.1 0.0 5 0 5 t statistics (a) Naive linear regression. 0.0 5 0 5 t statistics (b) After adjustment. ˆγ 0.27, variance explained is approximately 3%. Test of confounding: p-value 1.2 10 3. 18 / 27
COPD data: gender as primary variable Can we control FDR? FDP 0.0 0.2 0.4 0.6 0.8 1.0 LEAPP(RR) Naive Limma SVA 0.0 0.2 0.4 0.6 0.8 1.0 Nominal FDR 19 / 27
Outline 1 2 3 4 20 / 27
Mutual fund selection Two definitions of mutual fund skill: 1 The α in Capital Asset Pricing Model (CAPM) which uses just one market factor; 2 The α adjusted for known and unknown factors. I will call it α. Surprisingly, finance researchers find that most investors are chasing the CAPM-α, a Nobel prize winner but was introduced 50 years ago. 21 / 27
A simulation experiment At the beginning of every year from 1996 to 2014, find all the mutual funds that exist in the last 5 years. Estimate their CAPM-α and α using the 5 year data. Form decile groups based on the estimated α and α, compare their monthly returns in the next year. Note: I m actually using the Treynor index α sd(γ T j Z). 22 / 27
Top 10% Funds ER SR CAPM-α FFC-α AUM Monthly Flow α \ α 5.1 27.7-2.75-2.98 1259.3 10.11 (-1.96) (-2.06) α α 6.2 37.6-0.86-1.16 1279.5 15.45 (-0.69) (-1.00) α \ α 9.1 59.3 2.42 2.00 1097.5 8.0 (2.41) (1.97) Table : Performance of the top funds. ER is excessive return, SR is Sharpe ratio (µ/σ), AUM is asset under management. 23 / 27
1.5 cumulative log return 1.0 0.5 strategy all funds α only α ~ only α and α ~ 0.0 0.5 2000 2005 2010 2015 time Figure : Cumulative log-return. 24 / 27
15% 9 largest α largest treynor(α ~ ) average monthly return * 12 10% 5% 0% 7 6 5 3 4 2 1 8 0 12 3456 78 9 0 1 2 3 4 567 8 9 9 12 345 0 6 78 9 02 3456 78 1 7 8 9 0 123 4 5 6 67 5 4 12 3 0 9 8 9 12 3456 7 8 0 0 12 345 67 89 9 12 34 5 6789 1 2 4 5 678 3 0 0 1234 56 8 9 7 0 percentile 0 (0, 10] 1 (10, 20] 2 (20, 30] 3 (30, 40] 4 (40, 50] 5 (50, 60] 6 (60, 70] 7 (70, 80] 8 (80, 90] 9 (90, 100] 0 0 1 2 3 4 5 0 1 2 3 4 5 year Figure : Return in the next 5 years of 10 deciles. 25 / 27
Outline 1 2 3 4 26 / 27
Confounding is a common problem across domains. Sometimes it s helpful to think rows and columns in a similar way. Be wise when investing. 27 / 27