Cross-Sectional Regression after Factor Analysis: Two Applications

Similar documents
Confounder Adjustment in Multiple Hypothesis Testing

Combining multiple observational data sources to estimate causal eects

Inference on Risk Premia in the Presence of Omitted Factors

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

Factor Models for Asset Returns. Prof. Daniel P. Palomar

Financial Econometrics Lecture 6: Testing the CAPM model

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Identifying Financial Risk Factors

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Non-specific filtering and control of false positives

Equity risk factors and the Intertemporal CAPM

Linear Factor Models and the Estimation of Expected Returns

Specification Errors, Measurement Errors, Confounding

Exam: high-dimensional data analysis January 20, 2014

arxiv: v2 [stat.me] 1 Mar 2019

Statistical inference in Mendelian randomization: From genetic association to epidemiological causation

Multivariate Tests of the CAPM under Normality

Regression: Ordinary Least Squares

Introduction to Computational Finance and Financial Econometrics Probability Theory Review: Part 2

ZHAW Zurich University of Applied Sciences. Bachelor s Thesis Estimating Multi-Beta Pricing Models With or Without an Intercept:

Econ671 Factor Models: Principal Components

Empirical Bayes Moderation of Asymptotically Linear Parameters

Statistics 910, #5 1. Regression Methods

GMM - Generalized method of moments

Model Mis-specification

Modern Portfolio Theory with Homogeneous Risk Measures

Homogeneity Pursuit. Jianqing Fan

Circling the Square: Experiments in Regression

For more information about how to cite these materials visit

3 Comparison with Other Dummy Variable Methods

Notes on empirical methods

R = µ + Bf Arbitrage Pricing Model, APM

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Network Connectivity and Systematic Risk

Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed

Association studies and regression

Joint Probability Distributions

Financial Econometrics Return Predictability

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

17 Factor Models and Principal Components

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

ORTHOGO ALIZED EQUITY RISK PREMIA SYSTEMATIC RISK DECOMPOSITIO. Rudolf F. Klein a,* and K. Victor Chow b,* Abstract

Probabilities & Statistics Revision

ASSET PRICING MODELS

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Exploring non linearities in Hedge Funds

Ross (1976) introduced the Arbitrage Pricing Theory (APT) as an alternative to the CAPM.

Econ 583 Final Exam Fall 2008

Lecture 13. Simple Linear Regression

Section 3: Simple Linear Regression

Simple Linear Regression

Empirical Bayes Moderation of Asymptotically Linear Parameters

A Guide to Modern Econometric:

Financial Econometrics

Exam: high-dimensional data analysis February 28, 2014

Scatter plot of data from the study. Linear Regression

Noise Fit, Estimation Error and a Sharpe Information Criterion: Linear Case

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Simple Linear Regression

In modern portfolio theory, which started with the seminal work of Markowitz (1952),

The Bond Pricing Implications of Rating-Based Capital Requirements. Internet Appendix. This Version: December Abstract

ECON4515 Finance theory 1 Diderik Lund, 5 May Perold: The CAPM

Bootstrap tests of mean-variance efficiency with multiple portfolio groupings

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Markowitz Efficient Portfolio Frontier as Least-Norm Analytic Solution to Underdetermined Equations

An overview of applied econometrics

10.7 Fama and French Mutual Funds notes

Scatter plot of data from the study. Linear Regression

High-Throughput Sequencing Course

Linear Factor Models and the Estimation of Expected Returns

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University

Information Choice in Macroeconomics and Finance.

Modeling Real Estate Data using Quantile Regression

Econometrics of Panel Data

1 Description of variables

Forecasting the term structure interest rate of government bond yields

Manual: R package HTSmix

Regression diagnostics

Factor Investing using Penalized Principal Components

Heteroscedasticity and Autocorrelation

A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices

Ordinary Least Squares Regression

Generalized Elastic Net Regression

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Ch3. TRENDS. Time Series Analysis

Deep Learning in Asset Pricing

STA 2201/442 Assignment 2

The Simple Linear Regression Model

13. Parameter Estimation. ECE 830, Spring 2014

Linear Regression (9/11/13)

11.433J / J Real Estate Economics Fall 2008

Miloš Kopa. Decision problems with stochastic dominance constraints

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

1 Introduction to Generalized Least Squares

Transcription:

al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016

Overview 1 2 3 4 1 / 27

Outline 1 2 3 4 2 / 27

Data matrix Y R n p Panel data. Transposable data. Modern datasets: usually high dimensional (both n, p 1). Two examples: 1 Gene expressions (row: cell; column: gene). 2 Mutual fund monthly returns (row: month; column: fund). 3 / 27

Gene discovery Which genes are associated with a treatment/condition? Let X R n 1 be the treatment vector. Simple linear regression: col j (Y) = α j X + ɛ j. Equivalently, Y n p = X n 1 α T p 1 + ɛ n p. Statistical significance of α j, multiple testing... 4 / 27

Mutual fund selection How skillful is a mutual fund manager? Let Z R n d be the well-known systemic risk factors. The Fama-French-Carhart four factor model: 1 Market Minus Risk Free; 2 Small [market capitalization] Minus Big; 3 High [book-to-market ratio] Minus Low; 4 Momentum. Simple linear regression for mutual fund j: col j (Y) = α j + β T j Z + ɛ j. Equivalently, Y n p = 1 n 1 α T p 1 + Z n d β T p d + ɛ n p. α j is usually regarded as the skill of manager j. 5 / 27

A common model In both examples, we can model the data matrix by Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. α is the parameter of interest. β is nuisance (not always included). ɛ is noise, assumed Gaussian and column-independent. In genomics testing, X is treatment and Z is other factors affecting Y. In mutual fund selection, X is intercept and Z contains the systemic risk factors. Standard statistical method: linear regression for each column. 6 / 27

Unmeasured variables Not all the adjustment covariates Z are always measured. In the biology example, Z can be gender, age, microarray platform, batch,... In the finance example, Z can be other systemic risk factors (hundreds are documented). 7 / 27

Is this a problem? Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. NO if Z X (α is unconfounded). YES if Z and X are dependent (α is confounded). 8 / 27

Unconfounded case Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. The least squares estimator ˆα is still unbiased, but dependent. Troublesome for multiple testing (FDR control) if the latent variables are ignored. Solution: estimate Z and β by factor analysis which give the dependency structure. 9 / 27

Confounded case Y n p = X n 1 α T p 1 + Z n d βp d T + ɛ. The least squares estimator ˆα is biased (by how much?). Assume Z n d = X n 1 γd 1 T + W and W X, then Y n p = X n 1 τ T p 1 + W n d β T p d + ɛ, where τ = α + βγ. The OLS estimator ˆα is unbiased for τ (the marginal effects), but not α. 10 / 27

Factor analysis Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ = X n 1 τ T p 1 + W n d β T p d + ɛ β can be estimated from factor analysis: 1 Regress X out of Y; 2 Run factor analysis (e.g. PCA) on the residual matrix. 11 / 27

Cross-sectional regression Back to the decomposition of marginal effects: τ p 1 = α p 1 + β p d γ d 1. Now we have good estimate of τ and β, can we estimate α from this formula? There are p + d parameters but p equations, so NO...? Need additional assumptions for identifiability, like sparsity. Proposition: if α 0 (p d)/2 and β is good, then α is identifiable. Regress ˆτ on ˆβ with robust loss function (sparsity penalty on α). 12 / 27

Does sparsity make sense? Not always. Reasonable in our examples: 1 Most genes are most likely unrelated to the treatment. 2 Most mutual funds have no skill by economic game theory [Berk and Green, 2004]. 13 / 27

Entire procedure Three steps: Y n p = X n 1 α T p 1 + Z n d β T p d + ɛ. Row regression/regular regression/time-series regression/longitudinal regression... Factor analysis on residuals. Column regression/cross-sectional regression. 14 / 27

Outline 1 2 3 4 15 / 27

A biology example: COPD study COPD = chronic obstructive pulmonary disease. Singh et al. [2011] tried to find genes associated with the severity of COPD (moderate or severe). 0.15 N(0.024,2.6^2) density 0.10 0.05 0.00 5 0 5 t statistics Distribution of t-statistics: overdispersed and skewed. 16 / 27

COPD data: severity as primary variable 0.4 0.4 0.3 N(0, 1) 0.3 N(0, 1) density 0.2 density 0.2 0.1 0.1 0.0 5 0 5 t statistics (a) Naive linear regression. 0.0 5 0 5 t statistics (b) After adjustment. ˆd = 1 [Onatski, 2010]. ˆγ 0.98, confounded variance of X is approximately 22%. Test of confounding: p-value 0. 17 / 27

COPD data: gender as primary variable Genes associated with gender should come from X /Y chromosomes (positive controls). 0.4 0.4 0.3 N(0, 1) 0.3 N(0, 1) density 0.2 density 0.2 0.1 0.1 0.0 5 0 5 t statistics (a) Naive linear regression. 0.0 5 0 5 t statistics (b) After adjustment. ˆγ 0.27, variance explained is approximately 3%. Test of confounding: p-value 1.2 10 3. 18 / 27

COPD data: gender as primary variable Can we control FDR? FDP 0.0 0.2 0.4 0.6 0.8 1.0 LEAPP(RR) Naive Limma SVA 0.0 0.2 0.4 0.6 0.8 1.0 Nominal FDR 19 / 27

Outline 1 2 3 4 20 / 27

Mutual fund selection Two definitions of mutual fund skill: 1 The α in Capital Asset Pricing Model (CAPM) which uses just one market factor; 2 The α adjusted for known and unknown factors. I will call it α. Surprisingly, finance researchers find that most investors are chasing the CAPM-α, a Nobel prize winner but was introduced 50 years ago. 21 / 27

A simulation experiment At the beginning of every year from 1996 to 2014, find all the mutual funds that exist in the last 5 years. Estimate their CAPM-α and α using the 5 year data. Form decile groups based on the estimated α and α, compare their monthly returns in the next year. Note: I m actually using the Treynor index α sd(γ T j Z). 22 / 27

Top 10% Funds ER SR CAPM-α FFC-α AUM Monthly Flow α \ α 5.1 27.7-2.75-2.98 1259.3 10.11 (-1.96) (-2.06) α α 6.2 37.6-0.86-1.16 1279.5 15.45 (-0.69) (-1.00) α \ α 9.1 59.3 2.42 2.00 1097.5 8.0 (2.41) (1.97) Table : Performance of the top funds. ER is excessive return, SR is Sharpe ratio (µ/σ), AUM is asset under management. 23 / 27

1.5 cumulative log return 1.0 0.5 strategy all funds α only α ~ only α and α ~ 0.0 0.5 2000 2005 2010 2015 time Figure : Cumulative log-return. 24 / 27

15% 9 largest α largest treynor(α ~ ) average monthly return * 12 10% 5% 0% 7 6 5 3 4 2 1 8 0 12 3456 78 9 0 1 2 3 4 567 8 9 9 12 345 0 6 78 9 02 3456 78 1 7 8 9 0 123 4 5 6 67 5 4 12 3 0 9 8 9 12 3456 7 8 0 0 12 345 67 89 9 12 34 5 6789 1 2 4 5 678 3 0 0 1234 56 8 9 7 0 percentile 0 (0, 10] 1 (10, 20] 2 (20, 30] 3 (30, 40] 4 (40, 50] 5 (50, 60] 6 (60, 70] 7 (70, 80] 8 (80, 90] 9 (90, 100] 0 0 1 2 3 4 5 0 1 2 3 4 5 year Figure : Return in the next 5 years of 10 deciles. 25 / 27

Outline 1 2 3 4 26 / 27

Confounding is a common problem across domains. Sometimes it s helpful to think rows and columns in a similar way. Be wise when investing. 27 / 27