Deep Learning in Asset Pricing

Size: px

Start display at page:

Download "Deep Learning in Asset Pricing"

Junior Morris
5 years ago
Views:

1 Deep Learning in Asset Pricing Luyang Chen 1 Markus Pelger 1 Jason Zhu 1 1 Stanford University November 17th 2018 Western Mathematical Finance Conference 2018

return data has very low signal-to noise ratio This paper: Including

2 Motivation Hype: Machine Learning in Investment Same reporter 3 weeks later Efficient markets: Asset returns dominated by unforecastable news Financial return data has very low signal-to noise ratio This paper: Including financial constraints (no-arbitrage) in learning algorithm significantly improves signal 1

3 Motivation Motivation: Asset Pricing The Challenge of Asset Pricing One of the most important questions in finance: Why are asset prices different for different assets? No-Arbitrage Pricing Theory: Stochastic discount factor SDF (also called pricing kernel or equivalent martingale measure) explains differences in risk and asset prices Fundamental question: What is the SDF? Challenges SDF should depend on all available economic information: Very large set of variables Functional form of SDF unknown and likely complex SDF needs to capture time-variation in economic conditions Risk premium in stock returns has a low signal-to-noise ratio 2

4 Motivation This paper Goals of this paper: General non-linear asset pricing model and optimal portfolio design Deep-neural networks applied to all U.S. equity data and large sets of macroeconomic and firm-specific information. Why is it important? 1 Stochastic discount factor (SDF) generates tradeable portfolio with highest risk-adjusted return (Sharpe-ratio=expected excess return/standard deviation) 2 Arbitrage opportunities Find underpriced assets and earn alpha 3 Risk management Understand which information and how it drives the SDF Manage risk exposure of financial assets 3

5 Motivation Contribution of this paper Contribution This Paper: Estimate the SDF with deep neural networks Crucial innovation: Include no-arbitrage condition in the neural network algorithm and combine four neural networks in a novel way Key elements of estimator: 1 Non-linearity: Feed-forward network captures non-linearities 2 Time-variation: Recurrent (LSTM) network finds a small set of economic state processes 3 Pricing all assets: Generative adversarial network identifies the states and portfolios with most unexplained pricing information 4 Dimension reduction: Regularization through no-arbitrage condition 5 Signal-to-noise ratio: No-arbitrage conditions increase the signal to noise-ratio General model that includes all existing models as a special case 4

6 Motivation Contribution of this paper Empirical Contributions Empirically outperforms all benchmark models. Optimal portfolio has out-of-sample annual Sharpe ratio of 2.1. Non-linearities and interaction between firm information matters. Most relevant firm characteristics are price trends, profitability, and capital structure variables. 5

7 Motivation Literature (partial list) Deep-learning for predicting asset prices Gu, Kelly and Xiu (2018) Feng, Polson and Xu (2018) Messmer (2017) Predicting future asset returns with feed forward network Linear or kernel methods for asset pricing of large data sets Lettau and Pelger (2018): Risk-premium PCA Feng, Giglio and Xu (2017): Risk-premium lasso Freyberger, Neuhierl and Weber (2017): Group lasso Kelly, Pruitt and Su (2018): Instrumented PCA 6

8 Model The Model No-arbitrage pricing Ri,t+1 e = excess return (return minus risk-free rate) at time t + 1 for asset i = 1,..., N Fundamental no-arbitrage condition: for all t = 1,..., T and i = 1,..., N E t [M t+1 R e i,t+1] = 0 E t [.] expected value conditioned on information set at time t M t+1 stochastic discount factor SDF at time t + 1. Conditional moments imply infinitely many unconditional moments for any F t -measurable variable I t E[M t+1 R e t+1,ii t ] = 0 7

9 Model The Model No-arbitrage pricing Without loss of generality SDF is projection on the return space M t+1 = 1 + N w i,t Ri,t+1 e i=1 Optimal portfolio N Sharpe-ratio i=1 w i,tr e i,t+1 has highest conditional Portfolio weights w i,t are a general function of macro-economic information I t and firm-specific characteristics I i,t : w i,t = w(i t, I i,t ), Need non-linear estimator with many explanatory variables! Use a feed forward network to estimate w i,t 8

10 Model The Model Equivalent factor model representation No-arbitrage condition is equivalent to with factor F t = M t 1 E t [R e i,t+1] = cov t(r e i,t+1, F t+1) var t (F t+1 ) = β i,t µ F t E t [F t+1 ] Without loss of generality we have a factor representation R e t+1 = β t F t+1 + ɛ t+1 9

11 Model The Model Objects of Interest We use different approaches to estimate: The SDF factor F t The risk loadings β t The unexplained residual ê t = ( I N β t 1 (β t 1 β t 1) 1 β t 1) R e t Asset Pricing Performance Measure Sharpe ratio of SDF factor SR = E[F ] Var(F ) Unexplained variation 1 T T t=1 1 N N t i=1 têi,t 2 Average pricing errors 1 ( N ) 2 1 T N i=1 T i t=1 iê i,t also time weighted 1 ( N N i=1 T i T 1 T i T t=1 iê i,t ) 2 10

12 Estimation Loss Function Objective Function for Estimation Estimate SDF portfolio weights w(.) to minimize the no-arbitrage moment conditions For a set of conditioning variables Î t the loss function is L(Ît) = 1 N N T ( i 1 T i 2. M t+1 R e T T i,t+1ît) i i=1 t=1 Allows unbalanced panel. How can we choose the conditioning variables Ît = f (I t, I i,t ) as general functions of the macroeconomic and firm-specific information? Generative Adversarial Network (GAN) chooses Ît! 11

13 Estimation Generative Adversarial Network (GAN) Determining Moment Conditions Two networks play zero-sum game: 1 one network creates the SDF M t+1 2 other network creates the conditioning variables Î t Iteratively update the two networks: 1 for a given Ît the SDF network minimizes the loss 2 for a given SDF the conditional networks finds Ît with the largest loss (most mispricing) Intuition: find the economic states and assets with the most pricing information 12

14 Estimation Recurrent Neural Network (RNN) Transforming Macroeconomic Time-Series Problems with economic time-series data Time-series data is often non-stationary transformation necessary Asset prices depend on economic states simple differencing of non-stationary data not sufficient Solution: Recurrent Neural Network (RNN) with Long-Short-Term Memory (LSTM) cells Transform all macroeconomic time-series into a low dimensional vector of stationary state variables 13

15 Estimation Example: Non-stationary Macroeconomic Variables Macroeconomic Variables log RPI log S&P : Log RPI : Log S&P500 14

16 Estimation Macroeconomic state processes Figure: Macroeconomic state processes (LSTM Outputs) based on 178 macroeconomic time-series. 15

17 Estimation Neural Networks Model Architecture SDF Network: Update parameters to minimize loss State RNN '! $ Feed Forward Network ) ",$ Construct SDF! %,,! $! ",$ * $+% Moment RNN /! $ Conditional Network: Update parameters to maximize loss Feed Forward Network (! $ Loss Calculation -, $+%. Iterative Optimizer with GAN 16

18 Data Data Data 50 years of monthly observations: 01/ /2016. Monthly stock returns for all U.S. securities from CRSP (around 31,000 stocks) 46 firm-specific characteristics for each stock and every month (usual suspects) I i,t normalized to cross-sectional quantiles 178 macroeconomic variables (124 from FRED, 46 cross-sectional median time-series for characteristics, 8 from Goyal-Welch) I t T-bill rates from Kenneth-French website Training/validation/test split is 20y/5y/25y 17

19 Data Benchmark models Benchmark models 1 Linear factor models (CAPM, Fama-French 5 factors) 2 Instrumented PCA (Kelly et al. (2018): estimate SDF as linear function of characteristics: w i,t = θ I i,t 3 Deep learning return forecasting (Gu et al. (2018)): Predict conditional expected returns E t [R i,t+1 ] Empirical loss function for prediction 1 NT N i=1 t=1 T (R i,t+1 g(i t, I i,t )) 2 Use only simple feedforward network for forecasting Conditional means proportional to β t obtain SDF factor and residual through projections on conditional means 18

20 Results Results - Sharpe Ratio Sharpe Ratios of Benchmark Models Model SR (Train) SR (Valid) SR (Test) FF FF IPCA RtnFcst

21 Results Results - Sharpe Ratio Table: Performances of our approach sorted by validation Sharpe ratio SR SR SR SMV CSMV HL CHL HU CHU LR (Train) (Valid) (Test) (10 4 ) Optimal model: 4 moments, 4 macro states, 4 layers, 64 hidden units, 32 moments 20

22 Results Optimal Portfolio Performance 400 IPCA RtnFcst SDF Cumulated Excess Return Figure: Cumulated Normalized SDF Portfolio. 21

23 Results Results - Sharpe Ratio for Forecasting Approach Performances with Return Forecast Approach Macro Neurons SR (Train) SR (Valid) SR (Test) Y [32, 16, 8] Y [128, 128] N [32, 16, 8] N [128, 128]

24 Results IPCA: Number of Factors Sharpe Ratio train valid test K Figure: Sharpe ratio as a function of the number of factors for IPCA 23

25 Results Results - Asset Pricing Performance Asset Pricing Performance Unexplained Variation Train Valid Test Forecasting Ensemble GAN Fama-McBeth Type Alpha (10 3 ) Train Valid Test Forecasting Ensemble GAN Fama-McBeth Type Alpha (Weighted, 10 4 ) Forecasting Ensemble GAN

26 Results Results - Sharpe Ratio Performance of Benchmark Models Table: SDF Portfolio vs. Fama-French 5 Factors Mkt-RF SMB HML RMW CMA intercept coefficient correlation Conventional factors do no span SDF 25

27 Results Results - Risk of Optimal Portfolios Max One Month Loss and Max Drawdown Max 1 Month Loss Max Drawdown IPCA RtnFcst Ensemble GAN Ensemble GAN less risky 26

28 Results Results - Variable Importance Variables Ranked by Average Absolute Gradient for SDF network ST_REV r12_2 SUV D2P NOA D2A CF2P r12_7 RNA Resid_Var MktBeta BEME OA E2P Investment AT Variance PM NI Q LTurnover LT_Rev Beta SGA2S LME AC Spread PROF IdioVol DPI2A r2_1 ATO S2P FC2Y OL Lev ROE C A2ME PCM ROA CTO CF Rel2High OP r36_

29 Results Results - Variable Importance Variables Ranked by Reduction in R 2 for RtnFcst ST_REV SUV r12_2 LME r2_1 IdioVol Beta CF2P Resid_Var Rel2High FC2Y r12_7 E2P PM Spread ROA AT SGA2S LTurnover OL OP Variance DPI2A A2ME LT_Rev NI S2P NOA PCM Investment MktBeta CF PROF CTO ROE Q ATO D2P BEME r36_13 D2A AC Lev OA C RNA

30 Results Results - Variable Importance Variables Ranked by Reduction in R 2 for IPCA ST_REV SUV Beta r36_13 r2_1 PCM r12_2 D2P C LME AT NI PM D2A MktBeta BEME CF Rel2High Lev OL CTO LT_Rev CF2P NOA ROA ROE E2P DPI2A ATO Investment Variance RNA SGA2S Q LTurnover OP Resid_Var FC2Y r12_7 OA A2ME AC S2P Spread PROF IdioVol

31 Results Size Effect weight LME Figure: SDF weight and market capitalization in test data 30

32 Non-linearities Results - SDF Weights Relationship between Weights and Characteristics weight weight LME BEME Figure: Weight as a function of size (LME) and book-to-market (BEME). Value have close to linear effect but size non-linear 31

33 Non-linearities Results - SDF Weights Relationship between Weights and Characteristics weight weight Investment LME Figure: Weight as a function of investment or profitability Non-linear effect! 32

34 Non-linearities Results - SDF Weights Relationship between Weights and Characteristics BEME weight LME Figure: Weight as a function of size (LME) and book-to-market (BEME). Size and value have non-linear interaction! 33

35 Non-linearities Results - SDF Weights Relationship between Weights and Characteristics ST_REV weight LME BEME Figure: Weight as a function of size, book-to-market and ST-reversal. Complex interaction between multiple variables! 34

36 Non-linearities Results - IPCA Weights Relationship between Weights and Characteristics BEME weight LME Figure: Weight as a function of size (LME) and book-to-market (BEME). Size and value can only be modeled linearly! 35

37 Non-linearities Results - Return Forecasting Weights Relationship between Weights and Characteristics BEME conditional return LME Figure: Weight as a function of size (LME) and book-to-market (BEME). Size and value only loads on extreme values 36

38 Simulation Simulation Setup Consider a single factor model R i,t+1 = β i,t F t+1 + ε i,t+1 The only factor is sampled from N (µ F, σf 2 ). The loadings are β i,t = C i,t with C i,t i.i.d N (0, 1). The residuals are i.i.d N (0, 1). N = 500 and T = 600. Define training/validation/test = 250, 100, 250. Consider σf 2 {0.01, 0.05, 0.1}. Sharpe Ratio of the factor SR = µ F /σ F = 0.3 or SR = 1. 37

39 Simulation Simulation Results: Intuition Intuition: Better noise diversification with our approach Simple return prediction SDF estimator 1 N 1 TN N N ( 1 T i=1 i=1 t=1 N (Ri,t+1 e f (I t)) 2 T t=1 ) 2 Ri,t+1M e t+1g(c i,t ) SDF estimator averages out the noise over the time-series 38

40 Simulation Simulation Results Sharpe Ratio on Test Dataset σ 2 F RtnFcst SDF estimator SR= SR=

41 Simulation Simulation Results Estimated loadings and SDF weights : Our SDF estimator : Return forecasting Our approach detects SDF and loading structure. Simple forecasting approach fails. 40

42 Conclusion Conclusion Methodology Novel combination of deep-neural networks to estimate the pricing kernel Key innovation: Use no-arbitrage condition as criterion function Time-variation explained by macroeconomic states and firm characteristics General asset pricing model that includes all other models as special cases Empirical Results Outperforms benchmark models Non-linearities and interactions are important 41

43 IPCA: Number of Factors Table: Performance with IPCA Number of Factors SR (Train) SR (Valid) SR (Test) A 1

44 Hyper-Parameter Search Search Space CV number of conditional variables: 4, 8, 16, 32 SMV number of macroeconomic state variables: 4, 8, 16, 32 HL number of fully-connected layers: 2, 3, 4 HU number of hidden units in fully-connected layers: 32, 64, 128 D dropout rate (keep probability): 0.9, 0.95 LR learning rate: 0.001, , , Choose best configuration of all possible combinations (1152) of hyper-parameters on validation set. Use ReLU activation function ReLU(x) i = max(x i, 0). A 2

45 Economic Significance of Variables Sensitivity We define the sensitivity of a particular variable as the magnitude of the derivative of weight w with respect to this variable (averaged over the data): Sensitivity(x j ) = 1 C N w(ĩ t, I i,t ) (1) x j i=1 t with C a normalization constant. The analysis is performed with the feed-forward network and we only consider the sensitivity of firm characteristics and state macro variables. A sensitivity of value z for a given variable means that the weight w will approximately change (in magnitude) by z if that variable is changed by a small amount. A 3

46 Interactions between Variables Significance of Interactions We might also want to understand how the output simultaneously depends upon multiple variables. We can measure the economic significance of the interaction between variables x i and x j by the derivative: Sensitivity(x i, x j ) = 1 C N 2 w(ĩt, I i,t ) (2) x i x j i=1 This derivative can be generalized to measure higher-order interactions. t A 4

47 Finite Difference Schemes First-Order and Second-Order Finite Difference Schemes Suppose we have some multivariate function f. Without actually measuring gradients, we can approximate them with finite difference methods f f (x j + ) f (x j ) x j 2 f f (x i +, x j + ) f (x i +, x j ) f (x i, x j + ) + f (x i, x j ) x i x j 2 (4) (3) A 5

48 Feed Forward Network Figure: Feed Forward Network with Dropout A 6

49 Feed Forward Network Network Structure The input layer accepts the raw predictors (or features). h 0 = x i,t = [I i,t, Ĩ t] (5) Each hidden layer takes the output from the previous layer and transforms it into an output as h k = f (h k 1 W k + b k ) k = 1,..., K (6) In our implementation, we use ReLU activation function. ReLU(x) i = max(x i, 0) (7) The output layer is simply a linear transformation of the output from the last hidden layer to a scaler w i,t = h K W K+1 + b K+1 (8) A 7

50 Feed Forward Network Model Complexity Number of hidden layers: K. Let s denote p k to be the number of neurons (or hidden units) in the layer k. The parameters in the layer k are W k R p k 1 p k and b k R p k (9) with p 0 = dim(i i,t ) + dim(ĩt) and p K+1 = 1. Number of parameters: K+1 k=1 (p k 1 + 1)p k. e.g. A 4-hidden-layer network with hidden units [128, 128, 64, 64] has parameters. A 8

51 State RNN Two Reasons for RNN Instead of directly passing macro variables I t as features to the feed forward network, we apply a nonlinear transformation to them with an RNN. Many macro variables themselves are not stationary and have trends. Necessary transformations of I t are essential in generating a statble model. Using RNN allows us to encode all historical information of the macro economy. Intuitively, RNN summarizes all historical macro information into a low dimensional vector of state variables in a data-driven way. A 9

52 Transform Macro Variables via RNN Properties of RNN For any F t -measurable sequence I t, the output sequence Ĩ t is again F t -measurable. The transformation creates no look-ahead bias. Ĩ t contains all the macro information in the past, while I t only uses current information. RNN helps create a stationary macro inputs for the feed forward network. A 10

53 Recurrent Network Recurrent Network with RNN Cell Recurrent Network with LSTM Cell A 11

54 Recurrent Network RNN Cell RNN Cell Structure A vanilla RNN model takes the current input variable x t = I t and the previous hidden state h t 1 and performs a nonlinear transformation to get the current hidden state h t h t = f (h t 1 W h + x t W x ). (10) A 12

55 Recurrent Network LSTM Cell A 13

56 Recurrent Network LSTM Cell Structure An LSTM model creates a new memory cell c t with current input x t and previous hidden state h t 1: c t = tanh(h t 1W (c) h + x tw (c) x ). (11) An input gate i t and a forget gate f t are created to control the final memory cell: i t = σ(h t 1W (i) h + x tw x (i) ) (12) f t = σ(h t 1W (f ) h + x tw x (f ) ) (13) c t = f t c t 1 + i t c t. (14) Finally, an output gate o t is used to control the amount of information stored in the hidden state: o t = σ(h t 1W (o) h + x tw x (o) ) (15) h t = o t tanh(c t). (16) A 14

Regression: Ordinary Least Squares

Regression: Ordinary Least Squares Mark Hendricks Autumn 2017 FINM Intro: Regression Outline Regression OLS Mathematics Linear Projection Hendricks, Autumn 2017 FINM Intro: Regression: Lecture 2/32 Regression