Estimation of High-dimensional Vector Autoregressive (VAR) models

Size: px

Start display at page:

Download "Estimation of High-dimensional Vector Autoregressive (VAR) models"

Lorraine Marshall
6 years ago
Views:

1 Estimation of High-dimensional Vector Autoregressive (VAR) models George Michailidis Department of Statistics, University of Michigan gmichail CANSSI-SAMSI Workshop, Fields Institute, Toronto May 2014 Joint work with Sumanta Basu George Michailidis (UM) High-dimensional VAR 1 / 47

2 Outline 1 Introduction 2 Modeling Framework 3 Theoretical Considerations 4 Implementation 5 Performance Evaluation George Michailidis (UM) High-dimensional VAR 2 / 47

3 Vector Autoregressive models (VAR) widely used for structural analysis and forecasting of time-varying systems capture rich dynamics among system components popular in diverse application areas control theory: system identification problems economics: estimate macroeconomic relationships (Sims, 1980) genomics: reconstructing gene regulatory network from time course data neuroscience: study functional connectivity among brain regions from fmri data (Friston, 2009) George Michailidis (UM) High-dimensional VAR 3 / 47

4 VAR models in Economics testing relationship between money and income (Sims, 1972) understanding stock price-volume relation (Hiemstra et al., 1994) dynamic effect of government spending and taxes on output (Blanchard and Jones, 2002) identify and measure the effects of monetary policy innovations on macroeconomic variables (Bernanke et al., 2005) George Michailidis (UM) High-dimensional VAR 4 / 47

5 Feb-60 Aug-60 Feb-61 Aug-61 Feb-62 Aug-62 Feb-63 Aug-63 Feb-64 Aug-64 Feb-65 Aug-65 Feb-66 Aug-66 Feb-67 Aug-67 Feb-68 Aug-68 Feb-69 Aug-69 Feb-70 Aug-70 Feb-71 Aug-71 Feb-72 Aug-72 Feb-73 Aug-73 Feb-74 Aug-74 VAR models in Economics 6 4 Employment Federal Funds Rate Consumer Price Index George Michailidis (UM) High-dimensional VAR 5 / 47

6 VAR models in Functional Genomics technological advances allow collecting huge amount of data DNA microarrays, RNA-sequencing, mass spectrometry capture meaningful biological patterns via network modeling difficult to infer direction of influence from co-expression transition patterns in time course data helps identify regulatory mechanisms George Michailidis (UM) High-dimensional VAR 6 / 47

7 VAR models in Functional Genomics (ctd) HeLa gene expression regulatory network [Courtesy: Fujita et al., 2007] George Michailidis (UM) High-dimensional VAR 7 / 47

8 VAR models in Neuroscience identify connectivity among brain regions from time course fmri data connectivity of VAR generative model (Seth et al., 2013) George Michailidis (UM) High-dimensional VAR 8 / 47

9 Model p-dimensional, discrete time, stationary process X t = {X t 1,...,Xt p} X t = A 1 X t A d X t d + ε t, ε t i.i.d N(0,Σ ε ) (1) A 1,...,A d : p p transition matrices (solid, directed edges) Σ 1 ε : contemporaneous dependence (dotted, undirected edges) stability: Eigenvalues of A (z) := I p d t=1 A tz t outside {z C, z 1} George Michailidis (UM) High-dimensional VAR 9 / 47

10 Why high-dimensional VAR? The parameter space grows quadratically (p 2 edges for p time series) order of the process (d) often unknown Economics: Forecasting with many predictors (De Mol et al., 2008) Understanding structural relationship - price puzzle" (Christiano et al., 1999) Functional Genomics: reconstruct networks among hundreds to thousands of genes experiments costly - small to moderate sample size Finance: structural changes - local stationarity George Michailidis (UM) High-dimensional VAR 10 / 47

11 Literature on high-dimensional VAR models Economics: Bayesian vector autoregression (lasso, ridge penalty; Litterman, Minnesota Prior) Factor model based approach (FAVAR, dynamic factor models) Bioinformatics: Discovering gene regulatory mechanisms using pairwise VARs (Fujita et al., 2007 and Mukhopadhyay and Chatterjee, 2007) Penalized VAR with grouping effects over time (Lozano et al., 2009) Truncated lasso and thesholded lasso variants (Shojaie and Michailidis, 2010 and Shojaie, Basu and Michailidis, 2012) Statistics: lasso (Han and Liu, 2013) and group lasso penalty (Song and Bickel, 2011) low-rank modeling with nuclear norm penalty (Negahban and Wainwright, 2011) sparse VAR modeling via two-stage procedures (Davis et al., 2012) George Michailidis (UM) High-dimensional VAR 11 / 47

12 Outline 1 Introduction 2 Modeling Framework 3 Theoretical Considerations 4 Implementation 5 Performance Evaluation George Michailidis (UM) High-dimensional VAR 12 / 47

13 Model p-dimensional, discrete time, stationary process X t = {X t 1,...,Xt p} X t = A 1 X t A d X t d + ε t, ε t i.i.d N(0,Σ ε ) (2) A 1,...,A d : p p transition matrices (solid, directed edges) Σ 1 ε : contemporaneous dependence (dotted, undirected edges) stability: Eigenvalues of A (z) := I p d t=1 A tz t outside {z C, z 1} George Michailidis (UM) High-dimensional VAR 13 / 47

14 Detour: VARs and Granger Causality Concept introduced by Granger (1969) A time series X is said to Granger-cause Y if it can be shown, usually through a series of F-tests on lagged values of X (and with lagged values of Y also known), that those X values provide statistically significant information about future values of Y. In the context of a high-dimensional VAR model we have that X T t j is Granger-causal for X T i if A t i,j 0. Granger-causality does not imply true causality; it is built on correlations Also, related to estimating a Directed Acyclic Graph (DAG) with (d + 1) p variables, with a known ordering of the variables George Michailidis (UM) High-dimensional VAR 14 / 47

15 Estimating VARs through regression data: {X 0,X 1,...,X T } - one replicate, observed at T + 1 time points construct autoregression (X T ) (X T 1 ) (X T 2 ) (X T d ) (X T 1 ) (X T 2 ) (X T 3 ) (X T 1 d ) = } (X d ) {{ } } (X d 1 ) (X d 2 ) {{ (X 0 ) } Y X A 1. A d } {{ } B + (ε T ) (ε T 1 ).. (ε d ) } {{ } E vec(y ) = vec(x B ) + vec(e) Y }{{} Np 1 = (I X )vec(b ) + vec(e) = }{{} Z Np q β }{{} q 1 +vec(e) }{{} Np 1 N = (T d + 1), q = dp 2 Assumption : A t are sparse, d t=1 A t 0 k vec(e) N (0,Σ ε I) George Michailidis (UM) High-dimensional VAR 15 / 47

16 Estimates l 1 -penalized least squares (l 1 -LS) 1 argmin β R q N Y Zβ 2 + λ N β 1 l 1 -penalized log-likelihood (l 1 -LL) (Davis et al., 2012) 1 argmin β R q N (Y ( Zβ) Σ 1 ε I ) (Y Zβ) + λ N β 1 George Michailidis (UM) High-dimensional VAR 16 / 47

17 Outline 1 Introduction 2 Modeling Framework 3 Theoretical Considerations 4 Implementation 5 Performance Evaluation George Michailidis (UM) High-dimensional VAR 17 / 47

18 Detour: Consistency of Lasso Regression Y n 1 LASSO : = X n p β 1 ˆβ := argmin β R p n Y Xβ 2 + λ n β 1 p 1 + ε { } S = j {1,...,p} βj i.i.d. 0, card(s) = k, k n, ε i N(0,σ 2 ) Restricted Eigenvalue (RE): Assume 1 α RE := min v R p, v 1, v S c 1 3 v S 1 n Xv 2 > 0 n 1 Estimation error: ˆβ β Q(X,σ) 1 k logp α RE n with high probability George Michailidis (UM) High-dimensional VAR 18 / 47

19 Verifying Restricted Eigenvalue Condition Raskutti et al. (2010): If the rows of X i.i.d. N(0,Σ X ) and Σ X satisfies RE, then X satisfies RE with high probability. Assumption of independence among rows crucial Rudelson and Zhou (2013): If the design matrix X can be factorized as X = ΨA where A satisfies RE and Ψ acts as (almost) an isometry on the images of sparse vectors under A, then X satisfies RE with high probability. George Michailidis (UM) High-dimensional VAR 19 / 47

20 Back to Vector Autoregression Random design matrix X, correlated with error matrix E (X T ) (X T 1 ) (X T 2 ) (X T d ) (X T 1 ) (X T 2 ) (X T 3 ) (X T 1 d ) A 1 = (X d ) (X d 1 ) (X d 2 ) (X 0 ) A d }{{}}{{}}{{} B Y X vec(y ) = vec(x B ) + vec(e) Y }{{} Np 1 = (I X )vec(b ) + vec(e) = }{{} Z Np q β }{{} q 1 +vec(e) }{{} Np 1 N = (T d + 1), q = dp 2 + vec(e) N (0,Σ ε I) (ε T ) (ε T 1 ). } (ε d ) {{ } E George Michailidis (UM) High-dimensional VAR 20 / 47

21 Vector Autoregression (ctd) Key Questions: How often does RE hold? How small is α RE? How does the cross-correlation affect convergence rates? George Michailidis (UM) High-dimensional VAR 21 / 47

22 Consistency of VAR estimates Restricted Eigenvalue (RE) assumption: (I X ) q q RE(α,τ(N,q)) with α > 0,τ(N,q) > 0 if θ ( I X X /N ) θ α θ 2 2 τ(n,q) θ 2 1 for all θ R q (3) Deviation Condition: There exists a function Q(β,Σ ε ) such that vec ( X E/N ) log d + 2 log p max Q(β,Σ ε ) N Key Result: Estimation Consistency: If (3) and (4) hold with kτ(n, q) α/32, then, for any λ N 4Q(β,Σ ε ) (log d + 2 log p)/n, lasso estimate ˆβ l1 satisfies ˆβ l1 β 64 Q(β,Σ ε ) k (log d + 2 log p) α N (4) George Michailidis (UM) High-dimensional VAR 22 / 47

23 Verifying RE and Deviation Condition Negahban and Wainwright, 2011: for VAR(1) models, assume A 1 < 1, where A := Λ max (A A) For p = 1, d = 1, X t = ρx t 1 + ε t, reduces to ρ < 1 - equivalent to stability Han and Liu, 2013: for VAR(d) models, reformulate as VAR(1): X t = Ã 1 X t 1 + ε t, where X t = X t X t 1. X t d+1 Assume Ã 1 < 1 dp 1 Ã 1 = A 1 A 2 A d 1 A d I p I p I p 0 dp dp ε t = ε t 0. 0 dp 1 George Michailidis (UM) High-dimensional VAR 23 / 47

24 VAR(1): Stability and A 1 < 1 A 1 1 for many stable VAR(1) models [ X t = A 1 X t 1 + ε t α 0, A 1 = β α ] (α, β): {X t } stable 5 β α t 2 X 1 t 1 X 1 α t X β α t 2 t 1 t X 2 X 2 X 2 (α, β): A 1 < 1 5 George Michailidis (UM) High-dimensional VAR 24 / 47

25 VAR(d): Stability and Ã 1 < 1 Ã 1 1 for any stable VAR(d) models, if d > 1 [ X t = 2αX t 1 α 2 X t 2 + ε t X t, X t 1 ] [ 2α α 2 = 1 0 ][ X t 1 X t 2 ] [ ε t + 0 ] 3 ~ A 1 α 2 2α X t 2 X t 1 X t 2 ~ A 1 = 1 0 α George Michailidis (UM) High-dimensional VAR 25 / 47

26 Stable VAR models V VAR(2), VAR(3),... VAR(d),... VAR(1) A < 1 George Michailidis (UM) High-dimensional VAR 26 / 47

27 Stable VAR models V VAR(2), VAR(3),... VAR(d),... VAR(1) A < 1 George Michailidis (UM) High-dimensional VAR 27 / 47

28 Quantifying Stability through the Spectral Density Spectral density function of a covariance stationary process {X t }, f X (θ) = 1 2π Γ X (l)e ilθ, l= θ [ π,π] Γ X (l) = E [ X t (X t+l ) ], autocovariance matrix of order l If the VAR process is stable, it has a closed form (Priestley, 1981) f X (θ) = 1 ( ) 1 ( ) 1 A (e iθ ) Σε A (e iθ ) 2π The two sources of dependence factorize in frequency domain George Michailidis (UM) High-dimensional VAR 28 / 47

29 Quantifying Stability by Spectral Density Quantifying Stability by Spectral Density For univariate processes, peak" of the spectral density measures stability of the For process univariate - (sharper processes, peak the = less peak" stable) of the spectral density measures stability of the process - (sharper peak = less stable) Autocovariance Γ(h) ρ=0.1 ρ=0.5 ρ=0.7 f(θ) ρ=0.1 ρ=0.5 ρ= lag (h) θ (f) Autocovariance of AR(1) (g) Spectral Density of AR(1) For multivariate processes, similar role is played by the maximum eigenvalue of For multivariate processes, similar role is played by the maximum eigenvalue of the (matrix-valued) spectral density the (matrix-valued) spectral density Sumanta Basu (UM) Network Estimation in Time Series 52 / 56 George Michailidis (UM) High-dimensional VAR 29 / 47

30 Quantifying Stability by Spectral Density For a stable VAR(d) process {X t }, the maximum eigenvalue of its spectral density captures its stability M (f X ) = max Λ max (f X (θ)) θ [ π,π] The minimum eigenvalue of the spectral density captures dependence among its components m(f X ) = min Λ min (f X (θ)) θ [ π,π] For stable VAR(1) processes, M (f X ) scales with (1 ρ(a 1 )) 2, ρ(a 1 ) is the spectral radius of A 1 m(f X ) scales with the capacity (maximum incoming + outgoing effect at a node) of the underlying graph George Michailidis (UM) High-dimensional VAR 30 / 47

31 Consistency of VAR estimates Theorem Consider a random realization {X 0,...,X T } generated according to a stable VAR(d) process with Λ min (Σ ε ) > 0. Then there exist deterministic functions φ i (A t,σ ε ) > 0 and constants c i > 0 such that for N φ 0 (A t,σ ε ) k(logd + 2logp)/N, the lasso estimate (l 1 -LS) with λ N (2logp + logd)/n satisfies, with probability at least 1 c 1 exp[ c 2 (2logp + logd)], d ( ) Aˆ h A h φ1 (A t,σ ε ) k(logd + 2logp)/N h=1 T 1 d ( ) N (Â h A h )X h φ 2 (A t,σ ε ) k(logd + 2logp)/N t=d h=1 ) Further, a thresholded version of lasso Ã = (Ât,ij 1 { Â t,ij > λ N } satisfies supp(ã 1:d )\supp(a 1:d ) φ 3 (A t,σ ε )k φ i (A t,σ ε ) are large when M (f X ) is large and m(f X ) is small. George Michailidis (UM) High-dimensional VAR 31 / 47

32 Some Remarks Convergence rates governed by: dimensionality parameters - dimension of the process (p), order of the process (d), number of parameters (k) in the transition matrices A i and sample size (N = T -d + 1) internal parameters - curvature (α), tolerance (τ) and the deviation bound Q(β,Σ ε ) The squared l 2 -errors of estimation and prediction scale with the dimensionality parameters as k(2logp + logd)/n, similar to the rates obtained when the observations are independent The temporal and cross-sectional dependence affect the rates only through the internal parameters. Typically, the rates are better when α is large and Q(β,Σ ε ),τ are small. This dependence is captured in the next results. George Michailidis (UM) High-dimensional VAR 32 / 47

33 Verifying RE Proposition Consider a random realization {X 0,...,X T } generated according to a stable VAR(d) process. Then there exist universal positive constants c i such that for all N max{1,ω 2 }k log(dp), with probability at least 1 c 1 exp( c 2 N min{ω 2,1}), where I p (X X /N) RE(α,τ), ω = Λ min(σ ε )/Λ max (Σ ε ) µ max (A )/µ min ( A ), α = Λ min(σ ε ) 2µ max (A ), Λ min (Σ ε ) τ(n,q) = c 3 µ max (A ) max{ω 2,1} log(dp). N George Michailidis (UM) High-dimensional VAR 33 / 47

34 Verifying Deviation Condition Proposition If q 2, then, for any A > 0, N logd + 2logp, with probability at least 1 12q A, we have vec ( X E/N ) logd + 2logp max Q(β,Σ ε ), N where Q(β,Σ ε ) = ( (A + 1)) [ Λ max (Σ ε ) + Λ max(σ ε ) µ min (A ) + Λ ] max(σ ε )µ max (A ) µ min (A ) George Michailidis (UM) High-dimensional VAR 34 / 47

35 Some Comments RE: the convergence rates are faster for larger α and smaller τ. From the expressions of ω,α and τ, it is clear that the VAR estimates have lower error bounds when Λ max (Σ ε ), µ max (A ) are smaller and Λ min (Σ ε ), µ min (A ) are larger. Deviation bound: VAR estimates exhibit lower error bounds when Λ max (Σ ε ), µ max (A ) are smaller and Λ min (Σ ε ), µ min (A ) are larger (similar to RE) George Michailidis (UM) High-dimensional VAR 35 / 47

36 Outline 1 Introduction 2 Modeling Framework 3 Theoretical Considerations 4 Implementation 5 Performance Evaluation George Michailidis (UM) High-dimensional VAR 36 / 47

37 l 1 -LS: Denote the i th column of a matrix M by M i. arg 1 min β R q N Y Zβ 2 + λ N β 1 1 arg min B 1,...,B p N p i X B i i=1 Y 2 p + λ N B i 1 i=1 Amounts to running p separate LASSO programs, each with dp predictors: Y i X, i = 1,...,p. George Michailidis (UM) High-dimensional VAR 37 / 47

38 l 1 -LL: Davis et al, 2012, proposed the following algorithm: arg 1 arg min β R q N 1 min β R q N (Y ( Zβ) Σ 1 ( Σ 1/2 ε ε I ) (Y Zβ) + λ N β 1 ) Σ 1/2 ε X β 2 + λ N β 1 ) ( I Y ( Amounts to ) running a single LASSO program with dp 2 predictors: Σ 1/2 ε I Y Σ 1/2 ε X - cannot be implemented in parallel. σ ij ε := (i,j) th entry of Σ 1. The objective function is 1 N p p i=1 j=1 ε σ ij ε (Y i X B i ) (Y j X B j ) + λ N p k=1 B k 1 George Michailidis (UM) High-dimensional VAR 38 / 47

39 Block Coordinate Descent for l 1 -LL 1 pre-select d. Run l 1 -LS to get ˆB, ˆΣ 1 ε. 2 iterate till convergence: 1 For i = 1,...,p, set r i := (1/2 ˆσ ii ε ) j i update ˆB i = argmin B i ˆσ ε ij ( ) Yj X ˆB j ˆσ ii ε N (Y i + r i ) X B i 2 + λ N B i 1 each iteration amounts to running p separate LASSO programs, each with dp predictors: Y i + r i X, i = 1,...,p. Can be implemented in parallel George Michailidis (UM) High-dimensional VAR 39 / 47

40 Outline 1 Introduction 2 Modeling Framework 3 Theoretical Considerations 4 Implementation 5 Performance Evaluation George Michailidis (UM) High-dimensional VAR 40 / 47

41 VAR models considered Small Size VAR, p = 10,d = 1,T = 30,50 Medium Size VAR, p = 30,d = 1,T = 80,120,160 In each setting, we generate an adjacency matrix A 1 with 5 10% non-zero edges selected at random and rescale to ensure that the process is stable with SNR = 2. We generate three different error processes with covariance matrix Σ ε from one of the following families: 1 Block-I: Σ ε = ((σ ε,ij )) 1 i,j p with σ ε,ii = 1, σ ε,ij = ρ if 1 i j p/2, 0 otherwise; 2 Block-II: Σ ε = ((σ ε,ij )) 1 i,j p with σ ε,ii = 1, σ ε,ij = ρ if 1 i j p/2 or p/2 < i j p, 0 otherwise; 3 bf Toeplitz: Σ ε = ((σ ε,ij )) 1 i,j p with σ ε,ij = ρ i j. George Michailidis (UM) High-dimensional VAR 41 / 47

42 VAR models considered (ctd) REGUALARIZED ESTIMATION IN TIME SERIES (a) A 1 (b) Σ ɛ : Block-I (c) Σ ɛ : Block-II (d) Σ ɛ : Toeplitz We let ρ vary in {0.5,0.7,0.9}. : Adjacency matrix A 1 and error covariance matrix Σ ɛ of different ty Larger values of ρ indicate that the error processes are more strongly in the simulation studies correlated. Block-I: Σ ɛ = ((σ ɛ,ij )) 1 i,j p with σ ɛ,ii = 1, σ ɛ,ij = ρ if 1 i j George Michailidis (UM) High-dimensional VAR 42 / 47

43 Comparisons and Performance Criteria Different methods for VAR estimation: OLS l 1 -LS l 1 -LL l 1 -LL-O (Oracle version, assuming Σ ε known) Ridge evaluated using the following performance metrics: 1 Model Selection: Area under receiving operator characteristic curve (AUROC) 2 Estimation error: Relative estimation accuracy measured by ˆB B F / B F George Michailidis (UM) High-dimensional VAR 43 / 47

44 Results I Table: VAR(1) model with p = 10, T = 30 BLOCK-I BLOCK-II Toeplitz ρ AUROC l 1 -LS l 1 -LL l 1 -LL-O Estimation OLS Error l 1 -LS l 1 -LL l 1 -LL-O ridge George Michailidis (UM) High-dimensional VAR 44 / 47

45 Results II Table: VAR(1) model with p = 30, T = 120 BLOCK-I BLOCK-II Toeplitz ρ AUROC l 1 -LS l 1 -LL l 1 -LL-O Estimation OLS Error l 1 -LS l 1 -LL l 1 -LL-O Ridge George Michailidis (UM) High-dimensional VAR 45 / 47

46 Summary/Discussion Investigated penalized VAR estimation in high-dimension Established estimation consistency for all stable VAR models, based on novel techniques using spectral representation of stationary processes Developed parallellizable algorithm for likelihood based VAR estimates There is extensive work on characterizing univariate time series, through mixing conditions or functional dependence measures. However, thre is little work for multivariate series, which is needed to be able to provide results in the current setting. George Michailidis (UM) High-dimensional VAR 46 / 47

47 References S. Basu and G. Michailidis, Estimation in High-dimensional Vector Autoregressive Models, arxiv: S. Basu, A. Shojaie and G. Michailidis, Network Granger Causality with Inherent Grouping Structure, revised for JMLR George Michailidis (UM) High-dimensional VAR 47 / 47

REGULARIZED ESTIMATION IN SPARSE HIGH-DIMENSIONAL TIME SERIES MODELS. BY SUMANTA BASU AND GEORGE MICHAILIDIS 1 University of Michigan

The Annals of Statistics 2015, Vol. 43, No. 4, 1535 1567 DOI: 10.1214/15-AOS1315 Institute of Mathematical Statistics, 2015 REGULARIZED ESTIMATION IN SPARSE HIGH-DIMENSIONAL TIME SERIES MODELS BY SUMANTA