Scaling up Vector Autoregressive Models with Operator Random Fourier Features

Size: px

Start display at page:

Download "Scaling up Vector Autoregressive Models with Operator Random Fourier Features"

Mitchell Fox
5 years ago
Views:

1 Scaling up Vector Autoregressive Models with Operator Random Fourier Features Romain Brault, Néhémy Lim, Florence d Alché-Buc August, 6 Abstract A nonparametric approach to Vector Autoregressive Modeling consists in working in vector-valued Reproducing Kernel Hilbert Spaces. The main idea is to build vector-valued models (OKVAR) using operator-valued kernels. Similar to scalar-valued kernels, operator-valued kernels enjoy representer theorems and learning algorithms that heavily deps on training data. We present a new approach to scale up OKVAR models... This contribution aims at scaling up non-linear autoregression models based on operator-valued kernel (K) by constructing an explicit feature map function (ORFF) that transforms an input data to a Hilbert space embed in the RKHS induced by K. ORFF are constructed in the spirit of Random Fourier Features introduced by Rahimi and Recht. We show that ORFF competes with VAR on stationary linear time-series in terms of time and accuracy. Moreover ORFF is able to compete with OVK accuracy on non-stationary, non-linear time-series (being better than VAR) while keeping low execution time, comparable to VAR. Introduction blabla. Models We compare three models. blabla. VAR: We fit the model to the data using python statmodels package, available at OVK: We fit the model to the data using python operalib package, available at The optimization problem is solved using an lbfgs. ORFF: ORFF aims at approximating kernel K(x,z) = K (x z), by finding an explicit feature map such Φ(x) Φ(z) K (x z). In the following suppose that K = k (.)A is a decomposable kernel on X = (R d,+) and Y = R p. Let A = BB. Then an approximate feature map for K is Φ dec (x) = D ( ) cos x,ωj B D sin x,ω j B, ω j F [k ]. j= Which can also be expressed as a Kronecker product of a scalar feature map with an operator: where, Φ dec (x) = φ(x) B, φ dec (x) = D ( ) cos x,ωj, ω D sin x,ω j j F [k ] j= is a scalar valued feature map. In particular if k is a Gaussian kernel of bandwidth σ, then F [k ] = N(,/σ ). The optimization problem is solved using a mini-batch block coordinate descent. Note that the convergence of the algorithm can be speed-up by preconditioning by the Hessian of the system.

2 Data: X, Y, K, γ t, λ, T, n, D, b Result: how to write algorithm with L A TEXe Find (ω,x), B(ω) and µ(ω) from K ; for i = to D do θi,. = ; for t = to T do A t = (X t Y t ) P(x,y) ; // Sample n data from X Y. f (X t ) = predict(x t,θ t,k ); // Make a prediction. Ω i µ(ω) with seed i, where i = ((t ) mod D)+ ; // Sample b features from µ(ω). for ω Ω i do θ t+ return θ t+ i,ω = θt i,ω γ t A t (ω,x)b(ω)l (f(x),y)+λθi,ω t ; // Update the gradient. x,y A t Algorithm : Block-coordinate mini-batch SGD. Data: X, θ, K Find (ω,x), B(ω) and µ(ω) from K ; f = ; for x X do for i = to D do Ω i µ(ω) with seed i; for ω Ω i do f(x) = f(x)+(ω,x)b(ω)θ i,ω ; return f(x) Algorithm : f (X) =predict(x,θ,µ) Experiments. Simulated data.. Data generation A non-linear multi-time serie y t of dimension p and order one has the form { y N(,Σ x ) y t = h(y t )+u t t >. () Throughout the experiments the residuals considered are homoscedastic and distributed according to a probability measure u t N(,Σ u ). We study two different kind of noise: an isotropic with covariance Σ u = σ I p and an anisotropic with Toeplitz structure Σ u,ij = ν i j, where ν lives in (,). We generated N = datapoints and used a sequential cross-validation with time windows N t = N/ to measure performance of the different models... Setting : Linear model We first study the behavior of the three method on a linear VAR model (i.e. h(x) = Ax). The generated time-series are presented in fig. and fig.. In this setting we do not seed any advantages of OVKs over VAR model. Although OVKs takes order of magnitudes more times to achieve the same performance than OVK, ORFF (the approximation of OVK) is able to challenge VAR in terms of time and accuracy. We fixed D = 5 features for ORFF.

3 model VAR() ORFF OVK Dumb noise White Toeplitz White Toeplitz White Toeplitz White Toeplitz SCV-MSE variance time.467(s).48(s).994(s).(s).476(s).8946(s) (s) (s) Table : Sequential cross-validation MSE on setting Figure : Generated time serie with isotropic noise of variance σ =.9, no non-linearity and random depency structure with five interactions Figure : Generated time serie with toeplitz noise of variance ν =.9, no non-linearity and random depency structure with five interactions... Setting : Sine model We now study the behavior of the three methods on a non-linear VAR model generated by the mean of sin functions (i.e. h(x) = A sin(x)). For this setting the data are generated such that it incorporates a non linear tr f t. The data y t are generated according to eq. () We chose f t = Φ(t) θ, where θ ij N(,). x N(,Σ x ), x t = h(x t )+u t t >, y t = x t +f t () The generated time-series are presented in fig. and fig. 4. We fixed D = 5 features for ORFF. model VAR() ORFF OVK Dumb noise White Toeplitz White Toeplitz White Toeplitz White Toeplitz SCV-MSE variance time.998(s).79(s).5684(s).68(s).8(s).799(s) (s) (s) Table : Sequential cross-validation MSE on setting. In this setting considering a white noise, non-linear auto-regression with ORFF and OVK has a clear

4 advantage over VAR(). ORFF is able to capture the non-linearity in the fraction of time of OVK Figure : Generated time serie with isotropic noise of variance σ =.9, sine non-linearity φ s and random depency structure with five interactions Figure 4: Generated time serie with toeplitz noise of variance ν =.9, exponential non-linearity φ e and random depency structure with five interactions...4 Setting : Exponential model Setting. follows the same generation model as setting (see eq. ()). except that the non-linearities are exponential function, i.e. h(x) = Aexp( γx ). model VAR() ORFF OVK Dumb noise White Toeplitz White Toeplitz White Toeplitz White Toeplitz SCV-MSE variance time.97(s).64(s).865(s).745(s).454(s) (s) (s) (s) Table : Sequential cross-validation MSE on setting. 4 Real data 5 Conclusions 4

5 Figure 5: Generated time serie with isotropic noise of variance σ =.9, sine non-linearity φ s and random depency structure with five interactions Figure 6: Generated time serie with toeplitz noise of variance ν =.9, exponential non-linearity φ e and random depency structure with five interactions. 5

6 6 Supplementary material. 6

Efficient Complex Output Prediction

Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay