Bayesian Model Averaging for Multivariate Extreme Values Philippe Naveau naveau@lsce.ipsl.fr Laboratoire des Sciences du Climat et l Environnement (LSCE) Gif-sur-Yvette, France joint work with A. Sabourin and A-L. Fougères FP7-ACQWA, GIS-PEPER, MIRACLE & ANR-McSim, MOPERA 14 novembre 2011
Environmental Data Analysis Extreme Value Theory Bayesian Model Averaging
Environmental Data Analysis Extreme Value Theory Bayesian Model Averaging
Air pollutants (Leeds, UK, winter 94-98, daily max) NO vs. PM10 (left), SO2 vs. PM10 (center), and SO2 vs. NO (right) Heffernan& Tawn 2004, Boldi & Davison, 2007, Cooley, Davis, Naveau, 2010 0 50 100 150 200 0 200 400 600 800 1000 PM10 NO 0 50 100 150 200 0 100 200 300 400 500 PM10 SO2 0 200 400 600 800 1000 0 100 200 300 400 500 NO SO2
Typical question What is the probability of observing data in the blue box?
100 largest extremes NO 0 200 400 SO2 0 200 400 PM 10 0 200 400 0 200 400 PM 10 0 200 400 NO 0 200 400 S02
Environmental Data Analysis Extreme Value Theory Bayesian Model Averaging
Multivariate Extreme Value Theory (de Haan, Resnick and others) Maxima Max-stability G t (tz) =G(z) Regularly varying High quantiles Scaling property Λ(tA z )=t 1 Λ(A z ) Tail behavior Counting exceedances
Siméon Denis Poisson (1781-1840) 0 20 40 60 80 100 1900 1920 1940 1960 1980 2000 Counting excesses As a sum of random binary events, the variable N n that counts the number of events above the threshold u n has mean n Pr(X > u n) Poisson s theorem in 1837 If u n such that lim n Pr(X > un) =λ (0, ). n then N n follows approximately a Poisson variable N.
Still counting P(max(X 1,...,X n)/a n x, max(y 1,...,Y n)/a n y) =P(N n(a) =0) Y_i/a_n A y * * * * * * * * * * * * * * x X_i/a_n
Still counting P(max(X 1,...,X n)/a n x, max(y 1,...,Y n)/a n y) =P(N n(a) =0) Poisson again If then lim E(Nn(A)) = Λ(A), n lim P(Nn(A) =0) =P(N(A) =0) =exp( Λ(A)) n
Still counting P(max(X 1,...,X n)/a n x, max(y 1,...,Y n)/a n y) =P(N n(a) =0) Poisson again If then lim E(Nn(A)) = Λ(A), n lim P(Nn(A) =0) =P(N(A) =0) =exp( Λ(A)) n Two questions What is the sequence a n? What are its properties of Λ(A)?
Back to univariate case : Fréchet margins We know for the univariate GEV case with heavy-tailed and lim P(max(X 1,...,X n n)/a n x) =exp( x α ) with a n such that P(X > a n)=1/n Poisson condition with lim np(x/an Ax) =Λx(Ax) n Λ x(a x)=x α, for A x =[x, )
Scaling property Univariate case with Λ x(a x)=x α Λ x(ta x)=t α Λ x(a x) Multivariate case Λ(tA) =t α Λ(A)
Scaling property : an essential property of inference Λ(tA) =t α Λ(A) t A t α Λ t α A Area with data points
Interpreting the scaling property Λ(tA) =t α Λ(A) with α = 1 and y = y 1 + y 2 t y 1 + y 2 = t { {y : y/ y B and y >t} 1 y 1 + y 2 =1 B 1 t
Interpreting the scaling property Λ(tA) =t 1 Λ(A) A special case A = {x : x/r B and r > 1} where x, r = x and B any set belonging to the unit sphere A surprising property ta = {tx : x/r B and r > 1}, = {u : u/ u B and u > t}, with u = tx. This implies Λ({u : u/ u B and u > t}) =t 1 H (B) where H(.) spectral measure restricted to the unit sphere
Interpreting the scaling property Λ(tA) =t 1 Λ(A) A special case A = {x : x/r B and r > 1} where x, r = x and B any set belonging to the unit sphere A surprising property This implies ta = {tx : x/r B and r > 1}, = {u : u/ u B and u > t}, with u = tx. Λ({u : u/ u B and u > t}) =t 1 H (B) where H(.) spectral measure restricted to the unit sphere Independence between the radius r = x and the spectral measure The dependence among extremes is only captured by the spectral measure
Polar coordinates in 3D Radius r = x 1 + x 2 + x 3 and angle vector : w 1 = x 1 r, w 2 = x 2 r, w 3 = x 3 r
100 largest extremes NO 0 200 400 SO2 0 200 400 PM 10 0 200 400 0 200 400 PM 10 0 200 400 NO 0 200 400 S02
Dependence among the 100 angles W =(W 1, W 2, W 3 ) NO 0.00 0.35 0.71 1.06 1.41 SO 2 PM 10
Our main problems How to find appropriate models to describe the dependence over the simplex? How to infer the parameters of our models? How to combine competing models?
An unique moment constraint the spectral measure H R Simplex w idh(w) = 1 d
An unique moment constraint the spectral measure H R Simplex w idh(w) = 1 d Non-parametric versus parametric In theory, there is no difference between theory and practice. But, in practice, there is. Jan L. A. van de Snepscheut or Yogi Berra
Environmental Data Analysis Extreme Value Theory Bayesian Model Averaging
Bayesian model averaging (BMA) Model 1 Model 2 S collection likelihood of distribution functions likelihood on S: j {h 1 (. θ 1 ), θ j Θ 1 }. {h 2 (. θ 2 ), θ 2 Θ 2 } verage consists in adding a prior layer to the m
Bayesian model averaging (BMA) Model 1 Model 2 S collection likelihood of distribution functions likelihood on S: j {h 1 (. θ 1 ), θ j Θ 1 }. {h 2 (. θ 2 ), θ 2 Θ 2 } verage consists in adding a prior layer to the m Objective Compute the posterior predictive density of the quantity of interest h(w data) =p(model 1 data) h 1 (w data)+p(model 2 data) h 2 (w data)
Bayesian model averaging (BMA) Model 1 Model 2 S collection likelihood of distribution functions likelihood on S: j {h 1 (. θ 1 ), θ j Θ 1 }. {h 2 (. θ 2 ), θ 2 Θ 2 } verage consists in adding a prior layer to the m Objective Compute the posterior predictive density of the quantity of interest h(w data) =p(model 1 data) h 1 (w data)+p(model 2 data) h 2 (w data) BMA : Sloughter et al. (2010), Raftery et al. (2005), Hoeting etal. (1999)
Bayesian model averaging (BMA) Model 1 Model 2 S collection likelihood of distribution functions likelihood on S: j {h 1 (. θ 1 ), θ j Θ 1 }. {h 2 (. θ 2 ), θ 2 Θ 2 } verage consists in adding a prior layer to the m Objective Compute the posterior predictive density of the quantity of interest h(w data) =p(model 1 data) h 1 (w data) + p(model 2 data) h 2 (w data) Z h 1 (w data) = h 1 (w data,θ 1 )[posterior of θ 1 data]
Bayesian model averaging (BMA) Model 1 Model 2 S collection likelihood of distribution functions likelihood on S: j {h 1 (. θ 1 ), θ j Θ 1 }. {h 2 (. θ 2 ), θ 2 Θ 2 } verage consists in adding a prior layer to the m Objective Compute the posterior predictive density of the quantity of interest h(w data) =p(model 1 data) h 1 (w data)+p(model 2 data) h 2 (w data) p(model 1 data) = p(data model 1) p(model 1) p(data) with Z p(data model 1) = marginal likelihood wrt θ 1 = h 1 (data θ 1 ) prior(θ 1 )
Model 1 Model 2 S collection likelihood of distribution functions likelihood on S: j {h 1 (. θ 1 ), θ j Θ 1 }. {h 2 (. θ 2 ), θ 2 Θ 2 } verage consists in adding a prior layer to the m Priors Priors Averaged Model } likelihood h(w (θ 1, θ 2 )) = p 1 h 1 (w θ 1 )+p 2 h 2 (w θ 2 ) Priors
Model 1 Marginal likelihood Z p(data model 1) = h 1(data θ 1) prior(θ 1) Computationally hard posterior weights p(model 1 data) = p(data model 1) p(model 1) p(data) Computationally Z easy Computationally easy Averaging Model 1 and 2 h(w data) =p(model 1 data) h 1 (w data)+p(model 2 data) h 2 (w data)
Environmental Data Analysis Extreme Value Theory Bayesian Model Averaging
Multivariate Extreme Value Theory and BMA A mixture of max-stable distributions is not max -stable A mixture of spectral measures is still a valid spectral measure, i.e. R Simplex w idh(w) = 1 d for dh = P p j dh j
Multivariate Extreme Value Theory and BMA Proposition 3.1. Let H 1,...H J be J angular spectral measures associated to J max-stable measures ν j (.). Let(p 1,...p J ) be a vector of positive weights summing to one. If H corresponds to their weighted average H = J j=1 p jh j, then (i) H is a valid spectral measure for a multivariate max-stable random vector M with unit-fréchet margins, exponent measure ν([0, x] c )= J p j ν j ([0, x] c ) j=1 (ii) M has max-combination representation J M = d p j M j j=1
Choosing two spectral parametric densities Model 1 PB Pairwise Beta Cooley, Davis, Naveau, 2010 Model 2 NL Nested Asym Logistic Gumbel, 1960, Tawn, 1990
1.6 1.4 Motivation Data EVT BMA BMA+EVT Wrapping up Simulation from two spectral parametric densities Model 1 PB w2 Model 2 NL w2 2 0.8 2.2 1.2 0.4 2 2.8 2.6 2.4 2.6 1.8 0.8 2 1 10 0.5 5 0.1 1.6 0.8 1.4 1 2 2.4 2.2 3 10 1.8 1 0.6 0.2 0.4 10 5 0.001 w3 0.00 0.35 0.71 1.06 1.41 alpha = 0.9 beta[1] = 15 beta[2] = 8 beta[3] = 0.5 w1 0.00 0.35 0.71 1.06 1.41 w3 w1
Our two spectral parametric densities where with Model 1 PB hpb(w α, β) = 1 i<j d hi,j(w α, βi,j) hi,j(w α, βi,j) = Kd(α) wij 2α 1 (1 wij) (d 2)α d+2 Γ(2βij) Γ 2 (βij) wβi,j 1 i/ij w βi,j 1 j/ij wi wij = wi + wj, w i/ij = wi + wj where hnl(w1w2) = 1 α 3α u v Model 2 2 α α NL (1 w12) 1 α 1 (w1w2) 1 α 12 α 1 u2(α12 1) v α 3 + 1 α12 α12α uα12 2 v α 2 1 α = w 12 α 1 + w2 1 α 12 α = u α12 +(1 (w1 + w2)) 1 α. S
Simulation from two spectral parametric densities Algorithm 1. Model 1 PB (i) Choose uniformly a pair (i <j) (ii) Generate independently R ij Beta(2α +1, (d 2)α) Θ ij Beta(β i,j, β i,j) S Dirichlet d 2(1,...,1) (iii) Change variables back to define W via Model 2 NL Algorithm 2. (Stephenson, 2003) (i) Generate independently S PS(α) and S12 PS(α12). (ii) Simulate three independent standard exponentials E1,E2,E3 (iii) Set for i {1, 2}, Xi = S12S 1/α αα12 12 Ei and X3 = S α. E3 Then, X =(X1,X2,X3) has the desired distribution. Proof. If is generated according to the above algorithm, the cond W i = R ijθ ij W j = R ij(1 Θ ij) W [ (i,j)] =(1 R)S
Metropolis-Hasting at work Model 1 Marginal likelihood Z p(data model 1) = h 1(data θ 1) prior(θ 1) Computationally hard posterior weights p(model 1 data) = p(data model 1) p(model 1) p(data) Computationally Z easy Computationally easy Averaging Model 1 and 2 h(w data) =p(model 1 data) h 1 (w data)+p(model 2 data) h 2 (w data)
A few simulation results (with parameters tuned to our Leeds data) True model = PB Model Nsim p(data model) stdev(p(data model)) p(model data) PB 50 10 3 1.25 10 39 5 10 37 1 NL 50 10 3 4.76 10 17 6.5 10 16 6 10 22
A few simulation results (with parameters tuned to our Leeds data) True model = PB Model Nsim p(data model) stdev(p(data model)) p(model data) PB 50 10 3 1.25 10 39 5 10 37 1 NL 50 10 3 4.76 10 17 6.5 10 16 6 10 22 True model = NL Model Nsim p(data model) stdev(p(data model)) p(model data) PB 50 10 3 1.19 10 37 7 10 35 2 10 7 NL 50 10 3 6.23 10 43 3 10 42 1
A few simulation results (with parameters tuned to our Leeds data) True model = Mixture (PB + NL)/2 Model Nsim p(data model) stdev(p(data model)) p(model data) PB 100 10 3 1.85 10 18 8 10 16 0.45 NL 100 10 3 2.27 10 18 4 10 16 0.55 Kullback-Leibler divergence (small = good) KL(h true, h PB) 0.084 KL(h true, h NL) 0.187 KL(h true, h BMA) 0.014
Back to our example Environmental Data Analysis Extreme Value Theory Bayesian Model Averaging
Back to our example : priors (dotted) and posteriors (histo) for each parameter ModelPB Model NL log( ) log( [1]) alpha alpha12 0.0 1.0 2.0 3.0 0.0 0.5 1.0 1.5 3 2 1 0 1 2 3 log( [2]) 3 2 1 0 1 2 3 0.0 0.4 0.8 1.2 0.0 1.0 2.0 3 2 1 0 1 2 3 log( [3]) 3 2 1 0 1 2 3 0 5 10 15 0.0 0.4 0.8 0 2 4 6 8 10 12 0.0 0.4 0.8
0.5 0.8 Motivation Data EVT BMA BMA+EVT Wrapping up Back to our example : posteriors spectral densities ModelPB Model NL w2 PB w2 NL 0.4 0.1 0.2 0.3 0.6 1.3 1.1 0.6 0.9 1.4 1.7 0.6 1.3 2 1.1 5.8 1 0.9 4.6 0.9 1.7 0.8 1.6 1.4 1.3 1.1 1 0.8 0.6 0.9 1.2 2 1.2 0.2 1 02 2 0.3 0.1 1.7 0.4 1.6 0.5 1.4 0.6 0.5 0.4 0.1 1.1 0.8 0.3 0.9 1.2 1.3 1 0.8 1.2 4.6 1.6 0.00 0.35 0.71 1.06 1.41 w3 w1 0.00 0.35 0.71 1.06 1.41 w3 w1
Back to our example ModelPB Model NL
Back to our example BMA verdict Model Nsim p(data model) stdev(p(data model)) p(model data) PB 300 10 3 1.05 10 38 6.7 10 36 1.11 10 13 NL 300 10 3 9.53 10 50 1.6 10 49 1
Take home messages Feasibility of implementing BMA for multivariate extremes (in low dimensions) Computations can quickly become intensive The choice and number of parametric models are important Asymmetric nested logistic well tailored to represent bridges Pairwise beta model is flexible and be generalized (see Ballani and Schlater s extensions) More research needed to extend BMA to mixtures Going fully Bayesian non-parametric (Segers and colleagues, Boldi and Davison)