Étude de classes de noyaux adaptées à la simplification et à l interprétation des modèles d approximation.

Size: px

Start display at page:

Download "Étude de classes de noyaux adaptées à la simplification et à l interprétation des modèles d approximation."

Allison Miles
5 years ago
Views:

1 Étude de classes de noyaux adaptées à la simplification et à l interprétation des modèles d approximation. Soutenance de thèse de N. Durrande Ecole des Mines de St-Etienne, 9 november 2011 Directeurs Laurent Carraro Rapporteurs Beatrice Laurent Rodolphe Le Riche Henry Wynn Co-encadrants David Ginsbourger Examinateurs Yves Grandvalet Olivier Roustant Alberto Pasanisi 1/37

2 Introduction to Gaussian Process models Outline: 1 Introduction: The Gaussian process modeling framework issues when d is large: lack of interpretability large number of observation 2 Simplified models for high dimensional modeling 3 Gaussian Process and ANOVA representation 2/37

3 f Introduction to Gaussian Process models Let f : D R be a function which value is known on a limited set of points X = (X 1,..., X n ) This situation may be found in many fields: Engineering Geostatistics Numerical simulator x 3/37

4 Introduction to Gaussian Process models Given a kernel K and its associated centered Gaussian Process Z, one can look at paths satisfying Z ω (X i ) = f (X i ) We can calculate the conditional law of Z knowing the observations x 4/37

5 Introduction to Gaussian Process models We approximate f (x) using the conditional expectation m(x) and the conditional variance v(x) of Z x m(x) = k(x) T K 1 F v(x) = K (x, x) k(x) T K 1 k(x) where k is a functional vector: (k(.)) i = K (X i,.) K is the covariance matrix (K) ij = K (X i, X j ) 5/37

6 Introduction to Gaussian Process models The best predictor m(x) can be seen as a linear combination of the K (X i,.) x x 6/37

7 Introduction to Gaussian Process models The choice of the covariance kernel K has a great impact on the model x Squared exponential Brownian Squared exponential (σ 2, θ) = (1, 0.2) (σ 2, θ) = (1, 0.5) x x How can we choose the most appropriate kernel? 7/37

8 Introduction to Gaussian Process models Definition: A kernel is a symmetric function of positive type: K (x, y) = K (y, x) n N, a 1,..., a n R et x 1,..., x n D, n n a i a j K (x i, x j ) 0. i=1 j=1 The verification of the second point is often intractable. 8/37

9 Introduction to Gaussian Process models However, if K 1, K 2 are kernels and f a real valued function, K 1 K 2, K 1 + K 2 and f (x)k 1 (x, y)f (y) are covariance kernels. Its easy to create new kernels from old! Example: Construction of multidimensional kernels K (x, y) = i K i (x i, y i ) 9/37

10 Limitations for tensor product kernels Most of the time, multidimensional kernels are based on the product of univariate kernels Example the squared exponential kernel over [0, 1] 2 is: K (x, y) = exp( x y 2 ) = exp( (x 1 y 1 ) 2 (x 2 y 2 ) 2 ) = K u (x 1, y 1 ) K u (x 2, y 2 ) With such kernels, it happens that an observation f (X i ) only has an influence on the neighborhood of X i. 10/37

11 Limitations for tensor product kernels For such kernels, the basis function K (X i, ) associated to an observation at X i only has a local influence Conversely to other methods where the basis functions have a global meaning, the kriging models are difficult to evaluate /37 X 1 = (0.25, 0.25)

12 Limitations for tensor product kernels If the neighborhood of an observation is a domain of size ε for d = 1, we then have when the dimension increases d = 1 ε d = 2 d = 3 ε ε This phenomena is known as the curse of dimensionality. the number of observations has to grow exponentially with d 12/37

13 Limitations for tensor product kernels When using usual kernels in high dimension, kriging emulators face 2 issues: They require a large number of observations They are difficult to interpret The aim of this presentation is to get round those issues Outlines: 1 Additive models 2 Simplified models with interaction terms 3 Interpretation of high-dimensional models 13/37

14 Additive kernels A popular approach to get round the curse of dimensionality is to consider simplified models such as additive models [Stone 85]: f (x) m(x) = d m i (x i ) i=1 Example of such models Regression without interaction Generalized additive models [Hastie 90] 14/37

15 Additive kernels In order to obtain additive kriging models, we considered kernels with expression K (x, y) = d K i (x i, y i ) i=1 As we said, such kernels are symmetric and of positive type. We will call such kernels additive kernels. 15/37

16 Y x2 Y x2 Y x2 Additive kernels Examples of sample paths of a GP Y with additive kernel x1 x1 x The paths of Y are additive (up to a modification). 16/37

17 Additive kernels Now, let us consider a GP model with additive kernel As an additive kernel is still a kernel, kriging equations do not change m(x) = k(x) T K 1 F v(x) = K (x, x) k(x) T K 1 k(x) 17/37

18 Interpretability of Additive kriging models When the input space is high-dimensional, usual kriging models cannot easily be interpreted They can be seen as black box. On the other hand, additive kriging models are easily interpretable m(x) = (k 1 (x 1 ) + k 2 (x 2 )) t (K 1 + K 2 ) 1 F = k 1 (x 1 ) t (K 1 + K 2 ) 1 F + k }{{} 2 (x 2 ) t (K 1 + K 2 ) 1 F }{{} m 1 (x 1 ) m 2 (x 2 ) For the sub-model m 1, the covariance matrix K 2 appears as a noise of observation. 18/37

19 Interpretability of Additive kriging models We obtain for the sub-models: m m x x2 m 1 (x 1 ) m 2 (x 2 ) 19/37

20 Additive models and linear budget As previously, if we look at the neighborhood of a point X i, we obtain The observations do not only have a local influence The number of observations can increase linearly with dimension 20/37

21 Additive models and linear budget Let Z p be a centered GP over [0, 1] d with tensor product kernel Z a be a centered GP over [0, 1] d with additive kernel Z = Z a + Z p (supposed to be independent) We compare the predictivity of the approximation of Z ω by An additive kriging model A kriging model based on a tensor product kernel when d increases 21/37

22 Additive models and linear budget For d 1,..., 30, we choose n = 10 d. We compare the percentage of variance explained by the two models: percentage of variance explained additive kernel tensor product kernel dimension θ = /37

23 Additive models and linear budget Those results depend on the value of θ: percentage of variance explained additive kernel tensor product kernel percentage of variance explained additive kernel tensor product kernel percentage of variance explained additive kernel tensor product kernel dimension dimension dimension θ = 0.25 θ = 0.5 θ = 1 With a limited linear budget, simple models can outperform more complex models 23/37

24 Additive models and linear budget We have seen 2 advantages of additive models they are easily interpretable they require a reasonable number of observations However, those models usually are too basic for modeling a real life phenomena How can we increase the complexity of the models? 24/37

25 ANOVA kernels In order to take into account the interactions between variables, one can consider ANOVA kernels [Stitson 97]: K (x, y) = d (1 + K i (x i, y i )) i=1 = 1 + d K i (x i, y i ) + K i (x i, y i )K j (x j, y j ) + + i=1 i<j }{{}}{{} additive part 2 nd order interactions d K i (x i, y i ) i=1 }{{} full interaction 25/37

26 ANOVA kernels A decomposition of the best predictor is naturally associated to those kernels. Example: we have in 2D K = 1 + K 1 + K 2 + K 1 K 2 so the best predictor can be written as m(x) = (1 + k 1 (x 1 ) + k 2 (x 2 ) + k 1 (x 1 )k 2 (x 2 )) t K 1 F = m 0 + m 1 (x 1 ) + m 2 (x 2 ) + m 12 (x) This decomposition looks like the ANOVA representation of m but the m I do not satisfy D i m I (x I )dx i = 0 26/37

27 ANOVA kernels The ANOVA representation is based on a functional decomposition of L 2 : if D = D 1 D d and µ = µ 1 µ d, we have L 2 (D, µ) = d (1 Di + L 2 0 (D i, µ i )) i=1 If we can build a RKHS with the same structure, the ANOVA representation of m can be obtained naturally How to build a RKHS of zero mean function? One example is given in [Wahba 97] 27/37

28 Kernel ANOVA Decomposition Using the RKHS framework, we showed that from any usual 1-dimensional kernel K we could extract a kernel K 0 associated to a RKHS of zero mean functions H 1 = span(r) Let R be the Riesz representant of.dx for.,. H. We define H 0 as R H H 0 28/37

29 Kernel ANOVA Decomposition The expression of R(x) can be obtained easily R(x) = R, K (x,.) H = K (x, s)ds D Brownian Gaussian with theta = 1 Gaussian with theta = 0.25 R(x) R(x) R(x) x x x 29/37

30 Kernel ANOVA Decomposition Finally, we have H = H 1 H 0 with H 1 = span(r) a one dimensional RKHS H 0 a RKHS of zero mean functions Those spaces have kernel: K (x, s)ds K (y, s)ds K 1 (x, y) = K (s, t)dsdt K (x, s)ds K (y, s)ds K 0 (x, y) = K (x, y) K (s, t)dsdt 30/37

31 Kernel ANOVA Decomposition As for the ANOVA representation in L 2, we can build a RKHS H H = K (x, y) = d (1 Di + Hi 0 (x i, y i )) i=1 d (1 + Ki 0 (x i, y i )) i=1 with this space, the ANOVA representation is obtained naturally m(x) = (1 + k 0 1 (x 1) + k 0 2 (x 2) + k 0 1 (x 1)k 0 2 (x 2)) t K 1 F = m 0 + m 1 (x 1 ) + m 2 (x 2 ) + m 12 (x) 31/37

32 Application 1: interpretation Let us consider the random test function f : [0, 1] 10 R : x 10 sin(πx 1 x 2 ) + 20(x 3 0.5) x 4 + 5x 5 + N (0, 1) The steps for approximating f with GPM are: 1 Learn f on a DoE (here LHS maximin with 180 points) 2 get the optimal values ψ for the kernel parameters using MLE, 3 build the kriging predictor ˆf based on K 0 32/37 As ˆf is a function of 10 variables, the model can not easily be represented: it is usually considered as a blackbox. However, the structure of K allows to split m in submodels.

33 Application 1: interpretation The univariate sub-models are: ( ) we had f (x) = 10 sin(πx 1 x 2 ) + 20(x 3 0.5) x 4 + 5x 5 + N (0, 1) 33/37

34 Application 2: computation of Sobol indices Using K, the sensitivity indices S I can be computed analytically: S I = var (m I(X I ) var (m(x)) = F T K 1 ( i I Γ i) K 1 F F T K 1 ( d i=1 (1 n n + Γ i ) 1 n n ) K 1 F where Γ i is the matrix Γ i = D i ki 0 (s i )ki 0 (s i ) T ds i, 1 n n is the matrix of 1 and where is a term wise product. Conversely to other methods, the computation of S I do not require to compute all S J for J I. 34/37

35 Conclusion Additive models Are useful for modeling high dimensional phenomenon Can be used for extracting an additive trend Kernels for sensitivity analyses K correspond to a particular class of ANOVA kernels they allows to obtain efficiently the terms of the ANOVA representation. This is useful for show the first terms for model interpretation compute the sensitivity analysis indices 35/37

36 Conclusion Perspectives: Confidence intervals for the sub-models RKHS orthogonal to other operators than Future work: Models with very high dimensional inputs kernels for taking into account specific features 36/37

37 Conclusion Thank you for your attention 37/37

Gaussian process regression for Sensitivity analysis

Gaussian process regression for Sensitivity analysis GPSS Workshop on UQ, Sheffield, September 2016 Nicolas Durrande, Mines St-Étienne, durrande@emse.fr GPSS workshop on UQ GPs for sensitivity analysis