ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint work with Pierre Pudlo, Louis Raynal, Mathieu Ribatet and Christian Robert ABCruise, Helsinki 1/39

Introduction We consider statistical models for which no explicit forms for the likelihood are available or that a single evaluation of the latter is too CPU demanding. = numerous heterogeneous latent variables (like with the coalescent model) or intractable normalizing constant of the likelihood (like with Gibbs random fields) We focus on Approximate Bayesian Computation methods. ABCruise, Helsinki 2/39

The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with numerous simulated datasets. We assume that it is possible to generate realizations from the statistical model under consideration. It suffers from two major difficulties: to ensure reliability of the method, the number of simulations should be large, calibration has always been a critical step in ABC implementation. ABCruise, Helsinki 3/39

Idea: use regression or quantile Random Forests (RF) to estimate some quantities of interest: posterior expectations, variances, quantiles or covariances Why Random Forests? = RF regression and quantile methods were shown to be mostly insensitive both to strong correlations between predictors (here the summary statistics) and to the presence of noisy variables. = Using such a strategy less number of simulations and no calibration! ABCruise, Helsinki 4/39

Extend the work of Pudlo et al. (2016) to the case of parameters estimation: Pudlo et al. (JMM & CPR) (2016) Reliable ABC model choice via random forests, Bioinformatics Related methods: adjusted local linear: Beaumont et al. (2002) Approximate Bayesian computation in population genetics, Genetics ridge regression: Blum et al. (2013) A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation, Statistical Science adjusted neural networks: Blum and François (2010) Non-linear regression models for Approximate Bayesian Computation, Statistics and Computing ABCruise, Helsinki 5/39

Outline of the talk - Recap on random Forests 1. ClAssification and Regression Trees (CART) 2. Bootstrap AGGregatING 3. Random Forests - The ODOF methodology 1. Posterior Expectations 2. Quantiles 3. Variances 4. Covariances - Simulation study: a Gaussian toy example - Simulation study: a regression toy example ABCruise, Helsinki 6/39

Recap on Random Forest (inspired from Adele Cutler September 15-17, 2010 Ovronnaz, Switzerland) The work of Leo Breiman (1928-2005) Breiman et al. (1984) Classification and Regression Trees, Wadsworth Statistics/Probability Breiman (1996) Bagging, Learning Machine Breiman (2001) Random Forests, Machine Learning ABCruise, Helsinki 7/39

1. Classification and Regression Trees (CART) Grow a binary tree At each node, split the data into two daughter nodes Splits are chosen using a splitting criterion For regression the predicted values at terminal nodes or leaves are the average response variable for all observations in the leaves For classification the predicted class is the most common class in the leaves (majority vote) ABCruise, Helsinki 8/39

Splitting criteria Regression: Residual Sum of Squares (y i y R ) 2 left (y i y L ) 2 + right where y L mean y-value for left node y R mean y-value for right node Classification: Gini criterion N L K k=1 p kl (1 p kl ) + N R K k=1 where p kl proportion of class k in left node p kr proportion of class k in right node p kr (1 p kr ) ABCruise, Helsinki 9/39

ABCruise, Helsinki 10/39

ABCruise, Helsinki 11/39

Advantages Computationally simple and quick to fit, even for large problems No formal distributional assumptions (non-parametric) Can handle highly non-linear interactions and classification boundaries Automatic variable selection Very easy to interpret if the tree is small Disadvantages Accuracy = current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART Instability = if we change the data a little, the tree picture can change a lot ABCruise, Helsinki 12/39

2. Bagging (Bootstrap AGGregatING) predictors ABCruise, Helsinki 13/39

Single Regression Tree ABCruise, Helsinki 14/39

10 Regression Trees ABCruise, Helsinki 15/39

Average of 100 Regression Trees ABCruise, Helsinki 16/39

Fit classification or regression models to bootstrap samples from the data and combine by voting (classification) or averaging (regression) Bagging reduces the variance of the base learner but has limited effect on the bias It s most effective if we use strong base learners that have very little bias but high variance (unstable), e.g. trees ABCruise, Helsinki 17/39

3. Random Forests Grow a forest of many trees Grow each tree on an independent bootstrap sample from the training data At each node: 1. Select m variables at random out of all M possible variables (independently for each node) 2. Find the best split on the selected m variables Grow the trees to maximum depth (classification) Vote/average the trees to get predictions for new data ABCruise, Helsinki 18/39

Improve on CART with respect to: Accuracy Random Forests is competitive with the best known machine learning methods Instability if we change the data a little, the individual trees may change but the forest is relatively stable because it is a combination of many trees A case in the training data is not in the bootstrap sample for about one third of the trees (we say the case is out-of-bag ). Vote (or average) the predictions of these trees to give the out-of-bag predictor ABCruise, Helsinki 19/39

RF handles thousands of predictors RF regression and classification methods were shown to be mostly insensitive both to strong correlations between predictors and to the presence of noisy variables. ABCruise, Helsinki 20/39

The One Dimension One Forest (ODOF) Methodology Parametric statistical model: {f(y; θ): y Y, θ Θ}, Y R n, Θ R p Prior distribution π(θ) Goal: estimating a quantity of interest ψ(y) R: posterior means, variances, quantiles or covariances Difficulty: the evaluation of f( ; θ) is not possible ABCruise, Helsinki 21/39

η : Y R k is an appropriate summary statistic Produce the Reference Table (RT) that will be used as learning dataset for some different RF methods: for t 1 N 1. Simulate θ (t) π(θ) 2. Simulate ỹ t = (ỹ 1,t,..., ỹ n,t ) f(y; θ (t) ) 3. Compute η(ỹ t ) = {η 1 (ỹ t ),..., η k (ỹ t )} ABCruise, Helsinki 22/39

1. Posterior Expectations θ = (θ 1,..., θ d ) R d Construct d regression RF, one per dimension: for dimension j response θ j predictors variables the summary statistics η(y) = {η 1 (y),..., η k (y)} If L b (η(y )) denotes the leaf of the b-th tree associated with η(y ), the leaf reached after following the path of binary choices given by this tree, there are L b response variables in that leaf E(θ j η(y )) = 1 B B b=1 1 L b (η(y )) t:η(y t ) L b (η(y )) θ (t) j ABCruise, Helsinki 23/39

2. Quantiles Meinshausen (2006) Quantile Regression Forests, JMLR E(θ j η(y )) = 1 B with w t (η(y )) = 1 B B b=1 B b=1 1 L b (η(y )) I Lb (η(y ))(η(y t )) L b (η(y )) t:η(y t ) L b (η(y )) Estimate the posterior cdf of θ j with θ (t) j = N t=1 w t (η(y ))θ (t) j F (u η(y )) = N t=1 w t (η(y ))I {θ (t) j u}. Posterior quantiles and hence credible intervals are then derived by inverting ˆF ABCruise, Helsinki 24/39

3. Variances While an approximation to Var(θ j η(y )) can be derived in a natural way from ˆF, we suggest using a slightly more involved version In a given tree b, some entries from the reference table are not exploited since this tree relies on a bootstrap subsample. These absent entries are called out-of-bag simulations and can be used to return an estimate of E{θ j η(y t )}, θ j (t), Apply the weights ω t (η(y )) to the out-of-bag residuals Var(θ j η(y )) = N ω t (η(y )) t=1 { (θ (t) j θ j (t) } 2 ABCruise, Helsinki 25/39

3. Covariances For Cov(θ j, θ l η(y )), we propose to construct a specific RF response: the product out-of-bag errors for θ j and θ l { θ (t) j θ } { } (t) j θ l,t predictors variables the summary statistics η(y) = {η 1 (y),..., η k (y)} θ (t) l ABCruise, Helsinki 26/39

Simulation study: a Gaussian toy example (y 1,..., y n ) θ 1, θ 2 iid N (θ 1, θ 2 ), n = 10 θ 1 θ 2 N (0, θ 2 ) and θ 2 IG(4, 3) θ 1 y T ( n + 8, (nȳ)/(n + 1), (s 2 + 6)/((n + 1)(n + 8)) ) θ 2 y IG { n/2 + 4, s 2 /2 + 3 } = straightforward to derive theoretical values such as ψ 1 (y) = E(θ 1 y), ψ 2 (y) = E(θ 2 y), ψ 3 (y) = Var(θ 1 y) and ψ 4 (y) = Var(θ 2 y) ABCruise, Helsinki 27/39

Reference table of N = 10, 000 replicates Independent test set of size N pred = 100 k = 53 summary statistics: the sample mean, the sample variance and the sample median absolute deviation, and 50 independent noisy variables (uniform [0,1]) ABCruise, Helsinki 28/39

ψ ~ 1 2 1 0 1 2 2 1 0 1 2 ψ 1 ψ ~ 2 0.5 1.0 1.5 2.0 2.5 0.05 0.15 0.25 0.35 0.05 0.10 0.15 0.20 0.25 0.30 0.35 ψ 3 ψ ~ 4 0.0 0.2 0.4 0.6 0.8 ψ ~ 3 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 ψ 2 ψ 4 Scatterplot of the theoretical values with their corresponding estimates ABCruise, Helsinki 29/39

Q ~ 0.025(θ 1 y) 4 3 2 1 0 1 2 4 3 2 1 0 1 2 Q 0.025 (θ 1 y) Q ~ 0.975(θ 1 y) 1 0 1 2 3 4 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2 Q 0.025 (θ 2 y) Q ~ 0.975(θ 2 y) 1 2 3 4 5 Q ~ 0.025(θ 2 y) 1 0 1 2 3 4 Q 0.975 (θ 1 y) 1 2 3 4 5 Q 0.975 (θ 2 y) Scatterplot of the theoretical values of 2.5% and 97.5% posterior quantiles for θ 1 and θ 2 with their corresponding estimates ABCruise, Helsinki 30/39

ODOF adj local linear adj ridge adj neural net ψ 1 (y) = E(θ 1 y) 0.21 0.42 0.38 0.42 ψ 2 (y) = E(θ 2 y) 0.11 0.20 0.26 0.22 ψ 3 (y) = Var(θ 1 y) 0.47 0.66 0.75 0.48 ψ 4 (y) = Var(θ 2 y) 0.46 0.85 0.73 0.98 Q 0.025 (θ 1 y) 0.69 0.55 0.78 0.53 Q 0.025 (θ 2 y) 0.06 0.45 0.68 1.02 Q 0.975 (θ 1 y) 0.48 0.55 0.79 0.50 Q 0.975 (θ 2 y) 0.18 0.23 0.23 0.38 Comparison of normalized mean absolute errors ABCruise, Helsinki 31/39

~ Var (θ1 y) 0.0 0.1 0.2 0.3 0.4 0.5 True ODOF loc linear ridge Neural net ~ Var (θ2 y) 0.0 0.2 0.4 0.6 0.8 1.0 True ODOF loc linear ridge Neural net Boxplot comparison of Var(θ 1 y), Var(θ 2 y) with the true values, ODOF and usual ABC methods ABCruise, Helsinki 32/39

Simulation study: a regression toy example (y 1,..., y n ) β 1, β 2, σ 2 N n (Xβ, σ 2 I n ) X = [x 1 x 2 ] a n 2 design matrix, β = (β 1, β 2 ) and n = 100 β 1, β 2 σ 2 N 2 (0, nσ 2 (X X) 1 ) σ 2 IG(4, 3) = this conjugate ( model leads to closed-form posteriors n β 1, β 2 y T 2 n+1 (X X) 1 X y 3+y (Id X(X X) 1 X )y/2 4+n/2 ) n n+1 (X X) 1, 8 + n σ 2 y IG ( 4 + n 2, 3 + 1 2 y (Id X(X X) 1 X )y ) ABCruise, Helsinki 33/39

Reference table of N = 10, 000 replicates Independent test set of size N pred = 100 k = 62 summary statistics: the maximum likelihood estimates of β 1, β 2, the residual sum of squares, the empirical covariance and correlation between y and x j, the sample mean, the sample variance, the sample median... and 50 independent noisy variables (uniform [0,1]) X chosen such that there is a significant posterior correlation between β 1 and β 2 ABCruise, Helsinki 34/39

~ Var (β1 y) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ~ Var ridge(β1 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 ~ Var neural(β1 y) 0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Var(β 1 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Var(β 1 y) 0.00 0.05 0.10 0.15 0.20 Var(β 1 y) ~ Var (β2 y) 0.05 0.10 0.15 0.20 0.25 ~ Var ridge(β2 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 ~ Var neural(β2 y) 0.00 0.05 0.10 0.15 0.05 0.10 0.15 0.20 0.25 Var(β 2 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Var(β 2 y) 0.00 0.05 0.10 0.15 Var(β 2 y) Scatterplot of the theoretical values of posterior variances with their corresponding estimates ABCruise, Helsinki 35/39

~ Var (σ 2 y) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ~ 2 Var ridge(σ y) 0.00 0.05 0.10 0.15 0.20 0.25 ~ 2 Var neural(σ y) 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Var(σ 2 y) 0.00 0.05 0.10 0.15 0.20 0.25 Var(σ 2 y) 0.00 0.05 0.10 0.15 0.20 0.25 Var(σ 2 y) Scatterplot of the theoretical values of posterior variances with their corresponding estimates ABCruise, Helsinki 36/39

~ Cov (β1, β2 y) 0.05 0.10 0.15 0.20 0.25 ~ Cov ridge(β1, β2 y) 0.00 0.05 0.10 0.15 ~ Cov neural(β1, β2 y) 0.00 0.05 0.10 0.15 0.05 0.10 0.15 0.20 0.25 Cov(β 1, β 2 y) 0.00 0.05 0.10 0.15 Cov(β 1, β 2 y) 0.00 0.05 0.10 0.15 Cov(β 1, β 2 y) Scatterplot of the theoretical values of posterior covariances between β 1 and β 2 with their corresponding estimates ABCruise, Helsinki 37/39

ODOF adj ridge adj neural net E(β 1 y) 0.09 0.12 0.15 E(β 2 y) 0.10 0.25 0.38 E(σ 2 y) 0.04 0.06 0.07 Var(β 1 y) 0.53 0.98 0.60 Var(β 2 y) 0.49 0.85 0.57 Var(σ 2 y) 0.32 0.80 0.75 Cov(β 1, β 2 y) 0.29 0.86 0.62 Q 0.025 (β 1 y) 0.25 0.35 0.29 Q 0.975 (β 1 y) 0.40 0.85 0.78 Comparison of normalized mean absolute errors ABCruise, Helsinki 38/39

~ Var (β 1 y) 0.00 0.02 0.04 0.06 0.08 ~ Var (β 2 y) 0.00 0.02 0.04 0.06 0.08 ~ Var (σ 2 y) 0.00 0.01 0.02 0.03 0.04 True ODOF Neural net True ODOF Neural net True ODOF Neural net Boxplot comparison of Var(β 1 y), Var(β 2 y) and Var(σ 2 y) ABCruise, Helsinki 39/39