ABC random forest for parameter estimation. Jean-Michel Marin

Similar documents
arxiv: v4 [stat.me] 14 Nov 2017

BAGGING PREDICTORS AND RANDOM FOREST

Tutorial on Approximate Bayesian Computation

Ensemble Methods and Random Forests

Classification using stochastic ensembles

day month year documentname/initials 1

Lossless Online Bayesian Bagging

Approximate Bayesian Computation

Supplementary Information ABC random forests for Bayesian parameter inference

REGRESSION TREE CREDIBILITY MODEL

Statistical Machine Learning from Data

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Variance Reduction and Ensemble Methods

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Ensembles of Classifiers.

Bagging During Markov Chain Monte Carlo for Smoother Predictions

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Chapter 14 Combining Models

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Constructing Prediction Intervals for Random Forests

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Decision trees COMS 4771

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Bias Correction in Classification Tree Construction ICML 2001

Lecture 13: Ensemble Methods

Fast Likelihood-Free Inference via Bayesian Optimization

Neural Networks and Ensemble Methods for Classification

Statistics and learning: Big Data

Subject CS1 Actuarial Statistics 1 Core Principles

A Magiv CV Theory for Large-Margin Classifiers

Algorithm-Independent Learning Issues

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

Online Learning and Sequential Decision Making

Cross Validation & Ensembling

arxiv: v5 [stat.me] 18 Apr 2016

Linear Methods for Prediction

Click Prediction and Preference Ranking of RSS Feeds

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Bagging and the Bayesian Bootstrap

Nonparametric Bayesian Methods (Gaussian Processes)

STAT 518 Intro Student Presentation

Variable importance measures in regression and classification methods

ECE521 week 3: 23/26 January 2017

Adaptive Crowdsourcing via EM with Prior

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining und Maschinelles Lernen

Model Averaging With Holdout Estimation of the Posterior Distribution

MODULE -4 BAYEIAN LEARNING

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

FINAL: CS 6375 (Machine Learning) Fall 2014

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Ensemble Methods for Machine Learning

Adaptive Monte Carlo methods

Efficient Likelihood-Free Inference

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Decision Tree Learning Lecture 2

Advanced Introduction to Machine Learning CMU-10715

Association studies and regression

Gradient Boosting (Continued)

Statistical Inference

ABC methods for phase-type distributions with applications in insurance risk problems

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

PDEEC Machine Learning 2016/17

Minwise hashing for large-scale regression and classification with sparse data

ECE 5424: Introduction to Machine Learning

1 Handling of Continuous Attributes in C4.5. Algorithm

Bagging and Other Ensemble Methods

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning 4771

TDT4173 Machine Learning

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Advanced Statistical Methods: Beyond Linear Regression

Partial factor modeling: predictor-dependent shrinkage for linear regression

Learning with multiple models. Boosting.

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Presentation in Convex Optimization

University of Alberta

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Stat 502X Exam 2 Spring 2014

SF2930 Regression Analysis

UVA CS 4501: Machine Learning

The exam is closed book, closed notes except your one-page cheat sheet.

Making rating curves - the Bayesian approach

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction to Simple Linear Regression

Transcription:

ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint work with Pierre Pudlo, Louis Raynal, Mathieu Ribatet and Christian Robert ABCruise, Helsinki 1/39

Introduction We consider statistical models for which no explicit forms for the likelihood are available or that a single evaluation of the latter is too CPU demanding. = numerous heterogeneous latent variables (like with the coalescent model) or intractable normalizing constant of the likelihood (like with Gibbs random fields) We focus on Approximate Bayesian Computation methods. ABCruise, Helsinki 2/39

The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with numerous simulated datasets. We assume that it is possible to generate realizations from the statistical model under consideration. It suffers from two major difficulties: to ensure reliability of the method, the number of simulations should be large, calibration has always been a critical step in ABC implementation. ABCruise, Helsinki 3/39

Idea: use regression or quantile Random Forests (RF) to estimate some quantities of interest: posterior expectations, variances, quantiles or covariances Why Random Forests? = RF regression and quantile methods were shown to be mostly insensitive both to strong correlations between predictors (here the summary statistics) and to the presence of noisy variables. = Using such a strategy less number of simulations and no calibration! ABCruise, Helsinki 4/39

Extend the work of Pudlo et al. (2016) to the case of parameters estimation: Pudlo et al. (JMM & CPR) (2016) Reliable ABC model choice via random forests, Bioinformatics Related methods: adjusted local linear: Beaumont et al. (2002) Approximate Bayesian computation in population genetics, Genetics ridge regression: Blum et al. (2013) A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation, Statistical Science adjusted neural networks: Blum and François (2010) Non-linear regression models for Approximate Bayesian Computation, Statistics and Computing ABCruise, Helsinki 5/39

Outline of the talk - Recap on random Forests 1. ClAssification and Regression Trees (CART) 2. Bootstrap AGGregatING 3. Random Forests - The ODOF methodology 1. Posterior Expectations 2. Quantiles 3. Variances 4. Covariances - Simulation study: a Gaussian toy example - Simulation study: a regression toy example ABCruise, Helsinki 6/39

Recap on Random Forest (inspired from Adele Cutler September 15-17, 2010 Ovronnaz, Switzerland) The work of Leo Breiman (1928-2005) Breiman et al. (1984) Classification and Regression Trees, Wadsworth Statistics/Probability Breiman (1996) Bagging, Learning Machine Breiman (2001) Random Forests, Machine Learning ABCruise, Helsinki 7/39

1. Classification and Regression Trees (CART) Grow a binary tree At each node, split the data into two daughter nodes Splits are chosen using a splitting criterion For regression the predicted values at terminal nodes or leaves are the average response variable for all observations in the leaves For classification the predicted class is the most common class in the leaves (majority vote) ABCruise, Helsinki 8/39

Splitting criteria Regression: Residual Sum of Squares (y i y R ) 2 left (y i y L ) 2 + right where y L mean y-value for left node y R mean y-value for right node Classification: Gini criterion N L K k=1 p kl (1 p kl ) + N R K k=1 where p kl proportion of class k in left node p kr proportion of class k in right node p kr (1 p kr ) ABCruise, Helsinki 9/39

ABCruise, Helsinki 10/39

ABCruise, Helsinki 11/39

Advantages Computationally simple and quick to fit, even for large problems No formal distributional assumptions (non-parametric) Can handle highly non-linear interactions and classification boundaries Automatic variable selection Very easy to interpret if the tree is small Disadvantages Accuracy = current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART Instability = if we change the data a little, the tree picture can change a lot ABCruise, Helsinki 12/39

2. Bagging (Bootstrap AGGregatING) predictors ABCruise, Helsinki 13/39

Single Regression Tree ABCruise, Helsinki 14/39

10 Regression Trees ABCruise, Helsinki 15/39

Average of 100 Regression Trees ABCruise, Helsinki 16/39

Fit classification or regression models to bootstrap samples from the data and combine by voting (classification) or averaging (regression) Bagging reduces the variance of the base learner but has limited effect on the bias It s most effective if we use strong base learners that have very little bias but high variance (unstable), e.g. trees ABCruise, Helsinki 17/39

3. Random Forests Grow a forest of many trees Grow each tree on an independent bootstrap sample from the training data At each node: 1. Select m variables at random out of all M possible variables (independently for each node) 2. Find the best split on the selected m variables Grow the trees to maximum depth (classification) Vote/average the trees to get predictions for new data ABCruise, Helsinki 18/39

Improve on CART with respect to: Accuracy Random Forests is competitive with the best known machine learning methods Instability if we change the data a little, the individual trees may change but the forest is relatively stable because it is a combination of many trees A case in the training data is not in the bootstrap sample for about one third of the trees (we say the case is out-of-bag ). Vote (or average) the predictions of these trees to give the out-of-bag predictor ABCruise, Helsinki 19/39

RF handles thousands of predictors RF regression and classification methods were shown to be mostly insensitive both to strong correlations between predictors and to the presence of noisy variables. ABCruise, Helsinki 20/39

The One Dimension One Forest (ODOF) Methodology Parametric statistical model: {f(y; θ): y Y, θ Θ}, Y R n, Θ R p Prior distribution π(θ) Goal: estimating a quantity of interest ψ(y) R: posterior means, variances, quantiles or covariances Difficulty: the evaluation of f( ; θ) is not possible ABCruise, Helsinki 21/39

η : Y R k is an appropriate summary statistic Produce the Reference Table (RT) that will be used as learning dataset for some different RF methods: for t 1 N 1. Simulate θ (t) π(θ) 2. Simulate ỹ t = (ỹ 1,t,..., ỹ n,t ) f(y; θ (t) ) 3. Compute η(ỹ t ) = {η 1 (ỹ t ),..., η k (ỹ t )} ABCruise, Helsinki 22/39

1. Posterior Expectations θ = (θ 1,..., θ d ) R d Construct d regression RF, one per dimension: for dimension j response θ j predictors variables the summary statistics η(y) = {η 1 (y),..., η k (y)} If L b (η(y )) denotes the leaf of the b-th tree associated with η(y ), the leaf reached after following the path of binary choices given by this tree, there are L b response variables in that leaf E(θ j η(y )) = 1 B B b=1 1 L b (η(y )) t:η(y t ) L b (η(y )) θ (t) j ABCruise, Helsinki 23/39

2. Quantiles Meinshausen (2006) Quantile Regression Forests, JMLR E(θ j η(y )) = 1 B with w t (η(y )) = 1 B B b=1 B b=1 1 L b (η(y )) I Lb (η(y ))(η(y t )) L b (η(y )) t:η(y t ) L b (η(y )) Estimate the posterior cdf of θ j with θ (t) j = N t=1 w t (η(y ))θ (t) j F (u η(y )) = N t=1 w t (η(y ))I {θ (t) j u}. Posterior quantiles and hence credible intervals are then derived by inverting ˆF ABCruise, Helsinki 24/39

3. Variances While an approximation to Var(θ j η(y )) can be derived in a natural way from ˆF, we suggest using a slightly more involved version In a given tree b, some entries from the reference table are not exploited since this tree relies on a bootstrap subsample. These absent entries are called out-of-bag simulations and can be used to return an estimate of E{θ j η(y t )}, θ j (t), Apply the weights ω t (η(y )) to the out-of-bag residuals Var(θ j η(y )) = N ω t (η(y )) t=1 { (θ (t) j θ j (t) } 2 ABCruise, Helsinki 25/39

3. Covariances For Cov(θ j, θ l η(y )), we propose to construct a specific RF response: the product out-of-bag errors for θ j and θ l { θ (t) j θ } { } (t) j θ l,t predictors variables the summary statistics η(y) = {η 1 (y),..., η k (y)} θ (t) l ABCruise, Helsinki 26/39

Simulation study: a Gaussian toy example (y 1,..., y n ) θ 1, θ 2 iid N (θ 1, θ 2 ), n = 10 θ 1 θ 2 N (0, θ 2 ) and θ 2 IG(4, 3) θ 1 y T ( n + 8, (nȳ)/(n + 1), (s 2 + 6)/((n + 1)(n + 8)) ) θ 2 y IG { n/2 + 4, s 2 /2 + 3 } = straightforward to derive theoretical values such as ψ 1 (y) = E(θ 1 y), ψ 2 (y) = E(θ 2 y), ψ 3 (y) = Var(θ 1 y) and ψ 4 (y) = Var(θ 2 y) ABCruise, Helsinki 27/39

Reference table of N = 10, 000 replicates Independent test set of size N pred = 100 k = 53 summary statistics: the sample mean, the sample variance and the sample median absolute deviation, and 50 independent noisy variables (uniform [0,1]) ABCruise, Helsinki 28/39

ψ ~ 1 2 1 0 1 2 2 1 0 1 2 ψ 1 ψ ~ 2 0.5 1.0 1.5 2.0 2.5 0.05 0.15 0.25 0.35 0.05 0.10 0.15 0.20 0.25 0.30 0.35 ψ 3 ψ ~ 4 0.0 0.2 0.4 0.6 0.8 ψ ~ 3 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 ψ 2 ψ 4 Scatterplot of the theoretical values with their corresponding estimates ABCruise, Helsinki 29/39

Q ~ 0.025(θ 1 y) 4 3 2 1 0 1 2 4 3 2 1 0 1 2 Q 0.025 (θ 1 y) Q ~ 0.975(θ 1 y) 1 0 1 2 3 4 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2 Q 0.025 (θ 2 y) Q ~ 0.975(θ 2 y) 1 2 3 4 5 Q ~ 0.025(θ 2 y) 1 0 1 2 3 4 Q 0.975 (θ 1 y) 1 2 3 4 5 Q 0.975 (θ 2 y) Scatterplot of the theoretical values of 2.5% and 97.5% posterior quantiles for θ 1 and θ 2 with their corresponding estimates ABCruise, Helsinki 30/39

ODOF adj local linear adj ridge adj neural net ψ 1 (y) = E(θ 1 y) 0.21 0.42 0.38 0.42 ψ 2 (y) = E(θ 2 y) 0.11 0.20 0.26 0.22 ψ 3 (y) = Var(θ 1 y) 0.47 0.66 0.75 0.48 ψ 4 (y) = Var(θ 2 y) 0.46 0.85 0.73 0.98 Q 0.025 (θ 1 y) 0.69 0.55 0.78 0.53 Q 0.025 (θ 2 y) 0.06 0.45 0.68 1.02 Q 0.975 (θ 1 y) 0.48 0.55 0.79 0.50 Q 0.975 (θ 2 y) 0.18 0.23 0.23 0.38 Comparison of normalized mean absolute errors ABCruise, Helsinki 31/39

~ Var (θ1 y) 0.0 0.1 0.2 0.3 0.4 0.5 True ODOF loc linear ridge Neural net ~ Var (θ2 y) 0.0 0.2 0.4 0.6 0.8 1.0 True ODOF loc linear ridge Neural net Boxplot comparison of Var(θ 1 y), Var(θ 2 y) with the true values, ODOF and usual ABC methods ABCruise, Helsinki 32/39

Simulation study: a regression toy example (y 1,..., y n ) β 1, β 2, σ 2 N n (Xβ, σ 2 I n ) X = [x 1 x 2 ] a n 2 design matrix, β = (β 1, β 2 ) and n = 100 β 1, β 2 σ 2 N 2 (0, nσ 2 (X X) 1 ) σ 2 IG(4, 3) = this conjugate ( model leads to closed-form posteriors n β 1, β 2 y T 2 n+1 (X X) 1 X y 3+y (Id X(X X) 1 X )y/2 4+n/2 ) n n+1 (X X) 1, 8 + n σ 2 y IG ( 4 + n 2, 3 + 1 2 y (Id X(X X) 1 X )y ) ABCruise, Helsinki 33/39

Reference table of N = 10, 000 replicates Independent test set of size N pred = 100 k = 62 summary statistics: the maximum likelihood estimates of β 1, β 2, the residual sum of squares, the empirical covariance and correlation between y and x j, the sample mean, the sample variance, the sample median... and 50 independent noisy variables (uniform [0,1]) X chosen such that there is a significant posterior correlation between β 1 and β 2 ABCruise, Helsinki 34/39

~ Var (β1 y) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ~ Var ridge(β1 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 ~ Var neural(β1 y) 0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Var(β 1 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Var(β 1 y) 0.00 0.05 0.10 0.15 0.20 Var(β 1 y) ~ Var (β2 y) 0.05 0.10 0.15 0.20 0.25 ~ Var ridge(β2 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 ~ Var neural(β2 y) 0.00 0.05 0.10 0.15 0.05 0.10 0.15 0.20 0.25 Var(β 2 y) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Var(β 2 y) 0.00 0.05 0.10 0.15 Var(β 2 y) Scatterplot of the theoretical values of posterior variances with their corresponding estimates ABCruise, Helsinki 35/39

~ Var (σ 2 y) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ~ 2 Var ridge(σ y) 0.00 0.05 0.10 0.15 0.20 0.25 ~ 2 Var neural(σ y) 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Var(σ 2 y) 0.00 0.05 0.10 0.15 0.20 0.25 Var(σ 2 y) 0.00 0.05 0.10 0.15 0.20 0.25 Var(σ 2 y) Scatterplot of the theoretical values of posterior variances with their corresponding estimates ABCruise, Helsinki 36/39

~ Cov (β1, β2 y) 0.05 0.10 0.15 0.20 0.25 ~ Cov ridge(β1, β2 y) 0.00 0.05 0.10 0.15 ~ Cov neural(β1, β2 y) 0.00 0.05 0.10 0.15 0.05 0.10 0.15 0.20 0.25 Cov(β 1, β 2 y) 0.00 0.05 0.10 0.15 Cov(β 1, β 2 y) 0.00 0.05 0.10 0.15 Cov(β 1, β 2 y) Scatterplot of the theoretical values of posterior covariances between β 1 and β 2 with their corresponding estimates ABCruise, Helsinki 37/39

ODOF adj ridge adj neural net E(β 1 y) 0.09 0.12 0.15 E(β 2 y) 0.10 0.25 0.38 E(σ 2 y) 0.04 0.06 0.07 Var(β 1 y) 0.53 0.98 0.60 Var(β 2 y) 0.49 0.85 0.57 Var(σ 2 y) 0.32 0.80 0.75 Cov(β 1, β 2 y) 0.29 0.86 0.62 Q 0.025 (β 1 y) 0.25 0.35 0.29 Q 0.975 (β 1 y) 0.40 0.85 0.78 Comparison of normalized mean absolute errors ABCruise, Helsinki 38/39

~ Var (β 1 y) 0.00 0.02 0.04 0.06 0.08 ~ Var (β 2 y) 0.00 0.02 0.04 0.06 0.08 ~ Var (σ 2 y) 0.00 0.01 0.02 0.03 0.04 True ODOF Neural net True ODOF Neural net True ODOF Neural net Boxplot comparison of Var(β 1 y), Var(β 2 y) and Var(σ 2 y) ABCruise, Helsinki 39/39