A novel algorithmic approach to Bayesian Logic Regression

Size: px

Start display at page:

Download "A novel algorithmic approach to Bayesian Logic Regression"

Candice Newton
5 years ago
Views:

A novel algorithmic approach to Bayesian Logic Regression Hubin A.A.,Storvik G.O., Frommlet F. Department of Mathematics, University of Oslo aliaksah@math.uio.

1 A novel algorithmic approach to Bayesian Logic Regression Hubin A.A.,Storvik G.O., Frommlet F. Department of Mathematics, University of Oslo and Department of Medical Statistics (CEMSIIS) Oslo Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

2 Introduction. Issues Logic regression was developed as a tool to construct predictors from Boolean combinations of binary covariates previously used for Inference Not for predictions (before) Among the main applications were Modeling epistatic effects in genetic association studies Regulatory motif finding Identifying target populations for screening or not screening Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

3 Introduction. Issues Has not become widely known because of Combinatorial complexity Fit algorithms were not performing sufficiently well Few applications were addressed Efficient fit algorithms for model probabilities in space of logic regression are required, since The number of models to select from is doubly exponential in the number of input boolean variables (leaves) The search space has numerous sparsely located local extrema Time and computing resources are limited Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

4 Bayesian Logic Regression (GLM context) Y i µ i f(y µ i ), i {1,..., n} (1) µ i = g 1 (η i ) (2) η i = γ 0 β 0 + k j=1 γ jβ j L ij (3) L ij {0, 1}, j {1,..., k} are all feasible logical expressions (trees), based on the input leaves. E.g. L i1 = (X i1 X i2 ) Xi3 c, where is logical and is logical or c is logical not k is the total number of all possible trees of size up to C based on p input leaves β j R, j {0,..., k} are regression coefficients of these trees g( ) is a proper link function γ j {0, 1}, j {0,..., k} are latent indicators defining if a tree L ij is included into the model (γ j = 1) or not (γ j = 0) Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

5 Model Priors p(γ) I { k j=1 γ j Q} k j=1 v γ j c(l j ), (4) c(l j ) is a measure for the complexity of term L j, Q is the maximal allowed number of trees per model, and 0 < v < 1. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

6 Model Priors p(γ) I { k j=1 γ j Q} k j=1 v γ j c(l j ), (4) c(l j ) is a measure for the complexity of term L j, Q is the maximal allowed number of trees per model, and 0 < v < 1. Inference driven choice capable of controlling FDR is: p(γ) I { k j=1 γ j Q} k j=1 s j! p sj 2 2s j 2 I {s j C}, (5) s j is a number of leaves in tree L j, p is the number of input leaves, and C is the maximal allowed number of leaves per tree. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

7 Model Priors p(γ) I { k j=1 γ j Q} k j=1 v γ j c(l j ), (4) c(l j ) is a measure for the complexity of term L j, Q is the maximal allowed number of trees per model, and 0 < v < 1. Inference driven choice capable of controlling FDR is: p(γ) I { k j=1 γ j Q} k j=1 s j! p sj 2 2s j 2 I {s j C}, (5) s j is a number of leaves in tree L j, p is the number of input leaves, and C is the maximal allowed number of leaves per tree. Prediction or inference driven prior is: p(γ) I {r Q} (1 1/r)(1/r)a (1 1/s)(1/s) b k j=1 I {s j C}, a, b > 0, (6) r = k j=1 γ j is the model size and s in r plus the total number of logical operators present in the model. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

8 Marginal likelihood approximation Assume the following prior on regression coefficients: β γ N p (µ β, Σ β ). (7) Then Laplace approximations of the marginal likelihood can be obtained in the GLM context: p(d γ) e log p(d γ,ˆθ γ) 0.5 θ γ log n, (8) where p(d γ, ˆθ γ ) is the likelihood evaluated at the maximum likelihood estimate ˆθ γ of the parameters for model γ (the corresponding regression coefficients and possibly a variance parameter) while n is the number of observations. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

9 Inference on the model Let: γ = {γ 1,..., γ k } define a model itself, i.e. which covariates are addressed; θ γ define parameters of the model. Goals: p(γ, θ γ D) posterior distribution of parameters and models; p(γ D) marginal posterior probabilities of the models; p( D) marginal posterior probabilities of the quantiles of interest. But: 2 k different models in Ω γ ; k is huge; Both k and Ω γ are extremely difficult to specify. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

10 Possible pipeline Notice that p(γ, θ γ D) = p(θ γ γ, D)p(γ D); Here p(d γ) can be obtained by LA or similarly; Notice that p(γ D) = Approximate with p(d γ)p(γ) γ Ωγ p(d γ )p(γ ) ; p(γ D) = p(d γ)p(γ) γ V p(d γ )p(γ ) (9) V is the subspace of Ω γ to be efficiently explored; Near modal values in terms of MLIK prior are particularly important for construction of reasonable V Ω γ, missing them can dramatically influence posterior in the original space Ω γ. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

MJMCMC is efficient, but... In Hubin and Storvik [6] we suggested efficient mode jumping proposals in the discrete parameter spaces. But Ω γ and k must be clearly specified for MJMCMC.

11 MJMCMC is efficient, but... In Hubin and Storvik [6] we suggested efficient mode jumping proposals in the discrete parameter spaces. But Ω γ and k must be clearly specified for MJMCMC. The later is not feasible in logic regression. Figure: Locally optimized with randomization proposals Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

12 Genetically modified MJMCMC. Idea MJMCMC is embedded in the iterative setting of a genetic algorithm. In each iteration only a given set S of trees (of fixed size d) is considered; Each S then induces a separate search space for MJMCMC or in the language of genetic algorithms S is the population; S dynamically evolves as a Markov chain of populations through {S 0,..., S tmax } to allow MJMCMC explore different reasonable parts of the in-feasibly large total search space; Each S t, t {1,..., t max } is selected from the neighborhood N t 1 of S t 1. N t 1 includes all populations feasible by performing mutation, crossover, reduction and filtration operations to the current S t 1 ; Utilization of the approximation (9) allows us to compute marginal inclusion probabilities ; p(l D) = p(γ D) (10) γ V L T (γ) Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

13 Genetically modified MJMCMC. Pipeline S 0 is the set of p input binary leaves; S 1 is constructed by: 1 Running MJMCMC for a given number of iterations N init on S 0 ; 2 The first d 1 < d members of population S 1 are then defined by filtration operation, whilst p d 1 filtered leaves from S 0 are kept in F; 3 The remaining d d 1 members of S 1 are obtained by means of the crossover operation applied to S 0 ; All other S t, t {2,..., t max } are constructed by: 1 Running MJMCMC for a given number of iterations N expl on S t 1 ; 2 The first d t d members of population S t are then defined by filtration operation; 3 The remaining d d t members of S t are obtained by means of the crossover, mutation and reduction operations applied to S t 1 and F; Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

14 Filtration operation. S 0 case p(f 1 D)... p(f p d1 D) p(l 1 1 D) p(l1 2 D)... p(l1 d 1 D) (11) p(l 1 1 D) po s (12) X 1 F 1 X 2 F p d1 X 3 p o s L 1 1 X p L 1 d 1 Figure: Feature filtering. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

15 Filtration operation. S t, t {1,..., t max } case p(d 1 t+1 D)... p(d p d t+1 t+1 D) p(l t+1 d 1 +1 D)... p(lt+1 d t+1 D) (13) p(l t+1 d 1 +1 D) pt s (14) L t d 1 +1 D t+1 1 L t d 1 +2 D t+1 d d t+1 L t d 1 +3 p t s L t+1 d 1 +1 L t d L t+1 d t+1 Figure: Feature filtering. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

16 Crossover and mutation operations. Parents selection Other d d t+1 members are filled with either crossovers with probability p c or mutations with probability 1 p c. Crossovers inbreed parents from population S t only, mutations allow parents from F. L t 1 Lt 2 Lt 2... L t d 1 +1 Lt d L t d F 1 F 2 F 3.. Ḟd d1 Figure: Class probabilities for selection of parents proportional to current marginal inclusion probabilities Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

17 Crossover and mutation operations. Inbreeding of parents Within each mutation or crossover is used for inbreeding with probability p and, and - otherwise. The not c operator is applied to the parents with probability p not. L i L j choose c operators L c i L j Figure: Tree engineering step illustration. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

18 Reduction operator Reductions are applied for the trees greater than C; Each leave is independently deleted with Bernoulli probability p d ; The survived leaves are sticked together with with probability p and and - otherwise. L i = X 1 X 2 X 3 X 4 X 5 reduce and stick L i = X 1 X 4 Figure: Tree pruning step illustration. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

19 Genetically modified MJMCMC. Embarrassingly parallelized 1 Run B GMJMCMC chains in parallel with different seeds on separate CPUs or clusters; 2 Combine all unique models visited by all B chains into V; 3 Compute model posteriors as (9) 4 Compute marginal inclusion probabilities as (10) 5 Compute posteriors of other parameters of interest as p( D) = p( γ, D) p(γ D) (15) γ V Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

20 Simulation scenarios. Binary responses For this scenario we generated N = 100 datasets with n = 1000 observations and p = 50 binary covariates. The covariates were assumed to be independent and were simulated as X j Bernoulli(0.3) for j {1,..., 50}. Binary responses Bernoulli observations with individual success probability π: Scenario 1: logit(π) = X1 c X 4 + X 8 X 11 + X 5 X 9 Scenario 2: logit(π) = X c 1 X X 8 X X 5 X 9 Scenario 3: logit(π) = X 2 X 9 +9 X 7 X 12 X 20 9 X 4 X 10 X 17 X 30 Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

21 Binary responses. Results over 100 datasets FBLR MCLR GMJMCMC Scenario 1 X1 c X X 5 X X 11 X Overall Power FP FDR WL Scenario 2 X1 c X X 5 X X 11 X Overall Power FP FDR WL Scenario 3 X 2 X X 7 X 12 X X 4 X 10 X 17 X Overall Power FP FDR WL Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

22 Simulation scenarios. Continuous responses For this scenario we generated N = 100 datasets with n = 1000 observations and p = 50 binary covariates. The covariates were assumed to be independent and were simulated as X j Bernoulli(0.5) for j {1,..., 50}. Continuous responses Gaussian observations with error variance σ 2 = 1 and individual expectations specified as follows for the different scenarios: Scenario 4: E(Y ) = X 5 X X 8 X X 1 X 4 Scenario 5: E(Y ) = X X 2 X X 7 X 12 X X 4 X 10 X 17 X 30 Scenario 6: E(Y ) = X X X 18 X X 2 X 9 +9 X 12 X 20 X X 1 X 3 X X 4 X 10 X 17 X X 11 X 13 X 19 X 50 Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

23 Continuous responses. Results over 100 datasets Scenario 4 Scenario 6 X 5 X X X 8 X X X 1 X X 2 X Overall Power 0.99 X 18 X FP 0.01 X 1 X 3 X FDR X 12 X 20 X WL 0.00 X 4 X 10 X 17 X Scenario 5 X 11 X 13 X 19 X (0.93) X Overall Power 0.79 (0.88) X 2 X FP 4.28 (2.05) X 4 X 10 X 17 X FDR 0.38 (0.19) X 7 X 12 X WL 0.03 Overall Power 0.96 FP 0.37 FDR 0.06 WL 0.00 Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

24 Real data analysis. Arabadopsis thailana QTL study Phenotype: hypocytl length under different light conditions Population Phenotype Chr Marker expression p(l D) > 0.05 EstC Blue Light 4 X EstC Blue Light 5 X EstC Blue Light 2 X EstC Blue Light 4 2 X X EstC Red Light 2 MSAT EstC Red Light 2 PHYB EstC Red Light 2 1 (1-PHYB) X EstC Red Light 2 X EstC Far Red Light 4 MSAT EstC Far Red Light 4 NGA EstC White Light 5 X EstC White Light 1 X Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

25 Real data analysis. Drosophila Simulans study Phenotype: pc1 of the size and shape of the posterior lobe of the male genital arch marker chromosome marker name posterior mbic m2 X w x m4 X v x m7 2 gl x m9 2 cg m10 2 gpdh x m14 2 mhc 1 x m18 2 sli x m22 2 zip x m23 2 lsp x m26 3 dbi 1 x m29 3 fz 1 x m32 3 rdg x m33 3 ht m35 3 ninae x m37 3 mst x m40 3 hb m41 3 rox x m44 3 jan 1 x m12, m34 2, 3 glt ant x m11, m35 2, 3 ninae ninac Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

26 Real data analysis. Drosophila Mauritana study Phenotype: pc1 of the size and shape of the posterior lobe of the male genital arch marker chromosome marker name posterior mbic m1 X ewg x m4 X v x m9 2 cg x m11 2 ninac x m15 2 ddc x m18 2 sli x m22 2 zip x m24 3 ve m25 3 acr x m26 3 dbi m28 3 cyc x m29 3 fz m34 3 ant 1 x m37 3 mst x m39 3 tub m40 3 hb x m41 3 rox m44 3 jan x m1, m2 X, X w ewg m2, m36 X, 3 w.fas x m29, m40 3, 3 fz.hb x Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

27 Real data analysis. Simulans VS Mauritana prediction Short runs of GMJMCMC with 1 thread are addressed; Less conservative prior is used, namely p(γ) (1 1/r)(1/r)log(500) (1 1/s)(1/s) 2 ; Prediction is based on marginalized over all models probabilities, namely Ŷ = I { p(y D) 0.5}, p(y D) = γ V p(y γ, D) p(γ D); Algorithm min.p mean.p max.p min.fn mean.fn max.fn min.fp mean.fp max.fp RIDGE GMJMCMC NAIVEBAYESS lxgboost MJMCMC LASSO LR DEEPNETS txgboost RFOREST KMEANS Table: Comparison of performance (Precision, FDR, FNR) of different algorithms for Drosophila classification Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

28 Real data analysis. Simulans VS Mauritana study But no nonlinearities were found either in the inferential study Table: Drosophila data: whether simulans or mauritiana. Population Phenotype Chr Marker expression p(l D) > 0.5 Both S or M X run Both S or M 3 ninae Both S or M 2 ddc Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

29 Real data analysis. Simulated response prediction Simulate responses as: Y = I {logit 1 (0.4 9v eve eip gpdh + 9gl egfr glt 5cg w) 0.5} Now perform predictions: Algorithm min.p mean.p max.p min.fn mean.fn max.fn min.fp mean.fp max.fp txgboost GMJMCMC MJMCMC lxgboost DEEPNETS RFOREST LASSO RIDGE NAIVEBAYESS LR KMEANS Table: Comparison of performance (Precision, FDR, FNR) of different algorithms for Drosophila classification Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

30 Further (partly current) research Generalization. trees: Allow general feature engineering instead of only logical L i L j choose + g operator g l (L i L j ) Figure: General feature engineering step illustration. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

31 Further (partly current) research Generalization will allow (allows) inference like the following: Table: TP, FP, power and FDR under Kepler s 3rd law model ( p(l D) 0.5). Detections 32 threads 16 threads 4 threads 1 thread Models/Last mutation 5000/ / / /10000 (HostStarMassSlrMass PeriodDays 2 ) (HostStarRadiusSlrRad PeriodDays 2 ) (HostStarTempK PeriodDays 2 ) Other Totally: FDR FP Power Power 32 threads threads threads thread where the observations a are semi major axises of the ellipses of orbits. Explanatory variables include TypeFlag, RadiusJpt, PeriodDays, HostStar- MassSlrMass, Eccentricity, PlanetaryMassJpt, HostStarRadiusSlrRad, Host- StarMetallicity, and PlanetaryDensJpt. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

32 Concluding remarks We introduced the GMJMCMC algorithm for Bayesian logic regression models capable of estimating posterior model probabilities Bayesian model averaging and selection EMJMCMC R-package is available flexibility in the choice of methods marginal likelihoods model selection criteria extensive parallel computing is available vectorized predictions with NA hadling is incorporated Results showed that GMJMCMC performs well in terms of the search speed and quality addresses a more general class of models than competitors provides nice predictive and inferential performance in the applications Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

33 References A. Hubin, G.O. Storvik, F. Frommlet A novel algorithmic approach to Bayesian Logic Regression. arxiv: v1, A. Hubin and G.O. Storvik Efficient mode jumping MCMC for Bayesian variable selection in GLMM. arxiv: v3, C. Kooperberg, and I. Ruczinski. Identifying Interacting SNPs Using Monte Carlo Logic Regression. Genetic Epidemiology, 28: , A. Fritsch. A Full Bayesian Version of Logic regression for SNP Data. PhD thesis, Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

34 The End. Thank you. Aliaksandr Hubin (University of Oslo) Bayesian Logic Regression / 32

Selecting explanatory variables with the modified version of Bayesian Information Criterion

Selecting explanatory variables with the modified version of Bayesian Information Criterion Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh,