Aspects of Feature Selection in Mass Spec Proteomic Functional Data

Similar documents
Bayesian Wavelet-Based Functional Mixed Models

Bayesian Analysis of Mass Spectrometry Data Using Wavelet-Based Functional Mixed Models

Proteomics and Variable Selection

Or How to select variables Using Bayesian LASSO

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Linear Regression Linear Regression with Shrinkage

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian Regression Linear and Logistic Regression

Bayesian linear regression

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

STA414/2104 Statistical Methods for Machine Learning II

Linear Regression Linear Regression with Shrinkage

Generalized Elastic Net Regression

Wavelet-Based Nonparametric Modeling of Hierarchical Functions in Colon Carcinogenesis

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Statistical Inference

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Linear Models A linear model is defined by the expression

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Lecture 1b: Linear Models for Regression

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

High-dimensional Ordinary Least-squares Projection for Screening Variables

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Frequentist Accuracy of Bayesian Estimates

Partial factor modeling: predictor-dependent shrinkage for linear regression

Lecture 6: Methods for high-dimensional problems

Biostatistics Advanced Methods in Biostatistics IV

Linear Regression (9/11/13)

Analysis Methods for Supersaturated Design: Some Comparisons

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

arxiv: v1 [stat.me] 6 Jul 2017

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

ECE521 week 3: 23/26 January 2017

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

1 Mixed effect models and longitudinal data analysis

Chapter 3. Linear Models for Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Least Squares Regression

Multivariate Bayes Wavelet Shrinkage and Applications

ESL Chap3. Some extensions of lasso

Least Squares Regression

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

A short introduction to INLA and R-INLA

AMS-207: Bayesian Statistics

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Bayesian shrinkage approach in variable selection for mixed

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Statistical Data Mining and Machine Learning Hilary Term 2016

Regularization Paths

Frequentist Accuracy of Bayesian Estimates

Signal Denoising with Wavelets

Machine Learning for OR & FE

Default Priors and Effcient Posterior Computation in Bayesian

Outline lecture 2 2(30)

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Density Estimation. Seungjin Choi

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Feature selection with high-dimensional data: criteria and Proc. Procedures

Sparse Linear Models (10/7/13)

Regularization Path Algorithms for Detecting Gene Interactions

28 Bayesian Mixture Models for Gene Expression and Protein Profiles

Shrinkage Methods: Ridge and Lasso

Multivariate Survival Analysis

Some Curiosities Arising in Objective Bayesian Analysis

Integrated Anlaysis of Genomics Data

Regularization and Variable Selection via the Elastic Net

Permutation-invariant regularization of large covariance matrices. Liza Levina

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Learning with Sparsity Constraints

Introduction to Gaussian Process

MSA220/MVE440 Statistical Learning for Big Data

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Part III. A Decision-Theoretic Approach and Bayesian testing

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Recursive Sparse Estimation using a Gaussian Sum Filter

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Sparse Bayesian Nonparametric Regression

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

A Modern Look at Classical Multivariate Techniques

Regression, Ridge Regression, Lasso

Transcription:

Aspects of Feature Selection in Mass Spec Proteomic Functional Data Phil Brown University of Kent Newton Institute 11th December 2006 Based on work with Jeff Morris, and others at MD Anderson, Houston including Keith Baggerly, Kevin Coombes and Jim Griffin (Warwick) Typeset by FoilTEX 1

Plan of Talk Background to Proteomics data: aims Peak detection versus functional modelling Use of wavelets and linear (mixed) model Slab and Spike shrinkage and selection Forms of MCMC posterior inference Scale mixtures of normal as priors Algorithms: MCMC/MAP Typeset by FoilTEX 2

Issues responses v factor: What is random? What is fixed? Direction of causation? aims: prediction/discrimination or explanation? Typeset by FoilTEX 3

Mice SELDI-TOF Cancer data (MD Anderson) Spectra of serum extracted from mice and processed by Mass Spectroscopy Time of Flight. Each biological sample characterised by experimental conditions: one of 2 cancer cell lines, A375P or PC3MM2 In one of 2 organs, either brain or lung 2 spectra per mouse, one low or one high intensity laser settings, Y Spectra and X Factors: 16 mice, 32 spectra Typeset by FoilTEX 4

Natural to model spectra as responses to experimental factors part of spectrum discretised points between 2000 and 14000 Daltons, response as curve or 7985 dimensional discretised. Typeset by FoilTEX 5

Typeset by FoilTEX 6

Typeset by FoilTEX 7

Strategies for Analysis Preprocess to remove baseline differences, to normalise and align (see for example Bioconductor packages), interpolated to 2000 grid, equal spacing on time scale. Afterwards commonly find peaks by some form of signal to noise evaluation Mainstream approach would reverse direction of causation by using logistic or other regression on peak intensities Our approach rather to work with modelling of spectra by use of waveletsleaves open possibility of later identification of peaks, and bayes theorem for discrimination Typeset by FoilTEX 8

Functional Mixed models For ith sample spectrum Y i (t), i = 1,..., n thought of as a continuous function of time, t Y i (t) = p X il B l (t) + l=1 m Z ik U k (t) + E i (t) k=1 or in reality as a discretised version in terms of the set n observations on a grid of T time points, matrix Y, n T Y = XB + ZU + E Rows of U are iid MV N T (0, Q) Rows of E are iid MV N T (0, S) Typeset by FoilTEX 9

T T matices Q, S to be specified. Orthogonally transform to wavelets by applying the Discrete Wavelet Transform (DWT) to each sample (rows of Y ) Y W = XBW + ZUW + EW where W is an orthogonal matrix. The model becomes Y = XB + ZU + E Here also Q W QW = Q and S W SW = S which we take to be diagonal. Typeset by FoilTEX 10

Typeset by FoilTEX 11

Typeset by FoilTEX 12

Typeset by FoilTEX 13

Prior for the transformed fixed effects We employ a slab and spike prior here to nonlinearly shrink towards zero: for the lth fixed effect, l = 1,..., p at scale j and location k : B ljk = γ ljk Normal(0, τ ljk ) + (1 γ ljk )δ 0 Here γ ljk Bernoulli(π lj ) that is for the lth fixed effect γ, the probability of acceptance is allowed to depend on the the scale but not the locations within the scale. Typeset by FoilTEX 14

The slab and spike prior may be thought of as a particular extreme case of scale mixtures of normals : A mixture on variance of Normal π(β l ) = N(β l 0, ψ l ) G(dψ l ) (1) G(.) weights variances ψ- high probability of small variance. More later. Analysis is at the wavelet transformed level, using MCMC with empirical Bayes plug-ins for the hyperparameters of shrinkage, transforming back with Inverse DWT to the original parameters for posterior inference. Typeset by FoilTEX 15

Typeset by FoilTEX 16

Typeset by FoilTEX 17

Typeset by FoilTEX 18

Typeset by FoilTEX 19

Typeset by FoilTEX 20

Typeset by FoilTEX 21

Typeset by FoilTEX 22

Typeset by FoilTEX 23

General Scale mixtures of normal The slab and spike prior is an extreme example of a scale mixture of normals. Suppose we wish to replace the slab and spike by a scale mixture of normals that concentrates mass around zero but also has fat tails so that any really sizeable coefficient is not shrunk too much. We hope to use such priors to speed up searches for a good model by concentrating on modes rather than full MCMC. We look for a prior whose negative log provides a suitable penalty to add to negative log likelihoods of a variety of models- suitable for logistic discrimination (eg modelling disease given spectrum). Typeset by FoilTEX 24

Generalised linear model for many variables Strategy minimise: loglik(β) log[π(β)] For example this could be the logistic likelihood using wavelets as explanatory variables. Continuous mixture on variance of Normal π(β i ) = N(β i 0, ψ i ) G(dψ i ) (2) G(.) weights variances ψ- high probability of small variance. Typeset by FoilTEX 25

Normal Variance Mixtures The mean-zero double exponential distribution, DE(0, 1/γ) with probability density function 1 2γ exp{ β /γ}, < β <, 0 < γ < ( 1 is defined by an exponential mixing distribution, Ex ), with probability density 2γ 2 function g(ψ i ) = 1 { 2γ 2exp ψ i /[2γ 2 ] }. (3) Typeset by FoilTEX 26

The normal-jeffreys (NJ) prior distribution arises from the improper hyperprior (Kiiveri/Figueiredo) g(ψ i ) 1 ψ i, (4) which in turn induces an improper prior for β i of the form π(β i ) 1 β i. The prior has an infinite spike at β i = 0, a feature also of the penalised likelihood or posterior which is consequently unnormalisable. Typeset by FoilTEX 27

Normal Exponential Gamma (NEG) Mixing on the scale of the exponential distribution. A gamma mixture gives g(ψ i ) = λ γ 2(1 + ψ i/γ 2 ) (λ+1) 0 < λ, γ <. (5) The density of the marginal distribution of β i can be expressed as π(β i ) = λ 2 λ Γ(λ + 1/2) exp π γ { 1 4 β 2 } ( ) β γ 2 D 2(λ+1/2) γ (6) where D(.) is the parabolic cylinder function. heaviness of the tails. For large β i γ The parameter λ controls the ( ) (2λ+1) βi π(β i ) c. γ Typeset by FoilTEX 28

The case λ = 0.5 corresponds to the quasi-cauchy slab of Johnstone and Silverman (AS 2005) who focus on the median of the posterior distribution. This same prior of J&S coincides with univariate special case of the robustness prior of Berger, 1985,236-240. Jeffreys(1939,3rd Ed 1961) also has desiderata in hypothesis testing which lead to a prior with Cauchy tails. Typeset by FoilTEX 29

Typeset by FoilTEX 30

Shrinkage of β Define the penalty function p(β) = log[π(β)] The penalised MLE β and the MLE ˆβ satisfy ˆβ β σ 2 / n i=1 x2 i = sign( β)p ( β ) shows that the amount of shrinkage is directly controlled by the derivative of the penalty function. Typeset by FoilTEX 31

Jeffreys prior no shrinkage for large β (unbiased) Oracle property of Fan and Li (2001, JASA) Typeset by FoilTEX 32

Typeset by FoilTEX 33

Parameter updates in n rather than k dimensions Use singular value decomposition of X(n k) : X = UDV T If γ = V T β then we may update the distribution of this, simulate with MCMC in n dimensions << k and then translate back into β via the full k dimensional augmented (γ, γ ) with γ (k n) 1 being independent of the data, whose distribution comes from the prior, and conditional distributions of γ γ are easy to calculate. Avoids inversion of k k matrices (very large) Min(n, k) non-zero solutions Typeset by FoilTEX 34

Perfect starts For n < k there will be many local boundary solutions. We explore the multiplicity of these solutions by generating alternative starting values all of which fit the data perfectly. The Minimum length least squares (MLLS)( ridge for small ridge constant) fit to the data for n < k is ˆβ MLLS = (X T X) + X T y where + denotes the Moore-Penrose generalised inverse. Using the Singular value decomposition, X = UDiag(d 1, d 2,..., d n )V T The orthogonal projection matrix is I P = I k V V T of rank k n. Typeset by FoilTEX 35

Consider generating a random k vector z and take w = (I V V T )z and this may be added to ˆβ MLLS to get another perfectly fitting starting point. This may induce a different posterior estimate which will be close to perfectly fitting but with another set of β s set to zero. We can thus explore the modes of the posterior seeing what β components come up regularly. Typeset by FoilTEX 36

Algorithms Easy to show that for hierarchical priors p β = E and for a Gaussian mixture prior { } log(f(β ψ) β β E ψ {ψ 1 } = 1 p β β giving a direct link between Newton-Raphson algorithms and EM and allow a direct extension to exponential family likelihoods. Typeset by FoilTEX 37

Kiiveri (2003) uses EM with a ridge line search in the M-step in his GENERAVE algorithm. Fan and Li (2001) use a Newton-Raphson style algorithm for their SCAD penalty and general likelihoods. LARS (Efron et al AS2003) and approximate GLM variants (Park and Hastie, 2006). Code that cycles through one dim searches for logistic likelihood(genkin et al, 2005; C. Hoggart, 2006) Hyperparameters (shape and scale) chosen by cross-validation with such fast algorithms Typeset by FoilTEX 38

Simulations NEG with shape λ, scale by the mean µ of underlying Gamma distribution n = 30 multiple regression with k = 1000 points; 5 random splits into two sets of 15 for training and testing. y i = f(x i ) + ɛ i i = 1,..., 15 where a spline is fitted f(x) = 1000 j=1 β j(x k j ) + k j = 0.3+1.6 j 1 999 ; ɛ i N(0, σ 2 ), with σ = 0.1, x i iid uniform (0, 1). Typeset by FoilTEX 39

Typeset by FoilTEX 40

Typeset by FoilTEX 41

1.5 1 0.5 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Typeset by FoilTEX 42 Figure 2: fit to sine curve

Multiple regression simulation, σ 2 = 1, X X correlation form, AR(1) structure with lag 1 correlation ρ. n = 100 observations, k = 500 variables, 10 nonzero coefficients of β. Fitted by 5-fold cross validation and tested on 10 datasets of 100 obs. β = 1 β = 5 ρ = 0.5 ρ = 0.8 ρ = 0.5 ρ = 0.8 Ridge 5.44 3.85 100.6 61.36 Lasso 1.74 1.42 1.71 1.72 NJeffreys 1.32 1.17 1.23 1.30 NEG (λ = 0.1) 1.15 1.20 1.18 1.24 NEG 1.15 1.15 1.16 1.23 Table 1: MSE prediction results (oracle 1.00) Typeset by FoilTEX 43

MAIN REFERENCES Griffin, JE and Brown, PJ (2005) U of Kent/Warwick Morris, JS, Brown PJ, Herrick, RC, Baggerly, KA, Coombes, KR (2006) Submitted to Biometrics: BePress preprint. Morris, JS, Carroll, RJ (2005, JRSSB) Morris, JS, Brown PJ, Baggerly, KA, Coombes, KR (2006) In Bayesian Inference for Gene Expression and Proteomics Ed K-A Do, P Mueller, M Vannucci, 269-292. Fan, J and Li, R (2001) JASA, 1348-1360 Kiiveri, H. (2003) IMS Monograph: Festschrift for T Speed, 127-143 Park, MY and Hastie, T(2006) web Tech rpt, Stanford. Zou, H(2006) JASA, to appear. Typeset by FoilTEX 44