Aspects of Feature Selection in Mass Spec Proteomic Functional Data

Aspects of Feature Selection in Mass Spec Proteomic Functional Data Phil Brown University of Kent Newton Institute 11th December 2006 Based on work with Jeff Morris, and others at MD Anderson, Houston including Keith Baggerly, Kevin Coombes and Jim Griffin (Warwick) Typeset by FoilTEX 1

Plan of Talk Background to Proteomics data: aims Peak detection versus functional modelling Use of wavelets and linear (mixed) model Slab and Spike shrinkage and selection Forms of MCMC posterior inference Scale mixtures of normal as priors Algorithms: MCMC/MAP Typeset by FoilTEX 2

Issues responses v factor: What is random? What is fixed? Direction of causation? aims: prediction/discrimination or explanation? Typeset by FoilTEX 3

Mice SELDI-TOF Cancer data (MD Anderson) Spectra of serum extracted from mice and processed by Mass Spectroscopy Time of Flight. Each biological sample characterised by experimental conditions: one of 2 cancer cell lines, A375P or PC3MM2 In one of 2 organs, either brain or lung 2 spectra per mouse, one low or one high intensity laser settings, Y Spectra and X Factors: 16 mice, 32 spectra Typeset by FoilTEX 4

Natural to model spectra as responses to experimental factors part of spectrum discretised points between 2000 and 14000 Daltons, response as curve or 7985 dimensional discretised. Typeset by FoilTEX 5

Typeset by FoilTEX 6

Typeset by FoilTEX 7

Strategies for Analysis Preprocess to remove baseline differences, to normalise and align (see for example Bioconductor packages), interpolated to 2000 grid, equal spacing on time scale. Afterwards commonly find peaks by some form of signal to noise evaluation Mainstream approach would reverse direction of causation by using logistic or other regression on peak intensities Our approach rather to work with modelling of spectra by use of waveletsleaves open possibility of later identification of peaks, and bayes theorem for discrimination Typeset by FoilTEX 8

Functional Mixed models For ith sample spectrum Y i (t), i = 1,..., n thought of as a continuous function of time, t Y i (t) = p X il B l (t) + l=1 m Z ik U k (t) + E i (t) k=1 or in reality as a discretised version in terms of the set n observations on a grid of T time points, matrix Y, n T Y = XB + ZU + E Rows of U are iid MV N T (0, Q) Rows of E are iid MV N T (0, S) Typeset by FoilTEX 9

T T matices Q, S to be specified. Orthogonally transform to wavelets by applying the Discrete Wavelet Transform (DWT) to each sample (rows of Y ) Y W = XBW + ZUW + EW where W is an orthogonal matrix. The model becomes Y = XB + ZU + E Here also Q W QW = Q and S W SW = S which we take to be diagonal. Typeset by FoilTEX 10

Typeset by FoilTEX 11

Typeset by FoilTEX 12

Typeset by FoilTEX 13

Prior for the transformed fixed effects We employ a slab and spike prior here to nonlinearly shrink towards zero: for the lth fixed effect, l = 1,..., p at scale j and location k : B ljk = γ ljk Normal(0, τ ljk ) + (1 γ ljk )δ 0 Here γ ljk Bernoulli(π lj ) that is for the lth fixed effect γ, the probability of acceptance is allowed to depend on the the scale but not the locations within the scale. Typeset by FoilTEX 14

The slab and spike prior may be thought of as a particular extreme case of scale mixtures of normals : A mixture on variance of Normal π(β l ) = N(β l 0, ψ l ) G(dψ l ) (1) G(.) weights variances ψ- high probability of small variance. More later. Analysis is at the wavelet transformed level, using MCMC with empirical Bayes plug-ins for the hyperparameters of shrinkage, transforming back with Inverse DWT to the original parameters for posterior inference. Typeset by FoilTEX 15

Typeset by FoilTEX 16

Typeset by FoilTEX 17

Typeset by FoilTEX 18

Typeset by FoilTEX 19

Typeset by FoilTEX 20

Typeset by FoilTEX 21

Typeset by FoilTEX 22

Typeset by FoilTEX 23

General Scale mixtures of normal The slab and spike prior is an extreme example of a scale mixture of normals. Suppose we wish to replace the slab and spike by a scale mixture of normals that concentrates mass around zero but also has fat tails so that any really sizeable coefficient is not shrunk too much. We hope to use such priors to speed up searches for a good model by concentrating on modes rather than full MCMC. We look for a prior whose negative log provides a suitable penalty to add to negative log likelihoods of a variety of models- suitable for logistic discrimination (eg modelling disease given spectrum). Typeset by FoilTEX 24

Generalised linear model for many variables Strategy minimise: loglik(β) log[π(β)] For example this could be the logistic likelihood using wavelets as explanatory variables. Continuous mixture on variance of Normal π(β i ) = N(β i 0, ψ i ) G(dψ i ) (2) G(.) weights variances ψ- high probability of small variance. Typeset by FoilTEX 25

Normal Variance Mixtures The mean-zero double exponential distribution, DE(0, 1/γ) with probability density function 1 2γ exp{ β /γ}, < β <, 0 < γ < ( 1 is defined by an exponential mixing distribution, Ex ), with probability density 2γ 2 function g(ψ i ) = 1 { 2γ 2exp ψ i /[2γ 2 ] }. (3) Typeset by FoilTEX 26

The normal-jeffreys (NJ) prior distribution arises from the improper hyperprior (Kiiveri/Figueiredo) g(ψ i ) 1 ψ i, (4) which in turn induces an improper prior for β i of the form π(β i ) 1 β i. The prior has an infinite spike at β i = 0, a feature also of the penalised likelihood or posterior which is consequently unnormalisable. Typeset by FoilTEX 27

Normal Exponential Gamma (NEG) Mixing on the scale of the exponential distribution. A gamma mixture gives g(ψ i ) = λ γ 2(1 + ψ i/γ 2 ) (λ+1) 0 < λ, γ <. (5) The density of the marginal distribution of β i can be expressed as π(β i ) = λ 2 λ Γ(λ + 1/2) exp π γ { 1 4 β 2 } ( ) β γ 2 D 2(λ+1/2) γ (6) where D(.) is the parabolic cylinder function. heaviness of the tails. For large β i γ The parameter λ controls the ( ) (2λ+1) βi π(β i ) c. γ Typeset by FoilTEX 28

The case λ = 0.5 corresponds to the quasi-cauchy slab of Johnstone and Silverman (AS 2005) who focus on the median of the posterior distribution. This same prior of J&S coincides with univariate special case of the robustness prior of Berger, 1985,236-240. Jeffreys(1939,3rd Ed 1961) also has desiderata in hypothesis testing which lead to a prior with Cauchy tails. Typeset by FoilTEX 29

Typeset by FoilTEX 30

Shrinkage of β Define the penalty function p(β) = log[π(β)] The penalised MLE β and the MLE ˆβ satisfy ˆβ β σ 2 / n i=1 x2 i = sign( β)p ( β ) shows that the amount of shrinkage is directly controlled by the derivative of the penalty function. Typeset by FoilTEX 31

Jeffreys prior no shrinkage for large β (unbiased) Oracle property of Fan and Li (2001, JASA) Typeset by FoilTEX 32

Typeset by FoilTEX 33

Parameter updates in n rather than k dimensions Use singular value decomposition of X(n k) : X = UDV T If γ = V T β then we may update the distribution of this, simulate with MCMC in n dimensions << k and then translate back into β via the full k dimensional augmented (γ, γ ) with γ (k n) 1 being independent of the data, whose distribution comes from the prior, and conditional distributions of γ γ are easy to calculate. Avoids inversion of k k matrices (very large) Min(n, k) non-zero solutions Typeset by FoilTEX 34

Perfect starts For n < k there will be many local boundary solutions. We explore the multiplicity of these solutions by generating alternative starting values all of which fit the data perfectly. The Minimum length least squares (MLLS)( ridge for small ridge constant) fit to the data for n < k is ˆβ MLLS = (X T X) + X T y where + denotes the Moore-Penrose generalised inverse. Using the Singular value decomposition, X = UDiag(d 1, d 2,..., d n )V T The orthogonal projection matrix is I P = I k V V T of rank k n. Typeset by FoilTEX 35

Consider generating a random k vector z and take w = (I V V T )z and this may be added to ˆβ MLLS to get another perfectly fitting starting point. This may induce a different posterior estimate which will be close to perfectly fitting but with another set of β s set to zero. We can thus explore the modes of the posterior seeing what β components come up regularly. Typeset by FoilTEX 36

Algorithms Easy to show that for hierarchical priors p β = E and for a Gaussian mixture prior { } log(f(β ψ) β β E ψ {ψ 1 } = 1 p β β giving a direct link between Newton-Raphson algorithms and EM and allow a direct extension to exponential family likelihoods. Typeset by FoilTEX 37

Kiiveri (2003) uses EM with a ridge line search in the M-step in his GENERAVE algorithm. Fan and Li (2001) use a Newton-Raphson style algorithm for their SCAD penalty and general likelihoods. LARS (Efron et al AS2003) and approximate GLM variants (Park and Hastie, 2006). Code that cycles through one dim searches for logistic likelihood(genkin et al, 2005; C. Hoggart, 2006) Hyperparameters (shape and scale) chosen by cross-validation with such fast algorithms Typeset by FoilTEX 38

Simulations NEG with shape λ, scale by the mean µ of underlying Gamma distribution n = 30 multiple regression with k = 1000 points; 5 random splits into two sets of 15 for training and testing. y i = f(x i ) + ɛ i i = 1,..., 15 where a spline is fitted f(x) = 1000 j=1 β j(x k j ) + k j = 0.3+1.6 j 1 999 ; ɛ i N(0, σ 2 ), with σ = 0.1, x i iid uniform (0, 1). Typeset by FoilTEX 39

Typeset by FoilTEX 40

Typeset by FoilTEX 41

1.5 1 0.5 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Typeset by FoilTEX 42 Figure 2: fit to sine curve

Multiple regression simulation, σ 2 = 1, X X correlation form, AR(1) structure with lag 1 correlation ρ. n = 100 observations, k = 500 variables, 10 nonzero coefficients of β. Fitted by 5-fold cross validation and tested on 10 datasets of 100 obs. β = 1 β = 5 ρ = 0.5 ρ = 0.8 ρ = 0.5 ρ = 0.8 Ridge 5.44 3.85 100.6 61.36 Lasso 1.74 1.42 1.71 1.72 NJeffreys 1.32 1.17 1.23 1.30 NEG (λ = 0.1) 1.15 1.20 1.18 1.24 NEG 1.15 1.15 1.16 1.23 Table 1: MSE prediction results (oracle 1.00) Typeset by FoilTEX 43

MAIN REFERENCES Griffin, JE and Brown, PJ (2005) U of Kent/Warwick Morris, JS, Brown PJ, Herrick, RC, Baggerly, KA, Coombes, KR (2006) Submitted to Biometrics: BePress preprint. Morris, JS, Carroll, RJ (2005, JRSSB) Morris, JS, Brown PJ, Baggerly, KA, Coombes, KR (2006) In Bayesian Inference for Gene Expression and Proteomics Ed K-A Do, P Mueller, M Vannucci, 269-292. Fan, J and Li, R (2001) JASA, 1348-1360 Kiiveri, H. (2003) IMS Monograph: Festschrift for T Speed, 127-143 Park, MY and Hastie, T(2006) web Tech rpt, Stanford. Zou, H(2006) JASA, to appear. Typeset by FoilTEX 44