Alignment and Analysis of Proteomics Data using Square Root Slope Function Framework

Size: px

Start display at page:

Download "Alignment and Analysis of Proteomics Data using Square Root Slope Function Framework"

Oscar Bridges
5 years ago
Views:

1 Alignment and Analysis of Proteomics Data using Square Root Slope Function Framework J. Derek Tucker 1 1 Department of Statistics Florida State University Tallahassee, FL CTW: Statistics of Warpings and Phase Variations 2012 Tucker SRSF FDA - Proteomics 1/19

2 Problem Introduction Problem Introduction Given: A collection of observed Total Icon Count (TIC) Chromoatograms Goals: We would like to align the data study their variability (FPCA) develop probability models to capture their variability generate random samples Requirement: Need a proper metric structure on the space of these functions Our Method: Propose phase and amplitude separation using elastic metric as presented by A. Srivastava Tucker SRSF FDA - Proteomics 2/19

3 Problem Introduction Results on Proteomics Data TIC (Total Ion Count) Chromatograms of blood samples. Used in protein profiling, assuming that proteins with different abundances are functionally related to disease processes. 9.0 Original Data Aligned Data Tucker SRSF FDA - Proteomics 3/19

4 Problem Introduction Zoom In on Alignment TIC (Total Ion Count) Chromatograms of blood samples. Used in protein profiling, assuming that proteins with different abundances are functionally related to disease processes. 9.0 Original Data Aligned Data Tucker SRSF FDA - Proteomics 4/19

5 Problem Introduction Results on Proteomics Data A partial answer key is available where several peaks have been manually identified, good alignment was not used in alignment 9.0 Original Data Aligned Data Tucker SRSF FDA - Proteomics 5/19

6 Problem Introduction Zoom In on Alignment A partial answer key is available where several peaks have been manually identified, good alignment was not used in alignment 9.0 Original Data Aligned Data Tucker SRSF FDA - Proteomics 6/19

7 Problem Introduction Comparison with Other Methods MBM Comparison with MBM (James 2007) and MSE (Ramsay and Silverman 2005) methods MSE Tucker SRSF FDA - Proteomics 7/19

8 Problem Introduction Alignment Performance Can also quantify the alignment performance using the decrease in the cumulative cross-sectional variance of the aligned functions Var({g i }) = 1 1 n 1 0 ( n g i (t) 1 n 2 g i (t)) dt n i=1 i=1 Define: Original Variance = Var({f i }), Amplitude Variance = Var({ f i }), Phase Variance = Var({µ f γ i }) Original Variance Elastic Method MBM MSE Amplitude-variance Phase-variance Tucker SRSF FDA - Proteomics 8/19

9 Phase-Variability:Analysis of Warping Functions Analysis of Warping Functions using Horizontal fpca We have a collection of warping functions in the space Γ and we want to model their variability Γ is a nonlinear manifold and we cannot perform FPCA directly We choose to represent warping functions by their SRSFs as presented by A. Sravastava ψ(t) = γ(t) The L 2 norm of this SRSF is: 1 0 ψ(t) 2 dt = 1 0 γ(t) dt = γ(1) γ(0) = 1 Hence, the space of such SRSFs is a unit Hilbert sphere in L 2 ; call it Ψ Tucker SRSF FDA - Proteomics 9/19

10 Phase-Variability:Analysis of Warping Functions Results on Proteomics Data From left to right: the observed warping functions, their Karcher mean, and the first three principal directions of the observed data. Tucker SRSF FDA - Proteomics 10/19

11 Amplitude Variability: Analysis of Aligned Functions Analysis of Aligned Functions using Vertical fpca The aligned can be statistically analyzed in a standard way (in L 2 ) using cross-sectional computations in the SRSF space To properly calculate this we need to perform a joint FPCA which includes the vertical variability of F a functional variable f i is analyzed using the pair h i = (q i, f i (0)) rather than just q i Define covariance operator K h (s, t) = 1 n 1 where µ h = [µ q f (0)] Taking the SVD, K h = U h Σ h V T h n E[( h i (s) µ h (s))( h i (t) µ h (t))] i=1 Tucker SRSF FDA - Proteomics 11/19

12 Amplitude Variability: Analysis of Aligned Functions Results on Proteomics Data First 2 vertical principal-geodesic paths Most of the information is captured in the first first few directions First 5 eigenvalues ( ) Tucker SRSF FDA - Proteomics 12/19

13 Modeling of Phase and Amplitude Components Modeling of Phase and Amplitude Components Let c = (c 1,..., c k1 ) and z = (z 1,..., z k2 ) be the dominant principal coefficients of the amplitude- and phase-components, respectively Recall that c j = h, U h,j and z j = v, U ψ,j We can reconstruct the amplitude component using q = µ q + k 1 j=1 c j U h,j Similar for the phase component using v = k 2 j=1 z ju ψ,j and then using ψ = cos( v )µ ψ + sin( v ) v v, then t γ s (t) = ψ(s) 2 ds 0 Combining the two random quantities, we obtain a random function f s γ s Tucker SRSF FDA - Proteomics 13/19

14 Modeling of Phase and Amplitude Components Modeling Types Gaussian Models on fpca Coefficients Model f s (0), c, and z as multivariate normal random variables The mean of f s (0) is f (0) while the means of c and z are zero vectors Their joint covariance matrix is of the type: σ 2 0 L 1 L 2 L1 T Σ h S R (k 1+k 2 +1) (k 1 +k 2 +1) L2 T S Σ ψ Here, L 1 R 1 k 1 captures the covariance between f (0) and c, L 2 R 1 k 2 between f (0) and z, and S R k 1 k 2 between c and z Non-parametric Models on fpca Coefficients Use of kernel density estimation, where the density of f s (0), each of the k 1 components of c, and the k 2 components of z can be estimated using p ker (x) = 1 n ( ) x xi K nb b i=1 Tucker SRSF FDA - Proteomics 14/19

15 Modeling of Phase and Amplitude Components Modeling Results Amplitude Random Samples Random Warping Functions Random Samples Comparing them with the original data set we conclude that the random samples are similar to the original data Tucker SRSF FDA - Proteomics 15/19

16 Modeling of Phase and Amplitude Components Classification using Pair-Wise Distances da dp L More structure to pairwise-distance matrices for d a and d p over standard L 2 Rates d a = 87% (13/15) d p = 33% (5/15) L 2 = 27% (4/15) Tucker SRSF FDA - Proteomics 16/19

17 Modeling of Phase and Amplitude Components Cumulative Match Characteristic Curve A CMC curve plots the probability of classification against the returned candidate list size Also compared with a naive distance d Naive = argmin γ Γ f i f j γ Classification Rate Random Samples List Size variable L 2 d Naive d p d a Classification Performance of d Naive : 80% (12/15) Our method rapidly approaches over 90% classification rate in contrast to the d Naive and the standard L 2 distances Tucker SRSF FDA - Proteomics 17/19

18 Summary and Conclusions Summary and Future Work Conclusions Excellent alignment was achieved using our square-root slope function framework Used this framework to separate amplitude and phase of the given data Performed fpca on amplitude and phase and imposed models on the components Verified the model using random sampling This theory behind this work has been submitted to Computational Statistics and Data Analysis 2012 Future Work Expand the analysis of classification to probabilistic models given we have more samples Analyze and under stand how additive noise impacts SRSFs and Karcher Mean calculation Tucker SRSF FDA - Proteomics 18/19

19 Questions?? Summary and Conclusions Tucker SRSF FDA - Proteomics 19/19

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES FUNCTIONAL COMPONENT ANALYSIS AND REGRESSION USING ELASTIC METHODS J.

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES FUNCTIONAL COMPONENT ANALYSIS AND REGRESSION USING ELASTIC METHODS By J. DEREK TUCKER A Dissertation submitted to the Department of Statistics in partial