Efficient Processing of Models for Large- scale Shotgun Proteomics Data

Size: px

Start display at page:

Download "Efficient Processing of Models for Large- scale Shotgun Proteomics Data"

Quentin Gibson
5 years ago
Views:

1 Efficient Processing of Models for Large- scale Shotgun Proteomics Data Himanshu Grover, Ph.D. Vanathi Gopalakrishnan, Ph.D. University of Pi;sburgh C- Big 2012, Pi;sburgh, USA 14 th October, 2012

2 Outline Background on Proteins and Shotgun Proteomics ComputaJonal modeling framework: Context- sensijve PepJde IdenJficaJon (CSPI) Problem Statement Methods for efficient handling Challenges and Future Work

3 Proteomics Interac?ons PTMs Expression

4 Proteomics

5 Mass Spectrometry Analytical tool to identify unknown compounds Sample Complex IonizaJon Mass Analyzer Collabora?ve Detector 5

6 Amino Acids Amino Acids and Proteins

7 Shotgun Proteomics: Protein/PepJde IdenJficaJon Shotgun Sequencing Protein sample EnzymaJc Digest MS/MS (CID) Rel. Int Pep?de Spectrum m/z Fragmenta?on Spectrum

8 Database Searching Predominant methodology for peptide ID from MS/MS 8

9 Fact!! < 30% of spectra are confidently assigned with pepjdes Ø Noise Ø Variability Ø Inadequate scoring systems 9

10 Computa?onal BoKlenecks Ø High volume and rate of data genera?on 24* ^ 3 spectra per day from moderate sized labs Ø Large protein databases: ~90 K protein sequences for Humans Constrained searches: ~5-10 ^ 6 unique pepjdes in database ~10-20 ^ 3 pepjdes per spectrum Unconstrained searches Over billion pepjdes

11 Context- SensiJve PepJde IdenJficaJon (CSPI) Framework DemysJfied Grover et. al. (2012), OMICS (submi;ed for Ø Novel probabilisjc framework Scalable and flexible Ø Specific Goal: Model influence of pepjde physicochemical context on the observed peak heights (intensijes) in fragmentajon spectra 11

12 Input- Output Hidden Markov Models (IO- HMM)... q t- 1 q t q t+1... Hidden Layer P(q t q t- 1 ;Ө) (Transi?on Probability) y t- 1 y t y t+1 Output Layer P(y t q t ;Ө) (Emission Probability) Classical Hidden Markov Model... q t- 1 q t q t+1... Hidden Layer P(q t q t- 1,x t ;Ө) (Transi?on Probability) Input Layer x t- 1 x t x t+1 y t- 1 y t y t+1 Input- output Hidden Markov Model Output Layer P(y t q t,x t ;Ө) (Emission Probability)

13 CSPI Model Structure Input Layer x t- 1 x t x t+1.. q t- 1 q t q t+1.. Hidden Layer P (q t q t- 1, x t ; Θ) y t- 1 y t y t+1 Output Layer P (yt qt; Θ) 13

14 Input Layer: PepJde Physicochemical Context S G F L E E D E L K 100 Relative Intensity 0 Global Experimental Spectrum b 3 b 4 y 2 y 3 y 4 y 5 b5 Local y 7 b 6 b 7 b 8 y b 8 9 y m/z 14

15 Context in the context of CSPI S G F L E E D E L K x t = {x t,0, x t,1, x t,2,., x t,47 } Input Layer x t- 1 x t x t+1.. q t- 1 q t q t+1.. Hidden Layer y t- 1 y t y t+1 Output Layer

16 Matching A PepJde with Experimental Spectra b ions S G F L E E D E L K y y ions Experimental Spectrum y 7 Relative Intensity 0 b 3 b 4 y 2 y 3 y 4 y 5 b5 y 8 b b b b 9 y m/z 16

17 Normalized IntensiJes in context of CSPI Input Layer x t- 1 x t x t+1.. q t- 1 q t q t+1.. Hidden Layer y t- 1 y t y t+1 Output Layer b ions S G F L E E D E L K y ions

18 Summary t=0 t=1 t- 1 t t+1 T S G F L E E D E L K x t- 1,0 47 x t,0 47 x t+1,0 47 PSM. q t- 1 q t q t+1. y t- 1 = I b/y, t- 1 y t = I b/y, t- 1 y t+1 = I b/y, t- 1

19 ParameterizaJon: TransiJon/Emission FuncJons 1 S 1+ exp(w T k x t ) k=1 P(q t q t 1 = j, x t ;Θ qt ) = exp((w T i x t )) S 1+ exp(w T k x t ) where w i T are the Logistic Regression weight vectors k=1 Logis?c Func?on if y t ="NA" ;i = 1,2,...,s 1 if y t!="na" X t. q t- 1 q t. Emission Distr ns Y t y t q t ~ 1.0 if y t = 0 P(Θ) if y t > 0 where P = Exp(λ), Be(α,β), N(µ,σ 2 ) 19 { }

20 Parameter EsJmaJon Ø Parameters to esjmate per CSPI model (4 hidden states): Ø Over 700 (LogisJc funcjon weights, Emission distribujon parameters) Ø Maximum Likelihood Ø Generalized ExpectaJon MaximizaJon algorithm (GEM) 20

21 Inference: Log- likelihood RaJo Ø Score: Log Likelihood RaJo # CSPI _ Score = log P(Spectrum intensities PeptideSeq; Θ True ) & % ( $ P(Spectrum intensities PeptideSeq; Θ Null )' Ø Computed using Forward Procedure 21

22 ComputaJonal bo;leneck Ø Database searching Ø Extract candidate pepjdes (sub- strings) for each spectrum Ø Candidate Pep?des scoring Ø ^ 3 spectra * ~10-20 ^ 3 pep?des Ø CSPI: Ø Increases performance but Ø takes ~5-8 seconds per spectrum to evaluate candidates (under constrained searches)

23 Database Searching Ø Mass- range query Ø Amino acids (characters) have masses Ø Goal: Ø Search for sub- strings with a (roughly) specific mass Ø Naïve Approach: Ø Scan the protein database for each query

24 Indexed Database Searching Ø Berkeley DB: key- value store Ø Pre- compute Ø Key: Mass of pepjde Ø Value: LocaJon and length of pepjde Ø MulJple index files Ø Time (per query): < 1 sec

25 Challenge Ø Works well for constrained database searches: Ø Time to generate Ø Size Ø Issues with unconstrained searches Ø PotenJal solujon: Ø Parallel generajon and query Ø Simple synchronizajon primijves and muljple index files facilitates

26 Candidate PepJde Scoring Ø Embarrassingly parallel For each spectrum, searching and scoring/ ranking is independent of others Ø UJlize muljprocessing

27 Parallel ImplementaJon Spectra Protein Database (Index) Main (Parent) Process 1. Read and preprocess spectra 2. Query Protein Database FIFO Task (Input) Queue Put spectrum/candidates on shared queue 1. i th spectrum 2. Candidates Extract obj from queue Extract obj from queue Extract obj from queue Child Process 1 Score and Rank Child Process 2 Score and Rank Child Process N Score and Rank Put obj on queue Put obj on queue Put obj on queue FIFO Results (Output) Queue Scored results for i th spectrum Extract obj from Queue Output (Child) Process Write results to file

28 Parallel ImplementaJon Spectra Protein Database (Index) Main (Parent) Process 1. Read and preprocess spectra 2. Query Protein Database FIFO Task (Input) Queue Put spectrum/candidates on shared queue 1. i th spectrum 2. Candidates Extract obj from queue Extract obj from queue Extract obj from queue Child Process 1 Score and Rank Child Process 2 Score and Rank Child Process N Score and Rank Put obj on queue Put obj on queue Put obj on queue Scored results for i th spectrum Extract obj from Queue Output (Child) Process Write results to file FIFO Results (Output) Queue

29 Challenges and PotenJal SoluJons Ø Spectrum- level parallelizajon Ø Candidate- level opjmizajon can provide further gains: Non- trivial: Careful profiling of individual steps IPC overhead vs. performance gain Protein Database Size Search Constraints

30 Conclusions and Future Work Ø Complex and computajonally intensive algorithms Ø CollaboraJve efforts are required for robust analyses (evidence combinajon) Ø requires efficient processing Ø be;er parameter esjmates Ø Further efficiency improvements Ø Other applicajons: Ø Time- series Gene- Expression + Protein- expression MicroRNA expression + Gene Expression SJmulus/Response

31 Acknowledgements Ø Funding Agencies: This work was supported in part by the following grants: NIGMS Award Number K25GM071951, NIH Award Number P41RR and NLM Award Number R01LM to Dr. Vanathi Gopalakrishnan. 31

32 Thanks Ques?ons?

Hilbert Space Embeddings of Hidden Markov Models

Hilbert Space Embeddings of Hidden Markov Models Le Song Carnegie Mellon University Joint work with Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1 Big Picture QuesJon Graphical Models! Dependent