SeqAn and OpenMS Integration Workshop. Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI)

Mass-spectrometry data analysis in KNIME Julianus Pfeuffer, Alexander Fillbrunn

OpenMS OpenMS an open-source C++ framework for computational mass spectrometry Jointly developed at ETH Zürich, FU Berlin, University of Tübingen Open source: BSD 3-clause license Portable: available on Windows, OSX, Linux Vendor-independent: supports all standard formats and vendor-formats through proteowizard OpenMS TOPP tools The OpenMS Proteomics Pipeline tools Building blocks: One application for each analysis step All applications share identical user interfaces Uses PSI standard formats Can be integrated in various workflow systems Galaxy WS-PGRADE/gUSE KNIME Kohlbacher et al., Bioinformatics (2007), 23:e191

OpenMS Tools in KNIME Wrapping of OpenMS tools in KNIME via GenericKNIMENodes (GKN) Every tool writes its CommonToolDescription (CTD) via its command line parser GKN generates Java source code for nodes to show up in KNIME Wraps C++ executables and provides file handling nodes

Installation of the OpenMS plugin Community-contributions update site (stable & trunk) Bioinformatics & NGS provides > 180 OpenMS TOPP tools as Community nodes SILAC, itraq, TMT, label-free, SWATH, SIP, Search engines: OMSSA, MASCOT, X!TANDEM, MSGFplus, Protein inference: FIDO

Data Flow in Shotgun Proteomics Sample HPLC/MS Raw Data 100 GB Sig. Proc. 50 MB Maps Data Reduction Peak Data 1 GB Diff. Quant. Annotated Maps Differentially Expressed Proteins 50 MB Identification 50 kb

Quantification Strategies Quantitative Proteomics Relative Quantification Absolute Quantification AQUA SISCAPA Labeled Label-Free In vivo In vitro Spectral Counting MRM Feature-Based 14 N/ 15 N SILAC itraq TMT 16 O/ 18 O After: Lau et al., Proteomics, 2007, 7, 2787

Quantitative Data LC-MS Maps Spectra are acquired with rates up to dozens per second Stacking the spectra yields maps Resolution: Up to millions of points per spectrum Tens of thousands of spectra per LC run Huge 2D datasets of up to hundreds of GB per sample MS intensity follows the chromatographic concentration

LC-MS Data (Map) Quantification (15 nmol/µl, 3x over-expressed, ) 10

Label-Free Quantification (LFQ) Label-free quantification is probably the most natural way of quantifying No labeling required, removing further sources of error, no restriction on sample generation, cheap Data on different samples acquired in different measurements higher reproducibility needed Manual analysis difficult Scales very well with the number of samples, basically no limit, no difference in the analysis between 2 or 100 samples

LFQ Analysis Strategy 1. Find features in all maps

LFQ Analysis Strategy 1. Find features in all maps 2. Align maps

LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features

LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features GDAFFGMSCK

LFQ Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features 5. Quantify GDAFFGMSCK 1.0 : 1.2 : 0.5

Feature-Based Alignment LC-MS maps can contain millions of peaks Retention time of peptides and metabolites can shift between experiments In label-free quantification, maps thus need to be aligned in order to identify corresponding features Alignment can be done on the raw maps (where it is usually called dewarping ) or on already identified features The latter is simpler, as it does not require the alignment of millions of peaks, but just of tens of thousands of features Disadvantage: it replies on an accurate feature finding

Feature-Based Alignment ~350,000 peaks ~ 700 features

Feature Finding Identify all peaks belonging to one peptide Key idea: Identify suspicious regions (e.g. highest peaks) Fit a model to that region and identify peaks explained by it

Feature Finding Extension: collect all data points close to the seed Refinement: remove peaks that are not consistent with the model Fit an optimal model for the reduced set of peaks Iterate this until no further improvement can be achieved

m/z Map 1 Multiple Alignment Dewarp k maps onto a comparable coordinate system Choose one map (usually the one with the largest number of features) as reference map (here: map 2 -> T 2 = 1) T 1 Map 2 T 2 Map k rt T k rt Consensus map

LFQ with OpenMS in KNIME Identification Feature finding and mapping Map alignment Feature linking Statistical analysis with R Snippets Visualization with KNIME plotting nodes

Preprocessing of single maps

Combining information of maps

Statistical post-processing and visualization