Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko Birbaumer birbaumer@imsb.biol.ethz.ch

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins? Required Technology/Techniques : Fully automated microscopy, Image Analysis and Statistical Data Analysis Bioconductor: R project for the analysis and comprehension of genomic data

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 3 / 23 Introduction RNA interference experiment : in each well there are thousands of cells and the expression of a particular gene is knocked-down. How a virus enters a cell and how it is transported within the cell.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 4 / 23 High-Throughput Screening Vesicle Patterns Patterns of Transferrin in Hep2Beta cells upon silencing of a gene

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 5 / 23 High-Throughput Screening Vesicle Patterns Patterns of Transferrin in Hep2Beta cells upon silencing of a gene

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 6 / 23 High-Throughput Screening Overview of Image Analysis Pipeline

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 7 / 23 High-Throughput Screening Cell Classification GUI for cell classification based on nuclei features and SVM (R package e1071); the GUI was written in python (Tkinter) in combination with rpy used as an interface to R

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI Based on SVM all detected cells are classified and in-focus and interphase cells are selected 5 Classes : Interphase (1),Prophase/Prometaphase/Apoptotic (2), Metaphase (3), AnaphaseI/AnaphaseII/Telophase(4) and Artefact (5) ; Total accuracy of 10-fold CV : 96.63; Confusion Matrix based on 4000 training data points (nuclei) and tested on 2800 nuclei: Class 1 Class 2 Class 3 Class 4 Class 5 predicted Class 1 1998 1 0 0 0 predicted Class 2 2 99 3 0 0 predicted Class 3 0 0 297 0 0 predicted Class 4 0 0 0 50 7 predicted Class 5 0 0 0 0 342

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 9 / 23 Case Study Transferrin Phenotypes How can we classify these patterns in an unsupervised manner?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 10 / 23 Case Study Transferrin Phenotypes How can we classify these patterns in an unsupervised manner?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 11 / 23 Case Study Feature selection Vesicle Features: Relative distance to nucleus, radius, ellipticity, flux, area of vesicles, number of vesicles per cell area Spatial Point Patterns: Number of vesicles within 18/25 pixels around each vesicle Radius [pixels] within which 40 and 60 percent of all vesicles are contained around each vesicle Clustering Tendency (Ripley s k function):

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 12 / 23 Case Study Outline of Statistical Analysis All detected vesicles (11 features) are written into one file Principal Component analysis of this data file : 3 pc are chosen Parameterized Gaussian Mixture Modeling (package MCLUST) on data files containing all vesicles of a single cell with 3 principal components: G f (x i ) = τ k Φ k (x i µ k, Σ k ) (1) k=1 where x i corresponds to the feature vector of a vesicle, G is the number of components, τ k the probability that an observation belongs to the kth component ( G k=1 τ k = 1) and Φ k a normal distribution with mean µ k and covariance matrix Σ k.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 13 / 23 Case Study Vesicle Classification Based on the Bayesian Information Criterion (BIC) 5 vesicle classes are assumed; vesicles are classified according to these 5 groups.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 14 / 23 Case Study Principal Component Analysis Coordinate Projection of vesicles and projeceted covariance matrices of vesicle groups.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 15 / 23 Case Study Principal Component Analysis Biplot : original features of all vesicles are projected onto the first 2 principal components.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 16 / 23 Case Study Vesicle Groups All vesicles are marked according to the vesicle group they are belonging to.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 17 / 23 Case Study Vesicle Groups All vesicles are marked according to the vesicle group they are belonging to.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 18 / 23 Case Study Clustering Tree Hierarchical Clustering Tree: Cells with similar patterns (and biological perturbation) group together.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 19 / 23 Case Study Evaluation of Kinome Screens Aim: Find in an unsupervised manner well distinguishable patterns and corresponding proteins and characterize these patterns : functional modules of proteins Ultimate reduction of Data sets : Distance Matrix

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets? 2500 data files need to be processed by the mclust package; processing time per file : 10-15 minutes (minimum 3 weeks in total) Solution: Condor : distribute jobs among 60 and 70 computers : less than 2 days of processing time! Condor is an open-source computing software framework for distributed parallelization of computationally intensive tasks (runs under Linux, Mac and Windows) It can be used to manage workload on a dedicated cluster of computers or on non-dedicated desktop machines

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 21 / 23 Distributed Computing How to submit a condor job A job, described in the description file Genname descr.txt, is submitted for execution under Condor by : condor submit Genname descr.txt Organisation of the description file Genname descr.txt: getenv = True when to transfer output = ON EXIT OR EVICT notification = never universe = vanilla Executable = condor.py log = condorlogs/genname.log output = condorlogs/genname.out error = condorlogs/genname.error GetEnv = True transfer input files = Genname.mat,mcl.R arguments = Genname.mat queue

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 22 / 23 Outlook Pattern Matching Michael Rissi s ideas about vectorised pattern matching algorithm using NVidia s GPU programming tool CUDA GPU: Vector processor with fast 64k shared memory access (shareable between threads). CUDA: NVidia s programming tools for their 8xxx Series Graphics Cards. Standard C with some extensions for low level programming as well as an API for high level programming (e.g. math libraries FFT). Announced: double precision. General idea: Produce a database of vesicle patterns within a cell. Compare cells from a dataset with these patterns. Trivially vectorizable problem. Huge acceleration possible on GPUs (estimate: 40-100) Contact : rissim@particle.phys.ethz.ch

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 23 / 23 Outlook Pattern Matching Thanks for your attention!