Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko Birbaumer birbaumer@imsb.biol.ethz.ch
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses?
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins?
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins? Required Technology/Techniques : Fully automated microscopy, Image Analysis and Statistical Data Analysis Bioconductor: R project for the analysis and comprehension of genomic data
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins? Required Technology/Techniques : Fully automated microscopy, Image Analysis and Statistical Data Analysis Bioconductor: R project for the analysis and comprehension of genomic data Answer: For some genes we observe strikingly different patterns!
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 3 / 23 Introduction RNA interference experiment : in each well there are thousands of cells and the expression of a particular gene is knocked-down. How a virus enters a cell and how it is transported within the cell.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 4 / 23 High-Throughput Screening Vesicle Patterns Patterns of Transferrin in Hep2Beta cells upon silencing of a gene
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 5 / 23 High-Throughput Screening Vesicle Patterns Patterns of Transferrin in Hep2Beta cells upon silencing of a gene
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 6 / 23 High-Throughput Screening Overview of Image Analysis Pipeline
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 7 / 23 High-Throughput Screening Cell Classification GUI for cell classification based on nuclei features and SVM (R package e1071); the GUI was written in python (Tkinter) in combination with rpy used as an interface to R
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI Based on SVM all detected cells are classified and in-focus and interphase cells are selected
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI Based on SVM all detected cells are classified and in-focus and interphase cells are selected 5 Classes : Interphase (1),Prophase/Prometaphase/Apoptotic (2), Metaphase (3), AnaphaseI/AnaphaseII/Telophase(4) and Artefact (5) ; Total accuracy of 10-fold CV : 96.63; Confusion Matrix based on 4000 training data points (nuclei) and tested on 2800 nuclei: Class 1 Class 2 Class 3 Class 4 Class 5 predicted Class 1 1998 1 0 0 0 predicted Class 2 2 99 3 0 0 predicted Class 3 0 0 297 0 0 predicted Class 4 0 0 0 50 7 predicted Class 5 0 0 0 0 342
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 9 / 23 Case Study Transferrin Phenotypes How can we classify these patterns in an unsupervised manner?
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 10 / 23 Case Study Transferrin Phenotypes How can we classify these patterns in an unsupervised manner?
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 11 / 23 Case Study Feature selection Vesicle Features: Relative distance to nucleus, radius, ellipticity, flux, area of vesicles, number of vesicles per cell area Spatial Point Patterns: Number of vesicles within 18/25 pixels around each vesicle Radius [pixels] within which 40 and 60 percent of all vesicles are contained around each vesicle Clustering Tendency (Ripley s k function):
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 12 / 23 Case Study Outline of Statistical Analysis All detected vesicles (11 features) are written into one file Principal Component analysis of this data file : 3 pc are chosen Parameterized Gaussian Mixture Modeling (package MCLUST) on data files containing all vesicles of a single cell with 3 principal components: G f (x i ) = τ k Φ k (x i µ k, Σ k ) (1) k=1 where x i corresponds to the feature vector of a vesicle, G is the number of components, τ k the probability that an observation belongs to the kth component ( G k=1 τ k = 1) and Φ k a normal distribution with mean µ k and covariance matrix Σ k.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 12 / 23 Case Study Outline of Statistical Analysis All detected vesicles (11 features) are written into one file Principal Component analysis of this data file : 3 pc are chosen Parameterized Gaussian Mixture Modeling (package MCLUST) on data files containing all vesicles of a single cell with 3 principal components: G f (x i ) = τ k Φ k (x i µ k, Σ k ) (1) k=1 where x i corresponds to the feature vector of a vesicle, G is the number of components, τ k the probability that an observation belongs to the kth component ( G k=1 τ k = 1) and Φ k a normal distribution with mean µ k and covariance matrix Σ k. Based on BIC number of components is determined Symmetrized Kulback-Leibler is used as a distance (dissimilarity) measure between two vesicle distributions Representation of distance matrix via Hierarchical Clustering In collaboration with Prof. P. Buehlmann and Dr. M. Kalisch from the SFS ETH Zurich
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 13 / 23 Case Study Vesicle Classification Based on the Bayesian Information Criterion (BIC) 5 vesicle classes are assumed; vesicles are classified according to these 5 groups.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 14 / 23 Case Study Principal Component Analysis Coordinate Projection of vesicles and projeceted covariance matrices of vesicle groups.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 15 / 23 Case Study Principal Component Analysis Biplot : original features of all vesicles are projected onto the first 2 principal components.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 16 / 23 Case Study Vesicle Groups All vesicles are marked according to the vesicle group they are belonging to.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 17 / 23 Case Study Vesicle Groups All vesicles are marked according to the vesicle group they are belonging to.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 18 / 23 Case Study Clustering Tree Hierarchical Clustering Tree: Cells with similar patterns (and biological perturbation) group together.
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 19 / 23 Case Study Evaluation of Kinome Screens Aim: Find in an unsupervised manner well distinguishable patterns and corresponding proteins and characterize these patterns : functional modules of proteins Ultimate reduction of Data sets : Distance Matrix
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 19 / 23 Case Study Evaluation of Kinome Screens Aim: Find in an unsupervised manner well distinguishable patterns and corresponding proteins and characterize these patterns : functional modules of proteins Ultimate reduction of Data sets : Distance Matrix Phylogenic tree of moste distant wells within a Kinome Screen
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets?
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets? 2500 data files need to be processed by the mclust package; processing time per file : 10-15 minutes (minimum 3 weeks in total)
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets? 2500 data files need to be processed by the mclust package; processing time per file : 10-15 minutes (minimum 3 weeks in total) Solution: Condor : distribute jobs among 60 and 70 computers : less than 2 days of processing time! Condor is an open-source computing software framework for distributed parallelization of computationally intensive tasks (runs under Linux, Mac and Windows) It can be used to manage workload on a dedicated cluster of computers or on non-dedicated desktop machines
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 21 / 23 Distributed Computing How to submit a condor job A job, described in the description file Genname descr.txt, is submitted for execution under Condor by : condor submit Genname descr.txt Organisation of the description file Genname descr.txt: getenv = True when to transfer output = ON EXIT OR EVICT notification = never universe = vanilla Executable = condor.py log = condorlogs/genname.log output = condorlogs/genname.out error = condorlogs/genname.error GetEnv = True transfer input files = Genname.mat,mcl.R arguments = Genname.mat queue
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 21 / 23 Distributed Computing How to submit a condor job A job, described in the description file Genname descr.txt, is submitted for execution under Condor by : condor submit Genname descr.txt Organisation of the description file Genname descr.txt: getenv = True when to transfer output = ON EXIT OR EVICT notification = never universe = vanilla Executable = condor.py log = condorlogs/genname.log output = condorlogs/genname.out error = condorlogs/genname.error GetEnv = True transfer input files = Genname.mat,mcl.R arguments = Genname.mat queue Contents of the executable file condor.py: condorfile=sys.argv[1] command= R CMD BATCH --no-save --no-restore --args filemat= " + condorfile + " + + mcl.r os.system(command)
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 22 / 23 Outlook Pattern Matching Michael Rissi s ideas about vectorised pattern matching algorithm using NVidia s GPU programming tool CUDA GPU: Vector processor with fast 64k shared memory access (shareable between threads). CUDA: NVidia s programming tools for their 8xxx Series Graphics Cards. Standard C with some extensions for low level programming as well as an API for high level programming (e.g. math libraries FFT). Announced: double precision. General idea: Produce a database of vesicle patterns within a cell. Compare cells from a dataset with these patterns. Trivially vectorizable problem. Huge acceleration possible on GPUs (estimate: 40-100) Contact : rissim@particle.phys.ethz.ch
Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 23 / 23 Outlook Pattern Matching Thanks for your attention!