Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Similar documents
Integrating Globus into a Science Gateway for Cryo-EM

Introducing a Bioinformatics Similarity Search Solution

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning for pervasive systems Classification in high-dimensional spaces

Computational Biology Course Descriptions 12-14

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

MURCIA: Fast parallel solvent accessible surface area calculation on GPUs and application to drug discovery and molecular visualization

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Galaxy in Plant Pathology: Not everything is NGS data

Machine Learning for Gravitational Wave signals classification in LIGO and Virgo

Deep Learning. Convolutional Neural Networks Applications

Machine Learning 11. week

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Mixture models for analysing transcriptome and ChIP-chip data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Understanding Supernovae with Condor

Machine Learning Techniques for Computer Vision

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Statistics Toolbox 6. Apply statistical algorithms and probability models

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Probabilistic Graphical Models for Image Analysis - Lecture 1

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Applied Machine Learning Annalisa Marsico

K-means-based Feature Learning for Protein Sequence Classification

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Bioimage Informatics for Systems Pharmacology

Computational methods for predicting protein-protein interactions

TUTORIAL PART 1 Unsupervised Learning

Free Open Source Software for Geoinformatics (FOSS4G) A Practical Example System for Automated Geoscientific Analyses (SAGA)

Data Exploration and Unsupervised Learning with Clustering

STRUCTURAL BIOINFORMATICS I. Fall 2015

c 4, < y 2, 1 0, otherwise,

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

CRYPTOGRAPHIC COMPUTING

ESPRIT Feature. Innovation with Integrity. Particle detection and chemical classification EDS

Some notes on efficient computing and setting up high performance computing environments

Eigenface-based facial recognition

DivCalc: A Utility for Diversity Analysis and Compound Sampling

Brief Introduction of Machine Learning Techniques for Content Analysis

Karsten Vennemann, Seattle. QGIS Workshop CUGOS Spring Fling 2015

Scientific Data Mining: Why is it Difficult? Sapphire: using data mining techniques to address the data overload problem

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

The Schrödinger KNIME extensions

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

ESARAD Status. Status Overall. Current version s features Next version s features Next development work. October 2005.

Statistical Machine Learning

Mitosis Detection in Breast Cancer Histology Images with Multi Column Deep Neural Networks

Recent Advances in Bayesian Inference Techniques

Prediction of double gene knockout measurements

Assignment No A-05 Aim. Pre-requisite. Objective. Problem Statement. Hardware / Software Used

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Basics of Multivariate Modelling and Data Analysis

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

OECD QSAR Toolbox v.3.4

CPU SCHEDULING RONG ZHENG

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

New Prediction Methods for Tree Ensembles with Applications in Record Linkage

Naive Bayes classification

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

The Development of a Quality Control and Analysis Application for the ThermoFluor High Throughput Screening Assay

Multidimensional data analysis in biomedicine and epidemiology

Improving Satellite Data Utilization Through Deep Learning

BIOLIGHT STUDIO IN ROUTINE UV/VIS SPECTROSCOPY

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

Subcellular Localisation of Proteins in Living Cells Using a Genetic Algorithm and an Incremental Neural Network

OECD QSAR Toolbox v.4.0. Tutorial on how to predict Skin sensitization potential taking into account alert performance

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

Classification Techniques with Applications in Remote Sensing

Automated Analysis of the Mitotic Phases of Human Cells in 3D Fluorescence Microscopy Image Sequences

CSD. Unlock value from crystal structure information in the CSD

SmartDairy Catalog HerdMetrix Herd Management Software

Homology and Information Gathering and Domain Annotation for Proteins

Analysis of Software Artifacts

Fundamentals of Computational Science

CSC 411 Lecture 12: Principal Component Analysis

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

OECD QSAR Toolbox v.4.1. Tutorial on how to predict Skin sensitization potential taking into account alert performance

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Nuclear Data Uncertainty Analysis in Criticality Safety. Oliver Buss, Axel Hoefer, Jens-Christian Neuber AREVA NP GmbH, PEPA-G (Offenbach, Germany)

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

BMD645. Integration of Omics

An IDL Based Image Deconvolution Software Package

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

DATA ANALYTICS IN NANOMATERIALS DISCOVERY

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

CPU Scheduling. CPU Scheduler

CS145: INTRODUCTION TO DATA MINING

Principal Component Analysis, A Powerful Scoring Technique

Towards Automatic Nanomanipulation at the Atomic Scale

Transcription:

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko Birbaumer birbaumer@imsb.biol.ethz.ch

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins? Required Technology/Techniques : Fully automated microscopy, Image Analysis and Statistical Data Analysis Bioconductor: R project for the analysis and comprehension of genomic data

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 2 / 23 Introduction Pattern Recognition in Biological Systems What happens if we silence a gene in a cell and add infectious viruses? RNA interference enables silencing of single genes (Nobel Prize in Physiology and Medicine in 2006) library for 22000 genes in human cells : exploited to study the function of genes clinical and pharmaceutical purpose: which proteins are essential for virus entry? Are there drugs that target these proteins? Required Technology/Techniques : Fully automated microscopy, Image Analysis and Statistical Data Analysis Bioconductor: R project for the analysis and comprehension of genomic data Answer: For some genes we observe strikingly different patterns!

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 3 / 23 Introduction RNA interference experiment : in each well there are thousands of cells and the expression of a particular gene is knocked-down. How a virus enters a cell and how it is transported within the cell.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 4 / 23 High-Throughput Screening Vesicle Patterns Patterns of Transferrin in Hep2Beta cells upon silencing of a gene

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 5 / 23 High-Throughput Screening Vesicle Patterns Patterns of Transferrin in Hep2Beta cells upon silencing of a gene

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 6 / 23 High-Throughput Screening Overview of Image Analysis Pipeline

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 7 / 23 High-Throughput Screening Cell Classification GUI for cell classification based on nuclei features and SVM (R package e1071); the GUI was written in python (Tkinter) in combination with rpy used as an interface to R

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI Based on SVM all detected cells are classified and in-focus and interphase cells are selected

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 8 / 23 High-Throughput Screening Cell Classification 9 cell types are distinguished and classified according to 52 intensity and texture features (CellProfiler) via the GUI Based on SVM all detected cells are classified and in-focus and interphase cells are selected 5 Classes : Interphase (1),Prophase/Prometaphase/Apoptotic (2), Metaphase (3), AnaphaseI/AnaphaseII/Telophase(4) and Artefact (5) ; Total accuracy of 10-fold CV : 96.63; Confusion Matrix based on 4000 training data points (nuclei) and tested on 2800 nuclei: Class 1 Class 2 Class 3 Class 4 Class 5 predicted Class 1 1998 1 0 0 0 predicted Class 2 2 99 3 0 0 predicted Class 3 0 0 297 0 0 predicted Class 4 0 0 0 50 7 predicted Class 5 0 0 0 0 342

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 9 / 23 Case Study Transferrin Phenotypes How can we classify these patterns in an unsupervised manner?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 10 / 23 Case Study Transferrin Phenotypes How can we classify these patterns in an unsupervised manner?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 11 / 23 Case Study Feature selection Vesicle Features: Relative distance to nucleus, radius, ellipticity, flux, area of vesicles, number of vesicles per cell area Spatial Point Patterns: Number of vesicles within 18/25 pixels around each vesicle Radius [pixels] within which 40 and 60 percent of all vesicles are contained around each vesicle Clustering Tendency (Ripley s k function):

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 12 / 23 Case Study Outline of Statistical Analysis All detected vesicles (11 features) are written into one file Principal Component analysis of this data file : 3 pc are chosen Parameterized Gaussian Mixture Modeling (package MCLUST) on data files containing all vesicles of a single cell with 3 principal components: G f (x i ) = τ k Φ k (x i µ k, Σ k ) (1) k=1 where x i corresponds to the feature vector of a vesicle, G is the number of components, τ k the probability that an observation belongs to the kth component ( G k=1 τ k = 1) and Φ k a normal distribution with mean µ k and covariance matrix Σ k.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 12 / 23 Case Study Outline of Statistical Analysis All detected vesicles (11 features) are written into one file Principal Component analysis of this data file : 3 pc are chosen Parameterized Gaussian Mixture Modeling (package MCLUST) on data files containing all vesicles of a single cell with 3 principal components: G f (x i ) = τ k Φ k (x i µ k, Σ k ) (1) k=1 where x i corresponds to the feature vector of a vesicle, G is the number of components, τ k the probability that an observation belongs to the kth component ( G k=1 τ k = 1) and Φ k a normal distribution with mean µ k and covariance matrix Σ k. Based on BIC number of components is determined Symmetrized Kulback-Leibler is used as a distance (dissimilarity) measure between two vesicle distributions Representation of distance matrix via Hierarchical Clustering In collaboration with Prof. P. Buehlmann and Dr. M. Kalisch from the SFS ETH Zurich

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 13 / 23 Case Study Vesicle Classification Based on the Bayesian Information Criterion (BIC) 5 vesicle classes are assumed; vesicles are classified according to these 5 groups.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 14 / 23 Case Study Principal Component Analysis Coordinate Projection of vesicles and projeceted covariance matrices of vesicle groups.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 15 / 23 Case Study Principal Component Analysis Biplot : original features of all vesicles are projected onto the first 2 principal components.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 16 / 23 Case Study Vesicle Groups All vesicles are marked according to the vesicle group they are belonging to.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 17 / 23 Case Study Vesicle Groups All vesicles are marked according to the vesicle group they are belonging to.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 18 / 23 Case Study Clustering Tree Hierarchical Clustering Tree: Cells with similar patterns (and biological perturbation) group together.

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 19 / 23 Case Study Evaluation of Kinome Screens Aim: Find in an unsupervised manner well distinguishable patterns and corresponding proteins and characterize these patterns : functional modules of proteins Ultimate reduction of Data sets : Distance Matrix

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 19 / 23 Case Study Evaluation of Kinome Screens Aim: Find in an unsupervised manner well distinguishable patterns and corresponding proteins and characterize these patterns : functional modules of proteins Ultimate reduction of Data sets : Distance Matrix Phylogenic tree of moste distant wells within a Kinome Screen

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets?

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets? 2500 data files need to be processed by the mclust package; processing time per file : 10-15 minutes (minimum 3 weeks in total)

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 20 / 23 Distributed Computing Large Dataset Handling Principal Component Analysis of a data matrix of 3 GB size Solution: Calculate the covariance matrix with a Fortran program; eigenvectors and principal components are calculated again in R Question : how good is the flexibility of R functions for large datasets? 2500 data files need to be processed by the mclust package; processing time per file : 10-15 minutes (minimum 3 weeks in total) Solution: Condor : distribute jobs among 60 and 70 computers : less than 2 days of processing time! Condor is an open-source computing software framework for distributed parallelization of computationally intensive tasks (runs under Linux, Mac and Windows) It can be used to manage workload on a dedicated cluster of computers or on non-dedicated desktop machines

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 21 / 23 Distributed Computing How to submit a condor job A job, described in the description file Genname descr.txt, is submitted for execution under Condor by : condor submit Genname descr.txt Organisation of the description file Genname descr.txt: getenv = True when to transfer output = ON EXIT OR EVICT notification = never universe = vanilla Executable = condor.py log = condorlogs/genname.log output = condorlogs/genname.out error = condorlogs/genname.error GetEnv = True transfer input files = Genname.mat,mcl.R arguments = Genname.mat queue

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 21 / 23 Distributed Computing How to submit a condor job A job, described in the description file Genname descr.txt, is submitted for execution under Condor by : condor submit Genname descr.txt Organisation of the description file Genname descr.txt: getenv = True when to transfer output = ON EXIT OR EVICT notification = never universe = vanilla Executable = condor.py log = condorlogs/genname.log output = condorlogs/genname.out error = condorlogs/genname.error GetEnv = True transfer input files = Genname.mat,mcl.R arguments = Genname.mat queue Contents of the executable file condor.py: condorfile=sys.argv[1] command= R CMD BATCH --no-save --no-restore --args filemat= " + condorfile + " + + mcl.r os.system(command)

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 22 / 23 Outlook Pattern Matching Michael Rissi s ideas about vectorised pattern matching algorithm using NVidia s GPU programming tool CUDA GPU: Vector processor with fast 64k shared memory access (shareable between threads). CUDA: NVidia s programming tools for their 8xxx Series Graphics Cards. Standard C with some extensions for low level programming as well as an API for high level programming (e.g. math libraries FFT). Announced: double precision. General idea: Produce a database of vesicle patterns within a cell. Compare cells from a dataset with these patterns. Trivially vectorizable problem. Huge acceleration possible on GPUs (estimate: 40-100) Contact : rissim@particle.phys.ethz.ch

Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 23 / 23 Outlook Pattern Matching Thanks for your attention!