Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Similar documents
Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Clustering and Network

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Unsupervised machine learning

University of Florida CISE department Gator Engineering. Clustering Part 1

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Clustering & microarray technology

Overview of clustering analysis. Yuehua Cui

Chapter 5: Microarray Techniques

Introduction to Bioinformatics

Applying cluster analysis to 2011 Census local authority data

Data Preprocessing. Cluster Similarity

Multivariate Analysis Cluster Analysis

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

A general co-expression network-based approach to gene expression analysis: comparison and applications

Clustering and classification with applications to microarrays and cellular phenotypes

Inferring Transcriptional Regulatory Networks from High-throughput Data

Contingency Table Analysis via Matrix Factorization

Comparative Network Analysis

Hierarchical Clustering

Advanced Statistical Methods: Beyond Linear Regression

Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI

Descriptive Data Summarization

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Introduction to clustering methods for gene expression data analysis

2. Sample representativeness. That means some type of probability/random sampling.

CS249: ADVANCED DATA MINING

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Evolutionary Tree Analysis. Overview

Hierarchical Clustering

Multivariate Statistics: Hierarchical and k-means cluster analysis

High-dimensional data: Exploratory data analysis

Approaches to clustering gene expression time course data

Modern Information Retrieval

Clustering using Mixture Models

Data Exploration and Unsupervised Learning with Clustering

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Phylogenetic Tree Reconstruction

Logic and machine learning review. CS 540 Yingyu Liang

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Introduction to clustering methods for gene expression data analysis

Network Motifs of Pathogenic Genes in Human Regulatory Network

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Math 8803/4803, Spring 2008: Discrete Mathematical Biology

Microarray data analysis

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Diffusion/Inference geometries of data features, situational awareness and visualization. Ronald R Coifman Mathematics Yale University

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Statistical Significance for Hierarchical Clustering

Lecture 5: November 19, Minimizing the maximum intracluster distance

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Multimedia Retrieval Distance. Egon L. van den Broek

Package GeneExpressionSignature

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Discovering molecular pathways from protein interaction and ge

Biological Networks Analysis

Co-expression analysis of RNA-seq data

Protein function prediction via analysis of interactomes

Co-expression analysis

Clustering with k-means and Gaussian mixture distributions

Clustering of Pathogenic Genes in Human Co-regulatory Network. Michael Colavita Mentor: Soheil Feizi Fifth Annual MIT PRIMES Conference May 17, 2015

EVOLUTIONARY DISTANCES

Microarray Data Analysis: Discovery

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

A (short) introduction to phylogenetics

A transcriptome meta-analysis identifies the response of plant to stresses. Etienne Delannoy, Rim Zaag, Guillem Rigaill, Marie-Laure Martin-Magniette

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

A* Search. 1 Dijkstra Shortest Path

STATISTICA MULTIVARIATA 2

Multivariate Analysis

Machine Learning - MT Clustering

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

ICS 252 Introduction to Computer Design

2. Sample representativeness. That means some type of probability/random sampling.

Clustering Perturbation Resilient

Lecture 5: Clustering, Linear Regression

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Biochip informatics-(i)

Supplementary Information

Lecture 5: Clustering, Linear Regression

Distances & Similarities

Comparative Genomics II

Clustering with k-means and Gaussian mixture distributions

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Transcription:

Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden

Gene expression profiling A quick review Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.) The Gene Ontology (GO) Project Provides shared vocabulary/annotation GO terms are linked in a complex structure Enrichment analysis: Find the most differentially expressed genes Identify functional annotations that are over-represented Modified Fisher's exact test

Gene Set Enrichment Analysis Calculates a score for the enrichment of a entire set of genes Does not require setting a cutoff! Identifies the set of relevant genes! Provides a more robust statistical framework! GSEA steps: A quick review cont 1. Calculation of an enrichment score (ES) for each functional category 2. Estimation of significance level 3. Adjustment for multiple hypotheses testing

The clustering problem The goal of gene clustering process is to partition the genes into distinct sets such that genes that are assigned to the same cluster are similar, while genes assigned to different clusters are nonsimilar. gene x gene y

What are we clustering? We can cluster genes, conditions (samples), or both.

Why clustering

Clustering genes or conditions is a basic tool for the analysis of expression profiles, and can be useful for many purposes, including: Inferring functions of unknown genes (assuming a similar expression pattern implies a similar function). Identifying disease profiles (tissues with similar pathology should yield similar expression profiles). Deciphering regulatory mechanisms: co-expression of genes may imply co-regulation. Reducing dimensionality. Why clustering

Different views of clustering

Different views of clustering

Different views of clustering

Different views of clustering

Different views of clustering

Different views of clustering

Measuring similarity/distance An important step in many clustering methods is the selection of a distance measure (metric), defining the distance between 2 data points (e.g., 2 genes) Point 1 : [0.1 0.0 0.6 1.0 2.1 0.4 0.2] Point 2 : [0.2 1.0 0.8 0.4 1.4 0.5 0.3] Genes are points in the multi-dimensional space R n (where n denotes the number of conditions)

Measuring similarity/distance So how do we measure the distance between two point in a multi-dimensional space? B A

Symmetric vs. asymmetric distances. Measuring similarity/distance So how do we measure the distance between two point in a multi-dimensional space? Common distance functions: The Euclidean distance (a.k.a distance as the crow flies or distance). The Manhattan distance (a.k.a taxicab distance) The maximum norm (a.k.a infinity distance) p-norm 2-norm 1-norm infinity-norm The Hamming distance (number of substitutions required to change one point into another).

Correlation as distance Another approach is to use the correlation between two data points as a distance metric. Pearson Correlation Spearman Correlation Absolute Value of Correlation

Metric matters! The metric of choice has a marked impact on the shape of the resulting clusters: Some elements may be close to one another in one metric and far from one anther in a different metric. Consider, for example, the point (x=1,y=1) and the origin. What s their distance using the 2-norm (Euclidean distance )? What s their distance using the 1-norm (a.k.a. taxicab/ Manhattan norm)? What s their distance using the infinity-norm?

The clustering problem A good clustering solution should have two features: 1. High homogeneity: homogeneity measures the similarity between genes assigned to the same cluster. 2. High separation: separation measures the distance/dissimilarity between clusters. (If two clusters have similar expression patterns, then they should probably be merged into one cluster).

The philosophy of clustering Unsupervised learning problem No single solution is necessarily the true/correct! There is usually a tradeoff between homogeneity and separation: More clusters increased homogeneity but decreased separation Less clusters Increased separation but reduced homogeneity Method matters; metric matters; definitions matter; There are many formulations of the clustering problem; most of them are NP-hard (why?). In most cases, heuristic methods or approximations are used.

One problem, numerous solutions Many algorithms: Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK The results (i.e., obtained clusters) can vary drastically depending on: Clustering method Parameters specific to each clustering method (e.g. number of centers for the k-mean method, agglomeration rule for hierarchical clustering, etc.)

Hierarchical clustering

object 1 object 2 object 3 object 4 object 5 Hierarchical clustering Hierarchical clustering is an agglomerative clustering method Takes as input a distance matrix Progressively regroups the closest objects/groups Tree representation Distance matrix branch node c3 c1 object 1 object 5 object 1 0.00 4.00 6.00 3.50 1.00 object 2 4.00 0.00 6.00 2.00 4.50 object 3 6.00 6.00 0.00 5.50 6.50 object 4 3.50 2.00 5.50 0.00 4.00 object 5 1.00 4.50 6.50 4.00 0.00 root c4 c2 object 4 object 2 object 3 leaf nodes

mmm Déjà vu anyone?

Hierarchical clustering algorithm 1. Assign each object to a separate cluster. 2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster. 3. Repeat 2 until there is a single cluster. The result is a tree, whose intermediate nodes represent clusters Branch lengths represent distances between clusters

Hierarchical clustering result Five clusters 26

Hierarchical clustering 1. Assign each object to a separate cluster. 2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster. 3. Repeat 2 until there is a single cluster. One needs to define a (dis)similarity metric between two groups. There are several possibilities Average linkage: the average distance between objects from groups A and B Single linkage: the distance between the closest objects from groups A and B Complete linkage: the distance between the most distant objects from groups A and B

Impact of the agglomeration rule These four trees were built from the same distance matrix, using 4 different agglomeration rules. Single-linkage typically creates nesting clusters Complete linkage create more balanced trees. Note: these trees were computed from a matrix of random numbers. The impression of structure is thus a complete artifact.

Clustering in both dimensions