Hierarchical Clustering

Similar documents
Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogenetic trees 07/10/13

Introduction to clustering methods for gene expression data analysis

Clustering gene expression data & the EM algorithm

Phylogeny: building the tree of life

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Introduction to clustering methods for gene expression data analysis

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Overview of clustering analysis. Yuehua Cui

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Algorithms in Bioinformatics

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Clustering & microarray technology

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.


Phylogeny: traditional and Bayesian approaches

Phylogenetic Tree Reconstruction

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Evolutionary Tree Analysis. Overview

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

EVOLUTIONARY DISTANCES

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

A (short) introduction to phylogenetics

BINF6201/8201. Molecular phylogenetic methods

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Discovering molecular pathways from protein interaction and ge

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Chapter 5-2: Clustering

Phylogenetic inference

Phylogeny Tree Algorithms

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Hierarchical Clustering

25 : Graphical induced structured input/output models

Phylogeny Jan 5, 2016

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Protein function prediction via analysis of interactomes

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

Phylogeny. November 7, 2017

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Theory of Evolution Charles Darwin

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Hidden Markov Models

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

What is Phylogenetics

Multiple Sequence Alignment. Sequences

Evolutionary Trees. Evolutionary tree. To describe the evolutionary relationship among species A 3 A 2 A 4. R.C.T. Lee and Chin Lung Lu

Hamiltonian paths in tournaments A generalization of sorting DM19 notes fall 2006

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Clustering using Mixture Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Bioinformatics. Transcriptome

A Bayesian Criterion for Clustering Stability

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Lecture 5: November 19, Minimizing the maximum intracluster distance

The Complexity of Constructing Evolutionary Trees Using Experiments

Phylogenetic Networks, Trees, and Clusters

CSCE 478/878 Lecture 6: Bayesian Learning

Dominating Set Counting in Graph Classes

Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

BIOINFORMATICS GABRIEL VALIENTE ALGORITHMS, BIOINFORMATICS, COMPLEXITY AND FORMAL METHODS RESEARCH GROUP, TECHNICAL UNIVERSITY OF CATALONIA

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Assessing Congruence Among Ultrametric Distance Matrices

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Constraint-based Subspace Clustering

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Modern Information Retrieval

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Final Exam, Machine Learning, Spring 2009

Unsupervised machine learning

An Adaptive Association Test for Microbiome Data

Inferring Transcriptional Regulatory Networks from High-throughput Data

Information Representation by Hierarchies

Supplementary Information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

FINAL: CS 6375 (Machine Learning) Fall 2014

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

CS540 ANSWER SHEET

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Multivariate Statistics: Hierarchical and k-means cluster analysis

Phylogenetics: Building Phylogenetic Trees

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Learning Decision Trees

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Clustering and Network

Transcription:

Hierarchical Clustering Some slides by Serafim Batzoglou 1

From expression profiles to distances From the Raw Data matrix we compute the similarity matrix S. S ij reflects the similarity of the expression patterns of gene i and gene j. genes experiments Expression levels, Raw Data experiments 10 20 30 40 50 60 experiments 10 20 30 40 50 60 2 In some situation the input for clustering is only the similarities / distances

More generally In K-means and SOM the input was a vector for each item (e.g. a dot in R n ) Here we have a matrix of pairwise distances between items, and we wish to cluster the items. A distance based clustering alg 3

An Alternative view of Clustering Form a tree-hierarchy of the input elements satisfying: More similar elements are placed closer along the tree. Or: Tree distances reflect element similarity Note: No explicit partition into clusters. 4

Partitioning vs Hierarchical Representations dendrogram 5

Hierarchical Representations (2) Ultrametric: rooted tree, all root-leaf distances are equal 5.0 4.5 2.8 1 2 3 4 1 2 3 4 6

UPGMA Clustering (unweighted pair group method using arithmetic averages) Approach: Form a tree; closer species according to input distances should be closer in the tree Build the tree bottom up, each time merging two smaller trees All leaves are at same distance from the root 7

Hierarchical Clustering: UPGMA Sokal & Michener 58, Lance & Williams 67 UPGMA (unweighted pair group method using arithmetic averages) Given two disjoint clusters C i, C j, 1 d ij = Σ {p Ci, q Cj} d pq C i C j If C k = C i C j, then distance from C k to another cluster C l is: d il C i + d jl C j d kl = C i + C j

Algorithm: UPGMA Initialization: Assign each x i into its own cluster C i Define one leaf per sequence, height 0 Iteration: Find two clusters C r, C s s.t. d rs is min Define a new cluster C t = C r C s Define node A rs connecting C r, C s, height d rs /2 Thm: If the input distances match an ultrametric tree UPGMA finds it. 1 4 3 2 5 Delete C r, C s d it =d ti =( C r d ir + C s d is )/( C r + C s ) length(c r, A rs ) = height(a rs ) - height(c r ) length(cs,a rs ) = height(a rs ) - height(c s ) Termination: When all sequences belong to one cluster 1 4 2 3 5 Time: Naïve: O(n 3 ); Can show O(n 2 logn) (ex.); O(n 2 ) (harder ex.)

http://lectures.molgen.mpg.de/phylogeny/ultrametric/ 11

Robert R. Sokal (1926-2012) Ph.D. 1952, University of Chicago. Was at Dept. of Ecology and Evolution, SUNY Stony Brook Member of the National Academy of Sciences & American Academy of Sciences. Promoted the use of statistics in biology and co-founded the field of numerical taxonomy. Together with P.H.A. Sneath, authored the two defining texts in this field. Along with F. James Rohlf, authored the very popular biostatistics book, Biometry. Editor of the American Naturalist, president of several learned societies. 12

Results (2) 10 major groups with similar patterns of cooccurrence, confirming that specific groups of phenotypes co-occur within families. certain malformations co-occur in more than one group, e.g. TGA,AVSD. Some differences from a proposed taxonomy (Houyel 11) (Also: co-occurrence of defects in families is caused by shared susceptibility genes.) A starting point for further biomed research 15

Variants on hierarchical clustering Input: Distance matrix D ij; Initially each element is a cluster. Find min element D rs in D; merge clusters r,s Delete elts. r,s, add new elt. t with updated weights Repeat Variants: Average linkage: UPGMA Single linkage: D it = min(d ir, D is ) Max linkage D it = max(d ir, D is ) Sometimes the number of clusters is needed. Methods abound. Sometimes leaf order matters and not only topology. 16

Hierarchical clustering of GE data Eisen et al., PNAS 1998 Growth response: Starved human fibroblast cells, added serum Monitored levels of 8600 genes over 13 time-points t ij - level of target gene i in condition j; r ij same for reference D ij = log(t ij /r ij ) D* ij = [D ij E(D i )]/std(d i ) Similarity of genes k,l: S kl =(Σ j D* kj D* lj )/N cond Applied average linkage method Ordered leaves by increasing subtree weight: average expression level, time of maximal induction, other criteria 17

18

19

Clustering the same data after randomly permuted within rows (1), columns (2) and both(3) 20

Observations Distinct measurements of same genes cluster together Genes of similar function cluster together Many cluster-function specific insights Interpretation is a REAL biological challenge 21

Yeast GE data 22

Mike Eisen & Pat Brown 23

More on hierarchical methods (2) The methods described above agglomerative (bottom up) An alternative approach: Divisive (top down) Advantages: gives a single coherent global picture Intuitive for biologists (from phylogeny) Disadvantages: no single partition; no specific clusters Forces all elements to fit a tree There are other methods that do not assume an ultrametric solution, notably Neighbor Joining. In genomics still UPGMA rules. 24

Hierarchical Clustering & Congenital Heart Defects Ellsoe et al. (Soren Brunak lab) European Heart Journal (2017) 25

CHD Congenital heart defects (CHD) affect almost 1% of all live born children Number of adults with CHD is increasing Recurrence patterns in families are poorly understood Do cases in the same family tend to have similar types of malformations? 26

Study 1163 families, 3080 family members with clinical diagnosis (avg 2.65 CHD cases /family) Each case is identified as having one or more of 41 different types of CHD lesions: AVD, BSD, VSD, 27

Concordant & discordant disease pairs Concordant: (ASD,ASD), (ASD,VSD) Discordant: (BAV,BAV) 28

Gender ratio, concordance & discordance 29

Scoring pairs of defects N(A,B) # families with A, B N(A, B) # families with A, not B N( A,B) # families with B, not A N( A, B) # families with none The odds ratio (OR) between phenotypes A and B: OR(A,B) = N(A,B) N( A, B)/N(A, B)N( A,B)??! Perhaps OR(A,B) = N(A,B)/[N(A, B)+N( A,B)] 30

Results 31