Clustering gene expression data & the EM algorithm

Size: px

Start display at page:

Download "Clustering gene expression data & the EM algorithm"

Pearl Pitts
5 years ago
Views:

1 CG, Fall Clusterng gene expresson data & the EM algorthm CG 08 Ron Shamr 1

2 How Gene Expresson Data Looks Entres of the Raw Data matrx: Rato values Absolute values Row = gene s expresson pattern / fngerprnt vector Column = experment/condton s profle genes condtons Expresson levels, Raw Data 2

condtons 10 20 30 40 50 Data Preprocessng Input: Real-valued raw data matrx.

3 condtons Data Preprocessng Input: Real-valued raw data matrx. Compute the smlarty matrx (dot product/correlaton/ ) Alternatvely dstances genes Expresson levels, Raw Data From the Raw Data matrx we compute the smlarty matrx S. S j reflects the smlarty of the expresson patterns of gene and gene j

4 DNA chps: Applcatons Deducng functons of unknown genes (smlar expresson pattern smlar functon) Identfyng dsease profles Decpherng regulatory mechansms (co-expresson co-regulaton). Classfcaton of bologcal condtons Drug development Analyss requres clusterng of genes/condtons. 4

5 Clusterng: Objectve Group elements (genes) to clusters satsfyng: Homogenety: Elements nsde a cluster are hghly smlar to each other. Separaton: Elements from dfferent clusters have low smlarty to each other. Needs formal objectve functons. Most useful versons are NP-hard. 5

6 The Clusterng Bazaar 6

7 Herarchcal clusterng CG 08 Ron Shamr 7

8 An Alternatve Vew Instead of partton to clusters Form a tree-herarchy of the nput elements satsfyng: More smlar elements are placed closer along the tree. Or: Tree dstances reflect element smlarty 8

9 Herarchcal Representaton Dendrogram: rooted tree, usually bnary; all leaf-root dstances are equal

10 Herarchcal Clusterng: Average Lnkage Sokal & Mchener 58, Lance & Wllams 67 Input: Dstance matrx (D j ) Iteratve algorthm. Intally each element s a cluster. n r - sze of cluster r Fnd mn element D rs n D; merge clusters r,s Delete elements r,s; add new element t wth D t =D t =n r /(n r +n s ) D r + n s /(n r +n s ) D s Repeat 10

11 A General Framework Lance & Wllams 67 Fnd mn element D rs, merge clusters r,s Delete elems. r,s, add new elem. t wth D t =D t =α r D r + α s D s + γ D r -D s Sngle-lnkage: D t =mn{d r,d s } Complete-lnkage: D t =max{d r,d s } Note: analogous formulaton n terms of smlarty matrx (rather than dstance) 11

12 Herarchcal clusterng of GE data Esen et al., PNAS 1998 Growth response: Starved human fbroblast cells, added serum Montored 8600 genes over 13 tme-ponts t j - fluorescence level of gene n condton j; r j same for reference s j = log(t j /r j ) S kl =(Σ j s kj s lj )/[ s k s l ] (cosne of angle) Appled average lnkage method Ordered leaves by ncreasng element weght: average expresson level, tme of maxmal nducton, or other crtera 12

13 CG 08 Ron Shamr 13

14 Esengrams for same data randomly permuted wthn rows (1), columns (2) and both(3) 14

15 Yeast stress data CG 08 Ron Shamr 15

16 Comments Dstnct measurements of same genes cluster together Genes of smlar functon cluster together Many cluster-functon specfc nsghts Interpretaton s a REAL bologcal challenge 16

17 More on herarchcal methods Agglomeratve vs. the more natural dvsve. Advantages: gves a sngle coherent global pcture Intutve for bologsts (from phylogeny) Dsadvantages: No sngle partton; no specfc clusters Forces all elements to ft a tree herarchy 17

18 Non-Herarchcal Clusterng CG 08 Ron Shamr 18

19 K-means (Lloyd 57, Macqueen 67) Input: vector v for each element ; #clusters=k Defne a centrod c p of a cluster C p as ts average vector. Goal: mnmze Σ clusters p Σ n cluster p d(v,c p ) Objectve = homogenety only (k fxed) NP-hard already for k=2. 19

20 K-means alg. Intalze an arbtrary partton P nto k clusters. Repeat the followng tll convergence: Update centrods (max c, P fxed) Assgn each pont to ts closest centrod (max P, c fxed) Can be shown to have poly expected tme under varous assumptons on data dstrbuton. A varant: perform a sngle best modfcaton (that decreases the score the most). 20

21 21

22 22

23 A Soft Verson Based on a probablstc model of data as comng from a mxture of Gaussans: Pz ( = j) = π Px ( z= j)~ N( µ, σi) Goal: evaluate the parameters θ (assume σ s known). Method: apply EM to maxmze the lkelhood of data. j 2 d( x, µ j) L( θ) π j exp( ) 2 2σ j j 23

24 EM, soft verson Iteratvely, compute soft assgnment and use t to derve expectatons of π, μ: 24

25 Soft vs. hard clusterng Soft verson mnmzes: 2 d( x, µ j) L( θ) π j exp( ) 2 2σ j If we assume that each element s n one cluster (hard assgnment) then: lo Lg( θ ) d( x, µ ) Ths s exactly the k-means crteron! c () 2 25

26 Expectaton-maxmzaton: The probablstc settng Input: data x comng from a probablstc model wth hdden nformaton y Goal: Learn the model s parameters so that the lkelhood of the data s maxmzed. Example: a mxture of two Gaussans Py ( = 1) = p; Py ( = 2) = p = 1 p ( x µ ) j Px ( y = j) = exp 2 σ 2π 2σ CG 08 Ron Shamr

27 The lkelhood functon Py ( = 1) = p; Py ( = 2) = p = 1 p ( x µ ) j Px ( y = j) = exp 2 σ 2π 2σ L( θ) = Px ( θ) = Px (, y = j θ) j 2 p j ( x µ ) j log L( θ ) = log exp 2 j σ 2π 2σ CG 08 Ron Shamr

28 The EM algorthm Goal: max logp(x θ)=log (Σ P(x,y θ)) Assume we have a model θ t whch we wsh to mprove. Note: P(x θ) = P(x,y θ) / P(y x,θ) t t t Py ( x, θ ) lo Px g( θ) = Py ( x, θ ) lo Pxy g(, θ) Py ( x, θ ) lo Py g( x, θ) t t t Py ( x, θ ) lo Px g( θ) = Py ( x, θ ) lo Pxy g(, θ) Py ( x, θ ) lo Py g( x, θ) y y y t t log Px ( θ) = Py ( x, θ ) log Pxy (, θ) Py ( x, θ ) log Py ( x, θ) t log Px ( θ ) = y y y t t t t Py ( x, θ ) lo Pxy g(, θ ) Py ( x, θ ) lo Py g( x, θ ) t t t t t Py ( x, θ ) = Q( θ θ ) Q( θ θ ) + Py ( x, θ ) lo g y Py ( x, θ ) Constant >=0 y CG 08 Ron Shamr

29 The EM algorthm (cont.) Man component: s the expectaton of logp(x,y θ) over the dstrbuton of y gven by the current parameters θ t The algorthm: E-step: Calculate the Q functon M-step: Maxmze Q(θ θ t ) wth respect to θ [ ] t t t Q( θ θ ) = Py ( x, θ ) log Pxy (, θ) = E log Pxy (, θ) y CG 08 Ron Shamr

30 Applcaton to the mxture model [ ] t t t Q( θ θ ) = Py ( x, θ ) log Pxy (, θ) = E log Pxy (, θ) y Pxy (, θ) = Px (, y= j θ) = Px (, y= j θ) y j 1 = 0 y y = j j j log Pxy (, θ) = ylog Px (, y= j θ) j j t t E[log Pxy (, θ)] = E[ y]log Px (, y= j θ) j j y j 30

31 Applcaton (cont.) t t E[log Pxy (, θ)] = E[ y]log Px (, y= j θ) w: = E[ y] = Py ( = 1 x, θ ) = j j t Px (, y= j θ ) t t t j j j t j Px (, y= j θ ) t t 1 Q( θ θ ) = wj log logσ + log pj j 2π ( x µ ) 2 2σ j 2 31

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD. Cluster Analyss Cluster Valdaton Determnng Number of Clusters 1 Cluster Valdaton The procedure of evaluatng the results of a clusterng algorthm s known under the term cluster valdty. How do we evaluate