Gibbs sampling Massimo Andreatta Center for Biological Sequence Analysis Technical University of Denmark massimo@cbs.dtu.dk Technical University of Denmark 1
Monte Carlo simulations MC methods use repeated random sampling to numerically approximate solutions to problems Technical University of Denmark 2
Monte Carlo simulations A simple example: computing π with sampling Technical University of Denmark 3
Monte Carlo simulations A simple example: computing π with sampling r A c = πr 2 A = ( 2r) 2 s Technical University of Denmark 4
Monte Carlo simulations A simple example: computing π with sampling r A c = πr 2 A c A s = πr2 4r 2 = π 4 A = ( 2r) 2 s π = 4 A c A s Technical University of Denmark 5
Monte Carlo simulations A simple example: computing π with sampling π = 4 A c A s Technical University of Denmark 6
Monte Carlo simulations A simple example: computing π with sampling X X X Throw darts randomly hit circle hit square = hit hit +miss = A c A s π = 4 A c A s Technical University of Denmark 7
Monte Carlo simulations A simple example: computing π with sampling hit=0 for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x 2 +y 2 ) X if (dist<1) hit++ π = 4 A c A s Technical University of Denmark 8
Monte Carlo simulations A simple example: computing π with sampling hit=0 for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x 2 +y 2 ) X if (dist<1) hit++ pi = 4 * hit/n π = 4 A c A s Technical University of Denmark 9
Monte Carlo simulations A simple example: computing π with sampling Technical University of Denmark 10
Monte Carlo simulations A simple example: computing π with sampling - More iterations more accurate estimate - After 1,000,000 iterations I got pi 3,14182... Technical University of Denmark 11
Gibbs sampling A special kind of Monte Carlo method (Markov Chain Monte Carlo, or MCMC) - estimates a distribution by sampling from it - the samples are taken with pseudo-random steps - stepping to the next state only depends on the current state (memory-less chain) Technical University of Denmark 12
Gibbs sampling f(z) Stochastic search Z Technical University of Denmark 13
Gibbs sampling f(z) Stochastic search de = f (Z i ) f (Z i 1 ) P = min 1,exp de T Z i = current state of the system P = probability of accepting the move T = a scalar lowered during the search Z Technical University of Denmark 14
Gibbs sampling - down to biology Sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT de = f (Z i ) f (Z i 1 ) P = min 1,exp de T Z i = current state of the system P = probability of accepting the move T = a scalar lowered during the search Technical University of Denmark 15
Gibbs sampling - down to biology Sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT de = f (Z i ) f (Z i 1 ) P = min 1,exp de T Z i = current state of the system P = probability of accepting the move T = a scalar lowered during the search E = C p,a p,a log p p,a q a de = E i E i 1 Technical University of Denmark 16
Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 E = C p,a p,a log p p,a q a de = E i E i 1 Technical University of Denmark 17
Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 18 Note that the probability of going to the new state only depends on the previous state
Gibbs sampling - sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT Numerical example - 1 move to state +1 T = 0.2 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E i 1 = 2.44 E i = 2.52 P = min 1,exp 0.08 = min 1, 1.49 0.2 [ ] =1 Accept move with Prob = 100% Technical University of Denmark 19
Gibbs sampling - sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT Numerical example - 2 move to state +1 T = 0.2 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E i 1 = 2.44 E i = 2.35 P = min 1,exp 0.09 = min 1, 0.638 0.2 [ ] = 0.638 Accept move with Prob = 63.8% Technical University of Denmark 20
Gibbs sampling - sequence alignment Now, one thing at a time Technical University of Denmark 21
Gibbs sampling - sequence alignment T What is the MC temperature? it s a scalar decreased during the simulation iteration Technical University of Denmark 22
Gibbs sampling - sequence alignment T What is the MC temperature? it s a scalar decreased during the simulation t 1 =0.4 P(t 1 ) = min 1,exp de = min 1,exp 0.3 = 0.47 t 1 0.4 E.g. same de=-0.3 but at different temperatures t 2 =0.1 P(t 2 ) = min 1,exp 0.3 = 0.05 0.1 P(t 3 ) = min 1,exp 0.3 0 0.02 t 3 =0.02 iteration Technical University of Denmark 23
Technical University of Denmark 24
f(z) Move freely around states when the system is warm, then cool it off to force it into a state of high fitness Technical University of Denmark 25 Z
Gibbs sampling - sequence alignment Why sampling? 50 sequences 12 amino acids long try all possible combinations with a 9-mer overlap 4 50 ~ 10 30 possible combinations...computationally unfeasible SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT............ DFAAQVDYPSTGLY Technical University of Denmark 26
Single sequence move SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 27
Phase shift move SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 shift all sequences SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 28
A sketch for the alignment algorithm Start from a random alignment Set initial temperature For N iterations pick a random sequence suggest a shift move accept or reject the move depending on P = min 1,exp de T every P sh moves, attempt a phase shift move decrease temperature Technical University of Denmark 29
Does it work? Technical University of Denmark 30
Gibbs sequence alignment - performance Technical University of Denmark 31
More Gibbs sampling Aligning scoring matrices Technical University of Denmark 32
Alignment of scoring matrices 4 networks trained on HLA*DRB1-0401 Technical University of Denmark 33
Alignment of scoring matrices Combined logo Equally valid solutions, but with different core registers Technical University of Denmark 34
The PSSM-align algorithm Individual PSSM 20 L Technical University of Denmark 35
The PSSM-align algorithm Individual PSSM L 1. Extend matrix with BG frequencies Technical University of Denmark 36
The PSSM-align algorithm All individual PSSMs L 1. Extend matrix with BG frequencies Technical University of Denmark 37
The PSSM-align algorithm All individual PSSMs L 1. Extend matrix with BG frequencies Technical University of Denmark 38
The PSSM-align algorithm All individual PSSMs L 1. Extend matrix with BG frequencies 2. Apply random shift Technical University of Denmark 39
The PSSM-align algorithm core 1. Extend matrix with BG frequencies 2. Apply random shift 3. Do Gibbs sampling for many iterations Accept moves with probability: P = min 1,exp de T Maximize combined Information Content of the core Technical University of Denmark 40
The PSSM-align algorithm Offset 2-3 0 0-8 0 3 core 1. Extend matrix with BG frequencies 2. Apply random shift 3. Do Gibbs sampling for many iterations Avg matrix Maximize combined Information Content of the core Technical University of Denmark 41
Alignment of scoring matrices before alignment after alignment Technical University of Denmark 42
And more Gibbs sampling Clustering peptide data Technical University of Denmark 43
Gibbs clustering Multiple motifs SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK Cluster 1 ----SLFIGLKGDIRESTV-- --DGEEEVQLIAAVPGK---- ------VFRLKGGAPIKGVTF ---SFSCIAIGIITLYLG--- ----IDQVTIAGAKLRSLN-- WIQKETLVTFKNPHAKKQDV- ------KMLLDNINTPEGIIP Cluster 2 --ELLEFHYYLSSKLNK---- ------LNKFISPKSVAGRFA ESLHNPYPDYHWLRT------ -NKVKSLRILNTRRKL----- --MMGMFNMLSTVLGVS---- AKSSPAYPSVLGQTI------ --RHLIFCHSKKKCDELAAK- Technical University of Denmark 44
Gibbs clustering - the algorithm 1. List of peptides FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK Technical University of Denmark 45
Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----GMFNMLSTV----- -----SSPAYPSVL----- g 1 g 2 g N -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- Technical University of Denmark 46
Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----GMFNMLSTV----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- GMFNMLSTV Technical University of Denmark 47
Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- GMFNMLSTV 4b. Remove peptide from its group I Technical University of Denmark 48
Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- GMFNMLSTV 5b. Score peptide to a new random group R and in its original group I 4b. Remove peptide from its group I de = S R S I Technical University of Denmark 49
Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- -----GMFNMLSTV----- 5b. Score peptide to a new random group R and in its original group I 4b. Remove peptide from its group I de = S R S I GMFNMLSTV 6b. Accept or reject move P = min 1,exp de T Technical University of Denmark 50
Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence And iterate many times, gradually decreasing T -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- -----GMFNMLSTV----- 5b. Score peptide to a new random group R and in its original group I 4b. Remove peptide from its group I de = S R S I GMFNMLSTV 6b. Accept or reject move P = min 1,exp de T Technical University of Denmark 51
Does it work? Mixture of 100 binders for the two alleles Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 ATDKAAAAY A*0101 EVDQTKIQY A*0101 AETGSQGVY B*4402 ITDITKYLY A*0101 AEMKTDAAT B*4402 FEIKSAKKF B*4402 LSEMLNKEY A*0101 GELDRWEKI B*4402 LTDSSTLLV A*0101 FTIDFKLKY A*0101 TTTIKPVSY A*0101 EEKAFSPEV B*4402 AENLWVPVY B*4402 Technical University of Denmark 52
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixed G 1 A0101 B4402 G 2 Technical University of Denmark 53
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixed G 1 A0101 B4402 97 3 G 2 3 97 Resolved Technical University of Denmark 54
Five MHC class I alleles G 0 G 1 G 2 G 3 G 4 A0101 A0201 A0301 B0702 B4402 Technical University of Denmark 55
Five MHC class I alleles G 0 G 1 G 2 G 3 G 4 A0101 A0201 A0301 B0702 B4402 0 1 76 1 0 2 4 0 0 95 5 87 5 1 0 93 2 19 0 2 0 6 0 98 3 HLA-A0301 97% HLA-A0101 80% HLA-A0201 89% HLA-B4402 94% HLA-B0702 92% Technical University of Denmark 56
HLA-A*02:01 sub-motifs 666 peptide binders (aff < 500 nm) <Aff> = 10 nm <Th> = 4 hours <Aff> = 10 nm <Th> = 1.5 hours Technical University of Denmark 57
Splitting with Gibbs clustering <Aff> = 10 nm <Th> = 3.5 hours <Aff> = 10 nm <Th> = 2.25 hours Technical University of Denmark 58
Gibbs clustering And what if we don t know a priori the number of clusters? Technical University of Denmark 59
How many clusters? We could run the algorithm with different number of clusters k and choose the k with highest information content Technical University of Denmark 60
How many clusters? We could run the algorithm with different number of clusters k and choose the k with highest information content What s going on? Technical University of Denmark 61
How many clusters? We could run the algorithm with different number of clusters k and choose the k with highest information content What s going on? smaller groups tend to have higher information content Technical University of Denmark 62
How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a Technical University of Denmark 63
How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a This is equivalent to scoring each sequence S to its matrix E = S p,a log p p,a q a 20 L Technical University of Denmark 64
How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a This is equivalent to scoring each sequence S to its matrix E = S p,a log p p,a q a 20 L What is the problem? Overfitting. S was also used to calculate the log-odds matrix The contribution of S on the matrix will be larger if the cluster is small. Technical University of Denmark 65
How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a This is equivalent to scoring each sequence S to its matrix E = S p,a log p p,a q a 20 L What is the problem? Overfitting. S was also used to calculate the log-odds matrix The contribution of S on the matrix will be larger if the cluster is small. Technical University of Denmark 66
How many clusters? E = S p,a log p p,a q a Before scoring S, remove it and update the matrix E = S p,a log p S p,a q a What is the problem? Overfitting. S was also used to calculate the log-odds matrix The contribution of S on the matrix will be larger if the cluster is small. Technical University of Denmark 67
How many clusters? YQAFRTKVH SPRTLNAWV YALTVVWLL LSSIGIPAY AVAKCNLNH TPYDINQML LLMMTLPSI KELENEYYF IENATFFIF AEMLASIDL... E = log p p,a E = log pp,a S p,a q a p,a Is this so important..? S S q a Technical University of Denmark 68
How many clusters? YQAFRTKVH SPRTLNAWV YALTVVWLL LSSIGIPAY AVAKCNLNH TPYDINQML LLMMTLPSI KELENEYYF IENATFFIF AEMLASIDL... E = log p p,a E = log pp,a S p,a SCORE w/o removing q a Is this so important..? YES Technical University of Denmark 69 S p,a S q a Num of sequences in the cluster 100 20 3 5.52 10.42 26.78 removing 4.11 2.57 0.05 Score YALTVVWLL to a matrix, including vs. excluding YALTVVWLL in the matrix construction
How many clusters? Quality of clustering is not only determined by information content of individual clusters (intracluster distance), but also by the ability of different groups to discriminate (inter-cluster distance) Technical University of Denmark 70
How many clusters? Quality of clustering is not only determined by information content of individual clusters (intracluster distance), but also by the ability of different groups to discriminate (inter-cluster distance) E = log p S p,a E = log pp,a S S p,a q a S p,a S q p,a position and cluster-specific background (the background is calculated on all groups not containing S, it accounts for inter-cluster distance) Technical University of Denmark 71
How many clusters? One last thing and we are ready. E = S p,a log p S p,a S q p,a λn A parameter λ to modulate the tightness of the clustering (n is the number of clusters) Technical University of Denmark 72
How many clusters? One last thing and we are ready. E = S p,a log p S p,a S q p,a λn frequencies are calculated by removing the sequence being scored S position and cluster-specific background (the background is calculated on all groups not containing S, it accounts for intercluster distance) A parameter λ to modulate the tightness of the clustering (n is the number of clusters) Technical University of Denmark 73
How many clusters? 2 alleles lambda=0.02 3 alleles lambda=0.02 4 alleles lambda=0.02 KLD sum KLD sum 3.0 3.4 3.8 4.2 3.4 3.8 4.2 2 4 6 8 10 12 Groups 5 alleles lambda=0.02 KLD sum KLD sum 3.2 3.4 3.6 3.8 4.0 3.3 3.5 3.7 3.9 2 4 6 8 10 12 Groups 6 alleles lambda=0.02 KLD sum KLD sum 3.4 3.8 4.2 2.8 3.2 3.6 2 4 6 8 10 12 Groups 7 alleles lambda=0.02 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 Groups Groups Groups 8 alleles lambda=0.02 9 alleles lambda=0.02 KLD sum 2.8 3.2 3.6 Technical University of Denmark 2 4 6 8 10 12 2 4 6 8 10 12 74 Groups KLD sum 3.0 3.4 3.8 Groups Binders for 2 to 9 MHC class I alleles
How many clusters? Number of clusters 3 4 5 6 7 8 9 10 random allele combinations 2 3 4 5 6 7 8 Lambda penalty 0 0.02 0.04 Number of clusters 2 4 6 8 10 Lambda = 0.000 2 3 4 5 6 7 8 Alleles Alleles Lambda = 0.020 Lambda = 0.040 Number of clusters 2 4 6 8 10 Number of clusters 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Alleles Alleles Technical University of Denmark 75
In conclusion Sampling methods can solve problems where the search space is too large to be exhaustively explored Gibbs sampling can detect even weak motifs in a sequence alignment (e.g. MHC class II) More than 1,000 papers in PubMed using Gibbs sampling methods Transcription start-sites Receptor binding sites Acceptor:Donor sites... Technical University of Denmark 76