Gibbs sampling. Massimo Andreatta Center for Biological Sequence Analysis Technical University of Denmark.

Similar documents
MCMC: Markov Chain Monte Carlo

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Random Walks A&T and F&S 3.1.2

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Gibbs Sampling Methods for Multiple Sequence Alignment

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Markov chain Monte Carlo Lecture 9

Convex Optimization CMU-10725

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Introduction to Machine Learning CMU-10701

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

André Schleife Department of Materials Science and Engineering

An optimized energy potential can predict SH2 domainpeptide

Markov Chain Monte Carlo methods

Machine Learning for Data Science (CS4786) Lecture 24

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Week 10: Homology Modelling (II) - HHpred

Ch. 10 Vector Quantization. Advantages & Design

Markov Chain Monte Carlo

Reducing The Computational Cost of Bayesian Indoor Positioning Systems

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets

Modeling Symmetries for Stochastic Structural Recognition

Evaluation Methods for Topic Models

Alignment. Peak Detection

Learning Sequence Motif Models Using Gibbs Sampling

General Construction of Irreversible Kernel in Markov Chain Monte Carlo

Chapter 10. Optimization Simulated annealing

Markov Networks.

Statistical Modeling. Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Markov Processes. Stochastic process. Markov process

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Quantitative Bioinformatics

Hidden Markov Models

Machine Learning, Midterm Exam

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Markov-Chain Monte Carlo

CSE446: Clustering and EM Spring 2017

Convergence Rate of Markov Chains

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Monte Carlo (MC) Simulation Methods. Elisa Fadda

Predicting Protein Functions and Domain Interactions from Protein Interactions

Bayesian Networks BY: MOHAMAD ALSABBAGH

A = {(x, u) : 0 u f(x)},

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Bayesian Methods for Machine Learning

Introduction to Bioinformatics

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Introduction to Machine Learning Midterm Exam Solutions

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Approximate inference in Energy-Based Models

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Time-Sensitive Dirichlet Process Mixture Models

Applications of Hidden Markov Models

CSC 2541: Bayesian Methods for Machine Learning

CS534 Machine Learning - Spring Final Exam

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CPSC 540: Machine Learning

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Monte Carlo. Lecture 15 4/9/18. Harvard SEAS AP 275 Atomistic Modeling of Materials Boris Kozinsky

Markov Chain Monte Carlo Lecture 6

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

On Markov Chain Monte Carlo

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Final Exam, Fall 2002

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Bayesian Networks Structure Learning (cont.)

Doing Physics with Random Numbers

6 Markov Chain Monte Carlo (MCMC)

Mixtures of Gaussians continued

Introduction to Machine Learning Midterm Exam

Stochastic optimization Markov Chain Monte Carlo

12/2/15. G Perception. Bayesian Decision Theory. Laurence T. Maloney. Perceptual Tasks. Testing hypotheses. Estimation

9 Markov chain Monte Carlo integration. MCMC

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Hidden Markov Models

MCMC for Cut Models or Chasing a Moving Target with MCMC

An Introduction to Bayesian Networks: Representation and Approximate Inference

Metropolis-Hastings Algorithm

The connection of dropout and Bayesian statistics

Markov Chains and MCMC

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Statistical approach for dictionary learning

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Random Numbers and Simulation

Strong Lens Modeling (II): Statistical Methods

Transcription:

Gibbs sampling Massimo Andreatta Center for Biological Sequence Analysis Technical University of Denmark massimo@cbs.dtu.dk Technical University of Denmark 1

Monte Carlo simulations MC methods use repeated random sampling to numerically approximate solutions to problems Technical University of Denmark 2

Monte Carlo simulations A simple example: computing π with sampling Technical University of Denmark 3

Monte Carlo simulations A simple example: computing π with sampling r A c = πr 2 A = ( 2r) 2 s Technical University of Denmark 4

Monte Carlo simulations A simple example: computing π with sampling r A c = πr 2 A c A s = πr2 4r 2 = π 4 A = ( 2r) 2 s π = 4 A c A s Technical University of Denmark 5

Monte Carlo simulations A simple example: computing π with sampling π = 4 A c A s Technical University of Denmark 6

Monte Carlo simulations A simple example: computing π with sampling X X X Throw darts randomly hit circle hit square = hit hit +miss = A c A s π = 4 A c A s Technical University of Denmark 7

Monte Carlo simulations A simple example: computing π with sampling hit=0 for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x 2 +y 2 ) X if (dist<1) hit++ π = 4 A c A s Technical University of Denmark 8

Monte Carlo simulations A simple example: computing π with sampling hit=0 for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x 2 +y 2 ) X if (dist<1) hit++ pi = 4 * hit/n π = 4 A c A s Technical University of Denmark 9

Monte Carlo simulations A simple example: computing π with sampling Technical University of Denmark 10

Monte Carlo simulations A simple example: computing π with sampling - More iterations more accurate estimate - After 1,000,000 iterations I got pi 3,14182... Technical University of Denmark 11

Gibbs sampling A special kind of Monte Carlo method (Markov Chain Monte Carlo, or MCMC) - estimates a distribution by sampling from it - the samples are taken with pseudo-random steps - stepping to the next state only depends on the current state (memory-less chain) Technical University of Denmark 12

Gibbs sampling f(z) Stochastic search Z Technical University of Denmark 13

Gibbs sampling f(z) Stochastic search de = f (Z i ) f (Z i 1 ) P = min 1,exp de T Z i = current state of the system P = probability of accepting the move T = a scalar lowered during the search Z Technical University of Denmark 14

Gibbs sampling - down to biology Sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT de = f (Z i ) f (Z i 1 ) P = min 1,exp de T Z i = current state of the system P = probability of accepting the move T = a scalar lowered during the search Technical University of Denmark 15

Gibbs sampling - down to biology Sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT de = f (Z i ) f (Z i 1 ) P = min 1,exp de T Z i = current state of the system P = probability of accepting the move T = a scalar lowered during the search E = C p,a p,a log p p,a q a de = E i E i 1 Technical University of Denmark 16

Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 E = C p,a p,a log p p,a q a de = E i E i 1 Technical University of Denmark 17

Gibbs sampling - sequence alignment State transition SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 18 Note that the probability of going to the new state only depends on the previous state

Gibbs sampling - sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT Numerical example - 1 move to state +1 T = 0.2 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E i 1 = 2.44 E i = 2.52 P = min 1,exp 0.08 = min 1, 1.49 0.2 [ ] =1 Accept move with Prob = 100% Technical University of Denmark 19

Gibbs sampling - sequence alignment SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT Numerical example - 2 move to state +1 T = 0.2 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E i 1 = 2.44 E i = 2.35 P = min 1,exp 0.09 = min 1, 0.638 0.2 [ ] = 0.638 Accept move with Prob = 63.8% Technical University of Denmark 20

Gibbs sampling - sequence alignment Now, one thing at a time Technical University of Denmark 21

Gibbs sampling - sequence alignment T What is the MC temperature? it s a scalar decreased during the simulation iteration Technical University of Denmark 22

Gibbs sampling - sequence alignment T What is the MC temperature? it s a scalar decreased during the simulation t 1 =0.4 P(t 1 ) = min 1,exp de = min 1,exp 0.3 = 0.47 t 1 0.4 E.g. same de=-0.3 but at different temperatures t 2 =0.1 P(t 2 ) = min 1,exp 0.3 = 0.05 0.1 P(t 3 ) = min 1,exp 0.3 0 0.02 t 3 =0.02 iteration Technical University of Denmark 23

Technical University of Denmark 24

f(z) Move freely around states when the system is warm, then cool it off to force it into a state of high fitness Technical University of Denmark 25 Z

Gibbs sampling - sequence alignment Why sampling? 50 sequences 12 amino acids long try all possible combinations with a 9-mer overlap 4 50 ~ 10 30 possible combinations...computationally unfeasible SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT............ DFAAQVDYPSTGLY Technical University of Denmark 26

Single sequence move SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 27

Phase shift move SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT move to state +1 shift all sequences SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT E = C p,a p,a log p p,a q a de = E i E i 1 Accept or reject the move? P = min 1,exp de T Technical University of Denmark 28

A sketch for the alignment algorithm Start from a random alignment Set initial temperature For N iterations pick a random sequence suggest a shift move accept or reject the move depending on P = min 1,exp de T every P sh moves, attempt a phase shift move decrease temperature Technical University of Denmark 29

Does it work? Technical University of Denmark 30

Gibbs sequence alignment - performance Technical University of Denmark 31

More Gibbs sampling Aligning scoring matrices Technical University of Denmark 32

Alignment of scoring matrices 4 networks trained on HLA*DRB1-0401 Technical University of Denmark 33

Alignment of scoring matrices Combined logo Equally valid solutions, but with different core registers Technical University of Denmark 34

The PSSM-align algorithm Individual PSSM 20 L Technical University of Denmark 35

The PSSM-align algorithm Individual PSSM L 1. Extend matrix with BG frequencies Technical University of Denmark 36

The PSSM-align algorithm All individual PSSMs L 1. Extend matrix with BG frequencies Technical University of Denmark 37

The PSSM-align algorithm All individual PSSMs L 1. Extend matrix with BG frequencies Technical University of Denmark 38

The PSSM-align algorithm All individual PSSMs L 1. Extend matrix with BG frequencies 2. Apply random shift Technical University of Denmark 39

The PSSM-align algorithm core 1. Extend matrix with BG frequencies 2. Apply random shift 3. Do Gibbs sampling for many iterations Accept moves with probability: P = min 1,exp de T Maximize combined Information Content of the core Technical University of Denmark 40

The PSSM-align algorithm Offset 2-3 0 0-8 0 3 core 1. Extend matrix with BG frequencies 2. Apply random shift 3. Do Gibbs sampling for many iterations Avg matrix Maximize combined Information Content of the core Technical University of Denmark 41

Alignment of scoring matrices before alignment after alignment Technical University of Denmark 42

And more Gibbs sampling Clustering peptide data Technical University of Denmark 43

Gibbs clustering Multiple motifs SLFIGLKGDIRESTV DGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLN WIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFA ESLHNPYPDYHWLRT NKVKSLRILNTRRKL MMGMFNMLSTVLGVS AKSSPAYPSVLGQTI RHLIFCHSKKKCDELAAK Cluster 1 ----SLFIGLKGDIRESTV-- --DGEEEVQLIAAVPGK---- ------VFRLKGGAPIKGVTF ---SFSCIAIGIITLYLG--- ----IDQVTIAGAKLRSLN-- WIQKETLVTFKNPHAKKQDV- ------KMLLDNINTPEGIIP Cluster 2 --ELLEFHYYLSSKLNK---- ------LNKFISPKSVAGRFA ESLHNPYPDYHWLRT------ -NKVKSLRILNTRRKL----- --MMGMFNMLSTVLGVS---- AKSSPAYPSVLGQTI------ --RHLIFCHSKKKCDELAAK- Technical University of Denmark 44

Gibbs clustering - the algorithm 1. List of peptides FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK Technical University of Denmark 45

Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----GMFNMLSTV----- -----SSPAYPSVL----- g 1 g 2 g N -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- Technical University of Denmark 46

Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----GMFNMLSTV----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- GMFNMLSTV Technical University of Denmark 47

Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- GMFNMLSTV 4b. Remove peptide from its group I Technical University of Denmark 48

Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- GMFNMLSTV 5b. Score peptide to a new random group R and in its original group I 4b. Remove peptide from its group I de = S R S I Technical University of Denmark 49

Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- -----GMFNMLSTV----- 5b. Score peptide to a new random group R and in its original group I 4b. Remove peptide from its group I de = S R S I GMFNMLSTV 6b. Accept or reject move P = min 1,exp de T Technical University of Denmark 50

Gibbs clustering - the algorithm 1. List of peptides 2. create N random groups FIGLKGDIR EEEVQLIAA RLKGGAPIK SCIAIGIIT QVTIAGAKL QKETLVTFK LLDNINTPE LEFHYYLSS KFISPKSVA LHNPYPDYH VKSLRILNT GMFNMLSTV SSPAYPSVL LIFCHSKKK -----QVTIAGAKL----- -----QKETLVTFK----- -----LEFHYYLSS----- -----SSPAYPSVL----- g 1 g 2 g N 3 Move sequence And iterate many times, gradually decreasing T -----SLFIGLKGD----- -----SFSCIAIGI----- -----KMLLDNINT----- -----KYVHGTWRS----- -----NKVKSLRIL----- -----LHNPYPDYH----- -----LIFCHSKKK----- -----RLKGGAPIK----- -----KFISPKSVA----- -----EEEVQLIAA----- -----GMFNMLSTV----- 5b. Score peptide to a new random group R and in its original group I 4b. Remove peptide from its group I de = S R S I GMFNMLSTV 6b. Accept or reject move P = min 1,exp de T Technical University of Denmark 51

Does it work? Mixture of 100 binders for the two alleles Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 ATDKAAAAY A*0101 EVDQTKIQY A*0101 AETGSQGVY B*4402 ITDITKYLY A*0101 AEMKTDAAT B*4402 FEIKSAKKF B*4402 LSEMLNKEY A*0101 GELDRWEKI B*4402 LTDSSTLLV A*0101 FTIDFKLKY A*0101 TTTIKPVSY A*0101 EEKAFSPEV B*4402 AENLWVPVY B*4402 Technical University of Denmark 52

Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixed G 1 A0101 B4402 G 2 Technical University of Denmark 53

Two MHC class I alleles: HLA-A*0101 and HLA-B*4402 Mixed G 1 A0101 B4402 97 3 G 2 3 97 Resolved Technical University of Denmark 54

Five MHC class I alleles G 0 G 1 G 2 G 3 G 4 A0101 A0201 A0301 B0702 B4402 Technical University of Denmark 55

Five MHC class I alleles G 0 G 1 G 2 G 3 G 4 A0101 A0201 A0301 B0702 B4402 0 1 76 1 0 2 4 0 0 95 5 87 5 1 0 93 2 19 0 2 0 6 0 98 3 HLA-A0301 97% HLA-A0101 80% HLA-A0201 89% HLA-B4402 94% HLA-B0702 92% Technical University of Denmark 56

HLA-A*02:01 sub-motifs 666 peptide binders (aff < 500 nm) <Aff> = 10 nm <Th> = 4 hours <Aff> = 10 nm <Th> = 1.5 hours Technical University of Denmark 57

Splitting with Gibbs clustering <Aff> = 10 nm <Th> = 3.5 hours <Aff> = 10 nm <Th> = 2.25 hours Technical University of Denmark 58

Gibbs clustering And what if we don t know a priori the number of clusters? Technical University of Denmark 59

How many clusters? We could run the algorithm with different number of clusters k and choose the k with highest information content Technical University of Denmark 60

How many clusters? We could run the algorithm with different number of clusters k and choose the k with highest information content What s going on? Technical University of Denmark 61

How many clusters? We could run the algorithm with different number of clusters k and choose the k with highest information content What s going on? smaller groups tend to have higher information content Technical University of Denmark 62

How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a Technical University of Denmark 63

How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a This is equivalent to scoring each sequence S to its matrix E = S p,a log p p,a q a 20 L Technical University of Denmark 64

How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a This is equivalent to scoring each sequence S to its matrix E = S p,a log p p,a q a 20 L What is the problem? Overfitting. S was also used to calculate the log-odds matrix The contribution of S on the matrix will be larger if the cluster is small. Technical University of Denmark 65

How many clusters? Let s look back at the Energy function E = C p,a p,a log p p,a q a This is equivalent to scoring each sequence S to its matrix E = S p,a log p p,a q a 20 L What is the problem? Overfitting. S was also used to calculate the log-odds matrix The contribution of S on the matrix will be larger if the cluster is small. Technical University of Denmark 66

How many clusters? E = S p,a log p p,a q a Before scoring S, remove it and update the matrix E = S p,a log p S p,a q a What is the problem? Overfitting. S was also used to calculate the log-odds matrix The contribution of S on the matrix will be larger if the cluster is small. Technical University of Denmark 67

How many clusters? YQAFRTKVH SPRTLNAWV YALTVVWLL LSSIGIPAY AVAKCNLNH TPYDINQML LLMMTLPSI KELENEYYF IENATFFIF AEMLASIDL... E = log p p,a E = log pp,a S p,a q a p,a Is this so important..? S S q a Technical University of Denmark 68

How many clusters? YQAFRTKVH SPRTLNAWV YALTVVWLL LSSIGIPAY AVAKCNLNH TPYDINQML LLMMTLPSI KELENEYYF IENATFFIF AEMLASIDL... E = log p p,a E = log pp,a S p,a SCORE w/o removing q a Is this so important..? YES Technical University of Denmark 69 S p,a S q a Num of sequences in the cluster 100 20 3 5.52 10.42 26.78 removing 4.11 2.57 0.05 Score YALTVVWLL to a matrix, including vs. excluding YALTVVWLL in the matrix construction

How many clusters? Quality of clustering is not only determined by information content of individual clusters (intracluster distance), but also by the ability of different groups to discriminate (inter-cluster distance) Technical University of Denmark 70

How many clusters? Quality of clustering is not only determined by information content of individual clusters (intracluster distance), but also by the ability of different groups to discriminate (inter-cluster distance) E = log p S p,a E = log pp,a S S p,a q a S p,a S q p,a position and cluster-specific background (the background is calculated on all groups not containing S, it accounts for inter-cluster distance) Technical University of Denmark 71

How many clusters? One last thing and we are ready. E = S p,a log p S p,a S q p,a λn A parameter λ to modulate the tightness of the clustering (n is the number of clusters) Technical University of Denmark 72

How many clusters? One last thing and we are ready. E = S p,a log p S p,a S q p,a λn frequencies are calculated by removing the sequence being scored S position and cluster-specific background (the background is calculated on all groups not containing S, it accounts for intercluster distance) A parameter λ to modulate the tightness of the clustering (n is the number of clusters) Technical University of Denmark 73

How many clusters? 2 alleles lambda=0.02 3 alleles lambda=0.02 4 alleles lambda=0.02 KLD sum KLD sum 3.0 3.4 3.8 4.2 3.4 3.8 4.2 2 4 6 8 10 12 Groups 5 alleles lambda=0.02 KLD sum KLD sum 3.2 3.4 3.6 3.8 4.0 3.3 3.5 3.7 3.9 2 4 6 8 10 12 Groups 6 alleles lambda=0.02 KLD sum KLD sum 3.4 3.8 4.2 2.8 3.2 3.6 2 4 6 8 10 12 Groups 7 alleles lambda=0.02 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 Groups Groups Groups 8 alleles lambda=0.02 9 alleles lambda=0.02 KLD sum 2.8 3.2 3.6 Technical University of Denmark 2 4 6 8 10 12 2 4 6 8 10 12 74 Groups KLD sum 3.0 3.4 3.8 Groups Binders for 2 to 9 MHC class I alleles

How many clusters? Number of clusters 3 4 5 6 7 8 9 10 random allele combinations 2 3 4 5 6 7 8 Lambda penalty 0 0.02 0.04 Number of clusters 2 4 6 8 10 Lambda = 0.000 2 3 4 5 6 7 8 Alleles Alleles Lambda = 0.020 Lambda = 0.040 Number of clusters 2 4 6 8 10 Number of clusters 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Alleles Alleles Technical University of Denmark 75

In conclusion Sampling methods can solve problems where the search space is too large to be exhaustively explored Gibbs sampling can detect even weak motifs in a sequence alignment (e.g. MHC class II) More than 1,000 papers in PubMed using Gibbs sampling methods Transcription start-sites Receptor binding sites Acceptor:Donor sites... Technical University of Denmark 76