Protein Threading. Combinatorial optimization approach. Stefan Balev.

Similar documents
Practical Bioinformatics

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Advanced topics in bioinformatics

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Crick s early Hypothesis Revisited

Supplementary Information for

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

SUPPLEMENTARY DATA - 1 -

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

The Trigram and other Fundamental Philosophies

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Number-controlled spatial arrangement of gold nanoparticles with

Electronic supplementary material

Codon Distribution in Error-Detecting Circular Codes

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Supporting Information

Introduction to Molecular Phylogeny

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Supplemental Figure 1.

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

SUPPLEMENTARY INFORMATION

Supplementary Information

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Aoife McLysaght Dept. of Genetics Trinity College Dublin

part 3: analysis of natural selection pressure

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

Evolutionary Analysis of Viral Genomes

Why do more divergent sequences produce smaller nonsynonymous/synonymous

SUPPLEMENTARY INFORMATION

The role of the FliD C-terminal domain in pentamer formation and

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Re- engineering cellular physiology by rewiring high- level global regulatory genes

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Timing molecular motion and production with a synthetic transcriptional clock

Using algebraic geometry for phylogenetic reconstruction

Near-instant surface-selective fluorogenic protein quantification using sulfonated

Diversity of Chlamydia trachomatis Major Outer Membrane

Sequence Divergence & The Molecular Clock. Sequence Divergence

From DNA to protein, i.e. the central dogma

HADAMARD MATRICES AND QUINT MATRICES IN MATRIX PRESENTATIONS OF MOLECULAR GENETIC SYSTEMS

Using an Artificial Regulatory Network to Investigate Neural Computation

Lecture 15: Realities of Genome Assembly Protein Sequencing

Supplementary Information

Capacity of DNA Data Embedding Under. Substitution Mutations

In this article, we investigate the possible existence of errordetection/correction

The Mathematics of Phylogenomics

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

Chain-like assembly of gold nanoparticles on artificial DNA templates via Click Chemistry

Pathways and Controls of N 2 O Production in Nitritation Anammox Biomass

Insects act as vectors for a number of important diseases of

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Properties of amino acids in proteins

Packing of Secondary Structures

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation.

Supporting Information. An Electric Single-Molecule Hybridisation Detector for short DNA Fragments

Lecture IV A. Shannon s theory of noisy channels and molecular codes

Supplemental Figure 1. Phenotype of ProRGA:RGAd17 plants under long day

An Analytical Model of Gene Evolution with 9 Mutation Parameters: An Application to the Amino Acids Coded by the Common Circular Code

FliZ Is a Posttranslational Activator of FlhD 4 C 2 -Dependent Flagellar Gene Expression

Biology 155 Practice FINAL EXAM

Evidence for Evolution: Change Over Time (Make Up Assignment)

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Symmetry Studies. Marlos A. G. Viana

A p-adic Model of DNA Sequence and Genetic Code 1

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

Identification of a Locus Involved in the Utilization of Iron by Haemophilus influenzae

160, and 220 bases, respectively, shorter than pbr322/hag93. (data not shown). The DNA sequence of approximately 100 bases of each

Translation. A ribosome, mrna, and trna.

Chemistry Chapter 22

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar

Probabilistic modeling and molecular phylogeny

DNA sequence analysis of the imp UV protection and mutation operon of the plasmid TP110: identification of a third gene

codon substitution models and the analysis of natural selection pressure

Supplementary information. Porphyrin-Assisted Docking of a Thermophage Portal Protein into Lipid Bilayers: Nanopore Engineering and Characterization.

Advanced Topics in RNA and DNA. DNA Microarrays Aptamers

Sequence analysis and comparison

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

Lecture 15: Programming Example: TASEP

Administration. ndrew Torda April /04/2008 [ 1 ]

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Transcription:

Protein Threading Combinatorial optimization approach Stefan Balev Stefan.Balev@univ-lehavre.fr Laboratoire d informatique du Havre Université du Havre Stefan Balev Cours DEA 30/01/2004 p.1/42

Outline Biological background Mathematical models network flow model integer programming model Algorithms Experimental results Stefan Balev Cours DEA 30/01/2004 p.2/42

Biological background Stefan Balev Cours DEA 30/01/2004 p.3/42

Proteins r e p RNA translation proteins DNA transcription li c a ti o n complex biological macromolecules composed of sequences of 20 amino acids (50-2500) key elements of many cellular functions fibrous proteins contribute to hair, skin, bone, muscles,... membrane proteins mediate the exchange of molecules and information across cellular boundaries globular proteins mediate and catalyze the biochemical reactions The human genome codes 100,000 proteins, bacteria 500-1500 Stefan Balev Cours DEA 30/01/2004 p.4/42

Amino acids Stefan Balev Cours DEA 30/01/2004 p.5/42

Genetic code T C A G TTT PHE TCT SER TAT TYR TGT CYS T TTC PHE TCC SER TAC TYR TGC CYS TTA LEU TCA SER TAA stop TGA stop TTG LEU TCG SER TAG stop TGG TRP CTT LEU CCT PRO CAT HIS CGT ARG C CTC LEU CCC PRO CAC HIS CGC ARG CTA LEU CCA PRO CAA GLN CGA ARG CTG LEU CCG PRO CAG GLN CGG ARG ATT ILE ACT THR AAT ASN AGT SER A ATC ILE ACC THR AAC ASN AGC SER ATA ILE ACA THR AAA LYS AGA ARG ATG MET ACG THR AAG LYS AGG ARG GTT VAL GCT ALA GAT ASP GGT GLY G GTC VAL GCC ALA GAC ASP GGC GLY GTA VAL GCA ALA GAA GLU GGA GLY GTG VAL GCG ALA GAG GLU GGG GLY Stefan Balev Cours DEA 30/01/2004 p.6/42

Polypeptide chains Stefan Balev Cours DEA 30/01/2004 p.7/42

The four levels of protein structure primary (1D) structure (amino acid sequence in the protein chain) Stefan Balev Cours DEA 30/01/2004 p.8/42

The four levels of protein structure primary (1D) structure (amino acid sequence in the protein chain) secondary (2D) structure (α-helices and β-sheets) Stefan Balev Cours DEA 30/01/2004 p.8/42

The four levels of protein structure primary (1D) structure (amino acid sequence in the protein chain) secondary (2D) structure (α-helices and β-sheets) tertiary (3D) structure Stefan Balev Cours DEA 30/01/2004 p.8/42

The four levels of protein structure primary (1D) structure (amino acid sequence in the protein chain) secondary (2D) structure (α-helices and β-sheets) tertiary (3D) structure quaternary structure Stefan Balev Cours DEA 30/01/2004 p.8/42

The importance of protein structure The structure determines the function (?). One cannot understand biological reactions without understanding the structure of the participating molecules. Goals: protein engineering: mutating the gene of an existing protein in order to alter its function in a predictable way. protein design: designing de novo a protein to fulfill a desired function Applications: pharmaceutical industry (drug design, drug docking) chemical industry (enzymes) agriculture (pesticides, herbicides, modification of composition of plant oils,... ) Stefan Balev Cours DEA 30/01/2004 p.9/42

Protein folding problem Input: a1a2... an a word over the 20-letter amino acid alphabet Output: (xj, yj, zj), j = 1,..., N the coordinates of aj Stefan Balev Cours DEA 30/01/2004 p.10/42

Structure determination methods Experimental (in vitro) methods: x-ray crystallography, nuclear magnetic resonance. Slow and expensive, cannot cope with the explosion of sequences becoming available. Computational (in silico) methods. Still not accurate and reliable enough. Direct approach: based on modeled atomic force fields and approximation from classical mechanics, seeks to minimize the free energy. Difficult to model, sensitive to approximation errors, computationally expensive. Protein threading: seeks to assign a protein to already-known structure. Still with limited capabilities, but promising and evolving method. Stefan Balev Cours DEA 30/01/2004 p.11/42

Protein threading basic assumptions the sequence (1D structure) determines the 3D structure homologous proteins have similar structure (and function) homologous proteins have conserved structural cores and variable loop regions there are only around 1,000 and 10,000 different protein structural families Stefan Balev Cours DEA 30/01/2004 p.12/42

Protein threading main steps constructing a library of potential core folds (structural templates) choosing an objective function (score function) to evaluate any alignment of a sequence to a structure template finding the best (with respect to the score function) alignment of the query sequence and each structural template in the library choosing the most appropriate template based on the (normalized) scores of the optimal alignments found on the previous step Stefan Balev Cours DEA 30/01/2004 p.13/42

Mathematical model Stefan Balev Cours DEA 30/01/2004 p.14/42

Structure templates Set of m blocks of length li, i = 1,..., m. The blocks correspond to conserved elements of the 2D structure (α-helices and β-sheets) Contact map graph describes the interactions between the amino acids in the blocks Generalized contact map graph describes the interactions between the blocks Stefan Balev Cours DEA 30/01/2004 p.15/42

Query-to-structure alignment query sequence structure template Alignment (threading): covering of segments of of the query sequence by the template blocks A threading is completely determined by the starting positions of the blocks Stefan Balev Cours DEA 30/01/2004 p.16/42

Query-to-structure alignment rules the blocks preserve their order query sequence structure template the blocks do not overlap no gaps in the blocks Stefan Balev Cours DEA 30/01/2004 p.17/42

Absolute and relative positions query sequence structure template abs. position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rel. position block 1 1 2 3 4 5 6 7 8 9 rel. position block 2 1 2 3 4 5 6 7 8 9 rel. position block 3 1 2 3 4 5 6 7 8 9 If j is the absolute position of block i, then πi = j i 1 k=1 l k is its relative position The relative position of each block is between 1 and n = N + 1 m i=1 l i The set of threadings is T = {(π1,..., πm) 1 π1 πm n} The number of possible threadings is T = ( m+n 1 m (Example: for m = 20 and n = 100, T 2.5 10 22 ) ) Stefan Balev Cours DEA 30/01/2004 p.18/42

Score function incorporates all the biological and physical knowledge on the problem describes the degree of compatibility between sequence residues and their positions in the structure template essential for the quality of the threading we assume that is additive it can be computed considering no more than two blocks at a time Stefan Balev Cours DEA 30/01/2004 p.19/42

Score function cijkl, (i, k) E, 1 j l n score for putting block i on position j and block k on position l c ijkl block i block k position j position l ϕ(π) = (i,k) E c 1238 c 1225 c 2538 ciπikπk 2 5 8 ϕ(2, 5, 8) = c1225 + c1238 + c2538 The coefficients cijkl are precomputed and stored before the start of threading algorithm Stefan Balev Cours DEA 30/01/2004 p.20/42

Protein threading problem where: min{ϕ(π) π T } ϕ(π) = (i,k) E ciπikπk T = {(π1,..., πm) 1 π1 πm n} The problem is known to be NP-hard and MAX-SNP-hard. Stefan Balev Cours DEA 30/01/2004 p.21/42

Network flow formulation position j = 4 j = 3 j = 2 j = 1 S i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 Each possible threading corresponds to a path from S to T in the T block graph and vice versa. Stefan Balev Cours DEA 30/01/2004 p.22/42

Network flow formulation position j = 4 j = 3 j = 2 j = 1 S i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 Each possible threading corresponds to a path from S to T in the T block graph and vice versa. The red path correspons to the threading (1, 2, 2, 3, 4, 4) Stefan Balev Cours DEA 30/01/2004 p.22/42

Simple case no remote links The costs ci,j,i+1,l are associated to the corresponding edges. The problem reduces to finding the shortest path from S to T. This can be done by simple dynamic programming: F (i + 1, l) = min {F (i, j) + c i,j,i+1,l} j=1,...,l Complexity O(mn 2 ) l j c i,j,i+1,l F(i,j) i i+1 F(i+1,l) Stefan Balev Cours DEA 30/01/2004 p.23/42

Taking into account the remote links position j = 4 j = 3 j = 2 j = 1 S c 1122 c 2232 c 3243 c 4354 c 5464 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 T block A path from S to T activates complementary edges corresponding to the remote links. We call it augmented path. Stefan Balev Cours DEA 30/01/2004 p.24/42

Taking into account the remote links position j = 4 j = 3 j = 2 j = 1 S c c 1122 1132 c 2243 c 2232 c 3243 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 c 3264 c 3254 c 4354 c 5464 T block A path from S to T activates complementary edges corresponding to the remote links. We call it augmented path. Protein threading problem: find the augmented path of minimal length. Stefan Balev Cours DEA 30/01/2004 p.24/42

Integer programming models Stefan Balev Cours DEA 30/01/2004 p.25/42

Non-linear model Variables: Objective function: yij {0, 1}, i = 1,..., m, j = 1,..., n yij = 1 block i is on position j f(y) = Constraints: n j=1 yi+1,l (i,k) E 1 j l n cijklyij ykl yij = 1 block i is on exactly one position l j=1 yij if block i + 1 is on position l then block i is before position l Stefan Balev Cours DEA 30/01/2004 p.26/42

Linear model (1) Variables: xi,j,i+1,l = 1 block i is on position j and block i + 1 is on position l zijkl = 1 block i is on position j and block k is on position l Objective function: f(x, z) = m 1 i=1 1 j l n ci,j,i+1,lxi,j,i+1,l + (i,k) R 1 j l n cijklzijkl Stefan Balev Cours DEA 30/01/2004 p.27/42

Linear model (2) Network flow constraints: j l=1 xi 1,l,i,j = n l=j xi,j,i+1,l j n x x i 1,j,i,j + i,j,i+1,n + x i,j,i+1,j x i 1,1,i,j 1 i i+1 xsj = n j=1 n l=j x1j2l xsj = 1 S j n x S x x x Sj Sn S1 1,j,2,n + x 1,j,2,j n + 1 1 Stefan Balev Cours DEA 30/01/2004 p.28/42

Linear model (3) Constraints connecting x and z: n l=j xi,j,i+1,l = n l=j zijkl x i,j,i+1,n x i,j,i+1,j i i+1 k n j l z z ijkn ijkj j=1 xk 1,j,k,l = l j=1 zijkl l 1 z z j,l,k,l j,1,k,l k 1,l,k,l i k 1 k x x k 1,1,k,l Stefan Balev Cours DEA 30/01/2004 p.29/42

Other linear models x X feasible threadings in terms of x variables (network flow constraints) y Y feasible threadings in terms of y variables (as in the nonlinear model) Starting from X (or Y ) one can add different sets of constraints connecting z-variables to x-variables (or y variables). Such models are proposed and compared in [Balev et al.] Stefan Balev Cours DEA 30/01/2004 p.30/42

Solving the IP models Stefan Balev Cours DEA 30/01/2004 p.31/42

LP relaxation IP problem z IP = min cx s.t. x X linear constraints x {0, 1} integrality constraints LP relaxation z LP = min cx s.t. x X 0 x 1 the LP relaxation is easier to solve than the IP problem (simplex method) the LP relaxation provides a lower bound on the optimal objective value of the IP problem (i.e. z LP z IP ) Stefan Balev Cours DEA 30/01/2004 p.32/42

Branch-and-bound using LP Solve LP Choose a variable xj with fractional value Solve LP with xj=0 Choose a variable with fractional value xj=0 xj=1 Solve LR with xj=1 Choose a variable with fractional value Solve LP All variables have integer values Solve LP z > record new record = z LP fathom node LP Stefan Balev Cours DEA 30/01/2004 p.33/42

Experimental results Instances are solved by CPLEX (a general purpose LP/IP solver). Conclusions: much faster than the previously used methods [Lathrop et al] the efficiency depends on the way the model is formulated for more than 90% of real life instances the optimal solution is found in the root of the B&B tree (i.e. z LP = z IP ), the rest 10% require less than 10 nodes this is not the case for randomly generated instances (why?) for large instances even solving the LP relaxation is slow because of the big number of variables and constraints in the model Running time: from 30s on instances of size 10 20 to 2h on instances of size 10 40 Stefan Balev Cours DEA 30/01/2004 p.34/42

Lagrangian relaxation and duality Idea: drop a part of constraints making the problem easier to solve, introduce penalties for violating them in the objective function IP problem: z IP = min cx s.t. x X easy constraints Ax = b complicating constraints Lagrangian relaxation: z LR (λ) = min{cx + λ(b Ax) x X} LR is also an IP problem, but easier to solve than IP LR is relaxation of IP for any λ (i.e. z LR (λ) z IP ) Stefan Balev Cours DEA 30/01/2004 p.35/42

Lagrangian relaxation and duality Idea: drop a part of constraints making the problem easier to solve, introduce penalties for violating them in the objective function IP problem: z IP = min cx s.t. x X easy constraints Ax = b complicating constraints Lagrangian relaxation: z LR (λ) = min{cx + λ(b Ax) x X} Lagrangian dual: z LD (λ) = maxλ z LR (λ) LR is also an IP problem, but easier to solve than IP LR is relaxation of IP for any λ (i.e. z LR (λ) z IP ) LD is better than LP : z LP z LD z IP Stefan Balev Cours DEA 30/01/2004 p.35/42

Solving LD by subgradient optimization Initialization: λ 0 = 0, θ0, t = 0 Iteration t: Solve LR(λ t ). Let x t be the solution found. Let s t = b Ax t (s t is subgradient of z LR (λ) for λ = λ t ). If s t = 0 then stop (x t is optimal solution of IP). Otherwise let θt+1 = θtρ. If θt+1 < ε then stop. Otherwise let λ t+1 = λ t + θt+1s t, t = t + 1 λ 0 λ 3 s 3 λ * λ 4 s 2 λ 2 s 0 s 1 λ 1 Stefan Balev Cours DEA 30/01/2004 p.36/42

LR for protein threding The complicating constraints are those connecting x- and z-variables j = 5 j = 4 j = 3 j = 2 j = 1 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 (a) original problem j = 5 j = 4 j = 3 j = 2 j = 1 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 (c) the constraints corresponding to one of the ends of each link are relaxed j = 5 j = 4 j = 3 j = 2 j = 1 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 (b) all connecting constraints are relaxed j = 5 j = 4 j = 3 j = 2 j = 1 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 (d) like (c) but order on the free ends of the links is imposed Stefan Balev Cours DEA 30/01/2004 p.37/42

Solving protein threading problem LR is solved using dynamic programming algorithm similar to the one proposed by Lathrop et al. Complexity O((m + r)n 2 ), where r is the number of remote links between the blocks. In the worst case this is O(m 2 n 2 ), but for real-life instances it is O(mn 2 ). LD is computed using subgradient optimization limited to 500 iterations. Protein threading problem is solved by branch-and-bound algorithm using the LD bounds. Stefan Balev Cours DEA 30/01/2004 p.38/42

Experimental results Running times for 10,000 instances 100 10 1 time(s) - logscale 0.1 much faster than LP relaxation 0.01 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35 1e+40 size - logscale the optimal solution is found in the root in more than 90% of the cases the algorithm is less sensitive to perturbations in the score coefficients Stefan Balev Cours DEA 30/01/2004 p.39/42

FROST FROST (Fold Recognition Oriented Search Tool) a threading tool developed by MIG, INRA (J.-F. Gibrat, A. Marin et al) contains a library of about 1,200 structure templates. optimization algorithms in FROST branch-and-bound (Lathrop et al.) a steepest descent heuristic (Zimmermann) branch-and-bound using LP relaxation (Balev et al) branch-and-bound using Lagrangian relaxation (Balev) Stefan Balev Cours DEA 30/01/2004 p.40/42

Score normalization in FROST (1) The scores of alignment of a query to two templates are not directly comparable. To normalize the scores FROST aligns each template from the database to 1,000 query sequences. The score distribution is approximated by this empirical data. When a real query is threaded to the template the score is interpreted according to the distribution significant difference significant similarity score 0.25 0.75 1 % of the queries Stefan Balev Cours DEA 30/01/2004 p.41/42

Score normalization in FROST (2) The computing of score distributions involves about 1,200,000 threadings It must be repeated after each modification in the score scheme Using the other algorithms it takes about 3 months on a cluster of 16 PCs Using Lagrangian relaxation algorithm it takes less than a week Stefan Balev Cours DEA 30/01/2004 p.42/42