CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1
Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2
Protein Conformational Structures Hydrophobicity (lack affinity to water), hydrogen bonding, handedness, and tension between hierarchy and interactions (electrostatic and van der Waals), a protein structure is a complex geometric pattern of polypeptide, side chains, and the solvent environment. The protein in solvent has the conformation of minimum free energy. Molecular dynamics of the potential energy with some nonlinear force terms gives the conformation structure. 8/19/2005 Su-Shing Chen, CISE 3
Bond Distance d O C Bond Angle Φ (Carbonyl) Φ Cα (α Carbon) H Torsion Angle Ψ N (Nitrogen) d Ψ (Hydrogen) Geometric Features C n-1 O n-1 Cα n-1 8/19/2005 Su-Shing Chen, CISE 4
COOH (Carbonyl group) H 2 N Cα H (amino group) R (side chain) Basic structure of an amino acid 8/19/2005 Su-Shing Chen, CISE 5
Secondary Structures Alpha helix repeated curvature (bond) and torsion φ, ψ angles, repeating patterns of hydrogen bonding between CO of residue n and NH of residue n+4. Beta sheet repeating patterns of hydrogen bonding between distant parts of the backbone. Random coil 8/19/2005 Su-Shing Chen, CISE 6
Secondary Structures 8/19/2005 Su-Shing Chen, CISE 7
Structure Databases 3-D biomolecular structures of protein amino acid sequences. 3-D structures are determined by X-ray crystallography and nuclear magnetic resonance (NMR). Protein folding is a grand challenge problem: A primary protein sequence determines its 3-D structure Anfinsen et al 1961 8/19/2005 Su-Shing Chen, CISE 8
How to Form 3-D Structures Start from the NH 2 terminus, we identify each amino acid side chain by comparing the atomic structure of each residue with the chemical structure of the 20 amino acids. Each atom has x,y,z coordinate, together a ball-and-stick structure is formed. A chemical graph of chemical data associated with the ball-and-stick model. 8/19/2005 Su-Shing Chen, CISE 9
Atoms, Bonds and Energy The bond length: average length of a stable X-X bond is about? angstroms. The bond (curvature) angle Φ = κ The torsion angle Ψ=τ Potential Energy = (1/2) Σ c d (d-d 0 ) 2 + (1/2) Σ c κ (κ κ 0 ) 2 + (1/2) Σ c τ (1+cos(nτδ) + Σ (Α/r 12 Β/r 6 + q 1 q 2 /Dr). 8/19/2005 Su-Shing Chen, CISE 10
RMSE (Root Mean Square Error) Similarity measure of 3-D structures. X = {(x 1, y 1, z 1 ),, (x n, y n, z n )} X = {(x 1, y 1, z 1 ),, (x n, y n, z n )} R(X,X ) = squareroot Σ (x i -x i ) 2 +(y i - y i ) 2 +(z i -z i ) 2 R(X,X ) = squareroot Σ (d i -d i ) 2 +(κ i -κ i ) 2 +(τ i -τ i ) 2 8/19/2005 Su-Shing Chen, CISE 11
Inverse Protein Folding - Threading Find amino acid sequences folding into a known 3-D structure. Sequence similarities > 30 %. Profile method: Compatible environments: area of buried residue inaccessible to solvent, side chains of polar O, N atoms, local secondary structures. 8/19/2005 Su-Shing Chen, CISE 12
Protein Superfamilies & Domain Superfolds Many protein structures are similar. Protein domains of more than 30% sequence similarity adopt the same fold structure. Some proteins with statistically insignificant sequence similarity have similar fold. Dayhoff: Families > 50% similarity, superfamilies > 30-40% similarity. 8/19/2005 Su-Shing Chen, CISE 13
8/19/2005 Su-Shing Chen, CISE 14
Geometric Features of Proteins S. Chen, Characterizing and learning of protein conformations, 1993. A set of points P(i) on the backbone. A right handed orthonormal basis.{ti, Ni, Bi}. Ti is the (tangent) vector P(i)P(i+1). The binormal vector is Bi=Ti-1xTi/ Ti- 1xTi, normal to the plane P(i-1), P(i), P(i+1). The normal is Ni=BiXTi. The curvature ki is the angle between Ti-1 and Ti. The torsion is Bi and Bi+1. 8/19/2005 Su-Shing Chen, CISE 15
P(i+1) Ti P(i) Ti+1 P(i+2) Ni+1 Bi+1 8/19/2005 Su-Shing Chen, CISE 16
Motivations to Study Protein Structures Proteins are interesting to look at! Gene-sequencing projects are accumulating gene data and protein sequences at a rapid rate. However information about their structure is available for only a small fraction. Understanding them might help reduce this gap. 8/19/2005 Su-Shing Chen, CISE 17
Secondary Structures Prediction Protein structure prediction is one of the most significant tasks tackled in computational structural biology. It has the aim of determining the threedimensional structure of proteins from their amino acid sequences. In more formal terms, this is the prediction of protein tertiary structure from primary structure. Protein structure is a valuable resource in drug design and is an highly active field of research. The output of experimentally determined protein structures, typically by time-consuming and relatively expensive X-ray crystallography or NMR spectroscopy, is lagging far behind the output of protein sequences 8/19/2005 Su-Shing Chen, CISE 18
Chou-Fasman Based on frequencies of residues in alpha helices, beta sheets and turns. Accuracy 50-60% 8/19/2005 Su-Shing Chen, CISE 19
Chou-Fasman 8/19/2005 Su-Shing Chen, CISE 20
Chou-Fasman Assign Pij values 1. Assign all of the residues the appropriate set of parameters T S P T A E L M R S T G P(H) 69 77 57 69 142 151 121 145 98 77 69 57 P(E) 147 75 55 147 83 37 130 105 93 75 147 75 P(turn) 114 143 152 114 66 74 59 60 95 143 114 156 8/19/2005 Su-Shing Chen, CISE 21
Chou-Fasman Scan peptide for α helix regions Identify regions where 4/6 have a P(H) >100 alpha-helix nucleus T S P T A E L M R S T G P(H) 69 77 57 69 142 151 121 145 98 77 69 57 T S P T A E L M R S T G P(H) 69 77 57 69 142 151 121 145 98 77 69 57 8/19/2005 Su-Shing Chen, CISE 22
Chou-Fasman Extend α-helix nucleus Extend helix in both directions until a set of four residues have an average P(H) <100. Repeat steps 1 3 for entire peptide T S P T A E L M R S T G P(H) 69 77 57 69 142 151 121 145 98 77 69 57 8/19/2005 Su-Shing Chen, CISE 23
Chou-Fasman Scan peptide for β-sheet regions Identify regions where 3/5 have a P(E) >100 b-sheet nucleus Extend b-sheet until 4 continuous residues an have an average P(E) < 100 T S P T A E L M R S T G P(H) 69 77 57 69 142 151 121 145 98 77 69 57 P(E) 147 75 55 147 83 37 130 105 93 75 147 75 If region average > 105 and the average P(E) > average P(H) then b-sheet 8/19/2005 Su-Shing Chen, CISE 24
Chou-Fasman To identify a bend at residue number j, calculate the following value p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) < P(turn) > P(bsheet), then a beta-turn is predicted at that location 8/19/2005 Su-Shing Chen, CISE 25
Comparative (homolog) Modeling Homology modeling is based on the reasonable assumption that two homologous proteins will share very similar structures. Given the amino acid sequence of a unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated, computationally, into the corresponding amino acid from the unknown structure. 8/19/2005 Su-Shing Chen, CISE 26
Comparative (homology) Modeling 8/19/2005 Su-Shing Chen, CISE 27
Homology Modeling In homology modeling the over all fold of a protein is known. The goal is to try to predict the detailed conformation of a protein given a homologous protein Comparative ("homology") modeling approximates the 3D structure of a target protein for which only the sequence is available, provided an empirical 3D "template" structure is available with >30% sequence identity Suppose you want to know the 3D structure of a target protein that has not been solved empirically by X-ray crystallography or NMR. You have only the sequence. If an empirically determined 3D structure is available for a sufficiently similar protein (50% or better sequence identity would be good), you can use software that arranges the backbone of your sequence identically to this template. This is called "comparative modeling" or "homology modeling". It is, at best, moderately accurate for the positions of alpha carbons in the 3D structure, in regions where the sequence identity is high. It is inaccurate for the details of sidechain positions, and for inserted loops with no matching sequence in the solved structure. 8/19/2005 Su-Shing Chen, CISE 28
SWISS-PDB Viewer 8/19/2005 Su-Shing Chen, CISE 29
Protein Threading Protein threading scans the amino acid sequence of a unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. Its possible that two protein have less than 25% pairwise sequence identity but however have similar protein structure. In these cases remote homology modelling is required. 8/19/2005 Su-Shing Chen, CISE 30
Protein Threading The algorithm starts with target protein sequence aligned with SWISS-PORT protein sequences. The resulting multiple sequence is converted into a 1D structural profile. So the amino acid sequences now been translated into a 1D string of structure symbols. Now the idea is to find a 3D fold that is similar to our structure. Finally, predicted and observed 1D structure profiles were optimally aligned by a dynamic programming algorithm The best hit of the alignment procedure is recorded and a 3D model is build from there. 8/19/2005 Su-Shing Chen, CISE 31
Protein Threading 8/19/2005 Su-Shing Chen, CISE 32
Protein Threading 8/19/2005 Su-Shing Chen, CISE 33
Ab Intito Folding Researchers have pursued the problem of predicting three-dimensional protein structure only from the amino acid sequence Ab initio folding is based on the global optimization of a potential energy function and in general does not use knowledge of experimentally determined protein structures. Present ab initio folding methods require intense and exhaustive computing time, which increases as a function of the length of the protein. This limitation is due in part to the assumption that the initial condition for the ab initio folding protein is the linear sequence of residues comprising the protein as encoded by the gene. It is also due to optimizing based on all atom potential energy functions and the use of suboptimal global optimization techniques 8/19/2005 Su-Shing Chen, CISE 34
Prediction of transmembrane proteins transmembrane proteins - the polypeptide chain actually traverses the lipid bilayer. 8/19/2005 Su-Shing Chen, CISE 35
Why are they important Membrane proteins are important for several processes and functions in all biological systems Receptors for neurotransmitters or hormones Form ion channels Serve as the respiratory chain Nearly 30% of known proteins are membrane bound 8/19/2005 Su-Shing Chen, CISE 36
Why Is Prediction Of Transmembrane Regions Important? Bad News Even though X-Ray crystallography is becoming more popular transmembrane proteins are very difficult to crystallize Good News It is commonly accepted that topology prediction of transmembrane proteins is easier and yields higher accuracy than the prediction of the secondary structure of globular proteins 8/19/2005 Su-Shing Chen, CISE 37
Properties Of A Membrane Protein Traverses the lipid bi-layer once or several times Generally possess sequences of hydrophobic residues α-helical transmembrane structure Typically 17 to 25 residues in length 8/19/2005 Su-Shing Chen, CISE 38
Brief Transmembrane Prediction History Cell membrane is a lipid nonpolor layer First attempts used this information to label sequences of non-polar residues as potential transmembrane regions Accuracy was increased by considering the charge distribution between inside the cell and outside the cell segments Environment in the cell different from outside the cell Prediction using neural nets Using HMM (Hidden Markov Model) 8/19/2005 Su-Shing Chen, CISE 39
Polar and nonpolar amino acids 8/19/2005 Su-Shing Chen, CISE 40
Neural Networks The network attempts to determine the next state given the current state and input. This approach is recursive because the state calculated is used in the next step as the previous state for the network. The choice of neural networks as the empirical learning system on which to build was made for a couple of reasons. One basic reason is that networks provide a very general mechanism for representing concepts. A neural network, given the proper number of hidden units and hidden layers, can learn almost any type of concept. A second reason for using neural networks is that they generally deal very well with noisy and incorrect data. limitations of neural networks, one basic problem is how to go about selecting the topology of the network 8/19/2005 Su-Shing Chen, CISE 41
Example 8/19/2005 Su-Shing Chen, CISE 42
Neural Network for Protein Structure Prediction 8/19/2005 Su-Shing Chen, CISE 43
Hidden Markov Model Widely used in bioinformatics Sequence alignment, generating profiles for protein families and database searching Can be tailored to particular problems Any known structural knowledge can be incorporated into the models architecture in order to obtain a more accurate prediction A set of states, rules for changing states, and probabilities of state transitions 8/19/2005 Su-Shing Chen, CISE 44
HMM Architecture 8/19/2005 Su-Shing Chen, CISE 45
Parameters Of The Model Fixed Length Sequences Helix Length Min 17 and Max 25 residues Tail Length Min 1 and Max 15 Residues Train HMM 8/19/2005 Su-Shing Chen, CISE 46
HMM By defining states for transmembrane helix residues and other states for residues in loops, residues on either side of the membrane, and connecting them in a cycle, we can produce a model that in architecture closely resembles the biological system we are modelling. If the model parameters are tuned to capture the biological reality, the path of a protein sequence through the states with the highest probability should be able to predict the true topology. 8/19/2005 Su-Shing Chen, CISE 47
HMM Results 8/19/2005 Su-Shing Chen, CISE 48
Problem Studied in Earlier Classes 8/19/2005 Su-Shing Chen, CISE 49
No structures for Cellulose Synthase 8/19/2005 Su-Shing Chen, CISE 50