Large-Scale Genomic Surveys

Similar documents
Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Class 3, 1/19/98. Mark Gerstein. Yale University

Week 10: Homology Modelling (II) - HHpred

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Large-Scale Genomic Surveys

BIOINFORMATICS Sequences

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Introduction to Bioinformatics

Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

In-Depth Assessment of Local Sequence Alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Quantifying sequence similarity

Computational Biology

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sequence analysis and Genomics

Moreover, the circular logic

Substitution matrices

Basic Local Alignment Search Tool

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Pairwise sequence alignments

Tools and Algorithms in Bioinformatics

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Alignment Techniques and Their Uses

Sequence analysis and comparison

Single alignment: Substitution Matrix. 16 march 2017

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Pairwise & Multiple sequence alignments

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Practical considerations of working with sequencing data

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Alignment & BLAST. By: Hadi Mozafari KUMS

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Bioinformatics and BLAST

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Protein function prediction based on sequence analysis

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Overview Multiple Sequence Alignment

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Collected Works of Charles Dickens

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bioinformatics for Biologists

Sequence Comparison. mouse human

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence Database Search Techniques I: Blast and PatternHunter tools

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Multiple sequence alignment

Pairwise sequence alignment

EBI web resources II: Ensembl and InterPro

A profile-based protein sequence alignment algorithm for a domain clustering database

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Similarity searching summary (2)

Exercise 5. Sequence Profiles & BLAST

An Introduction to Sequence Similarity ( Homology ) Searching

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Genomics and bioinformatics summary. Finding genes -- computer searches

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

G4120: Introduction to Computational Biology

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

ALL LECTURES IN SB Introduction

Computational Molecular Biology (

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Scoring Matrices. Shifra Ben Dor Irit Orr

Homology Modeling. Roberto Lins EPFL - summer semester 2005

HMMs and biological sequence analysis

Similarity or Identity? When are molecules similar?

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

CS612 - Algorithms in Bioinformatics

Tutorial 4 Substitution matrices and PSI-BLAST

Hidden Markov Models

Pairwise Sequence Alignment

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Chapter 7: Rapid alignment methods: FASTA and BLAST

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Homology and Information Gathering and Domain Annotation for Proteins

Fundamentals of database searching

Sequence comparison: Score matrices

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Advanced topics in bioinformatics

bioinformatics 1 -- lecture 7

Local Alignment Statistics

Transcription:

Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction Function Classification Database Design Genome Annotation E-literature Expression Clustering Large-Scale Genomic Surveys

Databases Some Specific Informatics tools NCBI GenBank- Protein and DNA sequence NCBI Human Map - Human Genome Viewer NCBI Ensembl - Genome browsers for human, mouse, zebra fish, mosquito TIGR - The Institute for Genome Research SwissProt - Protein Sequence and Function ProDom - Protein Domains Pfam - Protein domain families ProSite - Protein Sequence Motifs Protein Data Base (PDB) - Coordinates for Protein 3D structures SCOP Database- Domain structures organized into evolutionary families HSSP - Domain database using Dali FlyBase WormBase PubMed / MedLine of Bioinformatics Sequence Alignment Tools BLAST Clustal MSAs FASTA PSI-Blast Hidden Markov Models 3D Structure Alignments / Classifications Dali VAST PRISM CATH SCOP

Multiple Sequence Alignment (MSA)

Aligning Text Strings and Gaps Raw Data??? T C A T G C A T T G 2 matches, 0 gaps T C A T G C A T T G 3 matches (2 end gaps) T C A T G.. C A T T G 4 matches, 1 insertion T C A - T G. C A T T G 4 matches, 1 insertion T C A T - G. C A T T G

Sequence Alignment E-value: Expect Value Each sequence alignment has a bit score or similarity score (S), a measure of the similarity between the hit and the query; normalized for effective length" The E-value of the hit is related to the number of alignments in the database you are searching with similarity score S that you expect to find by chance; likelihood of the match relative to a pair of random sequences with the same amino acid composition E = 10-50. Much more likely than a random occurrence E = 10-5. This could be an accidental event E = 1. It is easy to find another hit in the database that is as good The lower the Expect Value (E_val), the more significant the hit

E-value Expect Value Depends on: Similarity Score (Bit Score): Higher similarity score (e.g., high % seq id) corresponds to smaller E-value Length of the query: Since a particular Similarity Score is more easily obtained by chance with a longer query sequence, longer queries have larger E-values Size of the database: Since a larger database makes a particular Similarity Score easier to obtain, a larger database results in larger E-values

Calculating E-val: Raw Score: calculated by counting the number of identities, mismatches, gaps, etc in the alignment Bit Score: Normalizes the raw score to provide a measure of sequence similarity E-value: E = mn2 -S where m - effective length of the query (accounts for the fact that ends may not line up); n - length of the database (number of residues or bases) S - Bit score

Blast against Structural Genomic Target Registry (TargetDB) or against latest PDB or PDB-on-hold listings

Simple Score Simplest score - for identity match matrix S(i,j) = 1 if matches S(I,j) = 0 otherwise S = S(i, j) i, j ng S = Total Score S(i,j) = similarity matrix score for aligning residues i and j Sum is carried out over all aligned i and j residues n = number of gaps G = gap penalty

Dynamic Programming Needleman-Wunsch (1970) provided first automatic method for sequence alignment Dynamic Programming to Find Best Global Alignment Test Data ABCNYRQCLCRPM AYCYNRCKCRBP

Needle-Wunsch Algorithm

Final MAT Array Best Score = 8 A B C N Y - R Q C L C R - P M A Y C - Y N R - C K C R B P

Aligning Text Strings and Gaps Raw Data??? T C A T G C A T T G 2 matches, 0 gaps T C A T G C A T T G 3 matches (2 end gaps) T C A T G.. C A T T G 4 matches, 1 insertion T C A - T G. C A T T G 4 matches, 1 insertion T C A T - G. C A T T G

Scores in Each Box are Defined by Possible Paths Originating from That Box

Step 1 -- Make a Similarity Matrix (Match Scores Determined by Identity Matrix) Put 1's where characters are identical.

Step 2 -- Start Computing the Sum Matrix new_value_cell(r,c) <= cell(r,c) { Old value, either 1 or 0 } + Max[ cell (R+1, C+1), { Diagonally Down, no gaps } cells(r+1, C+2 to C_max),{ Down a row, making col. gap } cells(r+2 to R_max, C+1) { Down a col., making row gap } ]

Step 3 -- Keep Going

Step 4 -- Sum Matrix All Done Alignment Score is 8 matches.

Step 5 -- Traceback Find Best Score (8) and Trace Back A B C N Y - R Q C L C R - P M A Y C - Y N R - C K C R B P

Step 6 -- Alternate Tracebacks A B C - N Y R Q C L C R - P M A Y C Y N - R - C K C R B P

Gap Penalties The score at a position can also factor in a penalty for introducing gaps (i. e., not going from i, j to i- 1, j- 1). Gap penalties are often of linear form: GAP = a + bn GAP is the gap penalty a = cost of opening a gap b = cost of extending the gap by one N = length of the gap (e.g. assume b=0, a=1/2, so GAP = 1/2 regardless of length.)

Step 2 -- Computing the Sum Matrix with Gaps new_value_cell(r,c) <= cell(r,c) { Old value, either 1 or 0 } + Max[ cell (R+1, C+1), { Diagonally Down, no gaps } ] cells(r+1, C+2 to C_max) - GAP,{ Down a row, making col. gap } cells(r+2 to R_max, C+1) - GAP { Down a col., making row gap } GAP =1/2 1.5

Key Idea in Dynamic Programming The best alignment that ends at a given pair of positions (i and j) in the 2 sequences is the score of the best alignment previous to this position PLUS the score for aligning those two positions. An Example Below Aligning R to K does not affect alignment of previous N-terminal residues. Once this is done it is fixed. Then go on to align D to E. How could this be violated? Aligning R to K changes best alignment in box. C- ACSQRP--LRV-SH RSENCV A-SNKPQLVKLMTH VKDFCV ACSQRP--LRV-SH -R SENCV A-SNKPQLVKLMTH VK DFCV

Substitution Matrices Count number of amino acid identities (or nonidentities) Count the minimum number of mutations in the DNA needed to account for the non-identical pairs Measure of similarity based on frequency of mutations observed in homologous protein sequences Measure similarity based on physical properties

Similarity (Substitution) Matrix Identity Matrix Match L with L => 1 Match L with D => 0 Match L with V => 0?? S(aa-1,aa-2) Match L with L => 1 Match L with D => 0 Match L with V =>.5 Number of Common Ones PAM Blossum Gonnet A R N D C Q E G H I L K M F P S T W Y V A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 C 0-3 -3-3 8-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3 H -2 0 1-1 -3 0 0-2 7-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 6-1 -1-4 -3-2 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 10 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 6-1 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4

Where do matrices come from? 1 Manually align protein structures (or, more risky, sequences) 2 Look at frequency of a.a. substitutions at structurally constant sites. -- i.e. pair i - j exchanges 3 Compute log-odds S(aa-1,aa-2) = log 2 ( freq(o) / freq(e) ) O = observed exchanges, E = expected exchanges odds = freq(observed) / freq(expected) Sij = log odds freq(expected) = f(i)*f(j) = is the chance of getting amino acid i in a column and then having it change to j e.g. A-R pair observed only a tenth as often as expected + > More likely than random 0 > At random base rate - > Less likely than random AAVLL AAVQI AVVQL ASVLL 90% 45%

Amino Acid Frequencies of Occurrence 1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014

Different Matrices are Appropriate at Different Evolutionary Distances (Adapted from D Brutlag, Stanford)

PAM-250 (distant) Change in Matrix with Ev. Dist. PAM-78 (Adapted from D Brutlag, Stanford)

Are the evolutionary rates uniform over the whole of the protein sequence? (No.) The BLOSUM matrices: Henikoff & Henikoff (Henikoff, S. & Henikoff J.G. (1992) PNAS 89:10915-10919). Use blocks of sequence fragments from different protein families which can be aligned without the introduction of gaps. Amino acid pair frequencies can be compiled from these blocks Different evolutionary distances are incorporated into this scheme with a clustering procedure: two sequences that are identical to each other for more than a certain threshold of positions are clustered. More sequences are added to the cluster if they are identical to any sequence already in the cluster at the same level. All sequences within a cluster are then simply averaged. (A consequence of this clustering is that the contribution of closely related sequences to the frequency table is reduced, if the identity requirement is reduced. ) This leads to a series of matrices, analogous to the PAM series of matrices. BLOSUM80: derived at the 80% identity level. The BLOSUM Matrices BLOSUM62 is the BLAST default