Malik Hammoutène - SSC. Sequence Analysis

Similar documents
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

EECS730: Introduction to Bioinformatics

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Large-Scale Genomic Surveys

In-Depth Assessment of Local Sequence Alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence analysis and comparison

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Week 10: Homology Modelling (II) - HHpred

HMMs and biological sequence analysis

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence analysis and Genomics

BLAST: Basic Local Alignment Search Tool

Single alignment: Substitution Matrix. 16 march 2017

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

bioinformatics 1 -- lecture 7

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Introduction to Sequence Alignment. Manpreet S. Katari

1.5 Sequence alignment

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

An Introduction to Bioinformatics Algorithms Hidden Markov Models

EECS730: Introduction to Bioinformatics

Sequence Alignment Techniques and Their Uses

Algorithms in Bioinformatics

Multiple sequence alignment

CAP 5510 Lecture 3 Protein Structures

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Basics of protein structure

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

BLAST. Varieties of BLAST

Introduction to Bioinformatics

Lecture 2: Pairwise Alignment. CG Ron Shamir

Bioinformatics for Biologists

Local Alignment: Smith-Waterman algorithm

Hidden Markov Models

Computational Molecular Biology (

Computational Biology

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Pairwise sequence alignments

Sequences, Structures, and Gene Regulatory Networks

Bioinformatics Chapter 1. Introduction

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Building 3D models of proteins

Moreover, the circular logic

Overview Multiple Sequence Alignment

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Sequence Analysis '17 -- lecture 7

Evolutionary Models. Evolutionary Models

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Data Mining in Bioinformatics HMM

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Pairwise Sequence Alignment

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Computational Genomics and Molecular Biology, Fall

Collected Works of Charles Dickens

Sequence Alignment (chapter 6)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Tools and Algorithms in Bioinformatics

Stephen Scott.

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

B I O I N F O R M A T I C S

3D Structure. Prediction & Assessment Pt. 2. David Wishart 3-41 Athabasca Hall

Alignment & BLAST. By: Hadi Mozafari KUMS

Hidden Markov Models

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Introduction to protein alignments

Today s Lecture: HMMs

Heuristic Alignment and Searching

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Quantifying sequence similarity

Bioinformatics Exercises

Administration. ndrew Torda April /04/2008 [ 1 ]

BIOINFORMATICS: An Introduction

Physiochemical Properties of Residues

Motivating the need for optimal sequence alignments...

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Transcription:

CONTENTS Introduction Analysis of individual sequences Secondary structure prediction Pairwise sequence comparison Database searching I: single heuristic algorithms Alignment and search statistics Multiple sequence alignment Multiple alignment and database searching Protein families and protein domains Conclusion

INTRODUCTION DNA STORES AND PASSES ON GENETIC INFORMATION FROM ONE GENERATION TO ANOTHER. THE STEPS OF THE LADDER ARE MADE OF PAIRS OF NITROGEN BASES: 1. ADENOSINE = A 2. GUANOSINE = G 3. CYTIDINE = C 4. THYMIDINE = T

INTRODUCTION DNA HELD BY HYDROGEN BONDS

INTRODUCTION STOP 2 START 2 STOP 1 START 1

INTRODUCTION RIBOSOME KNOWS TO WHICH AMINO ACID A CODON CORRESPONDS

INTRODUCTION PRIMARY STRUCTURE OF A PROTEIN = A CHAIN OF AMINO ACIDS

INTRODUCTION SECONDARY STRUCTURE -> 3RD DIMENSION

ANALYSIS OF INDIVIDUAL SEQUENCES AMINO ACIDS 1. NON-POLAR AND NEUTRAL 2. POLAR AND NEUTRAL 3. ACID AND POLAR 4. BASIC AND POLAR

ANALYSIS OF INDIVIDUAL SEQUENCES HYDROPHOBICITY - = HYDROPHILIC

ANALYSIS OF INDIVIDUAL SEQUENCES AMINO ACID AMIN ACID R-GROUP

ANALYSIS OF INDIVIDUAL SEQUENCES

SECONDARY STRUCTURE PREDICTION CHOU FASMAN Table 8 Chou & Fasman Secondary Structure Propensity of the Amino Acids Pα Pβ Pc Pα Pβ Pc A 1.42 0.83 0.75 M 1.45 1.05 0.5 C 0.7 1.19 1.11 N 0.67 0.89 1.44 D 1.01 0.54 1.45 P 0.57 0.55 1.88 E 1.51 0.37 1.12 Q 1.11 1.1 0.79 F 1.13 1.38 0.49 R 0.98 0.93 1.09 G 0.57 0.75 1.68 S 0.77 0.75 1.48 H 1 0.87 1.13 T 0.83 1.19 0.98 I 1.08 1.6 0.32 V 1.06 1.7 0.24 K 1.16 0.74 1.1 W 1.08 1.37 0.45 L 1.21 1.3 0.49 Y 0.69 1.47 0.84

SECONDARY STRUCTURE PREDICTION GOR METHOD

SECONDARY STRUCTURE PREDICTION PHD

SECONDARY STRUCTURE PREDICTION TABLE OF PERFORMANCES 75 70 65 60 55 50 45 CF GOR I LIM LEVIN PTIT JASEP7 GOR III ZHANG PHD Scores (%)

PAIRWISE SEQUENCE COMPARISON DOT PLOTS

PAIRWISE SEQUENCE COMPARISON DOT PLOTS UNFILTERED FILTERED

PAIRWISE SEQUENCE COMPARISON SEQUENCE ALIGNMENT GATTATACCA GATTACA GATTATACCA GATTA---CA GAP OF LENGTH 3 INSERTIONS AND DELETIONS ( INDEL ) ARE REPRESENTED BY GAPS IN ALIGNMENTS

PAIRWISE SEQUENCE COMPARISON WHAT ABOUT THIS: GCTACTAGTTCGCTTAGC GCTACTAGCTCTAGCGCGTATAGC WHICH ONE?? GCTACTAG-T-T--CGC-T-TAGC GCTACTAGCTCTAGCGCGTATAGC GCTACTAGTT------CGCTTAGC GCTACTAGCTCTAGCGCGTATAGC

PAIRWISE SEQUENCE COMPARISON NEEDLEMAN/WUNSCH ALGORITHM (EXAMPLE WITH NO PENALTY) SEQUENCE #1: GAATTCAGTTA; M = 11 (LETTERS) SEQUENCE #2: GGATCGA; N = 7 (LETTERS) SCORING SCHEME: Si,j = 1 IF POS I OF #1 IS THE SAME AS POS J OF #2 Si,j = 0 IF MISMATCH SCORE w = 0 (GAP PENALTY) STEPS: INITIALIZATION MATRIX FILL TRACEBACK

PAIRWISE SEQUENCE COMPARISON SEQUENCE #1: GAATTCAGTTA; M = 11 (LETTERS) SEQUENCE #2: GGATCGA; N = 7 (LETTERS) INITIALIZATION

PAIRWISE SEQUENCE COMPARISON SEQUENCE #1: GAATTCAGTTA; M = 11 (LETTERS) SEQUENCE #2: GGATCGA; N = 7 (LETTERS) MATRIX FILL FOR EACH POSITION IN THE MATRIX, Mi,j IS DEFINED TO BE THE MAXIMUM SCORE AT THE POSITION i,j: Mi,j = MAX[ Mi-1, j-1 +Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)]

PAIRWISE SEQUENCE COMPARISON SEQUENCE #1: GAATTCAGTTA; M = 11 (LETTERS) SEQUENCE #2: GGATCGA; N = 7 (LETTERS) TRACEBACK G_ A A T T C A G T T A G G _ A _ T C _ G A

PAIRWISE SEQUENCE COMPARISON GAP PENALTY GATCGCTACGCTCAGC A.C.C..C..T PERFECT SIMILARITY EVERYTIME!

PAIRWISE SEQUENCE COMPARISON ALIGNMENT GLOBAL ALIGNMENT G-ATES GRATED LOCAL ALIGNMENT DO NOT NEED TO ALIGN ALL THE BASES IN ALL SEQUENCES ALIGN BILLGATESLIKESCHEESE AND GRATEDCHEESE G-ATESLIKESCHEESE OR G-ATES & CHEESE GRATED-----CHEESE GRATED & CHEESE

SINGLE SEQUENCE HEURISTIC ALGORITHMS DATABASE SEARCHING > fasta myquery swissprot -ktup 2 search program query sequence sequence database optional parameters

SINGLE SEQUENCE HEURISTIC ALGORITHMS RESULTS THE BEST SCORES ARE The best scores are: initn init1 opt z-sc E(77110) gi 1706794 sp P49789 FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0 gi 1703339 sp P49776 APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21 gi 1723425 sp P49775 HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16 gi 3915958 sp Q58276 Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07 gi 3916020 sp Q11066 YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07 gi 3023940 sp O07513 HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06 gi 2506515 sp Q04344 HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05 gi 2495235 sp P75504 YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002 gi 418447 sp P32084 YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058 gi 3025190 sp P94252 YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082 gi 1351828 sp P47378 YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022 gi 418446 sp P32083 YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017 gi 1708543 sp P49773 IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022 gi 2495231 sp P70349 IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03 gi 1724020 sp P49774 YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031 gi 1170581 sp P16436 IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035 gi 2495232 sp P80912 IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057 gi 1177047 sp P42856 ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058 gi 1177046 sp P42855 ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072 gi 1169825 sp P31764 GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51 gi 113999 sp P16550 APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56 gi 1351948 sp P49348 APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3 gi 123331 sp P23228 HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6 gi 1170899 sp P06994 MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7 gi 3915666 sp Q10798 DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3 gi 124341 sp P05113 IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7 gi 1170538 sp P46685 IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5 gi 121369 sp P15124 GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3 gi 2506868 sp P33937 NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6 gi 119377 sp P10403 ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8 gi 1351041 sp P48415 SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8 gi 4033418 sp O67501 IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3

SINGLE SEQUENCE HEURISTIC ALGORITHMS FASTA (FAST-ALL) 1) FIND K-TUPLES IN THE TWO SEQUENCES 2) IDENTIFY THE 10 HIGHEST SCORING REGIONS 3) INTRODUCE GAPS 4) DETERMINE BEST SEGMENT OF SIMILARITY

SINGLE SEQUENCE HEURISTIC ALGORITHMS 1) FIND K-TUPS IN THE TWO SEQUENCES position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a..... protein 2..... a c s p r k position in offset amino acid protein A protein B pos A - posb ----------------------------------------------------- a 6 6 0 c 2 7-5 k - 11 n 1 - p 4 9-5 r - 10 s 3 8-5 t 5 - ----------------------------------------------------- Note the common offset for the 3 amino acids c,s and p A possible alignment is thus quickly found - protein 1 n c s p t a protein 2 a c s p r k

SINGLE SEQUENCE HEURISTIC ALGORITHMS 2) IDENTIFY THE 10 HIGHEST SCORING REGIONS

SINGLE SEQUENCE HEURISTIC ALGORITHMS 3) INTRODUCE GAPS 4) DETERMINE BEST SEGMENT OF SIMILARITY

SINGLE SEQUENCE HEURISTIC ALGORITHMS BLAST (BASIC LOCAL ALIGNMENT SEARCH TOOL) 1) COMPILE THE LIST OF POSSIBLE WORDS 2) SCAN DATABASE FOR EXACT MATCHING 3) SCAN THROUGH THE LIST AND TRY TO EXTEND IT

SINGLE SEQUENCE HEURISTIC ALGORITHMS http://www.ebi.ac.uk/ http://www.ncbi.nlm.nih.gov/

ALIGNMENT AND SEARCH STATISTICS ALIGNMENT SCORE SUM OF THE WEIGHTS OF EVERY PAIR OF THE ALIGNMENT Z-SCORE (STANDARD DEVIATION FROM THE MEAN) Z-SCORE = (s-m)/e s = INITIAL SCORE m = MEAN OF THE RANDOM SCORES e = DEVIATION OF THE RANDOM SCORES

ALIGNMENT AND SEARCH STATISTICS EXTREME VALUE DISTRIBUTION (EVD) DENSITY FUNCTION:

ALIGNMENT AND SEARCH STATISTICS EXPECT VALUE E = NUMBER OF DATABASE HITS YOU EXPECT TO FIND BY CHANCE E = Kmne -λs Number K = scale for search space λ = scale for scoring system S = bitscore = (λs - lnk)/ln2 m = effective length of query n = effective length of database Score

MULTIPLE SEQUENCE ALIGNMENT EVOLUTION ALIGNMENT NYLS NKYLS NFS NFLS N-YLS NKYLS N-F-S N-FLS +K -L Y F NYLS

MULTIPLE SEQUENCE ALIGNMENT SUM-OF-PAIRS SCORE S(x, x) = 1 S(x, y) = -1 S(x, -) = -2 S(-, -) = 0 A A G 1 A - T -2 C C C 1 G G G 1 T T T 1 A A A 1 C - - -2 G A - -1 A A T 1 T T T 1 A G A -1 S(A,G) = -1 S(SEQ1,SEQ2) -1-1 -1-2 1 1 1 1 1 1 1 1-2 0-1 -2-1 -1 1 1 1-1 = = = = = = = = = = = S(SEQ1,SEQ3) -1-5 3 3 3 3-4 -5-1 3-1 = 2 S(SEQ2,SEQ3) SUM-OF-PAIRS SCORE

MULTIPLE SEQUENCE ALIGNMENT DYNAMIC PROGRAMMING C G T - G -G T A - - - - A G FOR n SEQUENCES OF SIZE l: SPACE: O(ln) TIME: O(2 n )

MULTIPLE SEQUENCE ALIGNMENT PROFILE EXAMPLE s(a,1)p(a,1) + s(b,1)p(b,1) + s(c,1)p(c,1) + S(d,1)p(D,1) = S(A,1)

MULTIPLE SEQUENCE ALIGNMENT CLUSTAL 1) COMPUTE A SIMILARITY OF PAIRS MATRIX

MULTIPLE SEQUENCE ALIGNMENT CLUSTAL 2) COMPUTE A TREE

MULTIPLE SEQUENCE ALIGNMENT CLUSTAL 3) BUILD THE FINAL MULTIPLE ALIGNMENT

MULTIPLE ALIGNMENT AND DB SEARCHING PROFILE SEARCHING MULTIPLE SEQUENCE ALIGNMENT CONSTRUCTION OF A PROFILE COMPARAISON TO A DATABASE BEST SIMILARITIES FOUND

MULTIPLE ALIGNMENT AND DB SEARCHING HIDDEN MARKOV MODEL Emission Probabilities Transition probabilities

MULTIPLE ALIGNMENT AND DB SEARCHING HIDDEN MARKOV MODEL SCORING #1 - T G C T - - A G G vrs: #2 - A C A C - - A T C Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]): #1 = Member #2: Member HMM: #1 = Score of 0.0023% #2 Score of 4.7%

PROTEIN FAMILIES AND PROTEIN DOMAINS PROSITE DATABASE http://www.expasy.org/prosite/

PROTEIN FAMILIES AND PROTEIN DOMAINS PFAM http://www.sanger.ac.uk/software/pfam/

CONCLUSION HIGHTHROUGHPUT COMPUTATIONAL ANALYSIS TOOL A BLESSING A PROBLEM

REFERENCES Bioinformatics Sequence and Genome Analysis David W. Mount Cold Spring Harbor Laboratory Press Notes and Powerpoint: http://www.hammoutene.com/epfl/isbio/

Thank you for your attention!