Sequence Database Search. 北京大学生物信息学中心高歌 Ge Gao, Ph.D. Center for Bioinformatics, Peking University

Similar documents
Bioinformatics and BLAST

Tools and Algorithms in Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Computational Biology

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

In-Depth Assessment of Local Sequence Alignment

BLAST. Varieties of BLAST

EECS730: Introduction to Bioinformatics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Basic Local Alignment Search Tool

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence analysis and Genomics

Single alignment: Substitution Matrix. 16 march 2017

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Heuristic Alignment and Searching

Week 10: Homology Modelling (II) - HHpred

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

BLAST: Target frequencies and information content Dannie Durand

Tools and Algorithms in Bioinformatics

Algorithms in Bioinformatics

Introduction to Sequence Alignment. Manpreet S. Katari

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics for Biologists

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Whole Genome Alignments and Synteny Maps

EECS730: Introduction to Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Chapter 7: Rapid alignment methods: FASTA and BLAST

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

An Introduction to Sequence Similarity ( Homology ) Searching

1.5 Sequence alignment

Building 3D models of proteins

bioinformatics 1 -- lecture 7

BLAST: Basic Local Alignment Search Tool

Mass spectrometry has been used a lot in biology since the late 1950 s. However it really came into play in the late 1980 s once methods were

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Quantifying sequence similarity

Hands-On Nine The PAX6 Gene and Protein

A profile-based protein sequence alignment algorithm for a domain clustering database

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Comparative Bioinformatics Midterm II Fall 2004

Introduction to Bioinformatics

Motivating the need for optimal sequence alignments...

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Introduction to protein alignments

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Exercise 5. Sequence Profiles & BLAST

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

Genomics and bioinformatics summary. Finding genes -- computer searches

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院

Substitution matrices

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Sequence-specific sequence comparison using pairwise statistical significance

Fast profile matching algorithms A survey

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Large-Scale Genomic Surveys

Sequence Alignment Techniques and Their Uses

Reducing storage requirements for biological sequence comparison

Closed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implications. Zhouchen Lin ( 林宙辰 ) 北京大学 Nov. 7, 2015

Tandem Mass Spectrometry: Generating function, alignment and assembly

B I O I N F O R M A T I C S

Comparing whole genomes

Similarity or Identity? When are molecules similar?

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Using Bioinformatics to Study Evolutionary Relationships Instructions

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Sequencing alignment Ameer Effat M. Elfarash

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

Global alignments - review

Fundamentals of database searching

An IntRoduction to grey methods by using R

Protein Structure Prediction Using Neural Networks

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

Introduction to spectral alignment

Bioinformatics Exercises

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Ping-Chiang Lyu. Institute of Bioinformatics and Structural Biology, Department of Life Science, National Tsing Hua University.

Sequence analysis and comparison

SUPPLEMENTARY INFORMATION

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Similarity searching summary (2)

Transcription:

Sequence Database Search 北京大学生物信息学中心高歌 Ge Gao, Ph.D. Center for Bioinformatics, Peking University

Unit 2: BLAST Algorithm: a Primer 北京大学生物信息学中心高歌 Ge Gao, Ph.D. Center for Bioinformatics, Peking University

There are nm entries in the matrix. Sequence Y of length n Sequence X of length m Dynamic programming matrix Each entry requires a constant number c of operation(s). c*m*n operations needed in total, for one pair wise alignment.

BLAST Ideas: Seeding and extending 1. Find matches (seed) between the query and subject 2. Extend seed into High Scoring Segment Pairs (HSPs) Run Smith Waterman algorithm on the specified region only. 3. Assess the reliability of the alignment.

Seeding For a given word length w (usually 3 for proteins and 11 for nucleotides), slicing the query sequence into multiple continuous seed words Query Sequence M V L S P A D K T N V K A A W

Speedup: Index database The database was pre indexed to quickly locate all positions in the database for a given seed. Database o o Database sequence #1 o o D K T o o Query sequence o o o

Diagonal and Two hits One of candidate sequences Query sequence Hit clusters

One of candidate sequences Query sequence (Source: Bedell et al. 2003) 0 1, 1,, 1 1, max, 0 0,0 d j i F d j i F y x s j i F j i F F j i

Speedup: mask low complexity Low complexity sequences yield false positives. Alphabet size (4 or 20) CACACACACACACACA 1 L! K log N KLKLKLKLKLKL L n i! i Window length Frequency of the ith letter

For example, for typical microsatellite CACACACACACACACA, with window length 6: K 1 6 1 6 1 6 log log log 1 log 6 0.36 4 4 4 4 n 20 A 3!*!* 6! 3!* 3! n C 6!!* n 6! 3!* 0!* G 0!!* n T!

To improve sensitivity, in addition to the seed word itself, the BLAST also use these highly similar neighbourhood words (based on the substitution matrix) for seeding. D K T 6+2+5 = 13 D R T DKT 16 DRT 13 DET 12 DKS 12 DQT 12 EKT 12 DKA 11 DKN 11 DKV 11 DNT 11 DST 11 NKT 11 DAT 10 DDT 10 DHT 10 DKC 10 DKD 10 DKE 10 DKI 10 DKK 10 DKL 10 DKM 10 DKP 10 DKQ 10 DKR 10 DMT 10 DPT 10 DTT 10 QKT 10 SKT 10

Quality Assessment Given the large data volume, it s critical to provide some measures for assessing the statistical significance of a given hit. 8000 7000 6000 5000 4000 3000 2000 1000 0 <20 30 40 50 60 70 80 90 100 110 >120

For a given amino acid, we have the chance of 1/20 to have a random match (as there are just 20 different amino acids in total). Thus, for an amino acid sequence with length L, the probability of having a random match across the full length is (1/20) L Say you have a 6 AA peptide (not so unusual, e.g. the Tryptophyllin T2 6 in Phyllomedusa azurea or Orange legged monkey frog, 橙腿猴树蛙 is a 6 AA peptide), then the odd would be (1/20) 6 = 1.56 * 10 8 Looks not so big, huh? But you re searching Swiss Prot databases which contains 192,206,270 amino acids in 540,958 sequences (Sept 18 th, 2013). Then you would expect to have (1/20) 6 * 192,206,270 = 3.00 matches by chance, for any 6 AA peptide!

E Value: How a match is likely to arise by chance The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched e.g. if E=10, 10 matches with scores this high are expected to be found by chance

p 1 e E p P (0.05) = E (0.0513) E value

Heuristic (pronounced hyu RIS tik, Greek: "Εὑρίσκω", "find" or "discover") refers to experience based techniques for problem solving, learning, and discovery. (Source: Wikipedia) Not best, but good enough

Key heuristics in BLAST Seeding and extending: looking for seeds of high scoring alignments ONLY Use dynamic programming selectively Tradeoff: speed vs. sensitivity Empirically, 1000 ~ 10000 times faster than plain Dynamic Programming based local alignment But suffer from low sensitivity, especially for distant sequences (e.g. E.coli human)

Summary Questions Why do you need E value? Could you give an alignment that the Smith Waterman algorithm, but not the BLAST algorithm, can identify? Explain your case.

生物信息学 : 导论与方法 Bioinformatics: Introduction and Methods https://www.coursera.org/course/pkubioinfo