Sequencing alignment Ameer Effat M. Elfarash

Similar documents
Sequencing alignment Ameer Effat M. Elfarash

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics Exercises

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

DNA1: Last week's take-home lessons

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST. Varieties of BLAST

Multiple sequence alignment

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Ch. 9 Multiple Sequence Alignment (MSA)

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Genomics and bioinformatics summary. Finding genes -- computer searches

Multiple Sequence Alignment. Sequences

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Hands-On Nine The PAX6 Gene and Protein

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Comparative genomics: Overview & Tools + MUMmer algorithm

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

EECS730: Introduction to Bioinformatics

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Introduction to Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Journal of Proteomics & Bioinformatics - Open Access

Algorithms in Bioinformatics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

What can sequences tell us?

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Sequence analysis and Genomics

Motivating the need for optimal sequence alignments...

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Large-Scale Genomic Surveys

Phylogenetic Tree Reconstruction

BIOINFORMATICS: An Introduction

Single alignment: Substitution Matrix. 16 march 2017

Week 10: Homology Modelling (II) - HHpred

Comparative Bioinformatics Midterm II Fall 2004

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Sequence analysis and comparison

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

CS612 - Algorithms in Bioinformatics

PHYLOGENY AND SYSTEMATICS

EVOLUTIONARY DISTANCES

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

CGS 5991 (2 Credits) Bioinformatics Tools

Multiple sequence alignment

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Basic Local Alignment Search Tool

Bioinformatics for Biologists

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

BINF6201/8201. Molecular phylogenetic methods

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Sequences, Structures, and Gene Regulatory Networks

Tools and Algorithms in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Small RNA in rice genome

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

SUPPLEMENTARY INFORMATION

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Collected Works of Charles Dickens

Copyright 2000 N. AYDIN. All rights reserved. 1

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Algorithms in Bioinformatics

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Introduction to protein alignments

Genome Sequencing & DNA Sequence Analysis

SUPPLEMENTARY INFORMATION

Tree Building Activity

An Introduction to Sequence Similarity ( Homology ) Searching

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Introduction to Bioinformatics Introduction to Bioinformatics

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Homology. and. Information Gathering and Domain Annotation for Proteins

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Sequence Alignment Techniques and Their Uses

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

MiGA: The Microbial Genome Atlas

Protein function prediction based on sequence analysis

Exploring Evolution & Bioinformatics

Background. Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System

Bioinformatics Chapter 1. Introduction

Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison)

BIOINFORMATICS LAB AP BIOLOGY

Transcription:

Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. amir_effat@yahoo.com

Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species

The Building Blocks ATGC VLMFNQEDHKRCSTPYW

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

Why Align Sequences? Discover functional, structural, and evolutionary information Similar Sequences may have similar function Gene Regulation Biochemical Function Similar Structure Identify primers and probes to search for homologous sequences in other organisms Crucial for genome sequencing

genome Genome Sequencing Strategy 1 2 3 4 5 6 7 8 9 10 11 1. Obtain a large collection of BAC clones 2. Map them onto the genome (Physical Mapping) 3. Select a minimum tiling path 4. Sequence each clone in the path with shotgun 5. Assemble 6. Put everything together

Why Align Sequences? Homology Similar sequences may have a common ancestor In 1990 Carl Woese and colleagues proposed the Tree of Life, using a molecular phylogenetic approach. It is based on sequencing rrna (16S and 18S), which all organisms share. All organisms were separated into three domains: Bacteria, Archaea, Eukarya.

16S RNA

(Escherichia) (Saccharomyces) Region (Methanococcus)

Evolutionary changes in sequences Three types of nucleotide changes: 1. Substitution a replacement of one (or more) sequence characters by another: 2. Insertion - an insertion of one (or more) sequence characters: 3. Deletion a deletion of one (or more) sequence characters: A A GA AAG A T AAGA Insertion + Deletion Indel AACA

How DNA Sequence Data is Compared Obtain Samples: Blood, Saliva, Hair Follicles, Feathers, Scales Genetic Data Extract DNA from Cells Sequence DNA Compare DNA Sequences to One Another TTCACCAACAGGCCCACA TTCAACAACAGGCCCAC TTCACCAACAGGCCCAC TTCATCAACAGGCCCAC GOALS: Identify the organism from which the DNA was obtained. Compare DNA sequences to each other.

Sequence Both Strands of DNA Sequence 1 Sequence : A T G A C G G A T C A G C Bioinformatics tools like BLAST can be used to compare the sequences from Different individuals.

Comparing two sequences MVNLTSDEKTAVLALWNKVDVEDCGGE MVHLTPEEKTAVNALWGKVNVDAVGGE

Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just 2 Each row represents an individual sequence Each column represents the same position

Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- variable conserved

Multiple Sequence Alignment: Approaches Local alignment finds regions of high similarity in parts of the sequences ADLGAVFALCDRYFQ ADLGRTQN CDRYYQ Global alignment finds the best alignment across the entire two sequences ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ

How to optimize alignment algorithms? Sequences often contain highly conserved regions These regions can be used for an initial alignment By analyzing a number of small, independent fragments, the algorithmic complexity can be drastically reduced!

Choosing an alignment: Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC How do we determine which is the best alignment?

Assessing the significance of an alignment score True AAGCTGAATTCGAA AGGCTCATTTCTGA AGATCAGTAGACTA GAGTAGCTATCTCT.. CGATAGATAGCATA GCATGTCATGATTC Random AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 CGATAGATAGCATA--------- ---------GCATGTCATGATTC 16.0

Optimal vs. correct alignment For a given group of sequences, there is no single correct alignment, only an alignment that is optimal according to some set of calculations This is partly due to: - the complexity of the problem, - limitations of the scoring systems used, - our limited understanding of life and evolution Determining what alignment is best for a given set of sequences is really up to the judgment of the investigator Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment

Web servers for pairwise alignment

NCBI

Basic Local Alignment Search Tool blastn nucleotide blastp protein

BLAST programs

BLAST bl2seq

Bl2seq results Match Similarity Dissimilarity Gaps

>gi 4885397 ref NP_005323.1 hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR MSA input: multiple sequence Fasta file >gi 4504351 ref NP_000510.1 delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi 4504349 ref NP_000509.1 beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi 4885393 ref NP_005321.1 epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi 6715607 ref NP_000175.1 G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi 28302131 ref NP_000550.2 A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

Query type: AA or DNA? For coding sequences, AA (protein) data are better Selection operates most strongly at the protein level the homology is more evident AA 20 char alphabet DNA - 4 char alphabet lower chance of random homology for AA

Evolutive distance Phylogenetic trees

Phylogenetic trees Rooted and unrooted trees Unrooted trees: compare one feature of a group of related organims B A A Or B C C Only one shape for a tree with 3 species

Distance matrix: UPGMA UPGMA = unweighted pair group method with arithmetic mean

Distance matrix: UPGMA c & d are considered as one unit

Other method: maximum parsimony Sequence 1: ATTA Sequence 2: TAAC Sequence 3: AAAT Sequence 4: TTTT ATTA 2 TTTT 2 0 TTTT 1 TAAC ATTA 4 TAAT 0 1 TAAC 2 TTTT TAAT 1 TAAT 1 Length: 6 Length: 8 AAAT AAAT Most likely tree

The alignment problem What happens when a sequence alignment is wrong? A B C A C B B C A A: AGT B: AT C: ATC A: AGT B: A -T C: ATC A: AGT - B: A -T - C: A -TC A: AGT B: AT - C: ATC

Central role of multiple alignments Comparative genomics Phylogenetic studies Hierarchical function annotation: homologs, domains, motifs Gene identification, validation Multiple alignment Structure comparison, modelling RNA sequence, structure, function Human genetics, SNPs DBD Therapeutics, drug design insertion domain Therapeutics, drug discovery binding sites / mutations LBD

Most important sequence databases Genbank maintained by USA National Center for Biology Information (NCBI) All biological sequences www.ncbi.nlm.nih.gov/genbank/genbankovervie w.html Genomes www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=ge nome Swiss-Prot - maintained by EMBL- European Bioinformatics Institute (EBI ) Protein sequences www.ebi.ac.uk/swissprot/

Sequence Similarity Searching Basic Local Alignment Search Tool