Sequencing alignment Ameer Effat M. Elfarash

Similar documents
Sequencing alignment Ameer Effat M. Elfarash

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics Exercises

Multiple sequence alignment

DNA1: Last week's take-home lessons

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Introduction to Bioinformatics Online Course: IBT

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Genomics and bioinformatics summary. Finding genes -- computer searches

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Ch. 9 Multiple Sequence Alignment (MSA)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

BLAST. Varieties of BLAST

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Comparative genomics: Overview & Tools + MUMmer algorithm

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Introduction to Bioinformatics

BIOINFORMATICS: An Introduction

Multiple Sequence Alignment. Sequences

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Hands-On Nine The PAX6 Gene and Protein

Large-Scale Genomic Surveys

Week 10: Homology Modelling (II) - HHpred

Journal of Proteomics & Bioinformatics - Open Access

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

What can sequences tell us?

Motivating the need for optimal sequence alignments...

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence analysis and comparison

Single alignment: Substitution Matrix. 16 march 2017

Introduction to Bioinformatics

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

PHYLOGENY AND SYSTEMATICS

CS612 - Algorithms in Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

CGS 5991 (2 Credits) Bioinformatics Tools

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Sequence analysis and Genomics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Multiple sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

EECS730: Introduction to Bioinformatics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Exploring Evolution & Bioinformatics

Introduction to Evolutionary Concepts

Homology. and. Information Gathering and Domain Annotation for Proteins

EECS730: Introduction to Bioinformatics

EVOLUTIONARY DISTANCES

Sequence Alignment Techniques and Their Uses

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

MiGA: The Microbial Genome Atlas

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Homology and Information Gathering and Domain Annotation for Proteins

Bioinformatics Chapter 1. Introduction

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Computational Methods for Structural Bioinforamtics and Computational Biology (1) (Sequence comparison)

Comparative Bioinformatics Midterm II Fall 2004

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Background. Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System

Algorithms in Bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Tree Building Activity

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Cladistics and Bioinformatics Questions 2013

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

An Introduction to Sequence Similarity ( Homology ) Searching

Name: Class: Date: ID: A

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Sequences, Structures, and Gene Regulatory Networks

Open a Word document to record answers to any italicized questions. You will the final document to me at

Introduction to protein alignments

SUPPLEMENTARY INFORMATION

BIOINFORMATICS LAB AP BIOLOGY

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Phylogenetic Tree Reconstruction

Microbiome: 16S rrna Sequencing 3/30/2018

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Processes of Evolution

Comparing whole genomes

Practical considerations of working with sequencing data

Basic Local Alignment Search Tool

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Microbial Taxonomy and the Evolution of Diversity

Introduction to Bioinformatics Introduction to Bioinformatics

Transcription:

Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. aelfarash@aun.edu.eg

Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics studies which seek to study evolutionary histories, functional and structural aspects of sequences, and to understand phenotypic differences between species

The Building Blocks ATGC VLMFNQEDHKRCSTPYW

Three types of nucleotide changes: Substitution a replacement of one (or more) sequence characters by another: Insertion - an insertion of one (or more) sequence characters: Deletion a deletion of one (or more) sequence characters: A A GA AAG A T AAGA AACA

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

Why Align Sequences? Crucial for genome sequencing Discover functional, structural, and evolutionary information because similar Sequences may have similar function Identify primers and probes to search for homologous sequences in other organisms

Why Align Sequences? Genome Sequencing Strategy genome 1 2 3 4 5 6 7 8 9 10 11 1. Obtain a large collection of BAC clones 2. Map them onto the genome (Physical Mapping) 3. Select a minimum tiling path 4. Sequence each clone in the path with shotgun 5. Assemble 6. Put everything together

Why Align Sequences? Homology Similar sequences may have a common ancestor In 1990 Carl Woese and colleagues proposed the Tree of Life, using a molecular phylogenetic approach. It is based on sequencing rrna (16S and 18S), which all organisms share. All organisms were separated into three domains: Bacteria, Archaea, Eukarya.

16S RNA

How DNA Sequence Data is Compared Obtain Samples: Blood, Saliva, Hair Follicles, Feathers, Scales Genetic Data Extract DNA from Cells Sequence DNA Compare DNA Sequences to One Another TTCACCAACAGGCCCACA TTCAACAACAGGCCCAC TTCACCAACAGGCCCAC TTCATCAACAGGCCCAC GOALS: Identify the organism from which the DNA was obtained. Compare DNA sequences to each other.

confirm insert sequence Sequence 1 Sequence : A T G A C G G A T C A G C

Comparing two sequences MVNLTSDEKTAVLALWNKVDVEDCGGE MVHLTPEEKTAVNALWGKVNVDAVGGE

Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG-- Each row represents an individual sequence Each column represents the same position

Multiple sequence alignment Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Seq7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

How to optimize alignment algorithms? Sequences often contain highly conserved regions These regions can be used for an initial alignment By analyzing a number of small, independent fragments, the algorithmic complexity can be drastically reduced!

Choosing an alignment: Many different alignments between two sequences are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC How do we determine which is the best alignment?

Assessing the significance of an alignment score True AAGCTGAATTCGAA AGGCTCATTTCTGA AGATCAGTAGACTA GAGTAGCTATCTCT.. CGATAGATAGCATA GCATGTCATGATTC Random AAGCTGAATTC-GAA AGGCTCATTTCTGA- 28.0 AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT 26.0 CGATAGATAGCATA--------- ---------GCATGTCATGATTC 16.0

Distance matrix: UPGMA UPGMA = unweighted pair group method with arithmetic mean

Optimal vs. correct alignment For a given group of sequences, there is no single correct alignment, only an alignment that is optimal according to some set of calculations Determining what alignment is best for a given set of sequences is really up to the judgment of the investigator Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment

Web servers for pairwise alignment

NCBI

BLAST programs

Query type: AA or DNA? For coding sequences, AA (protein) data are better Selection operates most strongly at the protein level the homology is more evident AA 20 char alphabet DNA - 4 char alphabet lower chance of random homology for AA

BLAST bl2seq

results Match Similarity Dissimilarity Gaps that is, the substitution of amino acids whose side chains have similar biochemical properties

>gi 4885397 ref NP_005323.1 hemoglobin, zeta [Homo sapiens] MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEK YR MSA input: multiple sequence Fasta file >gi 4504351 ref NP_000510.1 delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi 4504349 ref NP_000509.1 beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi 4885393 ref NP_005321.1 epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi 6715607 ref NP_000175.1 G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi 28302131 ref NP_000550.2 A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

Central role of multiple alignments Comparative genomics Phylogenetic studies Hierarchical function annotation: homologs, domains, motifs Gene identification, validation Multiple alignment Structure comparison, modelling RNA sequence, structure, function Human genetics, SNPs DBD Therapeutics, drug design insertion domain Therapeutics, drug discovery binding sites / mutations LBD

Most important sequence databases Genbank maintained by USA National Center for Biology Information (NCBI) All biological sequences www.ncbi.nlm.nih.gov/genbank/genbankovervie w.html Genomes www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=ge nome Swiss-Prot - maintained by EMBL- European Bioinformatics Institute (EBI ) Protein sequences www.ebi.ac.uk/swissprot/

Sequence Similarity Searching Basic Local Alignment Search Tool