Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Introduction to Bioinformatics

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Practical Bioinformatics

Sequence Alignment (chapter 6)

Comparing whole genomes

Sequence analysis and Genomics

Bioinformatics Exercises

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bioinformatics. Dept. of Computational Biology & Bioinformatics

UNIT 5. Protein Synthesis 11/22/16

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

B I O I N F O R M A T I C S

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

BIOINFORMATICS: An Introduction

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

An Introduction to Sequence Similarity ( Homology ) Searching

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Algorithms in Bioinformatics

Sequence analysis and comparison

BIOINFORMATICS LAB AP BIOLOGY

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Bioinformatics Chapter 1. Introduction

Motivating the need for optimal sequence alignments...

What is the central dogma of biology?

Sequencing alignment Ameer Effat M. Elfarash

GENERAL BIOLOGY LABORATORY EXERCISE Amino Acid Sequence Analysis of Cytochrome C in Bacteria and Eukarya Using Bioinformatics

BLAST. Varieties of BLAST

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Evolutionary Analysis of Viral Genomes

Sequencing alignment Ameer Effat M. Elfarash

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

Sequence Alignment Techniques and Their Uses

Biol478/ August

Pairwise Sequence Alignment

Molecular Evolution and DNA systematics

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Comparative genomics: Overview & Tools + MUMmer algorithm

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Single alignment: Substitution Matrix. 16 march 2017

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Homology and Information Gathering and Domain Annotation for Proteins

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Multiple Choice Review- Eukaryotic Gene Expression

Computational Biology: Basics & Interesting Problems

Genomes and Their Evolution

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

mrna Codon Table Mutant Dinosaur Name: Period:

Collected Works of Charles Dickens

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Computational Biology

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Sequences, Structures, and Gene Regulatory Networks

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Introduction to protein alignments

Tools and Algorithms in Bioinformatics

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Genomics and bioinformatics summary. Finding genes -- computer searches

1. In most cases, genes code for and it is that

Pairwise & Multiple sequence alignments

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Exploring Evolution & Bioinformatics

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Homology Modeling. Roberto Lins EPFL - summer semester 2005

The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The

Molecular Population Genetics

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Midterm Review Guide. Unit 1 : Biochemistry: 1. Give the ph values for an acid and a base. 2. What do buffers do? 3. Define monomer and polymer.

Introduction to Molecular and Cell Biology

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Advanced topics in bioinformatics

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology



Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

BME 5742 Biosystems Modeling and Control

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Biology 2018 Final Review. Miller and Levine

Review sheet for the material covered by exam III

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Homology. and. Information Gathering and Domain Annotation for Proteins

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

In-Depth Assessment of Local Sequence Alignment

Supplementary Information for

Basic Local Alignment Search Tool

Understanding relationship between homologous sequences

Transcription:

Bioinformatics Sequence Analysis An introduction Part 8 Mahdi Vasighi

Sequence analysis Some of the earliest problems in genomics concerned how to measure similarity of DNA and protein sequences, either within a genome, or across the genomes of different individuals, or across the genomes of different species. Why sequence similarity is important? Similar sequence = similar information = same ancestor Similar sequence = similar structure = similar function

Homology of sequences Homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event (orthologs) a duplication event (paralogs)

Mutation Mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus. Environmental factors Radiation Chemicals Mistakes in replication or repair

Mutation Classification of mutation types: By effect on structure Small scale mutations Large scale mutations By impact on protein sequence

Mutation Classification By effect on structure Small scale mutations Point mutations Often caused by chemicals or malfunction of DNA replication, exchange a single nucleotide for another. Transition AGGCGTATTGCATCGTTAAACGCGC AGATGTATTGCGTCGCTAAACGCGC Most common is the transition that exchanges a purine for a purine (A G) or a pyrimidine for a pyrimidine, (C T)

Mutation Classification By effect on structure Small scale mutations Point mutations Often caused by chemicals or malfunction of DNA replication, exchange a single nucleotide for another. Transversion AGGCGTATTGCATCGTTAAACGCGC AGTGGTATTGCCTCGATAAACGCGC Less common is a Transversion, which exchanges a purine for a pyrimidine or a pyrimidine for a purine (C/T A/G).

Mutation Classification By effect on structure Small scale mutations Point mutations Often caused by chemicals or malfunction of DNA replication, exchange a single nucleotide for another.

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Silent mutation M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... ATG CGT ATC GCA TCA TTG AAC TAA C... M R I A S L N Stop New codon translated for the same amino acid

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Missense mutations M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... ATG CGT ACT GCA TTG TTA AAC TAA C... M R T A L L N Stop New codon translated for different amino acids.

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Non-sense mutations M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... ATG CGT ATT GCA TCG TAA AAC TAA C... M R I A S Stop New codon translated for a stop and can truncate the protein.

Mutation Classification by impact on protein sequence Small scale mutations Point mutations Point mutations that occur within the protein coding region of a gene, depending upon what the erroneous codon codes for: Conservative mutations M R I A S L N Stop ATG CGT ATT GCA TCG TTA AAC TAA C... Non-polar (Hydrophobic) ATG CGT ATT GCA TCG TTC AAC TAA C... M R I A S F N Stop A change in a DNA or RNA sequence that leads to the replacement of one amino acid with a biochemically similar one.

Mutation Classification By effect on structure Large scale mutations occur in chromosomal structure:

Mutation Classification By effect on structure Large scale mutations occur in chromosomal structure:

Mutation Classification By effect on structure Large scale mutations occur in chromosomal structure:

Mutation

Sequence Alignment Sequence comparison can be used: To establish evolutionary relationships among organisms To comparison may allow identification of functionally conserved sequences To find structural similarity To identify corresponding genes in organisms Sequence Alignment is Arranging two or more sequences (DNA, RNA or protein) by searching for a series of individual characters or character patterns that are in the same order in the sequences. ATGCGTATTGCATCGTTAAACTAA ATGCGTATTGCA---TTAAACTAA AT-CGT---GCATCGTTAAACTAA

Sequence Alignment ACGTCTAG ACTCTAG- 2 matches 5 mismatches 1 not aligned ACGTCTAG -ACTCTAG 5 matches 2 mismatches 1 not aligned ACGTCTAG AC-TCTAG 7 matches 0 mismatches 1 not aligned this seemingly simple alignment operation is not as simple as it sounds!...aactgagtttacgctcataga... T---CT-A--G How can we measure distance between two strings or biological sequences? edit distance

Sequence Alignment There are two types of sequence alignment: 1. Global alignment T T G C G T A T T G C A T C G T T G C C T T T T C C A T - - 2. Local alignment - - - - - T A C G - - - - - - - - - - - T T C G - - - - - -

Sequence Alignment Our approach is guided by biology: It is possible for evolutionarily related proteins and nucleic acids to display substitutions at particular positions Substitution (point mutation) Insertion of short segments Deletion of short segments Segmental duplication Inversion Translocation GTATTGCATCGTTAAA GTATTGCA---TTAAA Insertions and/or deletions are called indels. Comparing two genes, it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known.

Pairwise Sequence Alignment Alignment of two sequences is performed using the following methods: 1. Dot matrix analysis 2. The dynamic programming (DP) algorithm 3. Word or k-tuple methods, such as used in BLAST and FASTA

Pairwise Sequence Alignment Dot matrix analysis A dot matrix analysis is primarily a method for comparing two sequences to look for possible alignment of characters between the sequences. The method is also used for: finding direct or inverted repeats in protein and DNA sequences, predicting regions in RNA that are self-complementary

Pairwise Sequence Alignment Dot matrix analysis A T G C C C A T A G T T G C A T A G Any region of similar sequence is revealed by a diagonal row of dots. Isolated dots not on the diagonal represent random matches that are probably not related to any significant alignment.

Pairwise Sequence Alignment Dot matrix analysis Detection of matching regions may be improved by filtering out random matches in a dot matrix by defining a window: T T G C A T A G A T G C C C A T A G Window size = 3 Stringency = 2 typical window size for DNA sequences is 15 and a suitable match requirement in this window is 10. For protein sequences, the matrix is often not filtered, but a window size of 3 and a match requirement of 2 will highlight matching regions

Pairwise Sequence Alignment Dot matrix analysis A T G C C C A T A G T T G C A T A G A T G C C C A T A G T T G C - - A T A G

Pairwise Sequence Alignment Dot matrix analysis G C T A G T C G A T G C T G A T C G A T G C T - G A T C G - - G C T A G - T C G

Pairwise Sequence Alignment Dot matrix analysis Seq 1 Seq 2 G C T A G T G C T G A T G C T - G A T G C T A G - T

Pairwise Sequence Alignment Dot matrix analysis Seq 2 G C T A G T Seq 1 G C T - G A T G C T - G A T G C T A G - T

Pairwise Sequence Alignment Dot matrix analysis Seq 1 Seq 2 G C T A G - T G C T - G A T G C T - G A T G C T A G - T

Pairwise Sequence Alignment Dot matrix analysis A T C G T G A T C G A T C G T G A T C G

Pairwise Sequence Alignment Dot matrix analysis Gene 1 Gene 1 Gene 2 Gene 2

Pairwise Sequence Alignment Dot matrix analysis MATLAB (matrix laboratory) is a high-level language and interactive environment that enables you to perform computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran. Bioinformatics Toolbox offers an integrated software environment for genome and proteome analysis. In particular, it provides access to genomic and proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis.

Pairwise Sequence Alignment Dot matrix analysis getgenbank S = getgenbank('m10051') S= LocusName: 'HUMINSR' LocusSequenceLength: '4723' Purpose: Retrieve sequence from GenBank database LocusNumberofStrands: '' Syntax: Data = getgenbank('accessionnumber',... LocusTopology: 'linear' LocusMoleculeType: 'mrna' 'PropertyName',PropertyValue...) Unique Accession: identifier 'M10051' for a sequence Version: record. 'M10051.1' Example S = getgenbank('m10051') CDS: [139 4287] S = getgenbank('m10051, sequence,true) LocusGenBankDivision: 'PRI' LocusModificationDate: '06-JAN-1995' Definition: 'Human insulin receptor mrna, complete cds.' GI: '186439' Keywords: 'insulin receptor; tyrosine kinase.' Segment: [] Source: 'Homo sapiens (human) SourceOrganism: [3x65 char] Reference: {[1x1 struct]} Comment: [14x67 char] Features: [51x74 char] Sequence: [1x4723 char] SearchURL: [1x105 char] RetrieveURL: [1x95 char]

getembl Purpose:Retrieve sequence information from EMBL database pdbstruct = Identification: [1x1 struct] Syntax: EMBLData = getembl(accessionnumber) Example Pairwise Sequence Alignment Dot matrix analysis emblout = getembl( X00558 ) Compound: [4x23 char] Source: DateUpdated: [4x38 char] [1x46 char] mblout = getembl( X00558, ToFile, c:\project\rat_protein.txt ) getpdb Purpose: Retrieve protein structure Remark1: data Reference: [1x1 struct] from {[1x1 Protein struct]} Data Bank (PDB) database DatabaseCrossReference: Remark2: [1x1 struct] '' Remark3: Comments: [1x1 struct] '' Syntax: PDBStruct = getpdb(pdbid) Assembly: '' Example pdbstruct = getpdb( 5CYT ) emblout = getembl('x00558') pdbstruct = getpdb('5cyt') emblout = CYTOCHROME C' Created)' Header: Accession: [1x1 struct] 'X00558' SequenceVersion: Title: 'REFINEMENT 'X00558.1' OF MYOGLOBIN AND DateCreated: '13-JUN-1985 (Rel. 06, Keywords: Description: 'ELECTRON 'Rat TRANSPORT liver (HEME apolipoprotein PROTEIN)' A-I mrna (apoa-i)' ExperimentData: 'X-RAY Keyword: DIFFRACTION' [1x44 char] Authors: OrganismSpecies: 'T.TAKANO''Rattus norvegicus (Norway RevisionDate: rat)' [1x2 struct] OrganismClassification: Superseded: [1x1 struct] [3x75 char] Journal: Organelle: [1x1 struct] '' Remark4: [2x59 char] Remark100: [2x59 Feature: char] [22x75 char] Remark200: BaseCount: [49x59 char] [1x1 struct] Remark280: Sequence: [6x59 char] [1x877 char] RetrieveURL: [1x64 char]

Pairwise Sequence Alignment Dot matrix analysis seqdotplot Purpose: Create dot plot of two sequences Syntax: seqdotplot(seq1,seq2) seqdotplot(seq1,seq2, Window, Number) Enter an integer for the size of a window. an integer for the number of characters within the window that match. Example Prion protein is a small protein found in high quantity in the brain of animals infected with moufflon = getgenbank('ab060288','sequence',true); mad-cow disease takin = getgenbank('ab060290','sequence',true); seqdotplot(moufflon,takin,11,7)

Pairwise Sequence Alignment Dot matrix analysis

Pairwise Sequence Alignment Dot matrix analysis http://myhits.isb-sib.ch/cgi-bin/dotlet

nt2aa Pairwise Sequence Alignment Dot matrix analysis Purpose: Convert nucleotide sequence to amino acid sequence Syntax: SeqAA = nt2aa(seqnt) Example S = getgenbank('m10051, sequence,true) p = nt2aa(s.sequence([[2699:2885],[3673:3818]])) >> S.CDS ans = location: 'join(2699..2885,3673..3818)' gene: 'INS' product: 'insulin' codon_start: '1' indices: [2699 2885 3673 3818] protein_id: 'AAA59173.1' db_xref: 'GI:386829' note: '' translation: [1x110 char] text: [9x58 char]

Pairwise Sequence Alignment Dot matrix analysis nt2int Purpose: Convert nucleotide sequence from letter to integer representation Syntax: SeqInt = nt2int(seqchar) Example s = nt2int('actgctagc') randseq Purpose: Generate random sequence from finite alphabet Syntax: Seq = randseq(seqlength) Example S = randseq(20) S = randseq(20,'alphabet', amino')

Pairwise Sequence Alignment Dot matrix analysis molviewer Purpose: Display and manipulate 3-D molecule structure Syntax: molviewer molviewer(pdbid) Example molviewer molviewer('5cyt')

Pairwise Sequence Alignment Dot matrix analysis -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- WHICH-ONE-IS-BETTER?---

Find the coding sequence of Hemoglobin subunit beta for Human, Chimpanzee and rat. Analysis them using MATLAB dotplot tool