Introduction to Bioinformatics

Similar documents
Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6)

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

B I O I N F O R M A T I C S

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Pairwise & Multiple sequence alignments

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Bioinformatics Exercises

Practical considerations of working with sequencing data

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment


Example of Function Prediction

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Sequence analysis and Genomics

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Motivating the need for optimal sequence alignments...

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Exploring Evolution & Bioinformatics

BLAST. Varieties of BLAST

Lecture 2: Pairwise Alignment. CG Ron Shamir

Evolutionary Tree Analysis. Overview

Biol478/ August

Computational approaches for functional genomics

Analysis and Design of Algorithms Dynamic Programming

EECS730: Introduction to Bioinformatics

Algorithms in Bioinformatics

BIOINFORMATICS: An Introduction

Comparative genomics: Overview & Tools + MUMmer algorithm

Orthologs Detection and Applications

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Dynamic Programming: Edit Distance

EECS730: Introduction to Bioinformatics

Single alignment: Substitution Matrix. 16 march 2017

Pairwise sequence alignment

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Outline. Sequence-comparison methods. Buzzzzzzzz. Why compare sequences? Gerard Kleywegt Uppsala University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Sequencing alignment Ameer Effat M. Elfarash

Dr. Amira A. AL-Hosary

Lecture 4: September 19

Sequencing alignment Ameer Effat M. Elfarash

Phylogenetic trees 07/10/13

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Dynamic programming. Curs 2017

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Tools and Algorithms in Bioinformatics

Quantifying sequence similarity

Phylogenetic inference

Introduction to protein alignments

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Graph Alignment and Biological Networks

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

String Matching Problem

Processes of Evolution

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Network alignment and querying

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

EECS730: Introduction to Bioinformatics

Hands-On Nine The PAX6 Gene and Protein

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

BIOINFORMATICS LAB AP BIOLOGY

Computational methods for predicting protein-protein interactions

Pairwise sequence alignments

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Bioinformatics for Biologists

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Molecular Evolution and Comparative Genomics

Sequence Alignment. Johannes Starlinger

Homology and Information Gathering and Domain Annotation for Proteins

Dynamic programming. Curs 2015

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Cladistics and Bioinformatics Questions 2013

Genomes and Their Evolution

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Ontology Alignment in the Presence of a Domain Ontology

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Outline Sequence-comparison methods. Buzzzzzzzz. MB330 - The class of 2008

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

EVOLUTIONARY DISTANCES

Transcription:

Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties are shared among organisms? p enome sequencing allows comparison of organisms at DN and protein levels p Comparisons can be used to 64 Find evolutionary relationships between organisms Identify functionally conserved sequences Identify corresponding genes in human and model organisms: develop models for human diseases omologs 65 p wo genes (sequences in general) and evolved from the same ancestor gene g are called homologs p omologs usually exhibit conserved functions p Close evolutionary relationship => expect a high number of homologs g = agtgtccgttaagtgcgttc = agtgccgttaaagttgtacgtc = ctgactgtttgtggttc Sequence similarity p e expect homologs to be similar to each other p Intuitively, similarity of two sequences refers to the degree of match between corresponding positions in sequence p hat about sequences that differ in length? 66 agtgccgttaaagttgtacgtc ctgactgtttgtggttc Similarity vs homology p Sequence similarity is not sequence homology If the two sequences and have accumulated enough mutations, the similarity between them is likely to be low #mutations agtgtccgttaagtgcgttc agtgtccgttatagtgcgttc agtgtccgcttatagtgcgttc 4 agtgtccgcttaagggcgttc 8 agtgtccgcttcaaggggcgt 6 gggccgttcatgggggt gcagggcgtcactgagggct 67 #mutations 64 acagtccgttcgggctattg 8 cagagcactaccgc 56 cacgagtaagatatagct 5 taatcgtgata 4 acccttatctacttcctggagtt 48 agcgacctgcccaa 496 caaac omology is more difficult to detect over greater evolutionary distances.

Similarity vs homology () p Sequence similarity can occur by chance Similarity does not imply homology p Consider comparing two short sequences against each other Orthologs and paralogs p e distinguish between two types of homology Orthologs: homologs from two different species, separated by a speciation event Paralogs: homologs within a species, separated by a gene duplication event g ene duplication event g Organism g g Organism B Orthologs OrganismC Paralogs 68 69 Orthologs and paralogs () p Orthologs typically retain the original function p In paralogs, one copy is free to mutate and acquire new function (no selective pressure) g g g g Organism Paralogy example: hemoglobin p emoglobin is a protein complex which transports oxygen p In humans, hemoglobin consists of four protein subunits and four nonprotein heme groups Organism B OrganismC Sickle cell diseases are caused by mutations in hemoglobin genes emoglobin, www.rcsb.org/pdb/explore.do?structureid=zx 7 7 http://en.wikipedia.org/wiki/image:sicklecells.jpg Paralogy example: hemoglobin p In adults, three types are normally present emoglobin : alpha and beta subunits emoglobin : alpha and delta subunits emoglobin F: alpha and gamma subunits p Each type of subunit (alpha, beta, gamma, delta) is encoded by a separate gene emoglobin, www.rcsb.org/pdb/explore.do?structureid=zx Paralogy example: hemoglobin p he subunit genes are paralogs of each other, i.e., they have a common ancestor gene p Exercise: hemoglobin human paralogs in NCBI sequence databases http://www.ncbi.nlm.nih.gov/sites/entre z?db=nucleotide Find human hemoglobin alpha, beta, gamma and delta Compare sequences emoglobin, www.rcsb.org/pdb/explore.do?structureid=zx 7 7

Orthology example: insulin p he genes coding for insulin in human (omo sapiens) and mouse (Mus musculus) are orthologs: hey have a common ancestor gene in the ancestor species of human and mouse Exercise: find insulin orthologs from human and mouse in NCBI sequence databases p lignment specifies which positions in two sequences match actctag matches 5 mismatches not aligned actctag 5 matches mismatches not aligned actctag 7 matches mismatches not aligned 74 75 Mutations: Insertions, deletions and substitutions p Maximum alignment length is the total length of the two sequences Indel: insertion or deletion of a base with respect to the ancestor sequence actctag Mismatch: substitution (point mutation) of a single base actctag matches mismatches 5 not aligned actctag matches mismatches 5 not aligned p Insertions and/or deletions are called indels e can t tell whether the ancestor sequence had a base or not at indel position! 76 77 Problems Sequence lignment (chapter 6) p hat sorts of alignments should be considered? p ow to score alignments? p ow to find optimal or good scoring alignments? p ow to evaluate the statistical significance of scores? p he biological problem p lobal alignment p Local alignment p Multiple alignment In this course, we discuss each of these problems briefly. 78 79

lobal alignment p Problem: find optimal scoring alignment between two sequences (Needleman & unsch 97) p Every position in both sequences is included in the alignment p e give score for each position in alignment Identity (match) + Substitution (mismatch) µ Indel p otal score: sum of position scores Scoring: oy example p Consider two sequences with characters drawn from the English language alphabet:, S(/) = + µ S(/) = µ µ µ 8 8 Dynamic programming p ow to find the optimal alignment? p e use previous solutions for optimal alignments of smaller subsequences p his general approach is known as dynamic programming Introduction to dynamic programming: the money change problem p Suppose you buy a pen for 4. and pay for with a 5 note p ou get 77 cents in change what coins is the cashier going to give you if he or she tries to minimise the number of coins? p he usual algorithm: start with largest coin (denominator), proceed to smaller coins until no change is left: 5,, 5 and cents p his greedy algorithm is incorrect, in the sense that it does not always give you the correct answer 8 8 he money change problem he money change problem p ow else to compute the change? p e could consider all possible ways to reduce the amount of change p Suppose we have 77 cents change, and the following coins: 5,, 5 cents p e can compute the change with recursion C(n) = min { C(n 5) +, C(n ) +, C(n 5) + } p Figure shows the recursion tree for the example 77 5 5 7 57 7 7 7 7 5 5 67 p Many values are computed more than once! p his leads to a correct but very inefficient algorithm p e can speed the computation up by solving the change problem for all i n Example: solve the problem for 9 cents with available coins being, and 5 cents p Solve the problem in steps, first for cent, then cents, and so on p In each step, utilise the solutions from the previous steps 84 85 4

he money change problem mount of change left Number of coins used 4 5 6 7 8 9 p lgorithm runs in time proportional to Md, where M is the amount of change and d is the number of coin types p he same technique of storing solutions of subproblems can be utilised in aligning sequences Representing alignments and scores lignments can be represented in the following tabular form. Each alignment corresponds to a path through the table. 86 87 Representing alignments and scores Representing alignments and scores lobal alignment score S,4 = µ µ 88 89 Filling the alignment matrix Filling the alignment matrix () Case Case Case Consider the alignment process at shaded square. Case. lign against (match) Case. lign in against (indel) in Case. lign in against (indel) in Case Case Case Scoring the alternatives. Case. S, = S, + s(, ) Case. S, = S, Case. S, = S, s(i, j) = for matching positions, s(i, j) = µ for substitutions. Choose the case (path) that yields the maximum score. Keep track of path choices. 9 9 5

lobal alignment: formal development Scoring partial alignments = a a a a n, B = b b b b m b b b b 4 a a a ny alignment can be written as a unique path through the matrix Score for aligning and B up to positions i and j: S i,j = S(a a a a i, b b b b j ) a a a b b b 4 b 4 p lignment of = a a a a i with B = b b b b j can be end in three possible ways Case : (a a a i ) a i (b b b j ) b j Case : (a a a i ) a i (b b b j ) Case : (a a a i ) (b b b j ) b j 9 9 Scoring alignments Scoring alignments () p Scores for each case: Case : (a a a i ) a i (b b b j ) b j Case : (a a a i ) a i (b b b j ) Case : (a a a i ) (b b b j ) b j + ifa i = b j s(a i, b j ) = { µ otherwise s(a i, ) = s(, b j ) = p First row and first column correspond to initial alignment against indels: S(i, ) = i S(, j) = j p Optimal global alignment score S(, B) = S n,m a a b b b 4 b 4 4 a 94 95 lgorithm for global alignment lobal alignment: example Input sequences, B, n =, m = B Set S i, := i for all i Set S,j := j for all j for i := to n for j := to m S i,j := max{s i,j, S i,j + s(a i,b j ), S i,j } end end µ = = C 4 6 8 4 6 8? lgorithm takes O(nm) time 96 97 6

lobal alignment: example lobal alignment: example () µ = 4 6 8 µ = 4 6 8 = = 5 7 9 4 4 4 4 6 C 6 8 C C 6 8 5 5 5 4? 7 4 98 99 7